This is an old revision of the document!


STATE

It is working in my computer. I will write a script and run it on the whole American National Corpus and test results, memory and the type of vector produced! I talked with the Adms. and in the next few days the package should be running on the server.

General

  • Develop in UCLA,
  • Set of Java libraries,
  • It is not finished; it is not dead code, though.
  • There is a rich documentation regarding the algorithms and the implementation.
  • Since it is a collection of algorithms, it is necessary to decide which ones are necessary!
  • "The focus of this framework is to ease the development of new algorithms and the comparison against existing models." (Jurgens, Stevens).
  • "Each word space algorithms is designed to run as a stand alone program and also to be used as a library class." (Jurgens, Stevens).
  • The library supports word-document vectors.
  • The authors affirm that it can collect more that a context-vector for a single word depending on the semantic meaning (e.g. bank as institution and bank as "Sitztgelegenheit" :-))
  • "Libraries provide support for converting between multiple matrix formats, enabling interaction with external matrix-based program".
  • SVD and randomized projections.
  • From the pictures, scalability of most of the algorithms seems to grow with a linear factor!
  • The package is constituted by four type of tools:
    • A library (implementation) of commonly used algorithms in semantic spaces.
    • Tools for building semantic models
    • Evaluation tools (e.g. TOEFL test for synonyms).
    • Interaction tools (e.g. queries, etc.).

Installation

  • Required Software
    • svn (Subversion). Can be installed with a apt-get command:
      sudo apt-get install subversion
  • To installed the package go to a target directory. The authors recommends to use the following command:
    svn checkout http://airhead-research.googlecode.com/svn/trunk/sspace sspace-read-only
  • A new directory should have been created. Go to the directory and use the command
    ant

    . Ant is part of the Apache project and is used to build java libraries. It will automatically detect the file build.html and install from it. I explained here how to install ant.

Technical Issues

Testing

  • The S-Space package supports reading and writing several matrix file formats. Among those supported are
    • SVDLIBC text, sparse text, binary and sparse binary
    • Matlab and Octave dense text and sparse text formats
    • CLUTO sparse text
  • The package provide an user interface, i.e., a class to used S-package from the terminal.
  • The package provide utilities to process 'raw text', meaning this that these utilities presuppose corpus pre-processing! The user might select how the text files are structures, i.e., a single string data file, as files, etc. Check here for a short tutorial.

Eduardo Aponte 2010/11/16 10:38