Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
software:rewsspacepackage [2010/11/16 10:34]
eapontep
software:rewsspacepackage [2010/12/07 12:53] (current)
eapontep
Line 24: Line 24:
   * To installed the package go to a target directory. The authors recommends to use the following command:<file bash>svn checkout http://airhead-research.googlecode.com/svn/trunk/sspace sspace-read-only</file>   * To installed the package go to a target directory. The authors recommends to use the following command:<file bash>svn checkout http://airhead-research.googlecode.com/svn/trunk/sspace sspace-read-only</file>
   * A new directory should have been created. Go to the directory and use the command<file bash>ant</file>. Ant is part of the Apache project and is used to build java libraries. It will automatically detect the file build.html and install from it. I explained [[rewSemVector|here]] how to install ant.   * A new directory should have been created. Go to the directory and use the command<file bash>ant</file>. Ant is part of the Apache project and is used to build java libraries. It will automatically detect the file build.html and install from it. I explained [[rewSemVector|here]] how to install ant.
 +  * If you want to make direct use of the .jar, you would also like to use the command
  
 ==== Testing ==== ==== Testing ====
Line 32: Line 33:
   * The package provide an user interface, i.e., a class to used S-package from the terminal.   * The package provide an user interface, i.e., a class to used S-package from the terminal.
   * The package provide utilities to process 'raw text', meaning this that these utilities presuppose corpus pre-processing! The user might select how the text files are structures, i.e., a single string data file, as files, etc. Check [[http://code.google.com/p/airhead-research/wiki/DocumentParsing|here]] for a short tutorial.   * The package provide utilities to process 'raw text', meaning this that these utilities presuppose corpus pre-processing! The user might select how the text files are structures, i.e., a single string data file, as files, etc. Check [[http://code.google.com/p/airhead-research/wiki/DocumentParsing|here]] for a short tutorial.
 +
 + --- //[[eapontep@uos.de|Eduardo Aponte]] 2010/11/16 10:38//
 +
 +=== First Trial ===
 +  * The trials I am performing now are based only on using the already developed .jar files and executing the programs from the command lines, i.e., doing no hacking on any class
 +  * I began with a very simple trial on the whole corpus using LSA without threads. As expected, after 20 minutes the processes finished with a memory error.
 +
 +=== Second Trial ===
 +
 +  * I ran the following command on the corpus <file bash>java -jar /net/data/CL/projects/wordspace/software_tests/sPackage/sspace-read-only/bin/lsa.jar -dwp500_articles_hw.latin1.txt.gz  -X200 -t10 -v -n100 results/firstTry.sspace</file> This command should read 200 documents from the corpus (the first 200 lines), using 10 threads (no idea how this would take place) and use svd (default set) with 100 dimensions. I didn't check memory, although I allowed a verbose terminal output.
 +    * Nicely I got the following:<file bash>FINE: Processed all 200 documents in 0.271 total seconds</file>
 +    * However, as expected, the process got stuck during SVD:<file bash>Nov 21, 2010 12:44:10 PM edu.ucla.sspace.lsa.LatentSemanticAnalysis processSpace
 +INFO: reducing to 100 dimensions
 +Nov 21, 2010 12:44:10 PM edu.ucla.sspace.matrix.MatrixIO matlabToSvdlibcSparseBinary
 +INFO: Converting from Matlab double values to SVDLIBC float values; possible loss of precision</file>
 +
 +
 +=== Trials with LSA ===
 +
 +I performed a number of trials with LSA. This trials where intended to prove the time and memory comsumed by different algorithms compatible with the LSA implementation. Available from the command lines are:
 +
 +  * SVDLIBC
 +  * Matlab
 +  * GNU Octave
 +  * JAMA
 +  * COLT 
 +
 +I didn't make a large review of the implementations and rather start proving every algorithm. As previous results showed, using the default algorithm (SVDLIBC) generated strange results (the angular distance between vectors) was extremely low among close neighbors. Two reason were identified as possible: either, the number of dimensions was to low in relation to the numbers of documents; this would cause that performing SVD would collapse the distance, creating a extremely dense vector-space. A second possibility was a bug in the implementation. My supposition is that, since the implementation requires a pipeline between the internal format of the argument and SVDLIBC, a loss in precision caused the problem. If that were the problem, selecting Matlab for SVD should solve the problem (because the matlab format are the internal format of lsa are identical).
 +
 +I performed a test with 30000 document and 200 dimensions with SVDLIBC, MATLAB, OCTAVE and COLT. The results were in part disappointing because, with the exception of MATLAB, all other algorithms ran out of memory (in particular, the pipeline between LSA and the SVD algorithm ran out of heap memory).
 +
 +  * SVDLIBC: Ran out of memory at 6459 seconds.
 +  * Matlab: After 5083 seconds returned a 450Mb .sspace file
 +  * GNU Octave: Ran out of memory at 7624 seconds
 +
 +
 +{{:software:stats.png|Statistics 1}}
 +
 +Visual inspection suggest that problems regarding the density of the vector space are solved by using MATLAB as the defauld algorithm.
 +
 +{{:software:statmatlab.png|}}
 +
 +Finally, I compared the scalability of Random Indexing and LSA (using SVDLIBC with 100 dimension):
 +
 +{{:software:stats3.png|}}
 +
 +It is clear that LSA can hardly handle large corpora. Although the results are different in the case of Random Indexing, they suggest a similar conclusion.
 +
 +I wrote a simple script that automatically document the results of every experiment. It can be found under the key name "myScript.sh" in the corpora directory. The results are documented in the directory statistics. A python script automatically generates a graphviz representation in the directory vizImages. Since it is intended to be used with twopi, it has this very extension.