Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
software:rewinfomap [2010/12/05 21:03]
eapontep [Testing]
software:rewinfomap [2010/12/07 11:52]
eapontep
Line 29: Line 29:
 The first step in order to build a model is to choose a directory where the models will be created. This is done by setting an environment variable <file bash>INFOMAP_WORKING_DIR=/home/jrandom/infomap_models The first step in order to build a model is to choose a directory where the models will be created. This is done by setting an environment variable <file bash>INFOMAP_WORKING_DIR=/home/jrandom/infomap_models
 export INFOMAP_WORKING_DIR</file> export INFOMAP_WORKING_DIR</file>
- 
 Afterwards run build the model. Informap accepts two formats: a single file where documents are divided by xml markers or as set of files, where every file contains exactly one document. I decided to use this second option. As input, there should be a file specifying the name of file containing a document.<file bash>infomap-build -m /usr/local/share/corpora/manyNames.txt many_01</file> Afterwards run build the model. Informap accepts two formats: a single file where documents are divided by xml markers or as set of files, where every file contains exactly one document. I decided to use this second option. As input, there should be a file specifying the name of file containing a document.<file bash>infomap-build -m /usr/local/share/corpora/manyNames.txt many_01</file>
 +Remember to add Infomap to your PATH variable. The installation includes a manual of all the applications available. 
 +In corpora directory, you will find a simple py script for building a corpora from a file where every line is a document. Afterwards I used the following command:<file bash>infomap-build -m /net/data/CL/projects/wordspace/software_tests/corpora/infoCorpus/directory.txt firstModel</file>
 +In order to change the default configuration of the model, you would need to change the file: ??. I ran tests only with the default configuration (including reduction to 100 dimension). 'directory.txt' is a file containing the name of every file-document in the directory where the corpus is saved. Although the manual doesn't specify what markers should be used, including every file-name in a new line works out. The option '-m' (or '-sf' for single file) specifies the type of corpus. Finally, 'fistModel' is the name of the model created in 'INFOMAP_WORKING_DIR'.
 +Two tests were run and the resulting models are available in the server: firstModel (using approximately 30000 documents -minus corrupted documents- in the Wiki Corpus. Constructing the model took me less than five minutes and the resulting directory occupies 65Mb. 
 +
 +{{:software:vizinfo1.png|}}
 +
 +A second test was conducted again with the Wiki-Corpus, this time with 200000 documents. Constructing the model take less than 10 minutes. The resulting directory occupies 312Mb
 +
 +{{:software:vizinfo2.png|}}
  
-Remember to add infomap to your PATH variable.+In order to access the models, the standard command is<file bash>associate [<options>] <model> <word></file> 
 +Among the option, it is possible to obtain a word vector, the nearest neighbors of a word, or the word-document vector. Consider:<file bash>associate -m <pathToTargetModel> -d -i d -n 10 document_100.txt 
 +document_100.txt:1.000000 
 +document_80694.txt:0.925041 
 +document_162763.txt:0.919077 
 +document_95383.txt:0.917450 
 +document_176694.txt:0.915522 
 +document_155572.txt:0.914388 
 +document_197410.txt:0.912332 
 +document_101202.txt:0.909776 
 +document_144550.txt:0.909703 
 +document_164895.txt:0.908825 
 +</file> 
 +This command retrieves the information from the model in <pathToTargetModel>, in particular, the output should be again 10 (-n 10 ) documents (-d ), the input should be a document (- i d ). The input is the document 'document_100.txt'. (In the server you would find the document in './infoCorpus'). After performing <file bash>associate -m <pathToTargetModel> -w -i d -n 10 document_100.txt</file> 
 +i.e., looking for words instead of document close to 'document_100.txt' the result was:<file bash>seemingly:0.731527 
 +angry:0.699753 
 +kid:0.693340 
 +girlfriend:0.676348 
 +jake:0.658571 
 +boyfriend:0.656249 
 +scare:0.652340 
 +vicious:0.651290 
 +feel:0.649538 
 +bizarre:0.643888</file> 
 +This turned out to be the entry of Kubricks film "The Clock Work Orange" :-).
  
-In corpora directory, you will find simple py script for building corpora from a file where every line is a document. Afterwards I used the following command:<file bash>infomap-build -m /net/data/CL/projects/wordspace/software_tests/corpora/infoCorpus/directory.txt firstModel</file>  +An interesting option provided by Infomap is to install model. This option is preferred for fina results, which should be available to several users. Following the manual, installing model is not much more than moving a selected number of files from a non-installed model directory to a directory available system-wideThis option is intended to keep intermediate and final results apart.
-directory.txt is a file contaning the name of every file contaning a document. +
-{{:software:vizinfo.twopi.png|Using 30000 documents}}+