Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
software:rewinfomap [2010/12/05 21:14]
eapontep [Testing]
software:rewinfomap [2010/12/07 11:58] (current)
eapontep
Line 29: Line 29:
 The first step in order to build a model is to choose a directory where the models will be created. This is done by setting an environment variable <file bash>INFOMAP_WORKING_DIR=/home/jrandom/infomap_models The first step in order to build a model is to choose a directory where the models will be created. This is done by setting an environment variable <file bash>INFOMAP_WORKING_DIR=/home/jrandom/infomap_models
 export INFOMAP_WORKING_DIR</file> export INFOMAP_WORKING_DIR</file>
- 
 Afterwards run build the model. Informap accepts two formats: a single file where documents are divided by xml markers or as set of files, where every file contains exactly one document. I decided to use this second option. As input, there should be a file specifying the name of file containing a document.<file bash>infomap-build -m /usr/local/share/corpora/manyNames.txt many_01</file> Afterwards run build the model. Informap accepts two formats: a single file where documents are divided by xml markers or as set of files, where every file contains exactly one document. I decided to use this second option. As input, there should be a file specifying the name of file containing a document.<file bash>infomap-build -m /usr/local/share/corpora/manyNames.txt many_01</file>
 +Remember to add Infomap to your PATH variable. The installation includes a manual of all the applications available. 
 +In corpora directory, you will find a simple py script for building a corpora from a file where every line is a document. Afterwards I used the following command:<file bash>infomap-build -m /net/data/CL/projects/wordspace/software_tests/corpora/infoCorpus/directory.txt firstModel</file>
 +In order to change the default configuration of the model, you would need to change the file: ??. I ran tests only with the default configuration (including reduction to 100 dimension). 'directory.txt' is a file containing the name of every file-document in the directory where the corpus is saved. Although the manual doesn't specify what markers should be used, including every file-name in a new line works out. The option '-m' (or '-sf' for single file) specifies the type of corpus. Finally, 'fistModel' is the name of the model created in 'INFOMAP_WORKING_DIR'.
 +Two tests were run and the resulting models are available in the server: firstModel (using approximately 30000 documents -minus corrupted documents- in the Wiki Corpus. Constructing the model took me less than five minutes and the resulting directory occupies 65Mb. 
 +
 +{{:software:vizinfo1.png|}}
 +
 +A second test was conducted again with the Wiki-Corpus, this time with 200000 documents. Constructing the model take less than 10 minutes. The resulting directory occupies 312Mb
  
-Remember to add infomap to your PATH variable.+{{:software:vizinfo2.png|}}
  
-In corpora directoryyou will find a simple py script for building a corpora from a file where every line is a document. Afterwards I used the following command:<file bash>infomap-build -m /net/data/CL/projects/wordspace/software_tests/corpora/infoCorpus/directory.txt firstModel</file>  +In order to access the modelsthe standard command is<file bash>associate [<options>] <model> <word></file> 
-directory.txt is a file contaning the name of every file contaning a document+Among the option, it is possible to obtain word vector, the nearest neighbors of a word, or the word-document vectorConsider:<file bash>associate -m <pathToTargetModel> -d -i d -n 10 document_100.txt 
-{{:software:vizinfo1.twopi.png|}}+document_100.txt:1.000000 
 +document_80694.txt:0.925041 
 +document_162763.txt:0.919077 
 +document_95383.txt:0.917450 
 +document_176694.txt:0.915522 
 +document_155572.txt:0.914388 
 +document_197410.txt:0.912332 
 +document_101202.txt:0.909776 
 +document_144550.txt:0.909703 
 +document_164895.txt:0.908825 
 +</file> 
 +This command retrieves the information from the model in <pathToTargetModel>, in particular, the output should be again 10 (-n 10 ) documents (-d ), the input should be a document (- i d ). The input is the document 'document_100.txt'. (In the server you would find the document in './infoCorpus'). After performing <file bash>associate -m <pathToTargetModel> -w -i d -n 10 document_100.txt</file> 
 +i.e., looking for words instead of document close to 'document_100.txtthe result was:<file bash>seemingly:0.731527 
 +angry:0.699753 
 +kid:0.693340 
 +girlfriend:0.676348 
 +jake:0.658571 
 +boyfriend:0.656249 
 +scare:0.652340 
 +vicious:0.651290 
 +feel:0.649538 
 +bizarre:0.643888</file> 
 +This turned out to be the entry of Kubricks film "The Clock Work Orange" :-). The most related document corresponds to the film  [[http://www.youtube.com/watch?v=tcSMDqXT52s|"Pretty in Pink"]] 
 +An interesting option provided by Infomap is to install a model. This option is preferred for fina results, which should be available to several users. Following the manual, installing a model is not much more than moving a selected number of files from a non-installed model directory to a directory available system-wide. This option is intended to keep intermediate and final results apart.