Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
|
software:rewinfomap [2010/10/31 12:28] eapontep created |
software:rewinfomap [2010/12/07 11:58] (current) eapontep |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| This page is under construction! | This page is under construction! | ||
| + | |||
| + | |||
| + | ==== General ==== | ||
| * **Infomap NLP Software: Not in development any more. The authors recommend to use SemanticVectors instead!!!** | * **Infomap NLP Software: Not in development any more. The authors recommend to use SemanticVectors instead!!!** | ||
| Line 11: | Line 14: | ||
| --- // | --- // | ||
| + | |||
| + | ==== Installation ==== | ||
| + | |||
| + | * Before installing Infomap you would have to install gdbm libraries in your computer. This could be quite challenging. In the following I document the installation process I followed. | ||
| + | - As a first step, you should download the last version of gdbm. | ||
| + | - Untar the .gz file and go into the created directory. | ||
| + | - Try: <file bash> | ||
| + | - The last overwrote all the libtool-related files in the directory. Now you can run <file bash> | ||
| + | - You might also have problems with the ANSI c headers. To solve this problem< | ||
| + | |||
| + | |||
| + | ==== Testing ===== | ||
| + | |||
| + | The first step in order to build a model is to choose a directory where the models will be created. This is done by setting an environment variable <file bash> | ||
| + | export INFOMAP_WORKING_DIR</ | ||
| + | Afterwards run build the model. Informap accepts two formats: a single file where documents are divided by xml markers or as set of files, where every file contains exactly one document. I decided to use this second option. As input, there should be a file specifying the name of file containing a document.< | ||
| + | Remember to add Infomap to your PATH variable. The installation includes a manual of all the applications available. | ||
| + | In corpora directory, you will find a simple py script for building a corpora from a file where every line is a document. Afterwards I used the following command:< | ||
| + | In order to change the default configuration of the model, you would need to change the file: ??. I ran tests only with the default configuration (including reduction to 100 dimension). ' | ||
| + | Two tests were run and the resulting models are available in the server: firstModel (using approximately 30000 documents -minus corrupted documents- in the Wiki Corpus. Constructing the model took me less than five minutes and the resulting directory occupies 65Mb. | ||
| + | |||
| + | {{: | ||
| + | |||
| + | A second test was conducted again with the Wiki-Corpus, | ||
| + | |||
| + | {{: | ||
| + | |||
| + | In order to access the models, the standard command is<file bash> | ||
| + | Among the option, it is possible to obtain a word vector, the nearest neighbors of a word, or the word-document vector. Consider:< | ||
| + | document_100.txt: | ||
| + | document_80694.txt: | ||
| + | document_162763.txt: | ||
| + | document_95383.txt: | ||
| + | document_176694.txt: | ||
| + | document_155572.txt: | ||
| + | document_197410.txt: | ||
| + | document_101202.txt: | ||
| + | document_144550.txt: | ||
| + | document_164895.txt: | ||
| + | </ | ||
| + | This command retrieves the information from the model in < | ||
| + | i.e., looking for words instead of document close to ' | ||
| + | angry: | ||
| + | kid: | ||
| + | girlfriend: | ||
| + | jake: | ||
| + | boyfriend: | ||
| + | scare: | ||
| + | vicious: | ||
| + | feel: | ||
| + | bizarre: | ||
| + | This turned out to be the entry of Kubricks film "The Clock Work Orange" | ||
| + | An interesting option provided by Infomap is to install a model. This option is preferred for fina results, which should be available to several users. Following the manual, installing a model is not much more than moving a selected number of files from a non-installed model directory to a directory available system-wide. This option is intended to keep intermediate and final results apart. | ||