software:rewinfomap

This page is under construction!

General

Infomap NLP Software: Not in development any more. The authors recommend to use SemanticVectors instead!!!
- Uses Latent Semantic Analysis
- The implementation is in C.
- Documentation: http://infomap-nlp.sourceforge.net/doc/
- Infomap is intended to build `language models' and to perform information retrieval tasks on the such models * Simple input format * You might need gdbm libraries. I had troubles installing this libraries in my laptop. In the present moment it is not working. * The documentation includes installation instructions, algorithm description and implementation guide. --- //[[eapontep@uos.de|Eduardo Aponte]] 2010/10/31 12:28// ==== Installation ==== * Before installing Infomap you would have to install gdbm libraries in your computer. This could be quite challenging. In the following I document the installation process I followed. - As a first step, you should download the last version of gdbm. - Untar the .gz file and go into the created directory. - Try: <file bash>./configure</file>This command should try to configure the program to your system specifications. It is highly likely that this process fails. The most likely reason is that a system library called libtool is not version compatible. To check your version of this program (in ubuntu):<file bash>apt-cache policy libtool</file>. I presuppose you have libtool installed in your computer. You probably have a newer version of libtool as the one presuppose by the gdbm package. The solution I found was to run:<file bash>autoconf -f -oconfigure</file> - The last overwrote all the libtool-related files in the directory. Now you can run <file bash>make</file> safely. If you obtain the following error -which actually is highly unlikely<file bash>checking build system type... Invalid configuration `x86_64-unknown-linux-gnu': machine `x86_64-unknown' not recognized</file>you will need to deceive the program. Add before any command:
```
linux32
```

You might also have problems with the ANSI c headers. To solve this problem
```
sudo apt-get install libc6-dev
```

Testing

The first step in order to build a model is to choose a directory where the models will be created. This is done by setting an environment variable

INFOMAP_WORKING_DIR=/home/jrandom/infomap_models
export INFOMAP_WORKING_DIR

Afterwards run build the model. Informap accepts two formats: a single file where documents are divided by xml markers or as set of files, where every file contains exactly one document. I decided to use this second option. As input, there should be a file specifying the name of file containing a document.

infomap-build -m /usr/local/share/corpora/manyNames.txt many_01

Remember to add Infomap to your PATH variable. The installation includes a manual of all the applications available. In corpora directory, you will find a simple py script for building a corpora from a file where every line is a document. Afterwards I used the following command:

infomap-build -m /net/data/CL/projects/wordspace/software_tests/corpora/infoCorpus/directory.txt firstModel

In order to change the default configuration of the model, you would need to change the file: ??. I ran tests only with the default configuration (including reduction to 100 dimension). 'directory.txt' is a file containing the name of every file-document in the directory where the corpus is saved. Although the manual doesn't specify what markers should be used, including every file-name in a new line works out. The option '-m' (or '-sf' for single file) specifies the type of corpus. Finally, 'fistModel' is the name of the model created in 'INFOMAP_WORKING_DIR'. Two tests were run and the resulting models are available in the server: firstModel (using approximately 30000 documents -minus corrupted documents- in the Wiki Corpus. Constructing the model took me less than five minutes and the resulting directory occupies 65Mb.

A second test was conducted again with the Wiki-Corpus, this time with 200000 documents. Constructing the model take less than 10 minutes. The resulting directory occupies 312Mb

In order to access the models, the standard command is

associate [<options>] <model> <word>

Among the option, it is possible to obtain a word vector, the nearest neighbors of a word, or the word-document vector. Consider:

associate -m <pathToTargetModel> -d -i d -n 10 document_100.txt
document_100.txt:1.000000
document_80694.txt:0.925041
document_162763.txt:0.919077
document_95383.txt:0.917450
document_176694.txt:0.915522
document_155572.txt:0.914388
document_197410.txt:0.912332
document_101202.txt:0.909776
document_144550.txt:0.909703
document_164895.txt:0.908825

This command retrieves the information from the model in <pathToTargetModel>, in particular, the output should be again 10 (-n 10 ) documents (-d ), the input should be a document (- i d ). The input is the document 'document_100.txt'. (In the server you would find the document in './infoCorpus'). After performing

associate -m <pathToTargetModel> -w -i d -n 10 document_100.txt

i.e., looking for words instead of document close to 'document_100.txt' the result was:

seemingly:0.731527
angry:0.699753
kid:0.693340
girlfriend:0.676348
jake:0.658571
boyfriend:0.656249
scare:0.652340
vicious:0.651290
feel:0.649538
bizarre:0.643888

This turned out to be the entry of Kubricks film "The Clock Work Orange" . The most related document corresponds to the film "Pretty in Pink" An interesting option provided by Infomap is to install a model. This option is preferred for fina results, which should be available to several users. Following the manual, installing a model is not much more than moving a selected number of files from a non-installed model directory to a directory available system-wide. This option is intended to keep intermediate and final results apart.