This is an old revision of the document!


Gensim Vector Space Modelling

Introduction

Gensim is a python module for vector space modeling, and it is extremely neat. It implements an incremental stochastic singular value decomosition algorithm that may be computed on distributed computer networks1).

Within the IKW network, I set up an local environment with which you can experiment.

Simple use case:

from gensim import corpora, models, similarities
 
# load corpus from List-Of-Words file (each line is a document, first line states number of documents)
# Doesn't have to be bzipped, but it CAN :-)
corpus = corpora.LowCorpus('/net/data/CL/projects/wordspace/software_tests/corpora/editedCorpusLOW.bz2')
 
# corpus will now be represented as a spare bag of words matrix. It also created a dictionary that maps 
# IDs of words to words, which can be used like this:
# >>> corpus.id2word[2923]
 
# Transform this corpus using Latent Semantic Indexing - we want 100 topics.
lsi = models.LsiModel(corpus, numTopics=200, id2word=corpus.id2word)
 
# Done. We could now query for different topic, eg
 
lsimodel.printTopic(10, topN = 5)
 
# would yield
# '-0.340 * "category" + 0.298 * "$M$" + 0.183 * "algebra" + -0.174 * "functor" + -0.168 * "operator"'

Benchmark

I used the EditedCorpora File to benchmark performance. The File was edited into the List-of-Words-Format (by just inserting a line containing the number of documents at the top). Afterwards Latent Semantic Indexing was performed on the corpus. As the algorithm used for singular value decomposition is incremental, the memory load is constant and can be controlled by passing a chunks parameter to the constructor of the LSI model. This parameter controls how many documents will be loaded into RAM at once, the default is 20000. Larger chunks will speed things up, but also require more RAM. In the distributed mode, this is the number of documents which will be passed to the workers over the network, hence we have to factor in the network transmission speed in choosing our chunk size. For the following experiments, a chunk size of 1000 documents was used.

Loading the Corpus

Loading the corpus and transforming it into sparse vectors takes quite exactly 23 minutes on Quickie.

Single Mode
Distributed Mode

Please refer to the advanced usage section for details on how to setup Gensim in distributed mode. For testing the distributed mode of the algorithm, twelve 2.54 GHz, 4 GB RAM dual core boxes have been used as workers, with one worker per core, totaling 24 workers. LI

Local GenSim Environment

Using Gensim within the IKW

Within the IKW network, there is a local installation of gensim (which in turn depends on Numpy >= 1.4 (current version on IKW machines is 1.3) and Pyro >=4.1), residing in

/net/data/CL/projects/wordspace/gensim_local

to use it, change into this directory and run

source bin/activate
python

This loads the libraries installed locally in this directory and starts the python interpreter. To stop using the virtual environment, simpy run

deactivate

Installing GenSim

In theory, you can copy the entire gensim_local folder to your machine and run it from there, however I won't guarantee that this works in practice. However, it is easy to create such a local environment yourself:

# Skip this step if you already have virtualenv
easy_install virtualenv
# If you don't have sudo rights, install virtualenv locally:
# mkdir ~/opt
# easy_install --instal-dir opt virtualenv
 
# Create a clean virtual environment
virtualenv --no-site-packages myVirtualEnv
 
# And activate it
cd myVirtualEnv
source bin/activate
 
# Now, using pip, install the other stuff
bin/pip install numpy
bin/pip install gensim[distributed]
bin/pip install Pyro4

Advanced Usage

Distributed Mode

Theory

The distributed mode works using the Pyro4 Library. The idea behind this library is that you can instantiate python objects remotely and then forget that you have instantiated them remotely and just work with them like normal objects. However, these objects will eat their own local resources. Plus, you don't have to rewrite a single line of code if you want to use them locally. Most of Pyro's functionality is neatly wrapped by Gensim and works off the shelf. Of course, there are a few things you'll have to take care of:

  1. Tell Gensim which computers to use. We'll do this by running a little script on each computer on the network hat we want to work for us. This script will create workers which can be enslaved. I wrapped the script into another script which automatically creates enough workers to match the number of CPUs in the box you're using, each worker will hog onto one CPU.
  2. Workers will get easily lost in the network, hence we'll need a name server that keeps track of all the workers we have. Of course, there's a script for that, too. Behind the scenes, workers will communicate over TCP/IP with the nameserver an
  3. Furthermore, our enslaved workers may be good at maths, but behave like little children when it comes to sharing the data between them. Therefore, we will need a dispatcher that distributes data to process to all workers evenly, and handles the feedback from the worker. So, our algorithm as such will give a task to the dispatcher, he'll break it into chunks and gives to the workers (using the name server to locate them), they report back, we get our results.

Practise

iIn the local Gensim folder, I prepared four scripts to put the procedure described above into practice,

  • run_nameserver.sh
  • create_worker.sh
  • run_dispatcher.sh
  • clean_up.sh

As neither the name server nor the dispatcher will require a lot of resources, we can easily run them on our local machine (however, remember the data flow is data server → dispatcher → worker, so keep it tight). For running the workers, we have three options:

  1. Log into all desired machines manually and run the worker script
  2. Use a cluster ssh call to do this job

The advantage of 2) is that it is somewhat easier than using the grid, however make sure you're not clogging someone else's work station - the grid engine takes care of distributing the work load evenly.

CSSH

Method 2) would be as follows:

CSSH needs a lot of screen real estate...

$ # Run nameserver
$ cd /net/data/CL/projects/wordspace/gensim_local
$ ./run_nameserver
$ #  You may want to open separate terminals / tabs / screens for name server, dispatcher and cluster ssh.
$ cssh dolly01 dolly02 dolly03 dolly04

This will open a window with which you can send the same commands to all four machines (dolly01 - 04 in our case). A list of all nodes on the grid can be found on Ganglia. Log in, and perform the following:

$ cd /net/data/CL/projects/wordspace/gensim_local
$ ./create_worker

This will create between one and eight workers, depending on the number of CPUs in each machine. Back on your local box, type

$./run_dispatcher

Done, all set up. Now, start your favourite interactive Python shell and work with Gensim as introduced above. But this time, run the LSI in distributed mode:

lsi = models.LsiModel(corpus, numTopics=200, id2word=corpus.id2word, distributed = True, chunks = 2000)

That's all there is to it! Play around with the chunk size to maximize the speed. After you're done, make sure to clean up behind you by killing all slaves, the dispatcher and the name server. This brutal processocide can be efficiently and hygienically performed by running

$ ./clean_up.sh

on your local box and all boxes containing workers (do so with your cssh terminal).