This is an old revision of the document!


Gensim Vector Space Modelling

Introduction

Gensim1) is a python module for vector space modeling, and it is extremely neat. It implements an incremental stochastic singular value decomosition algorithm that may be computed on distributed computer networks2).

Usage

Within the IKW network, I set up an local environment with which you can experiment.

Simple use case
from gensim import corpora, models, similarities
 
# load corpus from List-Of-Words file (each line is a document, first line states number of documents)
# Doesn't have to be bzipped, but it CAN :-)
corpus = corpora.LowCorpus('/net/data/CL/projects/wordspace/software_tests/corpora/editedCorpusLOW.bz2')
 
# corpus will now be represented as a spare bag of words matrix. It also created a dictionary that maps 
# IDs of words to words, which can be used like this:
# >>> corpus.id2word[2923]
 
# Transform this corpus using Latent Semantic Indexing - we want 100 topics.
lsi = models.LsiModel(corpus, numTopics=200, id2word=corpus.id2word)
 
# Done. We could now query for different topic, eg
 
lsimodel.printTopic(10, topN = 5)
 
# would yield
# '-0.340 * "category" + 0.298 * "$M$" + 0.183 * "algebra" + -0.174 * "functor" + -0.168 * "operator"'

Benchmark

I used the EditedCorpora File to benchmark performance. The File was edited into the List-of-Words-Format (by just inserting a line containing the number of documents at the top). Afterwards Latent Semantic Indexing was performed on the corpus. As the algorithm used for singular value decomposition is incremental, the memory load is constant and can be controlled by passing a chunks parameter to the constructor of the LSI model. This parameter controls how many documents will be loaded into RAM at once, the default is 20000. Larger chunks will speed things up, but also require more RAM. In the distributed mode, this is the number of documents which will be passed to the workers over the network, hence we have to factor in the network transmission speed in choosing our chunk size. For the following experiments, a chunk size of 1000 documents was used.

Loading the Corpus

Loading the corpus and transforming it into sparse vectors takes quite exactly 23 minutes on Quickie.

Single Mode
Distributed Mode

Please refer to advanced usage page for details on how to setup Gensim in distributed mode. For testing the distributed mode of the algorithm, twelve 2.54 GHz, 4 GB RAM dual core boxes have been used as workers, with one worker per core, totaling 24 workers. LI