This is an old revision of the document!


Gensim Vector Space Modelling

Introduction

Gensim1) is a python module for vector space modeling, and it is extremely neat. It implements an incremental stochastic singular value decomosition algorithm that may be computed on distributed computer networks2).

= Usage =

Within the IKW network, I set up an local environment with which you can experiment.

= Simple use case =

from gensim import corpora, models, similarities
 
# load corpus from List-Of-Words file (each line is a document, first line states number of documents)
# Doesn't have to be bzipped, but it CAN :-)
corpus = corpora.LowCorpus('/net/data/CL/projects/wordspace/software_tests/corpora/editedCorpusLOW.bz2')
 
# corpus will now be represented as a spare bag of words matrix. It also created a dictionary that maps 
# IDs of words to words, which can be used like this:
# >>> corpus.id2word[2923]
 
# Transform this corpus using Latent Semantic Indexing - we want 100 topics.
lsi = models.LsiModel(corpus, numTopics=200, id2word=corpus.id2word)
 
# Done. We could now query for different topic, eg
 
lsimodel.printTopic(10, topN = 5)
 
# would yield
# '-0.340 * "category" + 0.298 * "$M$" + 0.183 * "algebra" + -0.174 * "functor" + -0.168 * "operator"'
Benchmark

I used the EditedCorpora File to benchmark performance. The File was edited into the List-of-Words-Format (by just inserting a line containing the number of documents at the top). Loading the corpus and transforming it into sparse vectors takes appx. 24 minutes on Quickie.