This is an old revision of the document!


Gensim1) is a python module for vector space modeling, and it is extremely neat. It implements an incremental stochastic singular value decomosition algorithm that may be computed on distributed computer networks2). Within the IKW network, I set up an local environment with which you can experiment.

Simple use case:

from gensim import corpora, models, similarities
 
# load corpus from List-Of-Words file (each line is a document, first line states number of documents)
# Doesn't have to be bzipped, but it CAN :-)
corpus = corpora.LoqCorpus('/net/data/CL/projects/wordspace/software_tests/corpora/editedCorpusLOW.bz2')
 
# corpus will now be represented as a spare bag of words matrix. It also created a dictionary that maps 
# IDs of words to words, which can be used like this:
# >>> corpus.id2word[2923]
 
# Transform this corpus using Latent Semantic Indexing - we want 100 topics.
lsi = models.LsiModel(corpus, numTopics=200, id2word=corpus.id2word)
 
# Done. We could now query for different topic, eg
 
lsimodel.printTopic(10, topN = 5)
 
# would yield
# '-0.340 * "category" + 0.298 * "$M$" + 0.183 * "algebra" + -0.174 * "functor" + -0.168 * "operator"'