This is an old revision of the document!
Gensim Vector Space Modelling
Introduction
Gensim1) is a python module for vector space modeling, and it is extremely neat. It implements an incremental stochastic singular value decomosition algorithm that may be computed on distributed computer networks2).
= Usage =
Within the IKW network, I set up an local environment with which you can experiment.
= Simple use case =
from gensim import corpora, models, similarities # load corpus from List-Of-Words file (each line is a document, first line states number of documents) # Doesn't have to be bzipped, but it CAN :-) corpus = corpora.LowCorpus('/net/data/CL/projects/wordspace/software_tests/corpora/editedCorpusLOW.bz2') # corpus will now be represented as a spare bag of words matrix. It also created a dictionary that maps # IDs of words to words, which can be used like this: # >>> corpus.id2word[2923] # Transform this corpus using Latent Semantic Indexing - we want 100 topics. lsi = models.LsiModel(corpus, numTopics=200, id2word=corpus.id2word) # Done. We could now query for different topic, eg lsimodel.printTopic(10, topN = 5) # would yield # '-0.340 * "category" + 0.298 * "$M$" + 0.183 * "algebra" + -0.174 * "functor" + -0.168 * "operator"'
Benchmark
I used the EditedCorpora File to benchmark performance. The File was edited into the List-of-Words-Format (by just inserting a line containing the number of documents at the top). Loading the corpus and transforming it into sparse vectors takes appx. 24 minutes on Quickie.