Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | Next revision Both sides next revision | ||
software:gensim [2010/12/03 17:26] maebert |
software:gensim [2010/12/03 17:57] maebert |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | ==== Gensim Vector Space Modelling ==== | + | ====== Gensim Vector Space Modelling |
- | === Introduction === | + | ===== Introduction |
- | Gensim(( | + | [[http:// |
- | == Usage == | + | Within the IKW network, I set up an [[#Local GenSim Environment|local environment]] with which you can experiment. |
- | Within the IKW network, I set up an [[LocalGensimEnvironment|local environment]] with which you can experiment. | + | Simple use case: |
- | + | ||
- | == Simple use case == | + | |
<code python> | <code python> | ||
Line 34: | Line 32: | ||
</ | </ | ||
- | === Benchmark === | + | ===== Benchmark |
I used the EditedCorpora File to benchmark performance. The File was edited into the List-of-Words-Format (by just inserting a line containing the number of documents at the top). Afterwards Latent Semantic Indexing was performed on the corpus. As the algorithm used for singular value decomposition is incremental, | I used the EditedCorpora File to benchmark performance. The File was edited into the List-of-Words-Format (by just inserting a line containing the number of documents at the top). Afterwards Latent Semantic Indexing was performed on the corpus. As the algorithm used for singular value decomposition is incremental, | ||
Line 46: | Line 44: | ||
== Distributed Mode == | == Distributed Mode == | ||
- | Please refer to [[AdvancedGensimUsage| advanced usage]] | + | Please refer to the [[#Advanced Usage| advanced usage]] |
+ | |||
+ | |||
+ | ====== Local GenSim Environment ====== | ||
+ | |||
+ | ===== Using Gensim within the IKW ===== | ||
+ | |||
+ | Within the IKW network, there is a local installation of gensim (which in turn depends on Numpy >= 1.4 (current version on IKW machines is 1.3) and Pyro >=4.1), residing in | ||
+ | |||
+ | / | ||
+ | |||
+ | to use it, change into this directory and run | ||
+ | |||
+ | <code Bash> | ||
+ | source bin/ | ||
+ | python | ||
+ | </ | ||
+ | |||
+ | This loads the libraries installed locally in this directory and starts the python interpreter. To stop using the virtual environment, | ||
+ | |||
+ | deactivate | ||
+ | |||
+ | ===== Installing GenSim ===== | ||
+ | |||
+ | In theory, you can copy the entire gensim_local folder to your machine and run it from there, however I won't guarantee that this works in practice. However, it is easy to create such a local environment yourself: | ||
+ | |||
+ | <code Bash> | ||
+ | # Skip this step if you already have virtualenv | ||
+ | easy_install virtualenv | ||
+ | # If you don't have sudo rights, install virtualenv locally: | ||
+ | # mkdir ~/opt | ||
+ | # easy_install --instal-dir opt virtualenv | ||
+ | |||
+ | # Create a clean virtual environment | ||
+ | virtualenv --no-site-packages myVirtualEnv | ||
+ | |||
+ | # And activate it | ||
+ | cd myVirtualEnv | ||
+ | source bin/ | ||
+ | |||
+ | # Now, using pip, install the other stuff | ||
+ | bin/pip install numpy | ||
+ | bin/pip install gensim[distributed] | ||
+ | bin/pip install Pyro4 | ||
+ | </ | ||
+ | |||
+ | |||
+ | ====== Advanced Usage ====== | ||
+ | |||
+ | ===== Distributed Mode ===== | ||
+ | |||
+ | ==== Theory ==== | ||
+ | |||
+ | The distributed mode works using the [[http:// | ||
+ | |||
+ | - Tell Gensim which computers to use. We'll do this by running a little script on each computer on the network hat we want to work for us. This script will create workers which can be enslaved. I wrapped the script into another script which automatically creates enough workers to match the number of CPUs in the box you're using, each worker will hog onto one CPU. | ||
+ | - Workers will get easily lost in the network, hence we'll need a name server that keeps track of all the workers we have. Of course, there' | ||
+ | - Furthermore, | ||
+ | |||
+ | ==== Practise ==== | ||