Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
software:gensim [2010/12/03 17:04] maebert |
software:gensim [2010/12/03 18:21] maebert |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | ==== Gensim Vector Space Modelling ==== | + | ====== Gensim Vector Space Modelling |
- | === Introduction === | + | ===== Introduction |
+ | [[http:// | ||
- | Gensim(( http:// | + | Within the IKW network, I set up an [[#Local GenSim Environment|local environment]] with which you can experiment. |
- | == Usage == | + | Simple use case: |
- | + | ||
- | Within the IKW network, I set up an [[LocalGensimEnvironment|local environment]] with which you can experiment. | + | |
- | + | ||
- | == Simple use case == | + | |
<code python> | <code python> | ||
Line 34: | Line 31: | ||
</ | </ | ||
- | === Benchmark === | + | ===== Benchmark |
- | I used the EditedCorpora File to benchmark performance. The File was edited into the List-of-Words-Format (by just inserting a line containing the number of documents at the top). Loading | + | I used the EditedCorpora File to benchmark performance. The File was edited into the List-of-Words-Format (by just inserting a line containing the number of documents at the top). |
+ | == Loading the Corpus == | ||
+ | |||
+ | Loading the corpus and transforming it into sparse vectors takes quite exactly 23 minutes on Quickie. | ||
+ | |||
+ | == Single Mode == | ||
+ | |||
+ | == Distributed Mode == | ||
+ | |||
+ | Please refer to the [[#Advanced Usage| advanced usage]] section for details on how to setup Gensim in distributed mode. For testing the distributed mode of the algorithm, twelve 2.54 GHz, 4 GB RAM dual core boxes have been used as workers, with one worker per core, totaling 24 workers. LI | ||
+ | |||
+ | |||
+ | ====== Local GenSim Environment ====== | ||
+ | |||
+ | ===== Using Gensim within the IKW ===== | ||
+ | |||
+ | Within the IKW network, there is a local installation of gensim (which in turn depends on Numpy >= 1.4 (current version on IKW machines is 1.3) and Pyro >=4.1), residing in | ||
+ | |||
+ | / | ||
+ | | ||
+ | to use it, change into this directory and run | ||
+ | |||
+ | <code Bash> | ||
+ | source bin/ | ||
+ | python | ||
+ | </ | ||
+ | |||
+ | This loads the libraries installed locally in this directory and starts the python interpreter. To stop using the virtual environment, | ||
+ | |||
+ | deactivate | ||
+ | |||
+ | ===== Installing GenSim ===== | ||
+ | |||
+ | In theory, you can copy the entire gensim_local folder to your machine and run it from there, however I won't guarantee that this works in practice. However, it is easy to create such a local environment yourself: | ||
+ | |||
+ | <code Bash> | ||
+ | # Skip this step if you already have virtualenv | ||
+ | easy_install virtualenv | ||
+ | # If you don't have sudo rights, install virtualenv locally: | ||
+ | # mkdir ~/opt | ||
+ | # easy_install --instal-dir opt virtualenv | ||
+ | |||
+ | # Create a clean virtual environment | ||
+ | virtualenv --no-site-packages myVirtualEnv | ||
+ | |||
+ | # And activate it | ||
+ | cd myVirtualEnv | ||
+ | source bin/ | ||
+ | |||
+ | # Now, using pip, install the other stuff | ||
+ | bin/pip install numpy | ||
+ | bin/pip install gensim[distributed] | ||
+ | bin/pip install Pyro4 | ||
+ | </ | ||
+ | |||
+ | |||
+ | ====== Advanced Usage ====== | ||
+ | |||
+ | ===== Distributed Mode ===== | ||
+ | |||
+ | ==== Theory ==== | ||
+ | |||
+ | The distributed mode works using the [[http:// | ||
+ | |||
+ | - Tell Gensim which computers to use. We'll do this by running a little script on each computer on the network hat we want to work for us. This script will create workers which can be enslaved. I wrapped the script into another script which automatically creates enough workers to match the number of CPUs in the box you're using, each worker will hog onto one CPU. | ||
+ | - Workers will get easily lost in the network, hence we'll need a name server that keeps track of all the workers we have. Of course, there' | ||
+ | - Furthermore, | ||
+ | |||
+ | ==== Practise ==== | ||
+ | |||
+ | iIn the local Gensim folder, I prepared four scripts to put the procedure described above into practice, | ||
+ | |||
+ | * run_nameserver.sh | ||
+ | * create_worker.sh | ||
+ | * run_dispatcher.sh | ||
+ | * clean_up.sh | ||
+ | |||
+ | As neither the name server nor the dispatcher will require a lot of resources, we can easily run them on our local machine (however, remember the data flow is data server -> dispatcher -> worker, so keep it tight). For running the workers, we have three options: | ||
+ | |||
+ | - Log into all desired machines manually and run the worker script | ||
+ | - Use a [[http:// | ||
+ | - Use the [[https:// | ||
+ | |||
+ | The advantage of 2) is that it is somewhat easier than using the grid, however make sure you're not clogging someone else's work station - the grid engine takes care of distributing the work load evenly. | ||
+ | |||
+ | === CSSH === | ||
+ | |||
+ | | ||
+ | |||
+ | <code Bash> | ||
+ | $ # Run nameserver | ||
+ | $ cd / | ||
+ | $ ./ | ||
+ | $ # You may want to open separate terminals / tabs / screens for name server, dispatcher and cluster ssh. | ||
+ | $ cssh dolly01 dolly02 dolly03 dolly04 | ||
+ | </ | ||
+ | |||
+ | This will open a window with which you can send the same commands to all four machines (dolly01 - 04 in our case). A list of all nodes on the grid can be found on [[https:// | ||
+ | |||
+ | <code Bash> | ||
+ | $ cd / | ||
+ | $ ./ | ||
+ | </ | ||
+ | |||
+ | This will create between one and eight workers, depending on the number of CPUs in each machine. Back on your local box, type | ||
+ | |||
+ | <code Bash> | ||
+ | $./ | ||
+ | </ | ||
+ | |||
+ | Done, all set up. Now, start your favourite interactive Python shell and work with Gensim as introduced [[# | ||
+ | |||
+ | <code Python> | ||
+ | lsi = models.LsiModel(corpus, | ||
+ | </ | ||
+ | |||
+ | That's all there is to it! Play around with the chunk size to maximize the speed. After you're done, make sure to clean up behind you by killing all slaves, the dispatcher and the name server. This brutal processocide can be efficiently and hygienically performed by running | ||
+ | |||
+ | <code Bash> | ||
+ | $ ./ | ||
+ | </ | ||
+ | on your local box and all boxes containing workers (do so with your cssh terminal). |