Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
software:gensim [2010/12/03 17:57]
maebert
software:gensim [2010/12/06 14:46]
maebert [Practise]
Line 2: Line 2:
  
 ===== Introduction ===== ===== Introduction =====
- 
  
 [[http://nlp.fi.muni.cz/projekty/gensim/index.html|Gensim]] is a python module for vector space modeling, and it is extremely neat. It implements an incremental stochastic singular value decomosition algorithm that may be computed on distributed computer networks((Halko, N. and Martinsson, P.G. and Tropp, J.A., 2009: [[http://arxiv.org/pdf/0909.4061|Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions]])).  [[http://nlp.fi.muni.cz/projekty/gensim/index.html|Gensim]] is a python module for vector space modeling, and it is extremely neat. It implements an incremental stochastic singular value decomosition algorithm that may be computed on distributed computer networks((Halko, N. and Martinsson, P.G. and Tropp, J.A., 2009: [[http://arxiv.org/pdf/0909.4061|Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions]])). 
Line 99: Line 98:
 The distributed mode works using the [[http://www.xs4all.nl/~irmen/pyro4/|Pyro4]] Library. The idea behind this library is that you can instantiate python objects remotely and then forget that you have instantiated them remotely and just work with them like normal objects. However, these objects will eat their own local resources. Plus, you don't have to rewrite a single line of code if you want to use them locally. Most of Pyro's functionality is neatly wrapped by Gensim and works off the shelf. Of course, there are a few things you'll have to take care of: The distributed mode works using the [[http://www.xs4all.nl/~irmen/pyro4/|Pyro4]] Library. The idea behind this library is that you can instantiate python objects remotely and then forget that you have instantiated them remotely and just work with them like normal objects. However, these objects will eat their own local resources. Plus, you don't have to rewrite a single line of code if you want to use them locally. Most of Pyro's functionality is neatly wrapped by Gensim and works off the shelf. Of course, there are a few things you'll have to take care of:
  
-- Tell Gensim which computers to use. We'll do this by running a little script on each computer on the network hat we want to work for us. This script will create workers which can be enslaved. I wrapped the script into another script which automatically creates enough workers to match the number of CPUs in the box you're using, each worker will hog onto one CPU. +  - Tell Gensim which computers to use. We'll do this by running a little script on each computer on the network hat we want to work for us. This script will create workers which can be enslaved. I wrapped the script into another script which automatically creates enough workers to match the number of CPUs in the box you're using, each worker will hog onto one CPU. 
-- Workers will get easily lost in the network, hence we'll need a name server that keeps track of all the workers we have. Of course, there's a script for that, too. Behind the scenes, workers will communicate over TCP/IP with the nameserver an +  - Workers will get easily lost in the network, hence we'll need a name server that keeps track of all the workers we have. Of course, there's a script for that, too. Behind the scenes, workers will communicate over TCP/IP with the nameserver an 
-- Furthermore, our enslaved workers may be good at maths, but behave like little children when it comes to sharing the data between them. Therefore, we will need a dispatcher that distributes data to process to all workers evenly, and handles the feedback from the worker. So, our algorithm as such will give a task to the dispatcher, he'll break it into chunks and gives to the workers (using the name server to locate them), they report back, we get our results.+  - Furthermore, our enslaved workers may be good at maths, but behave like little children when it comes to sharing the data between them. Therefore, we will need a dispatcher that distributes data to process to all workers evenly, and handles the feedback from the worker. So, our algorithm as such will give a task to the dispatcher, he'll break it into chunks and gives to the workers (using the name server to locate them), they report back, we get our results.
  
 ==== Practise ==== ==== Practise ====
  
 +iIn the local Gensim folder, I prepared four scripts to put the procedure described above into practice, 
 +
 +  * run_nameserver.sh
 +  * create_worker.sh
 +  * run_dispatcher.sh
 +  * clean_up.sh
 +
 +As neither the name server nor the dispatcher will require a lot of resources, we can easily run them on our local machine (however, remember the data flow is data server -> dispatcher -> worker, so keep it tight). For running the workers, we have three options:
 +
 +  - Log into all desired machines manually and run the worker script
 +  - Use a [[http://cssh.sourceforge.net/docs/cssh_man.html|cluster ssh call]] to do this job
 +  - Use the [[https://doc.ikw.uni-osnabrueck.de/content/using-ikw-grid|IKW Grid Engine]]
 +
 + The advantage of 2) is that it is somewhat easier than using the grid, however make sure you're not clogging someone else's work station - the grid engine takes care of distributing the work load evenly. 
 +
 +=== CSSH ===
 +
 +Method 2) would be as follows:
 +
 +{{ :software:12_cssh_jobs.png?500|CSSH needs a lot of screen real estate...}}
 +
 +<code Bash>
 +$ # Run nameserver
 +$ cd /net/data/CL/projects/wordspace/gensim_local
 +$ ./run_nameserver
 +$ #  You may want to open separate terminals / tabs / screens for name server, dispatcher and cluster ssh.
 +$ cssh dolly01 dolly02 dolly03 dolly04
 +</code>
 +
 +This will open a window with which you can send the same commands to all four machines (dolly01 - 04 in our case). A list of all nodes on the grid can be found on [[https://ganglia.ikw.uni-osnabrueck.de/|Ganglia]]. Log in, and perform the following:
 +
 +<code Bash>
 +$ cd /net/data/CL/projects/wordspace/gensim_local
 +$ ./create_worker
 +</code>
 +
 +This will create between one and eight workers, depending on the number of CPUs in each machine. Back on your local box, type
 +
 +<code Bash>
 +$./run_dispatcher
 +</code>
 +
 +Done, all set up. Now, start your favourite interactive Python shell and work with Gensim as introduced [[#Introduction|above]]. But this time, run the LSI in distributed mode:
 +
 +<code Python>
 +lsi = models.LsiModel(corpus, numTopics=200, id2word=corpus.id2word, distributed = True, chunks = 2000)
 +</code>
 +
 +That's all there is to it! Play around with the chunk size to maximize the speed. After you're done, make sure to clean up behind you by killing all slaves, the dispatcher and the name server. This brutal processocide can be efficiently and hygienically performed by running 
 +
 +<code Bash>
 +$ ./clean_up.sh
 +</code>
 +
 +on your local box and all boxes containing workers (do so with your cssh terminal). 
 +
 +=== Use Screens! ===
 +
 +A word of warning: using this method it's very easy to, well, loose your processes in the endless depths of the network. It is therefore recommendable to open a [[http://news.softpedia.com/news/GNU-Screen-Tutorial-44274.shtml|screen]] on your host machine (eg. quickie) before you start working. Simple tutorial:
 +
 +  screen
 +  
 +Opens a screen. You may now create new virtual terminals with <CTRL + a><c> , switch between them with <CTRL + a><[0-9]>. To leave your screen _without terminating it_, type <CTRL + a><d> You are back on your terminal now. You may log out from this machine, go home, log in again. The screen will still be there, waiting for you. List available screens with
 +
 +  $ screen -ls
 +  There is a screen on:
 +      15500.ttys001.beta (Detached)
 +  1 Socket in /var/folders/zz/zzzivhrRnAmviuee++0-sU++US6/-Tmp-/.screen.
 +
 +Get back to your screen with
 +
 +  screen -r 15500
 +
 +So, if you're connection is lost, if you want to continue working from somewhere else, or if somebody shuts down your computer while you are at the coffee break, the connections to your enslaved army of workers will still be there.