Table of Contents
DSM Software and Data Sets
Off-the-shelf software packages for DSM
Python
- Gensim – high-performance topic modelling
- Vecto – a new framework for count & predict models
- DISSECT – easy-to-use package developed by the COMPOSES project
R
- wordspace – user-friendly DSM exploration
Java
- Semantic Vectors – scalable implementation based on random indexing (review)
- JoBimText – with support for distributed processing
C/C++
- Infomap NLP – classical LSA-style DSM (review)
- FastText – state-of-the-art neural word embeddings
Other
- SenseClusters – distributional clustering in Perl
- Text to Matrix Generator (TMG) – text mining with NMF in Matlab
If you know other useful off-the-shelf packages missing from this list, please drop me a line.
Precompiled DSMs
Evaluation tasks
Useful corpora
- The Westbury Lab at Alberta has a preprocessed (cleaned) Wikipedia Corpus from an April 2010 dump. The WaCky initiative offers WaCkypedia, a dependency-parsed Wikipedia Corpus from a 2009 dump. Both corpora only cover the English Wikipedia.