This is an old revision of the document!
Table of Contents
Courses and Tutorials on DSM
ESSLLI '09 – NAACL-HLT 2010 – ESSLLI '16 & '18 – Software & data sets – Bibliography
Software for the course
Practical examples and exercises for these courses and tutorials are based on the user-friendly software package wordspace for the interactive statistical computing environment R. If you want to follow along, please bring your own laptop and set up the required software as follows:
- Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive:
sparsesvd
iotools
tm
(optional)quanteda
(optional)Rcpp
(needed on Linux only)
- Install the
wordspace
package itself. It is available from CRAN through the standard installer, but you may be asked to use the latest version available here:- download a suitable version of the package for your platform
- in the RStudio installer, select “Install from: Package Archive File”
- During the course, you will be asked to install a further package with additional evaluation tasks (
wordspaceEval
) from a password-protected Web page:- download a suitable version and select “Install from: Package Archive File” in RStudio
- Download the sample data files listed below
- Download one or more of the pre-compiled DSMs listed below
Example data sets
verb_dep.txt.gz
(21.6 MB)adj_noun_tokens.txt.gz
(8.3 MB)delta_de_termdoc.txt.gz
(18.4 MB)potter_l2r2.txt.gz
(51.3 MB)potter_lemmas.txt.gz
(1.1 MB)
Pre-compiled DSMs
Pre-compiled DSMs for use with the wordspace
package for R. Each model is contained in an .rda
file, which can be loaded into R with the command load("model.rda")
and creates an object with the same name (model
).
DSMs based on the English Wikipedia
These models were compiled from WP500
, a 200-million word subset of the Wackypedia corpus comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions. The latent SVD dimensions are based on log-transformed sparse simple-ll scores with L2-normalization. Power scaling with Caron $P = 0$ (i.e. equalization of the latent dimensions) has been applied, but the reduced vectors are not re-normalized.
- dependency-filtered:
WP500_DepFilter_Lemma.rda
(31.1 MB) – 500 latent SVD dimensions:WP500_DepFilter_Lemma_svd500.rda
(179.3 MB) - dependency-structured:
WP500_DepStruct_Lemma.rda
(31.6 MB) – 500 latent SVD dimensions:WP500_DepStruct_Lemma_svd500.rda
(180.3 MB) - L2/R2 surface span:
WP500_Win2_Lemma.rda
(51.8 MB) – 500 latent SVD dimensions:WP500_Win2_Lemma_svd500.rda
(177.1 MB) - L5/R5 surface span:
WP500_Win5_Lemma.rda
(103.9 MB) – 500 latent SVD dimensions:WP500_Win5_Lemma_svd500.rda
(179.9 MB) - L30/R30 surface span:
WP500_Win30_Lemma.rda
(311.4 MB) – 500 latent SVD dimensions:WP500_Win30_Lemma_svd500.rda
(182.8 MB) - term-document model:
WP500_TermDoc_Lemma.rda
(105.1 MB) – 500 latent SVD dimensions:WP500_TermDoc_Lemma_svd500.rda
(162.5 MB) - type contexts (L1+R1):
WP500_Ctype_L1R1_Lemma.rda
(55.8 MB) – 500 latent SVD dimensions:WP500_Ctype_L1R1_Lemma_svd500.rda
(157.0 MB) - type contexts (L2+R2):
WP500_Ctype_L2R2_Lemma.rda
(33.1 MB) – 500 latent SVD dimensions:WP500_Ctype_L2R2_Lemma_svd500.rda
(64.3 MB) - type contexts (L2+R2 POS tags):
WP500_Ctype_L2R2pos_Lemma.rda
(56.1 MB) – 500 latent SVD dimensions:WP500_Ctype_L2R2pos_Lemma_svd500.rda
(175.3 MB) - word forms L2/R2:
WP500_Win2_Word.rda
(63.9 MB) – 500 latent SVD dimensions:WP500_Win2_Word_svd500.rda
(185.5 MB) - word forms L2/R2 with non-lemmatized features:
WP500_Win2_Word_WF.rda
(68.9 MB) – 500 latent SVD dimensions:WP500_Win2_Word_WF_svd500.rda
(185.9 MB)
Neural word embeddings
Some publicly available pre-trained neural embeddings, converted into .rda
format for use with the wordspace
package.
- word2vec:
GoogleNews300_wf200k.rda
(129.2 MiB)
Web interfaces
- Web interface for several pre-trained Infomap models (CIMeC, U Trento)