This is an old revision of the document!
Table of Contents
Courses and Tutorials on DSM
ESSLLI '09 – NAACL-HLT 2010 – ESSLLI '16 & '18 – Software & data sets – Bibliography
Software for the course
Practical examples and exercises for these courses and tutorials are based on the user-friendly software package wordspace for the interactive statistical computing environment R. If you want to follow along, please bring your own laptop and set up the required software as follows:
- Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive:
sparsesvd
wordspace
- optional:
tm
,quanteda
,Rtsne
,shiny
- During the course, you will be asked to install a further package with additional evaluation tasks (
wordspaceEval
) from a password-protected Web page:- download a suitable version and select “Install from: Package Archive File” in RStudio
- Download the sample data files listed below
- Download one or more of the pre-compiled DSMs listed below
- Install the
wordspace
package itself. It is available from CRAN through the standard installer, but you may be asked to use the latest version available here:- download a suitable version of the package for your platform
- in the RStudio installer, select “Install from: Package Archive File”
- Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive:
sparsesvd
wordspace
word *
tm(optional) *
quanteda(optional) *
Rcpp(needed on Linux only) ===== Example data sets ===== *
verb_dep.txt.gz(21.6 MB) *
adj_noun_tokens.txt.gz(8.3 MB) *
delta_de_termdoc.txt.gz(18.4 MB) *
potter_l2r2.txt.gz(51.3 MB) *
potter_lemmas.txt.gz(1.1 MB) ===== Pre-compiled DSMs ===== Pre-compiled DSMs for use with the
wordspacepackage for R. Each model is contained in an
.rdafile, which can be loaded into R with the command
load("model.rda")and creates an object with the same name (
model). ==== DSMs based on the English Wikipedia ==== These models were compiled from
WP500, a 200-million word subset of the Wackypedia corpus comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions. The latent SVD dimensions are based on log-transformed sparse simple-ll scores with L2-normalization. Power scaling with Caron $P = 0$ (i.e. equalization of the latent dimensions) has been applied, but the reduced vectors are not re-normalized. * dependency-filtered:
WP500_DepFilter_Lemma.rda(31.1 MB) – 500 latent SVD dimensions:
WP500_DepFilter_Lemma_svd500.rda(179.3 MB) * dependency-structured:
WP500_DepStruct_Lemma.rda(31.6 MB) – 500 latent SVD dimensions:
WP500_DepStruct_Lemma_svd500.rda(180.3 MB) * L2/R2 surface span:
WP500_Win2_Lemma.rda(51.8 MB) – 500 latent SVD dimensions:
WP500_Win2_Lemma_svd500.rda(177.1 MB) * L5/R5 surface span:
WP500_Win5_Lemma.rda(103.9 MB) – 500 latent SVD dimensions:
WP500_Win5_Lemma_svd500.rda(179.9 MB) * L30/R30 surface span:
WP500_Win30_Lemma.rda(311.4 MB) – 500 latent SVD dimensions:
WP500_Win30_Lemma_svd500.rda(182.8 MB) * term-document model:
WP500_TermDoc_Lemma.rda(105.1 MB) – 500 latent SVD dimensions:
WP500_TermDoc_Lemma_svd500.rda(162.5 MB) * type contexts (L1+R1):
WP500_Ctype_L1R1_Lemma.rda(55.8 MB) – 500 latent SVD dimensions:
WP500_Ctype_L1R1_Lemma_svd500.rda(157.0 MB) * type contexts (L2+R2):
WP500_Ctype_L2R2_Lemma.rda(33.1 MB) – 500 latent SVD dimensions:
WP500_Ctype_L2R2_Lemma_svd500.rda(64.3 MB) * type contexts (L2+R2 POS tags):
WP500_Ctype_L2R2pos_Lemma.rda(56.1 MB) – 500 latent SVD dimensions:
WP500_Ctype_L2R2pos_Lemma_svd500.rda(175.3 MB) * word forms L2/R2:
WP500_Win2_Word.rda(63.9 MB) – 500 latent SVD dimensions:
WP500_Win2_Word_svd500.rda(185.5 MB) * word forms L2/R2 with non-lemmatized features:
WP500_Win2_Word_WF.rda(68.9 MB) – 500 latent SVD dimensions:
WP500_Win2_Word_WF_svd500.rda(185.9 MB) ==== Neural word embeddings ==== Some publicly available pre-trained neural embeddings, converted into
.rdaformat for use with the
wordspacepackage. * word2vec:
GoogleNews300_wf200k.rda'' (129.2 MiB)
Web interfaces
- Web interface for several pre-trained Infomap models (CIMeC, U Trento)
- Explore word2vec embeddings (FAU Erlangen-Nürnberg)
- Explore DSMs based on Wikipedia (FAU Erlangen-Nürnberg)