Courses and Tutorials on DSM

ESSLLI '09 – NAACL-HLT 2010 – ESSLLI '16 & '18 – Software & data sets – Bibliography

Software for the course

Practical examples and exercises for these courses and tutorials are based on the user-friendly software package wordspace for the interactive statistical computing environment R. If you want to follow along, please bring your own laptop and set up the required software as follows:

Install up-to-date versions of R and the RStudio GUI
Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive:
- sparsesvd
- wordspace
- optional: tm, quanteda, Rtsne, shiny
During the course, you will be asked to install a further package with additional evaluation tasks (wordspaceEval) from a password-protected Web page:
- wordspaceEval v0.1: Source/Linux – MacOS – Windows (login required)
- download a suitable version and select “Install from: Package Archive File” in RStudio
Download the sample data files listed below
Download one or more of the pre-compiled DSMs listed below

Install the wordspace package itself. It is available from CRAN through the standard installer, but you may be asked to use the latest version available here:
- wordspace v0.2-0: Source/Linux – MacOS – Windows
- download a suitable version of the package for your platform
- in the RStudio installer, select “Install from: Package Archive File”

Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive:
- sparsesvd
- wordspace
- word *tm(optional) *quanteda(optional) *Rcpp(needed on Linux only) ===== Example data sets ===== *verb_dep.txt.gz(21.6 MB) *adj_noun_tokens.txt.gz(8.3 MB) *delta_de_termdoc.txt.gz(18.4 MB) *potter_l2r2.txt.gz(51.3 MB) *potter_lemmas.txt.gz(1.1 MB) ===== Pre-compiled DSMs ===== Pre-compiled DSMs for use with thewordspace package for R. Each model is contained in an .rda file, which can be loaded into R with the command load("model.rda") and creates an object with the same name (model). ==== DSMs based on the English Wikipedia ==== These models were compiled fromWP500, a 200-million word subset of the Wackypedia corpus comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions. The latent SVD dimensions are based on log-transformed sparse simple-ll scores with L2-normalization. Power scaling with Caron $P = 0$ (i.e. equalization of the latent dimensions) has been applied, but the reduced vectors are not re-normalized. * dependency-filtered:WP500_DepFilter_Lemma.rda (31.1 MB) – 500 latent SVD dimensions: WP500_DepFilter_Lemma_svd500.rda(179.3 MB) * dependency-structured:WP500_DepStruct_Lemma.rda (31.6 MB) – 500 latent SVD dimensions: WP500_DepStruct_Lemma_svd500.rda(180.3 MB) * L2/R2 surface span:WP500_Win2_Lemma.rda (51.8 MB) – 500 latent SVD dimensions: WP500_Win2_Lemma_svd500.rda(177.1 MB) * L5/R5 surface span:WP500_Win5_Lemma.rda (103.9 MB) – 500 latent SVD dimensions: WP500_Win5_Lemma_svd500.rda(179.9 MB) * L30/R30 surface span:WP500_Win30_Lemma.rda (311.4 MB) – 500 latent SVD dimensions: WP500_Win30_Lemma_svd500.rda(182.8 MB) * term-document model:WP500_TermDoc_Lemma.rda (105.1 MB) – 500 latent SVD dimensions: WP500_TermDoc_Lemma_svd500.rda(162.5 MB) * type contexts (L1+R1):WP500_Ctype_L1R1_Lemma.rda (55.8 MB) – 500 latent SVD dimensions: WP500_Ctype_L1R1_Lemma_svd500.rda(157.0 MB) * type contexts (L2+R2):WP500_Ctype_L2R2_Lemma.rda (33.1 MB) – 500 latent SVD dimensions: WP500_Ctype_L2R2_Lemma_svd500.rda(64.3 MB) * type contexts (L2+R2 POS tags):WP500_Ctype_L2R2pos_Lemma.rda (56.1 MB) – 500 latent SVD dimensions: WP500_Ctype_L2R2pos_Lemma_svd500.rda(175.3 MB) * word forms L2/R2:WP500_Win2_Word.rda (63.9 MB) – 500 latent SVD dimensions: WP500_Win2_Word_svd500.rda(185.5 MB) * word forms L2/R2 with non-lemmatized features:WP500_Win2_Word_WF.rda (68.9 MB) – 500 latent SVD dimensions: WP500_Win2_Word_WF_svd500.rda(185.9 MB) ==== Neural word embeddings ==== Some publicly available pre-trained neural embeddings, converted into.rda format for use with the wordspacepackage. * word2vec:GoogleNews300_wf200k.rda'' (129.2 MiB)

Web interfaces

Web interface for several pre-trained Infomap models (CIMeC, U Trento)
Explore word2vec embeddings (FAU Erlangen-Nürnberg)
Explore DSMs based on Wikipedia (FAU Erlangen-Nürnberg)

You are here: start » course » material

Table of Contents

Courses and Tutorials on DSM

Software for the course

Web interfaces

Navigation

Search

Toolbox