Courses and Tutorials on DSM

ESSLLI '09NAACL-HLT 2010ESSLLI '16 & '18Software & data setsBibliography

Software for the course

Practical examples and exercises for these courses and tutorials are based on the user-friendly software package wordspace for the interactive statistical computing environment R. If you want to follow along, please bring your own laptop and set up the required software as follows:

  1. Install up-to-date versions of R and the RStudio GUI
  2. Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive:
    • sparsesvd
    • wordspace
    • optional: tm, quanteda, Rtsne, shiny
  3. During the course, you will be asked to install a further package with additional evaluation tasks (wordspaceEval) from a password-protected Web page:
    • wordspaceEval v0.1: Source/LinuxMacOSWindows (login required)
    • download a suitable version and select “Install from: Package Archive File” in RStudio
  4. Download the sample data files listed below
  5. Download one or more of the pre-compiled DSMs listed below

Getting the latest & greatest

During the course, you may be asked to install a new version of wordspace that hasn't been submitted to CRAN yet. In this case, please follow these instructions:

  1. Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive:
    • sparsesvd
    • iotools
    • Rcpp (needed on Linux only)
  2. Download an appropriate version of the package for your platform:
  3. In the RStudio installer, select “Install from: Package Archive File”

You can also check the wordspace homepage for new releases and installation instructions.

Example data sets

Pre-compiled DSMs

Pre-compiled DSMs for use with the wordspace package for R. Each model is contained in an .rda file, which can be loaded into R with the command load(“model.rda”) and creates an object with the same name (model).

DSMs based on the English Wikipedia

These models were compiled from WP500, a 200-million word subset of the Wackypedia corpus comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions. The latent SVD dimensions are based on log-transformed sparse simple-ll scores with L2-normalization. Power scaling with Caron $P = 0$ (i.e. equalization of the latent dimensions) has been applied, but the reduced vectors are not re-normalized.

Neural word embeddings

Some publicly available pre-trained neural embeddings, converted into .rda format for use with the wordspace package.

Web interfaces