This is an old revision of the document!


Courses and Tutorials on DSM

ESSLLI '09NAACL-HLT 2010ESSLLI '16 & '18Software & data setsBibliography

Software for the course

Practical examples and exercises for these courses and tutorials are based on the user-friendly software package wordspace for the interactive statistical computing environment R. If you want to follow along, please bring your own laptop and set up the required software as follows:

  1. Install up-to-date versions of R and the RStudio GUI
  2. Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive:
    • sparsesvd
    • iotools
    • tm (optional)
    • quanteda (optional)
    • Rcpp (needed on Linux only)
  3. Install the wordspace package itself. It is available from CRAN through the standard installer, but you may be asked to use the latest version available here:
    • wordspace v0.2-0: Source/LinuxMacOSWindows
    • download a suitable version of the package for your platform
    • in the RStudio installer, select “Install from: Package Archive File”
  4. During the course, you will be asked to install a further package with additional evaluation tasks (wordspaceEval) from a password-protected Web page:
    • wordspaceEval v0.1: Source/LinuxMacOSWindows (login required)
    • download a suitable version and select “Install from: Package Archive File” in RStudio
  5. Download the sample data files listed below
  6. Download one or more of the pre-compiled DSMs listed below

Example data sets

Pre-compiled DSMs

Online access (Web interfaces)

Off-the-shelf packages for DSM

Downloads

Data sets

  • Verb + object noun co-occurrences (tokens) extracted from the British National Corpus: bnc_vobj_filtered.txt.gz (15 MB)
  • A 5-million word corpus of Harry Potter fan fiction in lemma_pos format (pre-cleaned): potter_tokens.txt.gz (8.9 MB)
  • NEW: DSM for 34,150 English nouns from 2-billion-word ukWaC corpus: ukwac_vobj_S_svd.rda (158 MB)
    • verb-object co-occurrences, features are 3,371 frequent verbs, log-scaled t-score, 300 SVD dimensions
    • nearest-neighbour demo with visualisation: neighbour_demo.R