This is an old revision of the document!
Table of Contents
Courses and Tutorials on DSM
ESSLLI '09 – NAACL-HLT 2010 – ESSLLI '16 & '18 – Software & data sets – Bibliography
Software for the course
Practical examples and exercises for these courses and tutorials are based on the user-friendly software package wordspace for the interactive statistical computing environment R. If you want to follow along, please bring your own laptop and set up the required software as follows:
- Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive:
sparsesvd
iotools
tm
(optional)quanteda
(optional)Rcpp
(needed on Linux only)
- Install the
wordspace
package itself. It is available from CRAN through the standard installer, but you may be asked to use the latest version available here:- download a suitable version of the package for your platform
- in the RStudio installer, select “Install from: Package Archive File”
- During the course, you will be asked to install a further package with additional evaluation tasks (
wordspaceEval
) from a password-protected Web page:- download a suitable version and select “Install from: Package Archive File” in RStudio
- Download the sample data files listed below
- Download one or more of the pre-compiled DSMs listed below
Example data sets
verb_dep.txt.gz
(21.6 MB)adj_noun_tokens.txt.gz
(8.3 MB)delta_de_termdoc.txt.gz
(18.4 MB)potter_l2r2.txt.gz
(51.3 MB)potter_lemmas.txt.gz
(1.1 MB)
Pre-compiled DSMs
Online access (Web interfaces)
- Web interface for several pre-trained Infomap models (CIMeC, U Trento)
- Explore a German LSA space (CogSci, U Osnabrück)
Off-the-shelf packages for DSM
- HiDEx, the High-Dimensional Explorer
- S-Space Package (work in progress)
- Wordspaces (interactive exploration)
- Divisi (semantic networks, tensors & SVD in Python)
Downloads
Data sets
- Verb + object noun co-occurrences (tokens) extracted from the British National Corpus: bnc_vobj_filtered.txt.gz (15 MB)
- A 5-million word corpus of Harry Potter fan fiction in lemma
_
pos format (pre-cleaned): potter_tokens.txt.gz (8.9 MB)
- NEW: DSM for 34,150 English nouns from 2-billion-word ukWaC corpus: ukwac_vobj_S_svd.rda (158 MB)
- verb-object co-occurrences, features are 3,371 frequent verbs, log-scaled t-score, 300 SVD dimensions
- nearest-neighbour demo with visualisation: neighbour_demo.R