ESSLLI 2009 – NAACL-HLT 2010 – ESSLLI '16 & '18 – ESSLLI 2021 – Software & data sets – Bibliography
Practical examples and exercises for these courses and tutorials are based on the user-friendly software package wordspace for the interactive statistical computing environment R. If you want to follow along, please bring your own laptop and set up the required software as follows:
sparsesvd (v0.2)wordspace (v0.2-6)e1071, rsparse, Rtsne, uwottm, quanteda, data.table, wordcloud, shiny, spacyr, udpipe, coreNLP (don't worry if some of these fail to install)NMF (also install biocManager, then run command BiocManager::install("bioBase"))wordspaceEval) from a password-protected Web page:Most of our hands-on examples work reasonably well in a standard R installation, even on a moderately powerful laptop computer. However, if you intend to work on real-life tasks and process large DSMs, it is important to enable multi-threaded computation in R. Since DSMs build on matrix operations, a multi-threaded linear algebra library (“BLAS”) is key.
sudo apt install libopenblas-dev
verb_dep.txt.gz (21.6 MB)adj_noun_tokens.txt.gz (8.3 MB)delta_de_termdoc.txt.gz (18.4 MB)potter_l2r2.txt.gz (51.3 MB)potter_lemmas.txt.gz (1.1 MB) VSS.txt (37 kB)
Pre-compiled DSMs for use with the wordspace package for R. Each model is contained in an .rda file, which can be loaded into R with the command load("model.rda") and creates an object with the same name (model).
These models were compiled from WP500, a 200-million word subset of the Wackypedia corpus comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions. The latent SVD dimensions are based on log-transformed sparse simple-ll scores with L2-normalization. Power scaling with Caron $P = 0$ (i.e. equalization of the latent dimensions) has been applied, but the reduced vectors are not re-normalized.
WP500_DepFilter_Lemma.rda (31.1 MB) – 500 latent SVD dimensions: WP500_DepFilter_Lemma_svd500.rda (179.3 MB)WP500_DepStruct_Lemma.rda (31.6 MB) – 500 latent SVD dimensions: WP500_DepStruct_Lemma_svd500.rda (180.3 MB)WP500_Win2_Lemma.rda (51.8 MB) – 500 latent SVD dimensions: WP500_Win2_Lemma_svd500.rda (177.1 MB)WP500_Win5_Lemma.rda (103.9 MB) – 500 latent SVD dimensions: WP500_Win5_Lemma_svd500.rda (179.9 MB)WP500_Win30_Lemma.rda (311.4 MB) – 500 latent SVD dimensions: WP500_Win30_Lemma_svd500.rda (182.8 MB)WP500_TermDoc_Lemma.rda (105.1 MB) – 500 latent SVD dimensions: WP500_TermDoc_Lemma_svd500.rda (162.5 MB)WP500_Ctype_L1R1_Lemma.rda (55.8 MB) – 500 latent SVD dimensions: WP500_Ctype_L1R1_Lemma_svd500.rda (157.0 MB)WP500_Ctype_L2R2_Lemma.rda (33.1 MB) – 500 latent SVD dimensions: WP500_Ctype_L2R2_Lemma_svd500.rda (64.3 MB)WP500_Ctype_L2R2pos_Lemma.rda (56.1 MB) – 500 latent SVD dimensions: WP500_Ctype_L2R2pos_Lemma_svd500.rda (175.3 MB)WP500_Win2_Word.rda (63.9 MB) – 500 latent SVD dimensions: WP500_Win2_Word_svd500.rda (185.5 MB)WP500_Win2_Word_WF.rda (68.9 MB) – 500 latent SVD dimensions: WP500_Win2_Word_WF_svd500.rda (185.9 MB)
Some publicly available pre-trained neural embeddings, converted into .rda format for use with the wordspace package.
GoogleNews300_wf200k.rda (129.2 MiB)