Differences

This shows you the differences between two versions of the page.

--- course:material [2018/07/26 11:29]
schtepf [Off-the-shelf packages for DSM]
+++ course:material [2021/08/11 16:12]
schtepf [Example data sets]
@@ Line 1: / Line 1: @@
 ====== Courses and Tutorials on DSM  ======
-[[course:esslli2009:start|ESSLLI '09]] –
+[[course:esslli2009:start|ESSLLI 2009]] –
 [[course:acl2010:start|NAACL-HLT 2010]] –
 [[course:esslli2018:start|ESSLLI '16 & '18]] –
+[[course:esslli2021:start|ESSLLI 2021]] –
 **Software & data sets** –
 [[course:bibliography|Bibliography]]
@@ Line 12: / Line 13: @@
 Practical examples and exercises for these courses and tutorials are based on the user-friendly software package [[http://wordspace.r-forge.r-project.org/|wordspace]] for the interactive statistical computing environment [[http://www.r-project.org/|R]].  If you want to follow along, please bring your own laptop and set up the required software as follows:
-  - Install up-to-date versions of [[https://cran.r-project.org/banner.shtml|R]] and the [[https://www.rstudio.com/products/rstudio/download/#download|RStudio]] GUI
+  - Install up-to-date versions of [[https://cran.r-project.org/banner.shtml|R]] (4.0 or newer) and the [[https://www.rstudio.com/products/rstudio/download/#download|RStudio]] GUI
   - Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive:
-    * ''sparsesvd''
+    * ''sparsesvd'' (v0.2)
-    * ''iotools''
+    * ''wordspace'' (v0.2-6)
-    * ''tm'' (optional)
+    * recommended: ''e1071'', ''rsparse'', ''Rtsne'', ''uwot''
-    * ''quanteda'' (optional)
+    * optional: ''tm'', ''quanteda'', ''data.table'', ''wordcloud'', ''shiny'', ''spacyr'', ''udpipe'', ''coreNLP'' (don't worry if some of these fail to install)
-    * ''Rcpp'' (needed on Linux only)
+    * optional: ''NMF'' (also install ''biocManager'', then run command ''BiocManager::install("bioBase")''
-  - Install the ''wordspace'' package itself.  It is available from CRAN through the standard installer, but you may be asked to use the latest version available here:
-    * ''wordspace'' v0.2-0: [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.tar.gz|Source/Linux]] – [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.tgz|MacOS]] – [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.zip|Windows]]
-    * download a suitable version of the package for your platform
-    * in the RStudio installer, select “Install from: Package Archive File”
   - During the course, you will be asked to install a further package with additional evaluation tasks (''wordspaceEval'') from a password-protected Web page:
-    * ''wordspaceEval'' v0.1: [[http://www.collocations.de/data/protected/wordspaceEval_0.1.tar.gz|Source/Linux]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.1.tgz|MacOS]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.1.zip|Windows]] (login required)
+    * ''wordspaceEval'' v0.2: [[http://www.collocations.de/data/protected/wordspaceEval_0.2.tar.gz|Source/Linux]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.2.tgz|MacOS]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.2.zip|Windows]] (login required)
+    * if you are stuck with R v3.x, please use the older package version 0.1: [[http://www.collocations.de/data/protected/wordspaceEval_0.1.tar.gz|Source/Linux]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.1.tgz|MacOS]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.1.zip|Windows]] (login required)
     * download a suitable version and select “Install from: Package Archive File” in RStudio
   - Download the sample data files listed below
   - Download one or more of the pre-compiled DSMs listed below
+===== Scaling R to large data sets =====
+Most of our hands-on examples work reasonably well in a standard R installation, even on a moderately powerful laptop computer.
+However, if you intend to work on real-life tasks and process large DSMs, it is important to enable multi-threaded computation
+in R. Since DSMs build on matrix operations, a multi-threaded linear algebra library (“BLAS”) is key.
+  - In Linux, it should be sufficient to install the OpenBLAS package, e.g. in Ubuntu: ''sudo apt install libopenblas-dev''
+  - In MacOS, follow [[https://groups.google.com/g/r-sig-mac/c/YN6uNYCIZK0|these instructions]] to enable the VecLib BLAS built into MacOS.  You may also want to [[https://mac.r-project.org/openmp/|enable OpenMP]] for an additional speed boost on expensive distance metrics (but this is less important).
+  - In Windows, you can try installing [[https://mran.microsoft.com/open|Microsoft R Open]] or do a Web search for alternative solutions.
+<!-- doesn't apply at the moment --
+==== Getting the latest & greatest ====
+During the course, you may be asked to install a new version of ''wordspace'' that hasn't been submitted to CRAN yet.  In this case, please follow these instructions:
+  - Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive:
+    * ''sparsesvd''
+    * ''iotools''
+    * ''Rcpp'' (needed on Linux only)
+  - Download an appropriate version of the package for your platform:
+    * ''wordspace'' v0.2-0: [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.tar.gz|Source/Linux]] – [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.tgz|MacOS]] – [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.zip|Windows]]
+  - In the RStudio installer, select “Install from: Package Archive File”
+You can also check the [[http://wordspace.r-forge.r-project.org/|wordspace homepage]] for new releases and installation instructions.
+-->
 ===== Example data sets =====
@@ Line 36: / Line 63: @@
   * ''[[http://www.collocations.de/data/potter_l2r2.txt.gz|potter_l2r2.txt.gz]]'' (51.3 MB)
   * ''[[http://www.collocations.de/data/potter_lemmas.txt.gz|potter_lemmas.txt.gz]]'' (1.1 MB)
+  * ''[[http://www.collocations.de/data/VSS.txt|VSS.txt]]'' (37 kB)
 ===== Pre-compiled DSMs =====
-Pre-compiled DSMs for use with the ''wordspace'' package for R. Each model is contained in an ''.rda'' file, and can be loaded into R with the command ''load("model.rda")''.
+Pre-compiled DSMs for use with the ''wordspace'' package for R. Each model is contained in an ''.rda'' file, which can be loaded into R with the command ''load("model.rda")'' and creates an object with the same name (''model'').
 ==== DSMs based on the English Wikipedia ====
-These models were compiled from ''WP500'', a 200-million word subset of the Wackypedia corpus comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions.
+These models were compiled from ''WP500'', a 200-million word subset of the Wackypedia corpus comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions. The latent SVD dimensions are based on log-transformed sparse simple-ll scores with L2-normalization. Power scaling with Caron $P = 0$ (i.e. equalization of the latent dimensions) has been applied, but the reduced vectors are not re-normalized.
-  * dependency-filtered: ''[[http://www.collocations.de/data/WP500_DepFilter_Lemma.rda|WP500_DepFilter_Lemma.rda]]'' (30.4 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_DepFilter_Lemma_svd500.rda|WP500_DepFilter_Lemma_svd500.rda]]'' (175.9 MB)
+  * dependency-filtered: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_DepFilter_Lemma.rda|WP500_DepFilter_Lemma.rda]]'' (31.1 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_DepFilter_Lemma_svd500.rda|WP500_DepFilter_Lemma_svd500.rda]]'' (179.3 MB)
-  * dependency-structured: ''[[http://www.collocations.de/data/WP500_DepStruct_Lemma.rda|WP500_DepStruct_Lemma.rda]]'' (30.9 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_DepStruct_Lemma_svd500.rda|WP500_DepStruct_Lemma_svd500.rda]]'' (176.8 MB)
+  * dependency-structured: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_DepStruct_Lemma.rda|WP500_DepStruct_Lemma.rda]]'' (31.6 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_DepStruct_Lemma_svd500.rda|WP500_DepStruct_Lemma_svd500.rda]]'' (180.3 MB)
-  * L2/R2 surface span: ''[[http://www.collocations.de/data/WP500_Win2_Lemma.rda|WP500_Win2_Lemma.rda]]'' (50.1 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win2_Lemma_svd500.rda|WP500_Win2_Lemma_svd500.rda]]'' (173.7 MB)
+  * L2/R2 surface span: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Lemma.rda|WP500_Win2_Lemma.rda]]'' (51.8 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Lemma_svd500.rda|WP500_Win2_Lemma_svd500.rda]]'' (177.1 MB)
-  * L5/R5 surface span: ''[[http://www.collocations.de/data/WP500_Win5_Lemma.rda|WP500_Win5_Lemma.rda]]'' (99.3 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win5_Lemma_svd500.rda|WP500_Win5_Lemma_svd500.rda]]'' (176.5 MB)
+  * L5/R5 surface span: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win5_Lemma.rda|WP500_Win5_Lemma.rda]]'' (103.9 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win5_Lemma_svd500.rda|WP500_Win5_Lemma_svd500.rda]]'' (179.9 MB)
-  * L30/R30 surface span: ''[[http://www.collocations.de/data/WP500_Win30_Lemma.rda|WP500_Win30_Lemma.rda]]'' (295.8 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win30_Lemma_svd500.rda|WP500_Win30_Lemma_svd500.rda]]'' (179.5 MB)
+  * L30/R30 surface span: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win30_Lemma.rda|WP500_Win30_Lemma.rda]]'' (311.4 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win30_Lemma_svd500.rda|WP500_Win30_Lemma_svd500.rda]]'' (182.8 MB)
-  * term-document model: ''[[http://www.collocations.de/data/WP500_TermDoc_Lemma.rda|WP500_TermDoc_Lemma.rda]]'' (101.3 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_TermDoc_Lemma_svd500.rda|WP500_TermDoc_Lemma_svd500.rda]]'' (158.7 MB)
+  * term-document model: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_TermDoc_Lemma.rda|WP500_TermDoc_Lemma.rda]]'' (105.1 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_TermDoc_Lemma_svd500.rda|WP500_TermDoc_Lemma_svd500.rda]]'' (162.5 MB)
-  * type contexts (L1+R1): ''[[http://www.collocations.de/data/WP500_Ctype_L1R1_Lemma.rda|WP500_Ctype_L1R1_Lemma.rda]]'' (55.1 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Ctype_L1R1_Lemma_svd500.rda|WP500_Ctype_L1R1_Lemma_svd500.rda]]'' (153.9 MB)
+  * type contexts (L1+R1): ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L1R1_Lemma.rda|WP500_Ctype_L1R1_Lemma.rda]]'' (55.8 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L1R1_Lemma_svd500.rda|WP500_Ctype_L1R1_Lemma_svd500.rda]]'' (157.0 MB)
-  * type contexts (L2+R2 POS tags): ''[[http://www.collocations.de/data/WP500_Ctype_L2R2pos_Lemma.rda|WP500_Ctype_L2R2pos_Lemma.rda]]'' (55.1 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Ctype_L2R2pos_Lemma_svd500.rda|WP500_Ctype_L2R2pos_Lemma_svd500.rda]]'' (172.2 MB)
+  * type contexts (L2+R2): ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L2R2_Lemma.rda|WP500_Ctype_L2R2_Lemma.rda]]'' (33.1 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L2R2_Lemma_svd500.rda|WP500_Ctype_L2R2_Lemma_svd500.rda]]'' (64.3 MB)
-  * word forms L2/R2: ''[[http://www.collocations.de/data/WP500_Win2_Word.rda|WP500_Win2_Word.rda]]'' (61.6 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win2_Word_svd500.rda|WP500_Win2_Word_svd500.rda]]'' (182.0 MB)
+  * type contexts (L2+R2 POS tags): ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L2R2pos_Lemma.rda|WP500_Ctype_L2R2pos_Lemma.rda]]'' (56.1 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L2R2pos_Lemma_svd500.rda|WP500_Ctype_L2R2pos_Lemma_svd500.rda]]'' (175.3 MB)
-  * word forms L2/R2 with non-lemmatized features: ''[[http://www.collocations.de/data/WP500_Win2_Word_WF.rda|WP500_Win2_Word_WF.rda]]'' (65.9 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win2_Word_WF_svd500.rda|WP500_Win2_Word_WF_svd500.rda]]'' (182.5 MB)
+  * word forms L2/R2: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Word.rda|WP500_Win2_Word.rda]]'' (63.9 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Word_svd500.rda|WP500_Win2_Word_svd500.rda]]'' (185.5 MB)
+  * word forms L2/R2 with non-lemmatized features: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Word_WF.rda|WP500_Win2_Word_WF.rda]]'' (68.9 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Word_WF_svd500.rda|WP500_Win2_Word_WF_svd500.rda]]'' (185.9 MB)
 ==== Neural word embeddings ====
@@ Line 60: / Line 89: @@
 Some publicly available pre-trained neural embeddings, converted into ''.rda'' format for use with the ''wordspace'' package.
-  * word2vec: ''[[http://www.collocations.de/data/GoogleNews300_wf200k.rda|GoogleNews300_wf200k.rda]]'' (129.2 MiB)
+  * word2vec: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/GoogleNews300_wf200k.rda|GoogleNews300_wf200k.rda]]'' (129.2 MiB)
 ===== Web interfaces =====
-  * Web interface for several pre-trained [[http://clic.cimec.unitn.it/infomap-query/|Infomap models]] (CIMeC, U Trento)
+  * Web interface for several pre-trained **[[http://clic.cimec.unitn.it/infomap-query/|Infomap models]]** (CIMeC, U Trento)
+  * Explore **[[https://corpora.linguistik.uni-erlangen.de/shiny/wordspace/word2vec/|word2vec embeddings]]** (FAU Erlangen-Nürnberg)
-===== Some other off-the-shelf packages for DSM =====
+  * Explore **[[https://corpora.linguistik.uni-erlangen.de/shiny/wordspace/WP500/|DSMs based on Wikipedia]]** (FAU Erlangen-Nürnberg)
-**Python**
-  * [[https://radimrehurek.com/gensim/|Gensim]] – high-performance topic modelling
-  * [[http://clic.cimec.unitn.it/composes/toolkit/|DISSECT]] – easy-to-use package developed by the COMPOSES project
-  * [[https://pypi.org/project/Divisi/|Divisi]] – semantic networks, tensors & SVD
-**Java**
-  * [[https://github.com/semanticvectors/semanticvectors/wiki|Semantic Vectors]] – scalable implementation based on random indexing
-  * [[https://github.com/fozziethebeat/S-Space|S-Space]] package
-**C/C++**
-  * [[http://infomap-nlp.sourceforge.net/|Infomap NLP]] – classical LSA-style DSM
-  * [[http://www.psych.ualberta.ca/~westburylab/downloads/HiDEx.download.html|HiDEx]], the High-Dimensional Explorer
-  * [[https://github.com/facebookresearch/fastText|FastText]] – state-of-the-art neural word embeddings
-**Perl**
-  * [[http://senseclusters.sourceforge.net/|SenseClusters]]

You are here: start » course » material

Differences

Navigation

Search

Toolbox