Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
course:material [2018/07/26 11:29]
schtepf [Off-the-shelf packages for DSM]
course:material [2021/08/11 16:12]
schtepf [Example data sets]
Line 1: Line 1:
 ====== Courses and Tutorials on DSM  ====== ====== Courses and Tutorials on DSM  ======
  
-[[course:esslli2009:start|ESSLLI '09]] –+[[course:esslli2009:start|ESSLLI 2009]] –
 [[course:acl2010:start|NAACL-HLT 2010]] – [[course:acl2010:start|NAACL-HLT 2010]] –
 [[course:esslli2018:start|ESSLLI '16 & '18]] – [[course:esslli2018:start|ESSLLI '16 & '18]] –
 +[[course:esslli2021:start|ESSLLI 2021]] –
 **Software & data sets** – **Software & data sets** –
 [[course:bibliography|Bibliography]] [[course:bibliography|Bibliography]]
Line 12: Line 13:
 Practical examples and exercises for these courses and tutorials are based on the user-friendly software package [[http://wordspace.r-forge.r-project.org/|wordspace]] for the interactive statistical computing environment [[http://www.r-project.org/|R]].  If you want to follow along, please bring your own laptop and set up the required software as follows: Practical examples and exercises for these courses and tutorials are based on the user-friendly software package [[http://wordspace.r-forge.r-project.org/|wordspace]] for the interactive statistical computing environment [[http://www.r-project.org/|R]].  If you want to follow along, please bring your own laptop and set up the required software as follows:
  
-  - Install up-to-date versions of [[https://cran.r-project.org/banner.shtml|R]] and the [[https://www.rstudio.com/products/rstudio/download/#download|RStudio]] GUI+  - Install up-to-date versions of [[https://cran.r-project.org/banner.shtml|R]] (4.0 or newer) and the [[https://www.rstudio.com/products/rstudio/download/#download|RStudio]] GUI
   - Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive:    - Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive: 
-    * ''sparsesvd'' +    * ''sparsesvd'' (v0.2) 
-    * ''iotools'' +    * ''wordspace'' (v0.2-6) 
-    * ''tm'' (optional) +    * recommended: ''e1071''''rsparse'', ''Rtsne'', ''uwot'' 
-    * ''quanteda'' (optional) +    * optional: ''tm''''quanteda'', ''data.table'', ''wordcloud'', ''shiny'', ''spacyr'', ''udpipe'', ''coreNLP'' (don't worry if some of these fail to install) 
-    * ''Rcpp'' (needed on Linux only) +    * optional: ''NMF'' (also install ''biocManager''then run command ''BiocManager::install("bioBase")''
-  - Install the ''wordspace'' package itself It is available from CRAN through the standard installerbut you may be asked to use the latest version available here: +
-    * ''wordspace'' v0.2-0: [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.tar.gz|Source/Linux]] – [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.tgz|MacOS]] – [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.zip|Windows]] +
-    * download a suitable version of the package for your platform +
-    * in the RStudio installerselect “Install fromPackage Archive File”+
   - During the course, you will be asked to install a further package with additional evaluation tasks (''wordspaceEval'') from a password-protected Web page:   - During the course, you will be asked to install a further package with additional evaluation tasks (''wordspaceEval'') from a password-protected Web page:
-    * ''wordspaceEval'' v0.1: [[http://www.collocations.de/data/protected/wordspaceEval_0.1.tar.gz|Source/Linux]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.1.tgz|MacOS]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.1.zip|Windows]] (login required)+    * ''wordspaceEval'' v0.2: [[http://www.collocations.de/data/protected/wordspaceEval_0.2.tar.gz|Source/Linux]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.2.tgz|MacOS]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.2.zip|Windows]] (login required) 
 +    * if you are stuck with R v3.x, please use the older package version 0.1: [[http://www.collocations.de/data/protected/wordspaceEval_0.1.tar.gz|Source/Linux]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.1.tgz|MacOS]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.1.zip|Windows]] (login required)
     * download a suitable version and select “Install from: Package Archive File” in RStudio     * download a suitable version and select “Install from: Package Archive File” in RStudio
   - Download the sample data files listed below   - Download the sample data files listed below
   - Download one or more of the pre-compiled DSMs listed below   - Download one or more of the pre-compiled DSMs listed below
 +
 +===== Scaling R to large data sets =====
 +
 +Most of our hands-on examples work reasonably well in a standard R installation, even on a moderately powerful laptop computer.
 +However, if you intend to work on real-life tasks and process large DSMs, it is important to enable multi-threaded computation
 +in R. Since DSMs build on matrix operations, a multi-threaded linear algebra library (“BLAS”) is key.
 +
 +  - In Linux, it should be sufficient to install the OpenBLAS package, e.g. in Ubuntu: ''sudo apt install libopenblas-dev''
 +  - In MacOS, follow [[https://groups.google.com/g/r-sig-mac/c/YN6uNYCIZK0|these instructions]] to enable the VecLib BLAS built into MacOS.  You may also want to [[https://mac.r-project.org/openmp/|enable OpenMP]] for an additional speed boost on expensive distance metrics (but this is less important).
 +  - In Windows, you can try installing [[https://mran.microsoft.com/open|Microsoft R Open]] or do a Web search for alternative solutions.
 +
 +
 +<!-- doesn't apply at the moment -- 
 +
 +==== Getting the latest & greatest ====
 +
 +During the course, you may be asked to install a new version of ''wordspace'' that hasn't been submitted to CRAN yet.  In this case, please follow these instructions:
 +
 +  - Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive: 
 +    * ''sparsesvd''
 +    * ''iotools''
 +    * ''Rcpp'' (needed on Linux only)
 +  - Download an appropriate version of the package for your platform:
 +    * ''wordspace'' v0.2-0: [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.tar.gz|Source/Linux]] – [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.tgz|MacOS]] – [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.zip|Windows]]
 +  - In the RStudio installer, select “Install from: Package Archive File”
 +
 +You can also check the [[http://wordspace.r-forge.r-project.org/|wordspace homepage]] for new releases and installation instructions.
 +
 +-->
  
 ===== Example data sets ===== ===== Example data sets =====
Line 36: Line 63:
   * ''[[http://www.collocations.de/data/potter_l2r2.txt.gz|potter_l2r2.txt.gz]]'' (51.3 MB)   * ''[[http://www.collocations.de/data/potter_l2r2.txt.gz|potter_l2r2.txt.gz]]'' (51.3 MB)
   * ''[[http://www.collocations.de/data/potter_lemmas.txt.gz|potter_lemmas.txt.gz]]'' (1.1 MB)    * ''[[http://www.collocations.de/data/potter_lemmas.txt.gz|potter_lemmas.txt.gz]]'' (1.1 MB) 
 +  * ''[[http://www.collocations.de/data/VSS.txt|VSS.txt]]'' (37 kB)
  
 ===== Pre-compiled DSMs ===== ===== Pre-compiled DSMs =====
  
-Pre-compiled DSMs for use with the ''wordspace'' package for R. Each model is contained in an ''.rda'' file, and can be loaded into R with the command ''load("model.rda")''.+Pre-compiled DSMs for use with the ''wordspace'' package for R. Each model is contained in an ''.rda'' file, which can be loaded into R with the command ''load("model.rda")'' and creates an object with the same name (''model'').
  
 ==== DSMs based on the English Wikipedia ==== ==== DSMs based on the English Wikipedia ====
  
-These models were compiled from ''WP500'', a 200-million word subset of the Wackypedia corpus comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions.+These models were compiled from ''WP500'', a 200-million word subset of the Wackypedia corpus comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions. The latent SVD dimensions are based on log-transformed sparse simple-ll scores with L2-normalization. Power scaling with Caron $P = 0$ (i.e. equalization of the latent dimensions) has been applied, but the reduced vectors are not re-normalized
  
-  * dependency-filtered: ''[[http://www.collocations.de/data/WP500_DepFilter_Lemma.rda|WP500_DepFilter_Lemma.rda]]'' (30.MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_DepFilter_Lemma_svd500.rda|WP500_DepFilter_Lemma_svd500.rda]]'' (175.MB) +  * dependency-filtered: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_DepFilter_Lemma.rda|WP500_DepFilter_Lemma.rda]]'' (31.MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_DepFilter_Lemma_svd500.rda|WP500_DepFilter_Lemma_svd500.rda]]'' (179.MB) 
-  * dependency-structured: ''[[http://www.collocations.de/data/WP500_DepStruct_Lemma.rda|WP500_DepStruct_Lemma.rda]]'' (30.MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_DepStruct_Lemma_svd500.rda|WP500_DepStruct_Lemma_svd500.rda]]'' (176.MB) +  * dependency-structured: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_DepStruct_Lemma.rda|WP500_DepStruct_Lemma.rda]]'' (31.MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_DepStruct_Lemma_svd500.rda|WP500_DepStruct_Lemma_svd500.rda]]'' (180.MB) 
-  * L2/R2 surface span: ''[[http://www.collocations.de/data/WP500_Win2_Lemma.rda|WP500_Win2_Lemma.rda]]'' (50.MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win2_Lemma_svd500.rda|WP500_Win2_Lemma_svd500.rda]]'' (173.MB) +  * L2/R2 surface span: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Lemma.rda|WP500_Win2_Lemma.rda]]'' (51.MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Lemma_svd500.rda|WP500_Win2_Lemma_svd500.rda]]'' (177.MB) 
-  * L5/R5 surface span: ''[[http://www.collocations.de/data/WP500_Win5_Lemma.rda|WP500_Win5_Lemma.rda]]'' (99.MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win5_Lemma_svd500.rda|WP500_Win5_Lemma_svd500.rda]]'' (176.MB) +  * L5/R5 surface span: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win5_Lemma.rda|WP500_Win5_Lemma.rda]]'' (103.MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win5_Lemma_svd500.rda|WP500_Win5_Lemma_svd500.rda]]'' (179.MB) 
-  * L30/R30 surface span: ''[[http://www.collocations.de/data/WP500_Win30_Lemma.rda|WP500_Win30_Lemma.rda]]'' (295.MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win30_Lemma_svd500.rda|WP500_Win30_Lemma_svd500.rda]]'' (179.MB) +  * L30/R30 surface span: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win30_Lemma.rda|WP500_Win30_Lemma.rda]]'' (311.MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win30_Lemma_svd500.rda|WP500_Win30_Lemma_svd500.rda]]'' (182.MB) 
-  * term-document model: ''[[http://www.collocations.de/data/WP500_TermDoc_Lemma.rda|WP500_TermDoc_Lemma.rda]]'' (101.MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_TermDoc_Lemma_svd500.rda|WP500_TermDoc_Lemma_svd500.rda]]'' (158.MB) +  * term-document model: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_TermDoc_Lemma.rda|WP500_TermDoc_Lemma.rda]]'' (105.MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_TermDoc_Lemma_svd500.rda|WP500_TermDoc_Lemma_svd500.rda]]'' (162.MB) 
-  * type contexts (L1+R1): ''[[http://www.collocations.de/data/WP500_Ctype_L1R1_Lemma.rda|WP500_Ctype_L1R1_Lemma.rda]]'' (55.MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Ctype_L1R1_Lemma_svd500.rda|WP500_Ctype_L1R1_Lemma_svd500.rda]]'' (153.MB) +  * type contexts (L1+R1): ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L1R1_Lemma.rda|WP500_Ctype_L1R1_Lemma.rda]]'' (55.MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L1R1_Lemma_svd500.rda|WP500_Ctype_L1R1_Lemma_svd500.rda]]'' (157.0 MB) 
-  * type contexts (L2+R2 POS tags): ''[[http://www.collocations.de/data/WP500_Ctype_L2R2pos_Lemma.rda|WP500_Ctype_L2R2pos_Lemma.rda]]'' (55.1 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Ctype_L2R2pos_Lemma_svd500.rda|WP500_Ctype_L2R2pos_Lemma_svd500.rda]]'' (172.MB) +  * type contexts (L2+R2): ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L2R2_Lemma.rda|WP500_Ctype_L2R2_Lemma.rda]]'' (33.1 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L2R2_Lemma_svd500.rda|WP500_Ctype_L2R2_Lemma_svd500.rda]]'' (64.3 MB) 
-  * word forms L2/R2: ''[[http://www.collocations.de/data/WP500_Win2_Word.rda|WP500_Win2_Word.rda]]'' (61.MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win2_Word_svd500.rda|WP500_Win2_Word_svd500.rda]]'' (182.MB) +  * type contexts (L2+R2 POS tags): ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L2R2pos_Lemma.rda|WP500_Ctype_L2R2pos_Lemma.rda]]'' (56.1 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L2R2pos_Lemma_svd500.rda|WP500_Ctype_L2R2pos_Lemma_svd500.rda]]'' (175.MB) 
-  * word forms L2/R2 with non-lemmatized features: ''[[http://www.collocations.de/data/WP500_Win2_Word_WF.rda|WP500_Win2_Word_WF.rda]]'' (65.9 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win2_Word_WF_svd500.rda|WP500_Win2_Word_WF_svd500.rda]]'' (182.MB)+  * word forms L2/R2: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Word.rda|WP500_Win2_Word.rda]]'' (63.MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Word_svd500.rda|WP500_Win2_Word_svd500.rda]]'' (185.MB) 
 +  * word forms L2/R2 with non-lemmatized features: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Word_WF.rda|WP500_Win2_Word_WF.rda]]'' (68.9 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Word_WF_svd500.rda|WP500_Win2_Word_WF_svd500.rda]]'' (185.MB)
  
 ==== Neural word embeddings ==== ==== Neural word embeddings ====
Line 60: Line 89:
 Some publicly available pre-trained neural embeddings, converted into ''.rda'' format for use with the ''wordspace'' package. Some publicly available pre-trained neural embeddings, converted into ''.rda'' format for use with the ''wordspace'' package.
  
-  * word2vec: ''[[http://www.collocations.de/data/GoogleNews300_wf200k.rda|GoogleNews300_wf200k.rda]]'' (129.2 MiB) +  * word2vec: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/GoogleNews300_wf200k.rda|GoogleNews300_wf200k.rda]]'' (129.2 MiB) 
  
 ===== Web interfaces ===== ===== Web interfaces =====
  
-  * Web interface for several pre-trained [[http://clic.cimec.unitn.it/infomap-query/|Infomap models]] (CIMeC, U Trento) +  * Web interface for several pre-trained **[[http://clic.cimec.unitn.it/infomap-query/|Infomap models]]** (CIMeC, U Trento) 
- +  Explore **[[https://corpora.linguistik.uni-erlangen.de/shiny/wordspace/word2vec/|word2vec embeddings]]** (FAU Erlangen-Nürnberg) 
-===== Some other off-the-shelf packages for DSM ===== +  * Explore **[[https://corpora.linguistik.uni-erlangen.de/shiny/wordspace/WP500/|DSMs based on Wikipedia]]** (FAU Erlangen-Nürnberg)
- +
-**Python** +
-  * [[https://radimrehurek.com/gensim/|Gensim]] – high-performance topic modelling +
-  * [[http://clic.cimec.unitn.it/composes/toolkit/|DISSECT]] – easy-to-use package developed by the COMPOSES project +
-  * [[https://pypi.org/project/Divisi/|Divisi]] – semantic networks, tensors & SVD +
- +
-**Java** +
-  * [[https://github.com/semanticvectors/semanticvectors/wiki|Semantic Vectors]] – scalable implementation based on random indexing +
-  * [[https://github.com/fozziethebeat/S-Space|S-Space]] package +
- +
-**C/C++** +
-  * [[http://infomap-nlp.sourceforge.net/|Infomap NLP]] – classical LSA-style DSM +
-  * [[http://www.psych.ualberta.ca/~westburylab/downloads/HiDEx.download.html|HiDEx]], the High-Dimensional Explorer +
-  * [[https://github.com/facebookresearch/fastText|FastText]] – state-of-the-art neural word embeddings +
- +
-**Perl** +
-  * [[http://senseclusters.sourceforge.net/|SenseClusters]] +
- +