Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
course:material [2018/08/02 09:38]
schtepf [Pre-compiled DSMs]
course:material [2019/07/17 16:59] (current)
schtepf
Line 14: Line 14:
   - Install up-to-date versions of [[https://​cran.r-project.org/​banner.shtml|R]] and the [[https://​www.rstudio.com/​products/​rstudio/​download/#​download|RStudio]] GUI   - Install up-to-date versions of [[https://​cran.r-project.org/​banner.shtml|R]] and the [[https://​www.rstudio.com/​products/​rstudio/​download/#​download|RStudio]] GUI
   - Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive: ​   - Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive: ​
-    * ''​sparsesvd''​ +    * ''​sparsesvd'' ​(v0.2) 
-    * ''​iotools''​ +    * ''​wordspace'' ​(v0.2-5) 
-    * ''​tm'' ​(optional) +    * optional: ​''​tm''​''​quanteda''​''​Rtsne''​''​uwot'',​ ''​wordcloud''​, ''​shiny'',​ ''​corpustools'',​ ''​spacyr'',​ ''​udpipe''​ (don't worry if some of these fail to install)
-    * ''​quanteda'' ​(optional) +
-    * ''​Rcpp'' ​(needed on Linux only) +
-  - Install the ''​wordspace'' ​package itself. ​ It is available from CRAN through the standard installerbut you may be asked to use the latest version available here: +
-    * ''​wordspace'' ​v0.2-0: [[http://​wordspace.r-forge.r-project.org/​downloads/​wordspace_0.2-0.tar.gz|Source/​Linux]] – [[http://​wordspace.r-forge.r-project.org/​downloads/​wordspace_0.2-0.tgz|MacOS]] – [[http://​wordspace.r-forge.r-project.org/​downloads/​wordspace_0.2-0.zip|Windows]] +
-    * download a suitable version ​of the package for your platform +
-    * in the RStudio installer, select “Install from: Package Archive File”+
   - During the course, you will be asked to install a further package with additional evaluation tasks (''​wordspaceEval''​) from a password-protected Web page:   - During the course, you will be asked to install a further package with additional evaluation tasks (''​wordspaceEval''​) from a password-protected Web page:
     * ''​wordspaceEval''​ v0.1: [[http://​www.collocations.de/​data/​protected/​wordspaceEval_0.1.tar.gz|Source/​Linux]] – [[http://​www.collocations.de/​data/​protected/​wordspaceEval_0.1.tgz|MacOS]] – [[http://​www.collocations.de/​data/​protected/​wordspaceEval_0.1.zip|Windows]] (login required)     * ''​wordspaceEval''​ v0.1: [[http://​www.collocations.de/​data/​protected/​wordspaceEval_0.1.tar.gz|Source/​Linux]] – [[http://​www.collocations.de/​data/​protected/​wordspaceEval_0.1.tgz|MacOS]] – [[http://​www.collocations.de/​data/​protected/​wordspaceEval_0.1.zip|Windows]] (login required)
Line 28: Line 22:
   - Download the sample data files listed below   - Download the sample data files listed below
   - Download one or more of the pre-compiled DSMs listed below   - Download one or more of the pre-compiled DSMs listed below
 +
 +/* -- doesn'​t apply at the moment -- 
 +==== Getting the latest & greatest ====
 +
 +During the course, you may be asked to install a new version of ''​wordspace''​ that hasn't been submitted to CRAN yet.  In this case, please follow these instructions:​
 +
 +  - Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive: ​
 +    * ''​sparsesvd''​
 +    * ''​iotools''​
 +    * ''​Rcpp''​ (needed on Linux only)
 +  - Download an appropriate version of the package for your platform:
 +    * ''​wordspace''​ v0.2-0: [[http://​wordspace.r-forge.r-project.org/​downloads/​wordspace_0.2-0.tar.gz|Source/​Linux]] – [[http://​wordspace.r-forge.r-project.org/​downloads/​wordspace_0.2-0.tgz|MacOS]] – [[http://​wordspace.r-forge.r-project.org/​downloads/​wordspace_0.2-0.zip|Windows]]
 +  - In the RStudio installer, select “Install from: Package Archive File”
 +
 +You can also check the [[http://​wordspace.r-forge.r-project.org/​|wordspace homepage]] for new releases and installation instructions.
 +
 +*/
  
 ===== Example data sets ===== ===== Example data sets =====
Line 45: Line 56:
 These models were compiled from ''​WP500'',​ a 200-million word subset of the Wackypedia corpus comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions. The latent SVD dimensions are based on log-transformed sparse simple-ll scores with L2-normalization. Power scaling with Caron $P = 0$ (i.e. equalization of the latent dimensions) has been applied, but the reduced vectors are not re-normalized. ​ These models were compiled from ''​WP500'',​ a 200-million word subset of the Wackypedia corpus comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions. The latent SVD dimensions are based on log-transformed sparse simple-ll scores with L2-normalization. Power scaling with Caron $P = 0$ (i.e. equalization of the latent dimensions) has been applied, but the reduced vectors are not re-normalized. ​
  
-  * dependency-filtered:​ ''​[[http://​www.collocations.de/​data/​WP500_DepFilter_Lemma.rda|WP500_DepFilter_Lemma.rda]]''​ (31.1 MB) – 500 latent SVD dimensions: ''​[[http://​www.collocations.de/​data/​WP500_DepFilter_Lemma_svd500.rda|WP500_DepFilter_Lemma_svd500.rda]]''​ (179.3 MB) +  * dependency-filtered:​ ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_DepFilter_Lemma.rda|WP500_DepFilter_Lemma.rda]]''​ (31.1 MB) – 500 latent SVD dimensions: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_DepFilter_Lemma_svd500.rda|WP500_DepFilter_Lemma_svd500.rda]]''​ (179.3 MB) 
-  * dependency-structured:​ ''​[[http://​www.collocations.de/​data/​WP500_DepStruct_Lemma.rda|WP500_DepStruct_Lemma.rda]]''​ (31.6 MB) – 500 latent SVD dimensions: ''​[[http://​www.collocations.de/​data/​WP500_DepStruct_Lemma_svd500.rda|WP500_DepStruct_Lemma_svd500.rda]]''​ (180.3 MB) +  * dependency-structured:​ ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_DepStruct_Lemma.rda|WP500_DepStruct_Lemma.rda]]''​ (31.6 MB) – 500 latent SVD dimensions: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_DepStruct_Lemma_svd500.rda|WP500_DepStruct_Lemma_svd500.rda]]''​ (180.3 MB) 
-  * L2/R2 surface span: ''​[[http://​www.collocations.de/​data/​WP500_Win2_Lemma.rda|WP500_Win2_Lemma.rda]]''​ (51.8 MB) – 500 latent SVD dimensions: ''​[[http://​www.collocations.de/​data/​WP500_Win2_Lemma_svd500.rda|WP500_Win2_Lemma_svd500.rda]]''​ (177.1 MB) +  * L2/R2 surface span: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Win2_Lemma.rda|WP500_Win2_Lemma.rda]]''​ (51.8 MB) – 500 latent SVD dimensions: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Win2_Lemma_svd500.rda|WP500_Win2_Lemma_svd500.rda]]''​ (177.1 MB) 
-  * L5/R5 surface span: ''​[[http://​www.collocations.de/​data/​WP500_Win5_Lemma.rda|WP500_Win5_Lemma.rda]]''​ (103.9 MB) – 500 latent SVD dimensions: ''​[[http://​www.collocations.de/​data/​WP500_Win5_Lemma_svd500.rda|WP500_Win5_Lemma_svd500.rda]]''​ (179.9 MB) +  * L5/R5 surface span: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Win5_Lemma.rda|WP500_Win5_Lemma.rda]]''​ (103.9 MB) – 500 latent SVD dimensions: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Win5_Lemma_svd500.rda|WP500_Win5_Lemma_svd500.rda]]''​ (179.9 MB) 
-  * L30/R30 surface span: ''​[[http://​www.collocations.de/​data/​WP500_Win30_Lemma.rda|WP500_Win30_Lemma.rda]]''​ (311.4 MB) – 500 latent SVD dimensions: ''​[[http://​www.collocations.de/​data/​WP500_Win30_Lemma_svd500.rda|WP500_Win30_Lemma_svd500.rda]]''​ (182.8 MB) +  * L30/R30 surface span: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Win30_Lemma.rda|WP500_Win30_Lemma.rda]]''​ (311.4 MB) – 500 latent SVD dimensions: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Win30_Lemma_svd500.rda|WP500_Win30_Lemma_svd500.rda]]''​ (182.8 MB) 
-  * term-document model: ''​[[http://​www.collocations.de/​data/​WP500_TermDoc_Lemma.rda|WP500_TermDoc_Lemma.rda]]''​ (105.1 MB) – 500 latent SVD dimensions: ''​[[http://​www.collocations.de/​data/​WP500_TermDoc_Lemma_svd500.rda|WP500_TermDoc_Lemma_svd500.rda]]''​ (162.5 MB) +  * term-document model: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_TermDoc_Lemma.rda|WP500_TermDoc_Lemma.rda]]''​ (105.1 MB) – 500 latent SVD dimensions: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_TermDoc_Lemma_svd500.rda|WP500_TermDoc_Lemma_svd500.rda]]''​ (162.5 MB) 
-  * type contexts (L1+R1): ''​[[http://​www.collocations.de/​data/​WP500_Ctype_L1R1_Lemma.rda|WP500_Ctype_L1R1_Lemma.rda]]''​ (55.8 MB) – 500 latent SVD dimensions: ''​[[http://​www.collocations.de/​data/​WP500_Ctype_L1R1_Lemma_svd500.rda|WP500_Ctype_L1R1_Lemma_svd500.rda]]''​ (157.0 MB) +  * type contexts (L1+R1): ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Ctype_L1R1_Lemma.rda|WP500_Ctype_L1R1_Lemma.rda]]''​ (55.8 MB) – 500 latent SVD dimensions: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Ctype_L1R1_Lemma_svd500.rda|WP500_Ctype_L1R1_Lemma_svd500.rda]]''​ (157.0 MB) 
-  * type contexts (L2+R2): ''​[[http://​www.collocations.de/​data/​WP500_Ctype_L2R2_Lemma.rda|WP500_Ctype_L2R2_Lemma.rda]]''​ (33.1 MB) – 500 latent SVD dimensions: ''​[[http://​www.collocations.de/​data/​WP500_Ctype_L2R2_Lemma_svd500.rda|WP500_Ctype_L2R2_Lemma_svd500.rda]]''​ (64.3 MB) +  * type contexts (L2+R2): ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Ctype_L2R2_Lemma.rda|WP500_Ctype_L2R2_Lemma.rda]]''​ (33.1 MB) – 500 latent SVD dimensions: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Ctype_L2R2_Lemma_svd500.rda|WP500_Ctype_L2R2_Lemma_svd500.rda]]''​ (64.3 MB) 
-  * type contexts (L2+R2 POS tags): ''​[[http://​www.collocations.de/​data/​WP500_Ctype_L2R2pos_Lemma.rda|WP500_Ctype_L2R2pos_Lemma.rda]]''​ (56.1 MB) – 500 latent SVD dimensions: ''​[[http://​www.collocations.de/​data/​WP500_Ctype_L2R2pos_Lemma_svd500.rda|WP500_Ctype_L2R2pos_Lemma_svd500.rda]]''​ (175.3 MB) +  * type contexts (L2+R2 POS tags): ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Ctype_L2R2pos_Lemma.rda|WP500_Ctype_L2R2pos_Lemma.rda]]''​ (56.1 MB) – 500 latent SVD dimensions: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Ctype_L2R2pos_Lemma_svd500.rda|WP500_Ctype_L2R2pos_Lemma_svd500.rda]]''​ (175.3 MB) 
-  * word forms L2/R2: ''​[[http://​www.collocations.de/​data/​WP500_Win2_Word.rda|WP500_Win2_Word.rda]]''​ (63.9 MB) – 500 latent SVD dimensions: ''​[[http://​www.collocations.de/​data/​WP500_Win2_Word_svd500.rda|WP500_Win2_Word_svd500.rda]]''​ (185.5 MB) +  * word forms L2/R2: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Win2_Word.rda|WP500_Win2_Word.rda]]''​ (63.9 MB) – 500 latent SVD dimensions: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Win2_Word_svd500.rda|WP500_Win2_Word_svd500.rda]]''​ (185.5 MB) 
-  * word forms L2/R2 with non-lemmatized features: ''​[[http://​www.collocations.de/​data/​WP500_Win2_Word_WF.rda|WP500_Win2_Word_WF.rda]]''​ (68.9 MB) – 500 latent SVD dimensions: ''​[[http://​www.collocations.de/​data/​WP500_Win2_Word_WF_svd500.rda|WP500_Win2_Word_WF_svd500.rda]]''​ (185.9 MB)+  * word forms L2/R2 with non-lemmatized features: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Win2_Word_WF.rda|WP500_Win2_Word_WF.rda]]''​ (68.9 MB) – 500 latent SVD dimensions: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Win2_Word_WF_svd500.rda|WP500_Win2_Word_WF_svd500.rda]]''​ (185.9 MB)
  
 ==== Neural word embeddings ==== ==== Neural word embeddings ====
Line 61: Line 72:
 Some publicly available pre-trained neural embeddings, converted into ''​.rda''​ format for use with the ''​wordspace''​ package. Some publicly available pre-trained neural embeddings, converted into ''​.rda''​ format for use with the ''​wordspace''​ package.
  
-  * word2vec: ''​[[http://​www.collocations.de/​data/​GoogleNews300_wf200k.rda|GoogleNews300_wf200k.rda]]''​ (129.2 MiB) +  * word2vec: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​GoogleNews300_wf200k.rda|GoogleNews300_wf200k.rda]]''​ (129.2 MiB) 
  
 ===== Web interfaces ===== ===== Web interfaces =====
  
-  * Web interface for several pre-trained [[http://​clic.cimec.unitn.it/​infomap-query/​|Infomap models]] (CIMeC, U Trento)+  * Web interface for several pre-trained ​**[[http://​clic.cimec.unitn.it/​infomap-query/​|Infomap models]]** (CIMeC, U Trento
 +  * Explore **[[https://​corpora.linguistik.uni-erlangen.de/​shiny/​wordspace/​word2vec/​|word2vec embeddings]]** (FAU Erlangen-Nürnberg) 
 +  * Explore **[[https://​corpora.linguistik.uni-erlangen.de/​shiny/​wordspace/​WP500/​|DSMs based on Wikipedia]]** (FAU Erlangen-Nürnberg)