Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
course:material [2018/08/06 12:21]
schtepf [Software for the course]
course:material [2019/05/17 09:49]
schtepf [Software for the course]
Line 16: Line 16:
     * ''sparsesvd''     * ''sparsesvd''
     * ''wordspace''     * ''wordspace''
-    * optional: ''tm'', ''quanteda'', ''Rtsne'', ''shiny''+    * optional: ''tm'', ''quanteda'', ''Rtsne'', ''uwot'', ''wordcloud'', ''shiny'', ''corpustools'', ''spacyr'', ''udpipe'' (don't worry if some of these fail to install)
   - During the course, you will be asked to install a further package with additional evaluation tasks (''wordspaceEval'') from a password-protected Web page:   - During the course, you will be asked to install a further package with additional evaluation tasks (''wordspaceEval'') from a password-protected Web page:
     * ''wordspaceEval'' v0.1: [[http://www.collocations.de/data/protected/wordspaceEval_0.1.tar.gz|Source/Linux]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.1.tgz|MacOS]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.1.zip|Windows]] (login required)     * ''wordspaceEval'' v0.1: [[http://www.collocations.de/data/protected/wordspaceEval_0.1.tar.gz|Source/Linux]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.1.tgz|MacOS]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.1.zip|Windows]] (login required)
Line 23: Line 23:
   - Download one or more of the pre-compiled DSMs listed below   - Download one or more of the pre-compiled DSMs listed below
  
-  - Install the ''wordspace'' package itself.  It is available from CRAN through the standard installerbut you may be asked to use the latest version available here: +==== Getting the latest & greatest ==== 
-    * ''wordspace'' v0.2-0: [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.tar.gz|Source/Linux]] – [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.tgz|MacOS]] – [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.zip|Windows]] + 
-    * download a suitable version of the package for your platform +During the course, you may be asked to install a new version of ''wordspace'' that hasn't been submitted to CRAN yet In this caseplease follow these instructions:
-    * in the RStudio installerselect “Install fromPackage Archive File”+
  
   - Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive:    - Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive: 
     * ''sparsesvd''     * ''sparsesvd''
-    * ''wordspace'' +    * ''iotools''
-    * ''word +
-    * ''tm'' (optional) +
-    * ''quanteda'' (optional)+
     * ''Rcpp'' (needed on Linux only)     * ''Rcpp'' (needed on Linux only)
 +  - Download an appropriate version of the package for your platform:
 +    * ''wordspace'' v0.2-0: [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.tar.gz|Source/Linux]] – [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.tgz|MacOS]] – [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.zip|Windows]]
 +  - In the RStudio installer, select “Install from: Package Archive File”
  
 +You can also check the [[http://wordspace.r-forge.r-project.org/|wordspace homepage]] for new releases and installation instructions.
  
 ===== Example data sets ===== ===== Example data sets =====
Line 53: Line 53:
 These models were compiled from ''WP500'', a 200-million word subset of the Wackypedia corpus comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions. The latent SVD dimensions are based on log-transformed sparse simple-ll scores with L2-normalization. Power scaling with Caron $P = 0$ (i.e. equalization of the latent dimensions) has been applied, but the reduced vectors are not re-normalized.  These models were compiled from ''WP500'', a 200-million word subset of the Wackypedia corpus comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions. The latent SVD dimensions are based on log-transformed sparse simple-ll scores with L2-normalization. Power scaling with Caron $P = 0$ (i.e. equalization of the latent dimensions) has been applied, but the reduced vectors are not re-normalized. 
  
-  * dependency-filtered: ''[[http://www.collocations.de/data/WP500_DepFilter_Lemma.rda|WP500_DepFilter_Lemma.rda]]'' (31.1 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_DepFilter_Lemma_svd500.rda|WP500_DepFilter_Lemma_svd500.rda]]'' (179.3 MB) +  * dependency-filtered: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_DepFilter_Lemma.rda|WP500_DepFilter_Lemma.rda]]'' (31.1 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_DepFilter_Lemma_svd500.rda|WP500_DepFilter_Lemma_svd500.rda]]'' (179.3 MB) 
-  * dependency-structured: ''[[http://www.collocations.de/data/WP500_DepStruct_Lemma.rda|WP500_DepStruct_Lemma.rda]]'' (31.6 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_DepStruct_Lemma_svd500.rda|WP500_DepStruct_Lemma_svd500.rda]]'' (180.3 MB) +  * dependency-structured: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_DepStruct_Lemma.rda|WP500_DepStruct_Lemma.rda]]'' (31.6 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_DepStruct_Lemma_svd500.rda|WP500_DepStruct_Lemma_svd500.rda]]'' (180.3 MB) 
-  * L2/R2 surface span: ''[[http://www.collocations.de/data/WP500_Win2_Lemma.rda|WP500_Win2_Lemma.rda]]'' (51.8 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win2_Lemma_svd500.rda|WP500_Win2_Lemma_svd500.rda]]'' (177.1 MB) +  * L2/R2 surface span: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Lemma.rda|WP500_Win2_Lemma.rda]]'' (51.8 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Lemma_svd500.rda|WP500_Win2_Lemma_svd500.rda]]'' (177.1 MB) 
-  * L5/R5 surface span: ''[[http://www.collocations.de/data/WP500_Win5_Lemma.rda|WP500_Win5_Lemma.rda]]'' (103.9 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win5_Lemma_svd500.rda|WP500_Win5_Lemma_svd500.rda]]'' (179.9 MB) +  * L5/R5 surface span: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win5_Lemma.rda|WP500_Win5_Lemma.rda]]'' (103.9 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win5_Lemma_svd500.rda|WP500_Win5_Lemma_svd500.rda]]'' (179.9 MB) 
-  * L30/R30 surface span: ''[[http://www.collocations.de/data/WP500_Win30_Lemma.rda|WP500_Win30_Lemma.rda]]'' (311.4 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win30_Lemma_svd500.rda|WP500_Win30_Lemma_svd500.rda]]'' (182.8 MB) +  * L30/R30 surface span: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win30_Lemma.rda|WP500_Win30_Lemma.rda]]'' (311.4 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win30_Lemma_svd500.rda|WP500_Win30_Lemma_svd500.rda]]'' (182.8 MB) 
-  * term-document model: ''[[http://www.collocations.de/data/WP500_TermDoc_Lemma.rda|WP500_TermDoc_Lemma.rda]]'' (105.1 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_TermDoc_Lemma_svd500.rda|WP500_TermDoc_Lemma_svd500.rda]]'' (162.5 MB) +  * term-document model: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_TermDoc_Lemma.rda|WP500_TermDoc_Lemma.rda]]'' (105.1 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_TermDoc_Lemma_svd500.rda|WP500_TermDoc_Lemma_svd500.rda]]'' (162.5 MB) 
-  * type contexts (L1+R1): ''[[http://www.collocations.de/data/WP500_Ctype_L1R1_Lemma.rda|WP500_Ctype_L1R1_Lemma.rda]]'' (55.8 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Ctype_L1R1_Lemma_svd500.rda|WP500_Ctype_L1R1_Lemma_svd500.rda]]'' (157.0 MB) +  * type contexts (L1+R1): ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L1R1_Lemma.rda|WP500_Ctype_L1R1_Lemma.rda]]'' (55.8 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L1R1_Lemma_svd500.rda|WP500_Ctype_L1R1_Lemma_svd500.rda]]'' (157.0 MB) 
-  * type contexts (L2+R2): ''[[http://www.collocations.de/data/WP500_Ctype_L2R2_Lemma.rda|WP500_Ctype_L2R2_Lemma.rda]]'' (33.1 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Ctype_L2R2_Lemma_svd500.rda|WP500_Ctype_L2R2_Lemma_svd500.rda]]'' (64.3 MB) +  * type contexts (L2+R2): ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L2R2_Lemma.rda|WP500_Ctype_L2R2_Lemma.rda]]'' (33.1 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L2R2_Lemma_svd500.rda|WP500_Ctype_L2R2_Lemma_svd500.rda]]'' (64.3 MB) 
-  * type contexts (L2+R2 POS tags): ''[[http://www.collocations.de/data/WP500_Ctype_L2R2pos_Lemma.rda|WP500_Ctype_L2R2pos_Lemma.rda]]'' (56.1 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Ctype_L2R2pos_Lemma_svd500.rda|WP500_Ctype_L2R2pos_Lemma_svd500.rda]]'' (175.3 MB) +  * type contexts (L2+R2 POS tags): ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L2R2pos_Lemma.rda|WP500_Ctype_L2R2pos_Lemma.rda]]'' (56.1 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L2R2pos_Lemma_svd500.rda|WP500_Ctype_L2R2pos_Lemma_svd500.rda]]'' (175.3 MB) 
-  * word forms L2/R2: ''[[http://www.collocations.de/data/WP500_Win2_Word.rda|WP500_Win2_Word.rda]]'' (63.9 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win2_Word_svd500.rda|WP500_Win2_Word_svd500.rda]]'' (185.5 MB) +  * word forms L2/R2: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Word.rda|WP500_Win2_Word.rda]]'' (63.9 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Word_svd500.rda|WP500_Win2_Word_svd500.rda]]'' (185.5 MB) 
-  * word forms L2/R2 with non-lemmatized features: ''[[http://www.collocations.de/data/WP500_Win2_Word_WF.rda|WP500_Win2_Word_WF.rda]]'' (68.9 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win2_Word_WF_svd500.rda|WP500_Win2_Word_WF_svd500.rda]]'' (185.9 MB)+  * word forms L2/R2 with non-lemmatized features: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Word_WF.rda|WP500_Win2_Word_WF.rda]]'' (68.9 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Word_WF_svd500.rda|WP500_Win2_Word_WF_svd500.rda]]'' (185.9 MB)
  
 ==== Neural word embeddings ==== ==== Neural word embeddings ====
Line 69: Line 69:
 Some publicly available pre-trained neural embeddings, converted into ''.rda'' format for use with the ''wordspace'' package. Some publicly available pre-trained neural embeddings, converted into ''.rda'' format for use with the ''wordspace'' package.
  
-  * word2vec: ''[[http://www.collocations.de/data/GoogleNews300_wf200k.rda|GoogleNews300_wf200k.rda]]'' (129.2 MiB) +  * word2vec: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/GoogleNews300_wf200k.rda|GoogleNews300_wf200k.rda]]'' (129.2 MiB) 
  
 ===== Web interfaces ===== ===== Web interfaces =====