Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
course:material [2018/07/26 11:06]
schtepf [Online access (Web interfaces)]
course:material [2019/07/17 16:59] (current)
schtepf
Line 14: Line 14:
   - Install up-to-date versions of [[https://​cran.r-project.org/​banner.shtml|R]] and the [[https://​www.rstudio.com/​products/​rstudio/​download/#​download|RStudio]] GUI   - Install up-to-date versions of [[https://​cran.r-project.org/​banner.shtml|R]] and the [[https://​www.rstudio.com/​products/​rstudio/​download/#​download|RStudio]] GUI
   - Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive: ​   - Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive: ​
-    * ''​sparsesvd''​ +    * ''​sparsesvd'' ​(v0.2) 
-    * ''​iotools''​ +    * ''​wordspace'' ​(v0.2-5) 
-    * ''​tm'' ​(optional) +    * optional: ​''​tm''​''​quanteda''​''​Rtsne''​''​uwot'',​ ''​wordcloud''​, ''​shiny'',​ ''​corpustools'',​ ''​spacyr'',​ ''​udpipe''​ (don't worry if some of these fail to install)
-    * ''​quanteda'' ​(optional) +
-    * ''​Rcpp'' ​(needed on Linux only) +
-  - Install the ''​wordspace'' ​package itself. ​ It is available from CRAN through the standard installerbut you may be asked to use the latest version available here: +
-    * ''​wordspace'' ​v0.2-0: [[http://​wordspace.r-forge.r-project.org/​downloads/​wordspace_0.2-0.tar.gz|Source/​Linux]] – [[http://​wordspace.r-forge.r-project.org/​downloads/​wordspace_0.2-0.tgz|MacOS]] – [[http://​wordspace.r-forge.r-project.org/​downloads/​wordspace_0.2-0.zip|Windows]] +
-    * download a suitable version ​of the package for your platform +
-    * in the RStudio installer, select “Install from: Package Archive File”+
   - During the course, you will be asked to install a further package with additional evaluation tasks (''​wordspaceEval''​) from a password-protected Web page:   - During the course, you will be asked to install a further package with additional evaluation tasks (''​wordspaceEval''​) from a password-protected Web page:
     * ''​wordspaceEval''​ v0.1: [[http://​www.collocations.de/​data/​protected/​wordspaceEval_0.1.tar.gz|Source/​Linux]] – [[http://​www.collocations.de/​data/​protected/​wordspaceEval_0.1.tgz|MacOS]] – [[http://​www.collocations.de/​data/​protected/​wordspaceEval_0.1.zip|Windows]] (login required)     * ''​wordspaceEval''​ v0.1: [[http://​www.collocations.de/​data/​protected/​wordspaceEval_0.1.tar.gz|Source/​Linux]] – [[http://​www.collocations.de/​data/​protected/​wordspaceEval_0.1.tgz|MacOS]] – [[http://​www.collocations.de/​data/​protected/​wordspaceEval_0.1.zip|Windows]] (login required)
Line 28: Line 22:
   - Download the sample data files listed below   - Download the sample data files listed below
   - Download one or more of the pre-compiled DSMs listed below   - Download one or more of the pre-compiled DSMs listed below
 +
 +/* -- doesn'​t apply at the moment -- 
 +==== Getting the latest & greatest ====
 +
 +During the course, you may be asked to install a new version of ''​wordspace''​ that hasn't been submitted to CRAN yet.  In this case, please follow these instructions:​
 +
 +  - Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive: ​
 +    * ''​sparsesvd''​
 +    * ''​iotools''​
 +    * ''​Rcpp''​ (needed on Linux only)
 +  - Download an appropriate version of the package for your platform:
 +    * ''​wordspace''​ v0.2-0: [[http://​wordspace.r-forge.r-project.org/​downloads/​wordspace_0.2-0.tar.gz|Source/​Linux]] – [[http://​wordspace.r-forge.r-project.org/​downloads/​wordspace_0.2-0.tgz|MacOS]] – [[http://​wordspace.r-forge.r-project.org/​downloads/​wordspace_0.2-0.zip|Windows]]
 +  - In the RStudio installer, select “Install from: Package Archive File”
 +
 +You can also check the [[http://​wordspace.r-forge.r-project.org/​|wordspace homepage]] for new releases and installation instructions.
 +
 +*/
  
 ===== Example data sets ===== ===== Example data sets =====
Line 39: Line 50:
 ===== Pre-compiled DSMs ===== ===== Pre-compiled DSMs =====
  
-Pre-compiled DSMs for use with the ''​wordspace''​ package for R. Each model is contained in an ''​.rda''​ file, and can be loaded into R with the command ''​load("​model.rda"​)''​.+Pre-compiled DSMs for use with the ''​wordspace''​ package for R. Each model is contained in an ''​.rda''​ file, which can be loaded into R with the command ''​load("​model.rda"​)'' ​and creates an object with the same name (''​model''​).
  
 ==== DSMs based on the English Wikipedia ==== ==== DSMs based on the English Wikipedia ====
  
-These models were compiled from ''​WP500'',​ a 200-million word subset of the Wackypedia corpus comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions.+These models were compiled from ''​WP500'',​ a 200-million word subset of the Wackypedia corpus comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions. The latent SVD dimensions are based on log-transformed sparse simple-ll scores with L2-normalization. Power scaling with Caron $P = 0$ (i.e. equalization of the latent dimensions) has been applied, but the reduced vectors are not re-normalized
  
- +  ​* dependency-filtered:​ ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_DepFilter_Lemma.rda|WP500_DepFilter_Lemma.rda]]''​ (31.MB) – 500 latent SVD dimensions: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_DepFilter_Lemma_svd500.rda|WP500_DepFilter_Lemma_svd500.rda]]''​ (179.MB) 
-  ​* dependency-filtered:​ ''​[[http://​www.collocations.de/​data/​WP500_DepFilter_Lemma.rda|WP500_DepFilter_Lemma.rda]]''​ (30.MB) – 500 latent SVD dimensions: ''​[[http://​www.collocations.de/​data/​WP500_DepFilter_Lemma_svd500.rda|WP500_DepFilter_Lemma_svd500.rda]]''​ (175.MB) +  * dependency-structured:​ ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_DepStruct_Lemma.rda|WP500_DepStruct_Lemma.rda]]''​ (31.MB) – 500 latent SVD dimensions: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_DepStruct_Lemma_svd500.rda|WP500_DepStruct_Lemma_svd500.rda]]''​ (180.MB) 
-  * dependency-structured:​ ''​[[http://​www.collocations.de/​data/​WP500_DepStruct_Lemma.rda|WP500_DepStruct_Lemma.rda]]''​ (30.MB) – 500 latent SVD dimensions: ''​[[http://​www.collocations.de/​data/​WP500_DepStruct_Lemma_svd500.rda|WP500_DepStruct_Lemma_svd500.rda]]''​ (176.MB) +  * L2/R2 surface span: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Win2_Lemma.rda|WP500_Win2_Lemma.rda]]''​ (51.MB) – 500 latent SVD dimensions: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Win2_Lemma_svd500.rda|WP500_Win2_Lemma_svd500.rda]]''​ (177.MB) 
-  * L2/R2 surface span: ''​[[http://​www.collocations.de/​data/​WP500_Win2_Lemma.rda|WP500_Win2_Lemma.rda]]''​ (50.MB) – 500 latent SVD dimensions: ''​[[http://​www.collocations.de/​data/​WP500_Win2_Lemma_svd500.rda|WP500_Win2_Lemma_svd500.rda]]''​ (173.MB) +  * L5/R5 surface span: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Win5_Lemma.rda|WP500_Win5_Lemma.rda]]''​ (103.MB) – 500 latent SVD dimensions: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Win5_Lemma_svd500.rda|WP500_Win5_Lemma_svd500.rda]]''​ (179.MB) 
-  * L5/R5 surface span: ''​[[http://​www.collocations.de/​data/​WP500_Win5_Lemma.rda|WP500_Win5_Lemma.rda]]''​ (99.MB) – 500 latent SVD dimensions: ''​[[http://​www.collocations.de/​data/​WP500_Win5_Lemma_svd500.rda|WP500_Win5_Lemma_svd500.rda]]''​ (176.MB) +  * L30/R30 surface span: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Win30_Lemma.rda|WP500_Win30_Lemma.rda]]''​ (311.MB) – 500 latent SVD dimensions: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Win30_Lemma_svd500.rda|WP500_Win30_Lemma_svd500.rda]]''​ (182.MB) 
-  * L30/R30 surface span: ''​[[http://​www.collocations.de/​data/​WP500_Win30_Lemma.rda|WP500_Win30_Lemma.rda]]''​ (295.MB) – 500 latent SVD dimensions: ''​[[http://​www.collocations.de/​data/​WP500_Win30_Lemma_svd500.rda|WP500_Win30_Lemma_svd500.rda]]''​ (179.MB) +  * term-document model: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_TermDoc_Lemma.rda|WP500_TermDoc_Lemma.rda]]''​ (105.MB) – 500 latent SVD dimensions: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_TermDoc_Lemma_svd500.rda|WP500_TermDoc_Lemma_svd500.rda]]''​ (162.MB) 
-  * term-document model: ''​[[http://​www.collocations.de/​data/​WP500_TermDoc_Lemma.rda|WP500_TermDoc_Lemma.rda]]''​ (101.MB) – 500 latent SVD dimensions: ''​[[http://​www.collocations.de/​data/​WP500_TermDoc_Lemma_svd500.rda|WP500_TermDoc_Lemma_svd500.rda]]''​ (158.MB) +  * type contexts (L1+R1): ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Ctype_L1R1_Lemma.rda|WP500_Ctype_L1R1_Lemma.rda]]''​ (55.MB) – 500 latent SVD dimensions: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Ctype_L1R1_Lemma_svd500.rda|WP500_Ctype_L1R1_Lemma_svd500.rda]]''​ (157.0 MB) 
-  * type contexts (L1+R1): ''​[[http://​www.collocations.de/​data/​WP500_Ctype_L1R1_Lemma.rda|WP500_Ctype_L1R1_Lemma.rda]]''​ (55.MB) – 500 latent SVD dimensions: ''​[[http://​www.collocations.de/​data/​WP500_Ctype_L1R1_Lemma_svd500.rda|WP500_Ctype_L1R1_Lemma_svd500.rda]]''​ (153.MB) +  * type contexts (L2+R2): ''​[[http://​corpora.linguistik.uni-erlangen.de/​data/​wordspace/​WP500_Ctype_L2R2_Lemma.rda|WP500_Ctype_L2R2_Lemma.rda]]''​ (33.1 MB) – 500 latent SVD dimensions: ''​[[http://​corpora.linguistik.uni-erlangen.de/​data/​wordspace/​WP500_Ctype_L2R2_Lemma_svd500.rda|WP500_Ctype_L2R2_Lemma_svd500.rda]]''​ (64.3 MB) 
-  * type contexts (L2+R2 POS tags): ''​[[http://​www.collocations.de/​data/​WP500_Ctype_L2R2pos_Lemma.rda|WP500_Ctype_L2R2pos_Lemma.rda]]''​ (55.1 MB) – 500 latent SVD dimensions: ''​[[http://​www.collocations.de/​data/​WP500_Ctype_L2R2pos_Lemma_svd500.rda|WP500_Ctype_L2R2pos_Lemma_svd500.rda]]''​ (172.MB) +  * type contexts (L2+R2 POS tags): ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Ctype_L2R2pos_Lemma.rda|WP500_Ctype_L2R2pos_Lemma.rda]]''​ (56.1 MB) – 500 latent SVD dimensions: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Ctype_L2R2pos_Lemma_svd500.rda|WP500_Ctype_L2R2pos_Lemma_svd500.rda]]''​ (175.MB) 
-  * word forms L2/R2: ''​[[http://​www.collocations.de/​data/​WP500_Win2_Word.rda|WP500_Win2_Word.rda]]''​ (61.MB) – 500 latent SVD dimensions: ''​[[http://​www.collocations.de/​data/​WP500_Win2_Word_svd500.rda|WP500_Win2_Word_svd500.rda]]''​ (182.MB) +  * word forms L2/R2: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Win2_Word.rda|WP500_Win2_Word.rda]]''​ (63.MB) – 500 latent SVD dimensions: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Win2_Word_svd500.rda|WP500_Win2_Word_svd500.rda]]''​ (185.MB) 
-  * word forms L2/R2 with non-lemmatized features: ''​[[http://​www.collocations.de/​data/​WP500_Win2_Word_WF.rda|WP500_Win2_Word_WF.rda]]''​ (65.9 MB) – 500 latent SVD dimensions: ''​[[http://​www.collocations.de/​data/​WP500_Win2_Word_WF_svd500.rda|WP500_Win2_Word_WF_svd500.rda]]''​ (182.MB)+  * word forms L2/R2 with non-lemmatized features: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Win2_Word_WF.rda|WP500_Win2_Word_WF.rda]]''​ (68.9 MB) – 500 latent SVD dimensions: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​WP500_Win2_Word_WF_svd500.rda|WP500_Win2_Word_WF_svd500.rda]]''​ (185.MB)
  
 ==== Neural word embeddings ==== ==== Neural word embeddings ====
Line 61: Line 72:
 Some publicly available pre-trained neural embeddings, converted into ''​.rda''​ format for use with the ''​wordspace''​ package. Some publicly available pre-trained neural embeddings, converted into ''​.rda''​ format for use with the ''​wordspace''​ package.
  
-  * word2vec: ''​[[http://​www.collocations.de/​data/​GoogleNews300_wf200k.rda|GoogleNews300_wf200k.rda]]''​ (129.2 MiB) +  * word2vec: ''​[[http://​corpora.linguistik.uni-erlangen.de/data/wordspace/​GoogleNews300_wf200k.rda|GoogleNews300_wf200k.rda]]''​ (129.2 MiB) 
  
 ===== Web interfaces ===== ===== Web interfaces =====
  
-  * Web interface for several pre-trained [[http://​clic.cimec.unitn.it/​infomap-query/​|Infomap models]] (CIMeC, U Trento) +  * Web interface for several pre-trained ​**[[http://​clic.cimec.unitn.it/​infomap-query/​|Infomap models]]** (CIMeC, U Trento) 
- +  * Explore **[[https://corpora.linguistik.uni-erlangen.de/shiny/wordspace/word2vec/|word2vec embeddings]]** (FAU Erlangen-Nürnberg
-===== Off-the-shelf packages for DSM ===== +  * Explore **[[https://corpora.linguistik.uni-erlangen.de/shiny/wordspace/WP500/|DSMs based on Wikipedia]]*(FAU Erlangen-Nürnberg)
- +
-  * [[http://​infomap-nlp.sourceforge.net/​|Infomap NLP]] +
-  ​* [[http://www.psych.ualberta.ca/​~westburylab/​downloads/​HiDEx.download.html|HiDEx]],​ the High-Dimensional Explorer +
-  * [[http://​code.google.com/p/semanticvectors|Semantic Vectors]] +
-  * [[http://​senseclusters.sourceforge.net/|SenseClusters]] +
-  ​[[http://​code.google.com/​p/​airhead-research/​|S-Space Package]] (work in progress+
-  * [[http://​code.google.com/​p/​wordspaces/​|Wordspaces]] (interactive exploration) +
-  ​* [[http://divisi.media.mit.edu/|Divisi]] (semantic networks, tensors & SVD in Python) +
- +
- +
- +
-===== Downloads ===== +
- +
-==== Data sets ==== +
- +
-  * Verb + object noun co-occurrences (tokens) extracted from the British National Corpus: [[http://www.collocations.de/data/​bnc_vobj_filtered.txt.gz|bnc_vobj_filtered.txt.gz]] (15 MB) +
- +
-  ​A 5-million word corpus of Harry Potter fan fiction in //​lemma//''​_''//​pos//​ format ​(pre-cleaned): [[http://​www.collocations.de/​data/​potter_tokens.txt.gz|potter_tokens.txt.gz]] (8.9 MB)+
  
-  * **NEW:** DSM for 34,150 English nouns from 2-billion-word ukWaC corpus: [[http://​www.collocations.de/​data/​ukwac_vobj_S_svd.rda|ukwac_vobj_S_svd.rda]] (158 MB) 
-    * verb-object co-occurrences,​ features are 3,371 frequent verbs, log-scaled t-score, 300 SVD dimensions 
-    * nearest-neighbour demo with visualisation:​ [[http://​wordspace.collocations.de/​lib/​exe/​fetch.php/​course:​neighbour_demo.r|neighbour_demo.R]]