Both sides previous revision
Previous revision
Next revision
|
Previous revision
Next revision
Both sides next revision
|
course:material [2018/07/26 11:00] schtepf [Pre-compiled DSMs] |
course:material [2018/08/06 12:21] schtepf [Software for the course] |
- Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive: | - Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive: |
* ''sparsesvd'' | * ''sparsesvd'' |
* ''iotools'' | * ''wordspace'' |
* ''tm'' (optional) | * optional: ''tm'', ''quanteda'', ''Rtsne'', ''shiny'' |
* ''quanteda'' (optional) | |
* ''Rcpp'' (needed on Linux only) | |
- Install the ''wordspace'' package itself. It is available from CRAN through the standard installer, but you may be asked to use the latest version available here: | |
* ''wordspace'' v0.2-0: [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.tar.gz|Source/Linux]] – [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.tgz|MacOS]] – [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.zip|Windows]] | |
* download a suitable version of the package for your platform | |
* in the RStudio installer, select “Install from: Package Archive File” | |
- During the course, you will be asked to install a further package with additional evaluation tasks (''wordspaceEval'') from a password-protected Web page: | - During the course, you will be asked to install a further package with additional evaluation tasks (''wordspaceEval'') from a password-protected Web page: |
* ''wordspaceEval'' v0.1: [[http://www.collocations.de/data/protected/wordspaceEval_0.1.tar.gz|Source/Linux]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.1.tgz|MacOS]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.1.zip|Windows]] (login required) | * ''wordspaceEval'' v0.1: [[http://www.collocations.de/data/protected/wordspaceEval_0.1.tar.gz|Source/Linux]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.1.tgz|MacOS]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.1.zip|Windows]] (login required) |
- Download the sample data files listed below | - Download the sample data files listed below |
- Download one or more of the pre-compiled DSMs listed below | - Download one or more of the pre-compiled DSMs listed below |
| |
| /* |
| - Install the ''wordspace'' package itself. It is available from CRAN through the standard installer, but you may be asked to use the latest version available here: |
| * ''wordspace'' v0.2-0: [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.tar.gz|Source/Linux]] – [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.tgz|MacOS]] – [[http://wordspace.r-forge.r-project.org/downloads/wordspace_0.2-0.zip|Windows]] |
| * download a suitable version of the package for your platform |
| * in the RStudio installer, select “Install from: Package Archive File” |
| |
| - Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive: |
| * ''sparsesvd'' |
| * ''wordspace'' |
| * ''word |
| * ''tm'' (optional) |
| * ''quanteda'' (optional) |
| * ''Rcpp'' (needed on Linux only) |
| */ |
| |
===== Example data sets ===== | ===== Example data sets ===== |
===== Pre-compiled DSMs ===== | ===== Pre-compiled DSMs ===== |
| |
Pre-compiled DSMs for use with the ''wordspace'' package for R. Each model is contained in an ''.rda'' file, and can be loaded into R with the command ''load("model.rda")''. | Pre-compiled DSMs for use with the ''wordspace'' package for R. Each model is contained in an ''.rda'' file, which can be loaded into R with the command ''load("model.rda")'' and creates an object with the same name (''model''). |
| |
==== DSMs based on the English Wikipedia ==== | ==== DSMs based on the English Wikipedia ==== |
| |
These models were compiled from ''WP500'', a 200-million word subset of the Wackypedia corpus comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions. | These models were compiled from ''WP500'', a 200-million word subset of the Wackypedia corpus comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions. The latent SVD dimensions are based on log-transformed sparse simple-ll scores with L2-normalization. Power scaling with Caron $P = 0$ (i.e. equalization of the latent dimensions) has been applied, but the reduced vectors are not re-normalized. |
| |
| |
* dependency-filtered: ''[[http://www.collocations.de/data/WP500_DepFilter_Lemma.rda|WP500_DepFilter_Lemma.rda]]'' (30.4 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_DepFilter_Lemma_svd500.rda|WP500_DepFilter_Lemma_svd500.rda]]'' (175.9 MB) | |
* dependency-structured: ''[[http://www.collocations.de/data/WP500_DepStruct_Lemma.rda|WP500_DepStruct_Lemma.rda]]'' (30.9 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_DepStruct_Lemma_svd500.rda|WP500_DepStruct_Lemma_svd500.rda]]'' (176.8 MB) | |
* L2/R2 surface span: ''[[http://www.collocations.de/data/WP500_Win2_Lemma.rda|WP500_Win2_Lemma.rda]]'' (50.1 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win2_Lemma_svd500.rda|WP500_Win2_Lemma_svd500.rda]]'' (173.7 MB) | |
* L5/R5 surface span: ''[[http://www.collocations.de/data/WP500_Win5_Lemma.rda|WP500_Win5_Lemma.rda]]'' (99.3 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win5_Lemma_svd500.rda|WP500_Win5_Lemma_svd500.rda]]'' (176.5 MB) | |
* L30/R30 surface span: ''[[http://www.collocations.de/data/WP500_Win30_Lemma.rda|WP500_Win30_Lemma.rda]]'' (295.8 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win30_Lemma_svd500.rda|WP500_Win30_Lemma_svd500.rda]]'' (179.5 MB) | |
* term-document model: ''[[http://www.collocations.de/data/WP500_TermDoc_Lemma.rda|WP500_TermDoc_Lemma.rda]]'' (101.3 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_TermDoc_Lemma_svd500.rda|WP500_TermDoc_Lemma_svd500.rda]]'' (158.7 MB) | |
* type contexts (L1+R1): ''[[http://www.collocations.de/data/WP500_Ctype_L1R1_Lemma.rda|WP500_Ctype_L1R1_Lemma.rda]]'' (55.1 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Ctype_L1R1_Lemma_svd500.rda|WP500_Ctype_L1R1_Lemma_svd500.rda]]'' (153.9 MB) | |
* type contexts (L2+R2 POS tags): ''[[http://www.collocations.de/data/WP500_Ctype_L2R2pos_Lemma.rda|WP500_Ctype_L2R2pos_Lemma.rda]]'' (55.1 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Ctype_L2R2pos_Lemma_svd500.rda|WP500_Ctype_L2R2pos_Lemma_svd500.rda]]'' (172.2 MB) | |
* word forms L2/R2: ''[[http://www.collocations.de/data/WP500_Win2_Word.rda|WP500_Win2_Word.rda]]'' (61.6 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win2_Word_svd500.rda|WP500_Win2_Word_svd500.rda]]'' (182.0 MB) | |
* word forms L2/R2 with non-lemmatized features: ''[[http://www.collocations.de/data/WP500_Win2_Word_WF.rda|WP500_Win2_Word_WF.rda]]'' (65.9 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win2_Word_WF_svd500.rda|WP500_Win2_Word_WF_svd500.rda]]'' (182.5 MB) | |
| |
===== Online access (Web interfaces) ===== | |
| |
* Web interface for several pre-trained [[http://clic.cimec.unitn.it/infomap-query/|Infomap models]] (CIMeC, U Trento) | |
* Explore a [[http://www.cogsci.uni-osnabrueck.de/~korpora/ws/cgi-bin/HIT/LSA_NN.perl|German LSA space]] (CogSci, U Osnabrück) | |
| |
===== Off-the-shelf packages for DSM ===== | |
| |
* [[http://infomap-nlp.sourceforge.net/|Infomap NLP]] | |
* [[http://www.psych.ualberta.ca/~westburylab/downloads/HiDEx.download.html|HiDEx]], the High-Dimensional Explorer | |
* [[http://code.google.com/p/semanticvectors|Semantic Vectors]] | |
* [[http://senseclusters.sourceforge.net/|SenseClusters]] | |
* [[http://code.google.com/p/airhead-research/|S-Space Package]] (work in progress) | |
* [[http://code.google.com/p/wordspaces/|Wordspaces]] (interactive exploration) | |
* [[http://divisi.media.mit.edu/|Divisi]] (semantic networks, tensors & SVD in Python) | |
| |
| * dependency-filtered: ''[[http://www.collocations.de/data/WP500_DepFilter_Lemma.rda|WP500_DepFilter_Lemma.rda]]'' (31.1 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_DepFilter_Lemma_svd500.rda|WP500_DepFilter_Lemma_svd500.rda]]'' (179.3 MB) |
| * dependency-structured: ''[[http://www.collocations.de/data/WP500_DepStruct_Lemma.rda|WP500_DepStruct_Lemma.rda]]'' (31.6 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_DepStruct_Lemma_svd500.rda|WP500_DepStruct_Lemma_svd500.rda]]'' (180.3 MB) |
| * L2/R2 surface span: ''[[http://www.collocations.de/data/WP500_Win2_Lemma.rda|WP500_Win2_Lemma.rda]]'' (51.8 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win2_Lemma_svd500.rda|WP500_Win2_Lemma_svd500.rda]]'' (177.1 MB) |
| * L5/R5 surface span: ''[[http://www.collocations.de/data/WP500_Win5_Lemma.rda|WP500_Win5_Lemma.rda]]'' (103.9 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win5_Lemma_svd500.rda|WP500_Win5_Lemma_svd500.rda]]'' (179.9 MB) |
| * L30/R30 surface span: ''[[http://www.collocations.de/data/WP500_Win30_Lemma.rda|WP500_Win30_Lemma.rda]]'' (311.4 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win30_Lemma_svd500.rda|WP500_Win30_Lemma_svd500.rda]]'' (182.8 MB) |
| * term-document model: ''[[http://www.collocations.de/data/WP500_TermDoc_Lemma.rda|WP500_TermDoc_Lemma.rda]]'' (105.1 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_TermDoc_Lemma_svd500.rda|WP500_TermDoc_Lemma_svd500.rda]]'' (162.5 MB) |
| * type contexts (L1+R1): ''[[http://www.collocations.de/data/WP500_Ctype_L1R1_Lemma.rda|WP500_Ctype_L1R1_Lemma.rda]]'' (55.8 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Ctype_L1R1_Lemma_svd500.rda|WP500_Ctype_L1R1_Lemma_svd500.rda]]'' (157.0 MB) |
| * type contexts (L2+R2): ''[[http://www.collocations.de/data/WP500_Ctype_L2R2_Lemma.rda|WP500_Ctype_L2R2_Lemma.rda]]'' (33.1 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Ctype_L2R2_Lemma_svd500.rda|WP500_Ctype_L2R2_Lemma_svd500.rda]]'' (64.3 MB) |
| * type contexts (L2+R2 POS tags): ''[[http://www.collocations.de/data/WP500_Ctype_L2R2pos_Lemma.rda|WP500_Ctype_L2R2pos_Lemma.rda]]'' (56.1 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Ctype_L2R2pos_Lemma_svd500.rda|WP500_Ctype_L2R2pos_Lemma_svd500.rda]]'' (175.3 MB) |
| * word forms L2/R2: ''[[http://www.collocations.de/data/WP500_Win2_Word.rda|WP500_Win2_Word.rda]]'' (63.9 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win2_Word_svd500.rda|WP500_Win2_Word_svd500.rda]]'' (185.5 MB) |
| * word forms L2/R2 with non-lemmatized features: ''[[http://www.collocations.de/data/WP500_Win2_Word_WF.rda|WP500_Win2_Word_WF.rda]]'' (68.9 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win2_Word_WF_svd500.rda|WP500_Win2_Word_WF_svd500.rda]]'' (185.9 MB) |
| |
| ==== Neural word embeddings ==== |
| |
===== Downloads ===== | Some publicly available pre-trained neural embeddings, converted into ''.rda'' format for use with the ''wordspace'' package. |
| |
==== Data sets ==== | * word2vec: ''[[http://www.collocations.de/data/GoogleNews300_wf200k.rda|GoogleNews300_wf200k.rda]]'' (129.2 MiB) |
| |
* Verb + object noun co-occurrences (tokens) extracted from the British National Corpus: [[http://www.collocations.de/data/bnc_vobj_filtered.txt.gz|bnc_vobj_filtered.txt.gz]] (15 MB) | ===== Web interfaces ===== |
| |
* A 5-million word corpus of Harry Potter fan fiction in //lemma//''_''//pos// format (pre-cleaned): [[http://www.collocations.de/data/potter_tokens.txt.gz|potter_tokens.txt.gz]] (8.9 MB) | * Web interface for several pre-trained **[[http://clic.cimec.unitn.it/infomap-query/|Infomap models]]** (CIMeC, U Trento) |
| * Explore **[[https://corpora.linguistik.uni-erlangen.de/shiny/wordspace/word2vec/|word2vec embeddings]]** (FAU Erlangen-Nürnberg) |
| * Explore **[[https://corpora.linguistik.uni-erlangen.de/shiny/wordspace/WP500/|DSMs based on Wikipedia]]** (FAU Erlangen-Nürnberg) |
| |
* **NEW:** DSM for 34,150 English nouns from 2-billion-word ukWaC corpus: [[http://www.collocations.de/data/ukwac_vobj_S_svd.rda|ukwac_vobj_S_svd.rda]] (158 MB) | |
* verb-object co-occurrences, features are 3,371 frequent verbs, log-scaled t-score, 300 SVD dimensions | |
* nearest-neighbour demo with visualisation: [[http://wordspace.collocations.de/lib/exe/fetch.php/course:neighbour_demo.r|neighbour_demo.R]] | |