Both sides previous revision
Previous revision
Next revision
|
Previous revision
|
course:material [2018/08/06 12:26] schtepf [Software for the course] |
course:material [2022/08/07 18:46] schtepf [Software for the course] |
====== Courses and Tutorials on DSM ====== | ====== Courses and Tutorials on DSM ====== |
| |
[[course:esslli2009:start|ESSLLI '09]] – | [[course:esslli2009:start|ESSLLI 2009]] – |
[[course:acl2010:start|NAACL-HLT 2010]] – | [[course:acl2010:start|NAACL-HLT 2010]] – |
[[course:esslli2018:start|ESSLLI '16 & '18]] – | [[course:esslli2018:start|ESSLLI '16 & '18]] – |
| [[course:esslli2021:start|ESSLLI 2021]] – |
**Software & data sets** – | **Software & data sets** – |
[[course:bibliography|Bibliography]] | [[course:bibliography|Bibliography]] |
Practical examples and exercises for these courses and tutorials are based on the user-friendly software package [[http://wordspace.r-forge.r-project.org/|wordspace]] for the interactive statistical computing environment [[http://www.r-project.org/|R]]. If you want to follow along, please bring your own laptop and set up the required software as follows: | Practical examples and exercises for these courses and tutorials are based on the user-friendly software package [[http://wordspace.r-forge.r-project.org/|wordspace]] for the interactive statistical computing environment [[http://www.r-project.org/|R]]. If you want to follow along, please bring your own laptop and set up the required software as follows: |
| |
- Install up-to-date versions of [[https://cran.r-project.org/banner.shtml|R]] and the [[https://www.rstudio.com/products/rstudio/download/#download|RStudio]] GUI | - Install up-to-date versions of [[https://cran.r-project.org/banner.shtml|R]] (4.0 or newer) and the [[https://www.rstudio.com/products/rstudio/download/#download|RStudio]] GUI |
- Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive: | - Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive: |
* ''sparsesvd'' | * ''sparsesvd'' (v0.2) |
* ''wordspace'' | * ''wordspace'' (v0.2-6) |
* optional: ''tm'', ''quanteda'', ''Rtsne'', ''shiny'' | * recommended: ''e1071'', ''rsparse'', ''Rtsne'', ''uwot'' |
| * optional: ''tm'', ''quanteda'', ''data.table'', ''wordcloud'', ''shiny'', ''spacyr'', ''udpipe'', ''coreNLP'' (don't worry if some of these fail to install) |
| * optional: ''NMF'' (also install ''biocManager'', then run command ''BiocManager::install("bioBase")'') |
- During the course, you will be asked to install a further package with additional evaluation tasks (''wordspaceEval'') from a password-protected Web page: | - During the course, you will be asked to install a further package with additional evaluation tasks (''wordspaceEval'') from a password-protected Web page: |
* ''wordspaceEval'' v0.1: [[http://www.collocations.de/data/protected/wordspaceEval_0.1.tar.gz|Source/Linux]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.1.tgz|MacOS]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.1.zip|Windows]] (login required) | * ''wordspaceEval'' v0.2: [[http://www.collocations.de/data/protected/wordspaceEval_0.2.tar.gz|Source/Linux]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.2.tgz|MacOS]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.2.zip|Windows]] (login required) |
| * if you are stuck with R v3.x, please use the older package version 0.1: [[http://www.collocations.de/data/protected/wordspaceEval_0.1.tar.gz|Source/Linux]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.1.tgz|MacOS]] – [[http://www.collocations.de/data/protected/wordspaceEval_0.1.zip|Windows]] (login required) |
* download a suitable version and select “Install from: Package Archive File” in RStudio | * download a suitable version and select “Install from: Package Archive File” in RStudio |
- Download the sample data files listed below | - Download the sample data files listed below |
- Download one or more of the pre-compiled DSMs listed below | - Download one or more of the pre-compiled DSMs listed below |
| |
| ===== Scaling R to large data sets ===== |
| |
| Most of our hands-on examples work reasonably well in a standard R installation, even on a moderately powerful laptop computer. |
| However, if you intend to work on real-life tasks and process large DSMs, it is important to enable multi-threaded computation |
| in R. Since DSMs build on matrix operations, a multi-threaded linear algebra library (“BLAS”) is key. |
| |
| - In Linux, it should be sufficient to install the OpenBLAS package, e.g. in Ubuntu: ''sudo apt install libopenblas-dev'' |
| - In MacOS, follow [[https://groups.google.com/g/r-sig-mac/c/YN6uNYCIZK0|these instructions]] to enable the VecLib BLAS built into MacOS. You may also want to [[https://mac.r-project.org/openmp/|enable OpenMP]] for an additional speed boost on expensive distance metrics (but this is less important). |
| - In Windows, you can try installing [[https://mran.microsoft.com/open|Microsoft R Open]] or do a Web search for alternative solutions. |
| |
| |
| <!-- doesn't apply at the moment -- |
| |
==== Getting the latest & greatest ==== | ==== Getting the latest & greatest ==== |
| |
You can also check the [[http://wordspace.r-forge.r-project.org/|wordspace homepage]] for new releases and installation instructions. | You can also check the [[http://wordspace.r-forge.r-project.org/|wordspace homepage]] for new releases and installation instructions. |
| |
| --> |
| |
===== Example data sets ===== | ===== Example data sets ===== |
* ''[[http://www.collocations.de/data/potter_l2r2.txt.gz|potter_l2r2.txt.gz]]'' (51.3 MB) | * ''[[http://www.collocations.de/data/potter_l2r2.txt.gz|potter_l2r2.txt.gz]]'' (51.3 MB) |
* ''[[http://www.collocations.de/data/potter_lemmas.txt.gz|potter_lemmas.txt.gz]]'' (1.1 MB) | * ''[[http://www.collocations.de/data/potter_lemmas.txt.gz|potter_lemmas.txt.gz]]'' (1.1 MB) |
| * ''[[http://www.collocations.de/data/VSS.txt|VSS.txt]]'' (37 kB) |
| |
===== Pre-compiled DSMs ===== | ===== Pre-compiled DSMs ===== |
These models were compiled from ''WP500'', a 200-million word subset of the Wackypedia corpus comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions. The latent SVD dimensions are based on log-transformed sparse simple-ll scores with L2-normalization. Power scaling with Caron $P = 0$ (i.e. equalization of the latent dimensions) has been applied, but the reduced vectors are not re-normalized. | These models were compiled from ''WP500'', a 200-million word subset of the Wackypedia corpus comprising the first 500 words of each article. Each model covers a vocabulary of the 50,000 most frequent content words (lemmatized) in the corpus and has at least 50,000 feature dimensions. The latent SVD dimensions are based on log-transformed sparse simple-ll scores with L2-normalization. Power scaling with Caron $P = 0$ (i.e. equalization of the latent dimensions) has been applied, but the reduced vectors are not re-normalized. |
| |
* dependency-filtered: ''[[http://www.collocations.de/data/WP500_DepFilter_Lemma.rda|WP500_DepFilter_Lemma.rda]]'' (31.1 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_DepFilter_Lemma_svd500.rda|WP500_DepFilter_Lemma_svd500.rda]]'' (179.3 MB) | * dependency-filtered: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_DepFilter_Lemma.rda|WP500_DepFilter_Lemma.rda]]'' (31.1 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_DepFilter_Lemma_svd500.rda|WP500_DepFilter_Lemma_svd500.rda]]'' (179.3 MB) |
* dependency-structured: ''[[http://www.collocations.de/data/WP500_DepStruct_Lemma.rda|WP500_DepStruct_Lemma.rda]]'' (31.6 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_DepStruct_Lemma_svd500.rda|WP500_DepStruct_Lemma_svd500.rda]]'' (180.3 MB) | * dependency-structured: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_DepStruct_Lemma.rda|WP500_DepStruct_Lemma.rda]]'' (31.6 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_DepStruct_Lemma_svd500.rda|WP500_DepStruct_Lemma_svd500.rda]]'' (180.3 MB) |
* L2/R2 surface span: ''[[http://www.collocations.de/data/WP500_Win2_Lemma.rda|WP500_Win2_Lemma.rda]]'' (51.8 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win2_Lemma_svd500.rda|WP500_Win2_Lemma_svd500.rda]]'' (177.1 MB) | * L2/R2 surface span: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Lemma.rda|WP500_Win2_Lemma.rda]]'' (51.8 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Lemma_svd500.rda|WP500_Win2_Lemma_svd500.rda]]'' (177.1 MB) |
* L5/R5 surface span: ''[[http://www.collocations.de/data/WP500_Win5_Lemma.rda|WP500_Win5_Lemma.rda]]'' (103.9 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win5_Lemma_svd500.rda|WP500_Win5_Lemma_svd500.rda]]'' (179.9 MB) | * L5/R5 surface span: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win5_Lemma.rda|WP500_Win5_Lemma.rda]]'' (103.9 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win5_Lemma_svd500.rda|WP500_Win5_Lemma_svd500.rda]]'' (179.9 MB) |
* L30/R30 surface span: ''[[http://www.collocations.de/data/WP500_Win30_Lemma.rda|WP500_Win30_Lemma.rda]]'' (311.4 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win30_Lemma_svd500.rda|WP500_Win30_Lemma_svd500.rda]]'' (182.8 MB) | * L30/R30 surface span: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win30_Lemma.rda|WP500_Win30_Lemma.rda]]'' (311.4 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win30_Lemma_svd500.rda|WP500_Win30_Lemma_svd500.rda]]'' (182.8 MB) |
* term-document model: ''[[http://www.collocations.de/data/WP500_TermDoc_Lemma.rda|WP500_TermDoc_Lemma.rda]]'' (105.1 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_TermDoc_Lemma_svd500.rda|WP500_TermDoc_Lemma_svd500.rda]]'' (162.5 MB) | * term-document model: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_TermDoc_Lemma.rda|WP500_TermDoc_Lemma.rda]]'' (105.1 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_TermDoc_Lemma_svd500.rda|WP500_TermDoc_Lemma_svd500.rda]]'' (162.5 MB) |
* type contexts (L1+R1): ''[[http://www.collocations.de/data/WP500_Ctype_L1R1_Lemma.rda|WP500_Ctype_L1R1_Lemma.rda]]'' (55.8 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Ctype_L1R1_Lemma_svd500.rda|WP500_Ctype_L1R1_Lemma_svd500.rda]]'' (157.0 MB) | * type contexts (L1+R1): ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L1R1_Lemma.rda|WP500_Ctype_L1R1_Lemma.rda]]'' (55.8 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L1R1_Lemma_svd500.rda|WP500_Ctype_L1R1_Lemma_svd500.rda]]'' (157.0 MB) |
* type contexts (L2+R2): ''[[http://www.collocations.de/data/WP500_Ctype_L2R2_Lemma.rda|WP500_Ctype_L2R2_Lemma.rda]]'' (33.1 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Ctype_L2R2_Lemma_svd500.rda|WP500_Ctype_L2R2_Lemma_svd500.rda]]'' (64.3 MB) | * type contexts (L2+R2): ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L2R2_Lemma.rda|WP500_Ctype_L2R2_Lemma.rda]]'' (33.1 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L2R2_Lemma_svd500.rda|WP500_Ctype_L2R2_Lemma_svd500.rda]]'' (64.3 MB) |
* type contexts (L2+R2 POS tags): ''[[http://www.collocations.de/data/WP500_Ctype_L2R2pos_Lemma.rda|WP500_Ctype_L2R2pos_Lemma.rda]]'' (56.1 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Ctype_L2R2pos_Lemma_svd500.rda|WP500_Ctype_L2R2pos_Lemma_svd500.rda]]'' (175.3 MB) | * type contexts (L2+R2 POS tags): ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L2R2pos_Lemma.rda|WP500_Ctype_L2R2pos_Lemma.rda]]'' (56.1 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Ctype_L2R2pos_Lemma_svd500.rda|WP500_Ctype_L2R2pos_Lemma_svd500.rda]]'' (175.3 MB) |
* word forms L2/R2: ''[[http://www.collocations.de/data/WP500_Win2_Word.rda|WP500_Win2_Word.rda]]'' (63.9 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win2_Word_svd500.rda|WP500_Win2_Word_svd500.rda]]'' (185.5 MB) | * word forms L2/R2: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Word.rda|WP500_Win2_Word.rda]]'' (63.9 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Word_svd500.rda|WP500_Win2_Word_svd500.rda]]'' (185.5 MB) |
* word forms L2/R2 with non-lemmatized features: ''[[http://www.collocations.de/data/WP500_Win2_Word_WF.rda|WP500_Win2_Word_WF.rda]]'' (68.9 MB) – 500 latent SVD dimensions: ''[[http://www.collocations.de/data/WP500_Win2_Word_WF_svd500.rda|WP500_Win2_Word_WF_svd500.rda]]'' (185.9 MB) | * word forms L2/R2 with non-lemmatized features: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Word_WF.rda|WP500_Win2_Word_WF.rda]]'' (68.9 MB) – 500 latent SVD dimensions: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/WP500_Win2_Word_WF_svd500.rda|WP500_Win2_Word_WF_svd500.rda]]'' (185.9 MB) |
| |
==== Neural word embeddings ==== | ==== Neural word embeddings ==== |
Some publicly available pre-trained neural embeddings, converted into ''.rda'' format for use with the ''wordspace'' package. | Some publicly available pre-trained neural embeddings, converted into ''.rda'' format for use with the ''wordspace'' package. |
| |
* word2vec: ''[[http://www.collocations.de/data/GoogleNews300_wf200k.rda|GoogleNews300_wf200k.rda]]'' (129.2 MiB) | * word2vec: ''[[http://corpora.linguistik.uni-erlangen.de/data/wordspace/GoogleNews300_wf200k.rda|GoogleNews300_wf200k.rda]]'' (129.2 MiB) |
| |
===== Web interfaces ===== | ===== Web interfaces ===== |