Distributional Semantic Models (NAACL-HLT 2010)
Distributional Semantic Models
Tutorial at the NAACL-HLT 2010 Conference, Los Angeles, 1 June 2010
Distributional semantic models (DSM) – also known as “word space” or “distributional similarity” models – are based on the assumption that the meaning of a word can (at least to a certain extent) be inferred from its usage, i.e. its distribution in text. Therefore, these models build high-dimensional vector representations through a statistical analysis of the contexts in which words occur.
Since the seminal papers of Landauer & Dumais (1997) and Schütze (1998), DSMs have been an active area of research in computational linguistics. Amongst many other tasks, they have been applied to solving the TOEFL synonym test, automatic thesaurus construction, identification of translation equivalents, word sense induction and discrimination, POS induction, identification of analogical relations, PP attachment disambiguation, semantic classification, as well as the prediction of fMRI and EEG data (see bibliography). Recent years have seen renewed and rapidly growing interest in distributional approaches, as shown by the series of workshops on DSM held at Context 2007, ESSLLI 2008, EACL 2009, CogSci 2009, NAACL-HLT 2010, ACL 2010 and ESSLLI 2010 (links).
This tutorial is targeted both at participants who are new to the field and need a comprehensive overview of DSM techniques and applications, and at experienced scientists who want to get up to speed on current directions in DSM research. Its main goals are to
- introduce the most common DSM architectures and their parameters, as well as prototypical applications;
- equip participants with the mathematical techniques needed for the implementation of DSMs, in particular those of matrix algebra;
- illustrate visualisation techniques and mathematical arguments that help in understanding the high-dimensional DSM vector spaces and making sense of key operations such as SVD dimensionality reduction; and
- provide an overview of current research on DSMs, available software, evaluation tasks and future trends.
An implementation of all methods presented in the tutorial will be made available on this Web site, based on the open-source statistical programming language R. With its sophisticated visualisation and data analysis features and an enormous choice of add-on packages, R provides an excellent “toy laboratory” for DSM research and is even powerful enough for mid-sized applications.
The tutorial is based on joint work with Alessandro Lenci and Marco Baroni.