Task 1: Correlation with free association norms

Overview and main goals

In psychology, free associations are the first words that come to the mind of a native speaker when he or she is presented with a stimulus word, presumably retrieved from associative memory. It is tempting to make a connection between such free associations and the statistical association patterns of words in the linguistic experience of speakers, including both first-order associations (collocations) and higher-order associations (distributional similarity). The misleading terminological resemblance between the two concepts is not the only reason, though:

  • Neither free associations nor statistical association can be linked directly to a specific linguistic phenomenon (such as multiword expressions or a particular semantic relation) and are often considered epiphenomena in linguistic theory (which is based on categorial distinctions and symbolic models).
  • It is quite plausible to assume that associative memory reflects salient statistical association patterns in the experience of a person. For free associations between words, the predominant factor should be linguistic experience (although associations between non-linguistic concepts will certainly play a role as well).

In the shared task, we wish to find out to what extent free associations can be explained and predicted by statistical association measures computed from corpus data. The scientific goals of this experiment are twofold:

  1. Improve our understanding of free associations. In particular, we are interested in the interplay between first-order and higher-order statistical associations in human associative memory (e.g. bear evokes the hypernym animal and the property brown, but mouse evokes the compound mouse trap). In future shared tasks, we will also attempt to model the asymmetry of many free associations (e.g. bowler strongly evokes hat, but not vice versa).
  2. Evaluate free associations as a straightforward "baseline" interpretation of distributional similarity. If word space proves to be a good model of human associative memory, then we should perhaps focus more on the relation between such free associations and theoretical linguistic categories rather than studying the linguistic aspects of word space models directly. 1)

In order to address these questions, we propose the three subtasks described below. Note that ideally the same word space model should be used for all subtasks, although its similarity scores etc. will be interpreted in different ways, of course. Participants are specifically encouraged to combine first-order statistical associations (see www.collocations.de/AM) with their word space model and to discuss the respective contribution made by each type of association.

Data preparation

Association norms

Psychologists measure free association with so-called association norms: Native speakers are presented with stimulus words and are asked to write down the first word that comes to mind for each stimulus. The degree of free association between a stimulus (S) and response (R) is then quantified by the percentage of test subjects who produced R when presented with S. The data sets for this task are based on a large, freely available database of English association norms, the Edinburgh Associative Thesaurus (http://www.eat.rl.ac.uk/). 2)

  • Kiss, G.R., Armstrong, C., Milroy, R., and Piper, J. (1973). An associative thesaurus of English and its computer analysis. In Aitken, A.J., Bailey, R.W. and Hamilton-Smith, N. (Eds.), The Computer and Literary Studies. Edinburgh: Edinburgh University Press.


Stimulus (cue) and response (target) words in the EAT database were normalised to lowercase, and multiword units (i.e. words containing blanks) were discarded. Both cues and targets seem to be partly lemmatised base forms (headwords), partly inflected forms (mostly plurals), and no part-of-speech distinctions are made (so the entry light may refer to noun, adjective or verb and was probably interpreted and used in all three meanings by test subjects). Automatic normalisation of inflected forms or identification of parts of speech was not feasible, but we have made an effort to exlcude word pairs containing inflected forms from the data sets.

In order to make sure that word space models have sufficient information for each word pair, only common English words were accepted as cues and targets. For operationalisation, common words were defined as headwords that occur in at least 50 different documents in the British National Corpus (BNC), XML Edition. This threshold was phrased in terms of document frequencies to avoid genre- and domain-specific words in the data sets, so that the choice of base corpus for the word space models should be less critical.

Data sets & tasks

ZIP-archive with data sets for all subtasks: free_association_tasks.zip

All files are TAB-delimited tables in ASCII text format with a single header row, so they can easily be loaded into R (with read.delim()) and most spreadsheet programs. Standard columns are cue (stimulus headword) and target (response headword); the other columns are specific for each task and are described below.

For each task, separate training and test sets are provided. Training sets are small and can be used to adapt parameters of the word space models or the formula used to predict free association strength from the statistical association data. We recommend using the test set for evaluation only!

Note that these tasks are mainly aimed at surface-level word space models that are not restricted to specific parts of speech and will typically not make use of syntactic features (except in a very generic way). Words have to be lemmatised (reduced to base forms) and normalised to lower case in order to match the entries in the data sets, though.

1. Discrimination

Files: FA/discrimination_train.tbl (3 x 20 pairs), FA/discrimination_test.tbl (3 x 100 pairs)


  • cue = stimulus word
  • target = response word
  • type = FIRST, HAPAX, or RANDOM

The task here is to discriminate between strongly associated and non-associated cue-target pairs, with a further subdivision of the second group into plausible and random pairs. Training and test data were randomly sampled from three pools:

  • FIRST: frequent first responses (given by more than 50% of test subjects) as strongly associated pairs
  • HAPAX: cue-target pairs that were produced by a single test subject; there is obviously no substantial association, but the target must be a plausible response (at least under certain circumstances)
  • RANDOM: random combinations of headwords from the EAT that were never produced as a cue-target pair (in any direction); most of these will likely be very implausible combinations

The main goal of this task is discrimination between the FIRST category (strongly associated pairs) and the other two categories. A further discrimination between HAPAX and RANDOM can be attempted, but is expected to be much more difficult.

Evaluation should report classification accuracy on the test set after parameter tuning on the training set. Note that the baseline accuracy for the main classification task is 66.6% (all pairs classified as non-associated). Post-hoc analysis might consider the influence of different parameter settings and first-order/higher-order combinations on the test set.

2. Correlation

Files: FA/correlation_train.tbl (40 pairs), FA/correlation_test.tbl (240 pairs)


  • cue = stimulus headword
  • target = response headword
  • assoc = (forward) association strength of pair = proportion of response target for stimulus cue

Here, the task is to predict free association strength for a given list of cue-target pairs, quantified by the proportion of test subjects that gave target as a response to the stimulus cue. Association strength therefore ranges from 0 to 1 (the highest value in the EAT is .91). Pairs in the training and test set have been selected by stratified sampling so that association strength is uniformly distributed across the full range (values above 0.7 have been pooled).

The predictor will typically be a nonlinear function of first-order and higher-order statistical association, whose parameters can be tuned on the training set. Evaluation should report linear correlation (Pearson) and rank correlation (Kendall) between predictions and the gold standard. Participants are encouraged to produce scatterplots and explore nonlinear correlations, although the predictor function should ideally remove such nonlinearities.

3. Response prediction

Files: FA/prediction_dev.tbl (50 cues), FA/prediction_test.tbl (200 cues)


  • cue = stimulus headword
  • target = most frequent response
  • a1 = association strength of this response (for information and post-hoc analysis)
  • a2 = association strength of second response (for information and post-hoc analysis)

In this task, models have to predict the most frequent responses for a given list of stimulus words. This task is presumably much harder than the correlation task, since the model has to choose from a very large set of possible response words (which are not narrowed down to the set of responses found in the EAT for each stimulus!). Cues were randomly selected from entries in the EAT database that have a clearly preferred response, operationalised in the following way: the association strength of the dominant response must be >= .4, and at least three times as high as that of the second response.

Because of the difficulty of this task, evaluation will be relatively lenient. Word space models can suggest up to 100 response candidates for each cue, and the score of the model is the average rank of the correct response (if the correct response is not among the suggested candidates, it is assigned rank 100 regardless of the number of suggestions). Further evaluation in terms of how many cues have the correct response among the first k candidates is encouraged.

Ancillary data: First-order associations

Database of first-order statistical associations: lexsem08_first_order_associations.ds.gz (ZIP archive, 5.8 MB)

This database contains lemmatised surface collocates of all cue words used in the free associations task, extracted from the British National Corpus with a span size of 5 words (left & right) and limited by sentence boundaries. Collocates were only included if they cooccur at least f=5 times with the cue word and show significant evidence for a positive statistical association (p < .001, one-sided log-likelihood test). First-order association is quantified by four well-known association measures with distinct mathematical properties, viz. log-likelihood, t-score, MI and Dice. See Evert (2008) for terminology and further information.

The database is provided in the .ds.gz format used by the UCS toolkit. It is a simple TAB-delimited ASCII table with a single header row, and can easily be read into R (using read.delim) or a spreadsheet program such as Excel after decompression with gzip. The table contains the following variables (columns):

l1          cue word (lemmatised)
l2          collocate (lemmatised)
f           cooccurrence frequency
f1          marginal frequency of cue word
f2          marginal frequency of collocate
N           sample size (cooccurrence tokens)
am.log.likelihood  log-likelihood association score
am.t.score  t-score association score
am.MI       MI (pointwise mutual information) score
am.Dice     Dice coefficient association score


Since our focus is not on competition, each team will be responsible for evaluating their own model and reporting the results in their paper submission, following the recommendations in the task descriptions above. Participants are strongly encouraged to make the full model output available for download to allow further analysis and discussion by other researchers.

NB: bug in script eval_task3.perl fixed as of March 29: if you downloaded earlier, please re-download

Evaluation package: eval_package_free_association.zip

  • sample output generated by FOO model 3)
  • sample evaluation scripts written in R and Perl
  • includes complete implementation of FOO model
We fully expect a negative answer here, and this is certainly the desirable outcome for many researchers. However, it will be interesting to see how close the relation between word space and associative memory really is.
We also considered using the USF Free Association Database (http://w3.usf.edu/FreeAssociation), but found it more difficult to adapt to our purposes. One reason is that hapax responses (those generated only by a single subject) were originally excluded from the database and are now available only in separate files with a different format. More information on the USF database can be found in: Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (1998). The University of South Florida word association, rhyme, and word fragment norms.
First-Order associations Only