Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
data:correlation_with_free_association_norms [2008/01/23 17:37]
schtepf
data:correlation_with_free_association_norms [2008/06/23 22:19]
schtepf
Line 1: Line 1:
-====== Correlation of statistical distribution and human free associations ======+====== Correlation of the statistical distribution of words with human free associations ====== 
 + 
  
  
Line 11: Line 13:
 In the shared task, we wish to find out to what extent free associations can be explained and predicted by statistical association measures computed from corpus data.  The scientific goals of this experiment are twofold: In the shared task, we wish to find out to what extent free associations can be explained and predicted by statistical association measures computed from corpus data.  The scientific goals of this experiment are twofold:
  
-  - **Improve our understanding of free associations.**  In particular, we are interested in the interplay between **first-order and higher-order statistical associations** in human associative memory (e.g. //bear// evokes the hypernym //insect// and //brown//, but //mouse// evokes the compound //mouse trap//).  In future shared tasks, we will also attempt to model the **asymmetry** of many free associations (e.g. //bowler// strongly evokes //hat//, but not vice versa).+  - **Improve our understanding of free associations.**  In particular, we are interested in the interplay between **first-order and higher-order statistical associations** in human associative memory (e.g. //bear// evokes the hypernym //animal// and the property //brown//, but //mouse// evokes the compound //mouse trap//).  In future shared tasks, we will also attempt to model the **asymmetry** of many free associations (e.g. //bowler// strongly evokes //hat//, but not vice versa).
   - **Evaluate free associations as a straightforward "baseline" interpretation of distributional similarity.**  If word space proves to be a good **model of human associative memory**, then we should perhaps focus more on the relation between such free associations and theoretical linguistic categories rather than studying the linguistic aspects of word space models directly.  ((We fully expect a negative answer here, and this is certainly the desirable outcome for many researchers. However, it will be interesting to see how close the relation between word space and associative memory really is.))   - **Evaluate free associations as a straightforward "baseline" interpretation of distributional similarity.**  If word space proves to be a good **model of human associative memory**, then we should perhaps focus more on the relation between such free associations and theoretical linguistic categories rather than studying the linguistic aspects of word space models directly.  ((We fully expect a negative answer here, and this is certainly the desirable outcome for many researchers. However, it will be interesting to see how close the relation between word space and associative memory really is.))
  
Line 17: Line 19:
  
 ===== Data preparation ===== ===== Data preparation =====
 +
  
  
Line 22: Line 25:
  
 Psychologists measure free association with so-called **association norms**:  Native speakers are presented with stimulus words and are asked to write down the first word that comes to mind for each stimulus.  The degree of free association between a stimulus (//S//) and response (//R//) is then quantified by the percentage of test subjects who produced //R// when presented with //S// The data sets for this task are based on a large, freely available database of English association norms, the **Edinburgh Associative Thesaurus** ([[http://www.eat.rl.ac.uk/]]). Psychologists measure free association with so-called **association norms**:  Native speakers are presented with stimulus words and are asked to write down the first word that comes to mind for each stimulus.  The degree of free association between a stimulus (//S//) and response (//R//) is then quantified by the percentage of test subjects who produced //R// when presented with //S// The data sets for this task are based on a large, freely available database of English association norms, the **Edinburgh Associative Thesaurus** ([[http://www.eat.rl.ac.uk/]]).
-((We also considered using the **USF Free Association Database** ([[http://www.usf.edu/FreeAssociation]]), but it was not suitable for our purposes due to the exclusion of hapax responses.  More information on the USF database can be found in: Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (1998).  //The University of South Florida word association, rhyme, and word fragment norms.//))+((We also considered using the **USF Free Association Database** ([[http://w3.usf.edu/FreeAssociation]]), but found it more difficult to adapt to our purposes.  One reason is that hapax responses (those generated only by a single subject) were originally excluded from the database and are now available only in separate files with a different format.  More information on the USF database can be found in: Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (1998).  //The University of South Florida word association, rhyme, and word fragment norms.//))
  
   * Kiss, G.R., Armstrong, C., Milroy, R., and Piper, J. (1973).  An associative thesaurus of English and its computer analysis. In Aitken, A.J., Bailey, R.W. and Hamilton-Smith, N. (Eds.), //The Computer and Literary Studies//. Edinburgh: Edinburgh University Press.   * Kiss, G.R., Armstrong, C., Milroy, R., and Piper, J. (1973).  An associative thesaurus of English and its computer analysis. In Aitken, A.J., Bailey, R.W. and Hamilton-Smith, N. (Eds.), //The Computer and Literary Studies//. Edinburgh: Edinburgh University Press.
Line 35: Line 38:
  
  
-===== Data sets & tasks ===== 
  
-ZIP-archive with data sets for all subtasks: **coming soon** 
  
-All files are TAB-delimited tables in ASCII text format with a single header row, so they can easily be loaded into [[http://www.r-project.org/|R]] (with ''read.delim()'') and most spreadsheet programs.  Standard columns are ''cue'' (stimulus headword) and ''target'' (response headword); the other columns are specific for each task and are described below. 
  
-For each task, separate training and test sets are provided. Training sets are small and can be used to adapt parameters of the word space models or the formula used to predict free association strength from the statistical association data. //No development or tuning on the evaluation sets is allowed!//+===== Data sets & tasks =====
  
 +ZIP-archive with data sets for all subtasks: {{data:free_association_tasks.zip}}
  
 +All files are TAB-delimited tables in ASCII text format with a single header row, so they can easily be loaded into [[http://www.r-project.org/|R]] (with ''read.delim()'') and most spreadsheet programs.  Standard columns are ''cue'' (stimulus headword) and ''target'' (response headword); the other columns are specific for each task and are described below.
  
 +For each task, separate training and test sets are provided. Training sets are small and can be used to adapt parameters of the word space models or the formula used to predict free association strength from the statistical association data. //We recommend using the test set for evaluation only!//
 +
 +Note that these tasks are mainly aimed at //surface-level// word space models that are not restricted to specific parts of speech and will typically not make use of syntactic features (except in a very generic way).  Words have to be lemmatised (reduced to base forms) and normalised to lower case in order to match the entries in the data sets, though.
  
 ==== 1. Discrimination ==== ==== 1. Discrimination ====
Line 63: Line 68:
  
 Evaluation should report classification accuracy on the test set after parameter tuning on the training set.  Note that the baseline accuracy for the main classification task is 66.6% (all pairs classified as non-associated).  Post-hoc analysis might consider the influence of different parameter settings and first-order/higher-order combinations on the test set. Evaluation should report classification accuracy on the test set after parameter tuning on the training set.  Note that the baseline accuracy for the main classification task is 66.6% (all pairs classified as non-associated).  Post-hoc analysis might consider the influence of different parameter settings and first-order/higher-order combinations on the test set.
 +
 +
  
  
Line 74: Line 81:
   * ''cue'' = stimulus headword   * ''cue'' = stimulus headword
   * ''target'' = response headword   * ''target'' = response headword
-  * ''assoc'' = (forward) association strength of pair = proportion of responses //target// for stimulus //cue//+  * ''assoc'' = (forward) association strength of pair = proportion of response //target// for stimulus //cue//
  
 Here, the task is to predict free association strength for a given list of cue-target pairs, quantified by the proportion of test subjects that gave //target// as a response to the stimulus //cue// Association strength therefore ranges from 0 to 1 (the highest value in the EAT is .91).  Pairs in the training and test set have been selected by stratified sampling so that association strength is uniformly distributed across the full range (values above 0.7 have been pooled). Here, the task is to predict free association strength for a given list of cue-target pairs, quantified by the proportion of test subjects that gave //target// as a response to the stimulus //cue// Association strength therefore ranges from 0 to 1 (the highest value in the EAT is .91).  Pairs in the training and test set have been selected by stratified sampling so that association strength is uniformly distributed across the full range (values above 0.7 have been pooled).
  
-The predictor will typically be a nonlinear function of first-order and higher-order statistical association, whose parameters can be tuned on the training set. Evaluation should report //linear correlation// (Pearson) and //rank correlation// (Kendall) between predictions and the gold standard. Participants are encouraged to produce scatterplots and explore nonlinear correlations, although the predictor function is expected to remove such nonlinearities.+The predictor will typically be a nonlinear function of first-order and higher-order statistical association, whose parameters can be tuned on the training set. Evaluation should report //linear correlation// (Pearson) and //rank correlation// (Kendall) between predictions and the gold standard. Participants are encouraged to produce scatterplots and explore nonlinear correlations, although the predictor function should ideally remove such nonlinearities. 
 + 
 + 
 + 
  
 ==== 3. Response prediction ==== ==== 3. Response prediction ====
  
-In this subtask, models have to predict the most frequent free associations of native speakers for a given list of stimulus words This task is presumably much harder than the correlation task, since the model has to choose from a very large set of possible response words (which are not narrowed down to the set of responses observed in psychological experiments).  For this reasonevaluation will be relatively lenient:+Files: ''FA/prediction_dev.tbl'' (50 cues), ''FA/prediction_test.tbl'' (200 cues)
  
-  Participants suggest approxresponse candidates for each stimulus word+Format: 
-  - The model predictions are accepted as correct if at least one of the candidates belongs to the most frequent responses in the gold standard (these will comprise 1 to 3 dominant response words).+  * ''cue'' = stimulus headword 
 +  * ''target'' = most frequent response 
 +  * ''a1'' = association strength of this response (for information and post-hoc analysis) 
 +  * ''a2'' = association strength of second response (for information and post-hoc analysis) 
 + 
 +In this task, models have to predict the most frequent responses for a given list of stimulus words This task is presumably much harder than the correlation task, since the model has to choose from a very large set of possible response words (which are not narrowed down to the set of responses found in the EAT for each stimulus!).  Cues were randomly selected from entries in the EAT database that have a clearly preferred response, operationalised in the following way: the association strength of the dominant response must be >= .4, and at least three times as high as that of the second response
 + 
 +Because of the difficulty of this task, evaluation will be relatively lenient.  Word space models can suggest up to 100 response candidates for each cue, and the score of the model is the //average rank of the correct response// (if the correct response is not among the suggested candidates, it is assigned rank 100 regardless of the number of suggestions).  Further evaluation in terms of how many cues have the correct response among the first //k// candidates is encouraged.  
 + 
 + 
 +===== Ancillary data: First-order associations ===== 
 + 
 +Database of first-order statistical associations: {{data:lexsem08_first_order_associations.ds.gz}} (ZIP archive, 5.8 MB) 
 + 
 +This database contains lemmatised surface collocates of all cue words used in the free associations task, extracted from the British National Corpus with a span size of 5 words (left & rightand limited by sentence boundaries.  Collocates were only included if they cooccur at least //f=5// times with the cue word and show significant evidence for a positive statistical association (//p < .001//, one-sided log-likelihood test).  First-order association is quantified by four well-known association measures with distinct mathematical properties, viz. //log-likelihood//, //t-score//, //MI// and //Dice//. See [[http://purl.org/stefan.evert/PUB/Evert2007HSK_extended_manuscript.pdf|Evert (2008)]] for terminology and further information. 
 + 
 +The database is provided in the ''.ds.gz'' format used by the [[http://www.collocations.de/software.html#UCS|UCS toolkit]].  It is a simple TAB-delimited ASCII table with a single header row, and can easily be read into [[http://www.r-project.org/|R]] (using ''read.delim'') or a spreadsheet program such as Excel after decompression with ''gzip'' The table contains the following variables (columns): 
 + 
 +  l1          cue word (lemmatised) 
 +  l2          collocate (lemmatised) 
 +  f           cooccurrence frequency 
 +  f1          marginal frequency of cue word 
 +  f2          marginal frequency of collocate 
 +  N           sample size (cooccurrence tokens) 
 +  am.log.likelihood  log-likelihood association score 
 +  am.t.score  t-score association score 
 +  am.MI       MI (pointwise mutual information) score 
 +  am.Dice     Dice coefficient association score
  
-If a model achieves high precision at this level, then further analysis e.g. by taking the rank of the "correct" candidate into account should be performed. 
  
  
 ===== Evaluation ===== ===== Evaluation =====
  
-Evaluation will be carried out by comparison of model predictions with our gold standard on the test sets.  Since our focus is not on competition, each team will be responsible for evaluating their own model and reporting the results in their paper submission.  Participants are strongly encouraged to make model predictions available for downloads to allow further analysis and discussion by other researchers.+Since our focus is not on competition, each team will be responsible for evaluating their own model and reporting the results in their paper submission, following the recommendations in the task descriptions above.  Participants are strongly encouraged to make the full model output available for download to allow further analysis and discussion by other researchers. 
 + 
 +**NB: bug in script eval_task3.perl fixed as of March 29: if you downloaded earlier, please re-download**  
 + 
 +Evaluation package: {{data:eval_package_free_association.zip}}
  
-In order to ensure comparability of the results, we will provide [[http://www.r-project.org/|R]] and [[http://www.perl.org/|Perl]] scripts for a basic evaluation of each subtask, together with detailed instructions and examples.+  * sample output generated by FOO model ((**F**irst-**O**rder associations **O**nly)) 
 +  * sample evaluation scripts written in [[http://www.r-project.org/|R]] and [[http://www.perl.org/|Perl]] 
 +  * includes complete implementation of FOO model