Differences

This shows you the differences between two versions of the page.

Link to this comparison view

data:esslli2008:comparison_with_speaker-generated_features [2010/02/08 00:30]
schtepf
data:esslli2008:comparison_with_speaker-generated_features [2010/11/01 14:07]
Line 1: Line 1:
-====== Task 3: Comparison with Speaker-Generated Properties ====== 
- 
- 
- 
-==== Introduction ==== 
- 
-From a cognitive point of view, there is little doubt that salient properties of a concept are an important part of its "meaning", and subjects show a remarkable degree of agreement in tasks that require enumerating the typical properties of a concept: a dog //barks, has a tail, is a pet, etc.// 
- 
-Psychologists have been collecting "feature norms", i.e., speaker-generated lists of concepts described in terms of properties, for decades now. 
- 
-A particularly large and well-articulated list was recently made publicly available by McRae and colleagues: 
- 
-McRae, K., Cree, G. S., Seidenberg, M. S., & McNorgan, C. (2005). Semantic feature production norms for a large set of living and nonliving things. Behavioral Research Methods, Instruments, and Computers, 37, 547-559. 
- 
-The list can be obtained as described [[http://www.creelab.org/downloads|here]]. 
- 
- 
- 
-==== Task Operationalization ==== 
- 
- 
-We operationalize the property generation task as follows.  
- 
-We focus on the same set of 44 concepts used in the [[http://wordspace.collocations.de/doku.php/data:concrete_nouns_categorization|concrete noun categorization task]]. 
- 
-For each target concept, we pick the top 10 properties from the McRae norms (ranked 
-by number of subjects that produced them) and use them as the gold 
-standard set for that concept. Given the ranked output of a model, we 
-compute precision for each concept with respect to this gold standard, 
-at various n-best thresholds, and we average precision across the 44 
-concepts. We limit ourselves to the top 10 human-generated properties 
-of each concept since, for about 10% of the target concepts, the norms only 
-contain 10 properties (for one concept, //snail//, the norms list 9 properties). 
- 
-The provided evaluation script, by default, reports average precision at the 10-, 20- and 30-best thresholds. 
- 
-=== Property Expansion === 
- 
-The properties in the norms database are expressed by phrases such as //tastes sweet// 
-or //is loud//, resulting from manual normalization of the 
-subjects' responses (McRae et al. 2005, p. 551). Thus, we face 
-two problems when determining whether a property generated by a model 
-matches a property in the norms: First, all word space models we are aware of produce single orthographic //words// as properties, and these have to be matched against the 
-//phrases// in the norms. Second, we need to undo the normalization of McRae and colleagues, so that, say, //loud//, //noise// and //noisy// will all be counted as 
-matches against property //is loud//. 
- 
-We dealt with these issues by generating an "expansion set" for each 
-of the top 10 properties of each of the 44 target concepts, i.e., a 
-list of single word expressions that seemed plausible ways to express 
-the relevant property. The expansion set was prepared by first 
-extracting from WordNet the synonyms of the words that constituted the 
-last element of a property phrase (//red// in //is red//), and 
-then filtering out irrelevant synonyms by hand while adding other potential matches, including inflectional  and derivational variants (//leg// for 
-//legs// and //transport// for 
-//transportation//, respectively), as well as other semantic neighbours or 
-closely related entities (//lives on water// was expanded to 
-//aquatic, lake, ocean, river, sea, water//). 
- 
-While we recognize the somewhat subjective nature of the expansion operation, we have no reason to think that matching against the expanded set introduces a bias in favour or against any specific model. 
- 
-When evaluating against the expansion set, there is the possibility that 
-a model will match a property more than once (e.g., matching both 
-//transport// and //transportation//). In these cases, we count the top 
-match, and we ignore the lower ones (i.e., lower matches are not treated as 
-hits, but they do not contribute to the n-best count either). 
- 
- 
- 
- 
- 
- 
-==== Gold standard and evaluation script ==== 
- 
-**NB: on March 7, we made a small correction to the property expansion file; if you downloaded the archive before this date, please download it again** 
- 
-This {{propgen.tar.gz|archive}} contains the gold standard (with property expansions as described above) and an evaluation script that computes average precision at various n-best thresholds. 
- 
-Detailed information about the script can be accessed by running it with the ''-h'' option: 
- 
-''evaluate-against-expanded-props.pl -h | more'' 
- 
-In short, if you can organize the output of you model in a file, say ''output.txt'', with lines in format: 
- 
-''concept property score'' 
- 
-then you can run the evaluation (against the gold standard set with expansions generated as described above) as: 
- 
-''evaluate-against-expanded-props.pl expanded-props.txt output.txt'' 
- 
-We provide this script to have a common benchmark when comparing models, but we also encourage you to explore the McRae et al.'s database for other possible ways to evaluate the models. 
-