Differences
This shows you the differences between two versions of the page.
data:esslli2008:comparison_with_speaker-generated_features [2008/03/07 00:00] 127.0.0.1 external edit |
data:esslli2008:comparison_with_speaker-generated_features [2010/11/01 14:07] |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Task 3: Comparison with Speaker-Generated Properties ====== | ||
- | |||
- | |||
- | |||
- | ==== Introduction ==== | ||
- | |||
- | From a cognitive point of view, there is little doubt that salient properties of a concept are an important part of its " | ||
- | |||
- | Psychologists have been collecting " | ||
- | |||
- | A particularly large and well-articulated list was recently made publicly available by McRae and colleagues: | ||
- | |||
- | McRae, K., Cree, G. S., Seidenberg, M. S., & McNorgan, C. (2005). Semantic feature production norms for a large set of living and nonliving things. Behavioral Research Methods, Instruments, | ||
- | |||
- | The list can be obtained as described [[http:// | ||
- | |||
- | |||
- | |||
- | ==== Task Operationalization ==== | ||
- | |||
- | |||
- | We operationalize the property generation task as follows. | ||
- | |||
- | We focus on the same set of 44 concepts used in the [[http:// | ||
- | |||
- | For each target concept, we pick the top 10 properties from the McRae norms (ranked | ||
- | by number of subjects that produced them) and use them as the gold | ||
- | standard set for that concept. Given the ranked output of a model, we | ||
- | compute precision for each concept with respect to this gold standard, | ||
- | at various n-best thresholds, and we average precision across the 44 | ||
- | concepts. We limit ourselves to the top 10 human-generated properties | ||
- | of each concept since, for about 10% of the target concepts, the norms only | ||
- | contain 10 properties (for one concept, //snail//, the norms list 9 properties). | ||
- | |||
- | The provided evaluation script, by default, reports average precision at the 10-, 20- and 30-best thresholds. | ||
- | |||
- | === Property Expansion === | ||
- | |||
- | The properties in the norms database are expressed by phrases such as //tastes sweet// | ||
- | or //is loud//, resulting from manual normalization of the | ||
- | subjects' | ||
- | two problems when determining whether a property generated by a model | ||
- | matches a property in the norms: First, all word space models we are aware of produce single orthographic //words// as properties, and these have to be matched against the | ||
- | //phrases// in the norms. Second, we need to undo the normalization of McRae and colleagues, so that, say, //loud//, //noise// and //noisy// will all be counted as | ||
- | matches against property //is loud//. | ||
- | |||
- | We dealt with these issues by generating an " | ||
- | of the top 10 properties of each of the 44 target concepts, i.e., a | ||
- | list of single word expressions that seemed plausible ways to express | ||
- | the relevant property. The expansion set was prepared by first | ||
- | extracting from WordNet the synonyms of the words that constituted the | ||
- | last element of a property phrase (//red// in //is red//), and | ||
- | then filtering out irrelevant synonyms by hand while adding other potential matches, including inflectional | ||
- | //legs// and // | ||
- | // | ||
- | closely related entities (//lives on water// was expanded to | ||
- | //aquatic, lake, ocean, river, sea, water//). | ||
- | |||
- | While we recognize the somewhat subjective nature of the expansion operation, we have no reason to think that matching against the expanded set introduces a bias in favour or against any specific model. | ||
- | |||
- | When evaluating against the expansion set, there is the possibility that | ||
- | a model will match a property more than once (e.g., matching both | ||
- | // | ||
- | match, and we ignore the lower ones (i.e., lower matches are not treated as | ||
- | hits, but they do not contribute to the n-best count either). | ||
- | |||
- | |||
- | Back to [[data: | ||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | ==== Gold standard and evaluation script ==== | ||
- | |||
- | ** | ||
- | NB: ON MARCH 7, WE MADE A SMALL CORRECTION TO THE PROPERTY EXPANSION FILE; IF YOU DOWNLOADED THE ARCHIVE BEFORE THIS DATE, PLEASE DOWNLOAD IT AGAIN** | ||
- | |||
- | This {{data: | ||
- | |||
- | Detailed information about the script can be accessed by running it with the '' | ||
- | |||
- | '' | ||
- | |||
- | In short, if you can organize the output of you model in a file, say '' | ||
- | |||
- | '' | ||
- | |||
- | then you can run the evaluation (against the gold standard set with expansions generated as described above) as: | ||
- | |||
- | '' | ||
- | |||
- | We provide this script to have a common benchmark when comparing models, but we also encourage you to explore the McRae et al.'s database for other possible ways to evaluate the models. | ||
- | |||
- | |||
- | Back to [[data: | ||
- | |||
- | Back to [[Start]] | ||