Differences

This shows you the differences between two versions of the page.

--- data:esslli2008:comparison_with_speaker-generated_features [2010/02/08 00:30]
schtepf
+++ data:esslli2008:comparison_with_speaker-generated_features [2010/11/01 14:07]
@@ Line 1: / Line 1: @@
-====== Task 3: Comparison with Speaker-Generated Properties ======
-==== Introduction ====
-From a cognitive point of view, there is little doubt that salient properties of a concept are an important part of its "meaning", and subjects show a remarkable degree of agreement in tasks that require enumerating the typical properties of a concept: a dog //barks, has a tail, is a pet, etc.//
-Psychologists have been collecting "feature norms", i.e., speaker-generated lists of concepts described in terms of properties, for decades now.
-A particularly large and well-articulated list was recently made publicly available by McRae and colleagues:
-McRae, K., Cree, G. S., Seidenberg, M. S., & McNorgan, C. (2005). Semantic feature production norms for a large set of living and nonliving things. Behavioral Research Methods, Instruments, and Computers, 37, 547-559.
-The list can be obtained as described [[http://www.creelab.org/downloads|here]].
-==== Task Operationalization ====
-We operationalize the property generation task as follows.
-We focus on the same set of 44 concepts used in the [[http://wordspace.collocations.de/doku.php/data:concrete_nouns_categorization|concrete noun categorization task]].
-For each target concept, we pick the top 10 properties from the McRae norms (ranked
-by number of subjects that produced them) and use them as the gold
-standard set for that concept. Given the ranked output of a model, we
-compute precision for each concept with respect to this gold standard,
-at various n-best thresholds, and we average precision across the 44
-concepts. We limit ourselves to the top 10 human-generated properties
-of each concept since, for about 10% of the target concepts, the norms only
-contain 10 properties (for one concept, //snail//, the norms list 9 properties).
-The provided evaluation script, by default, reports average precision at the 10-, 20- and 30-best thresholds.
-=== Property Expansion ===
-The properties in the norms database are expressed by phrases such as //tastes sweet//
-or //is loud//, resulting from manual normalization of the
-subjects' responses (McRae et al. 2005, p. 551). Thus, we face
-two problems when determining whether a property generated by a model
-matches a property in the norms: First, all word space models we are aware of produce single orthographic //words// as properties, and these have to be matched against the
-//phrases// in the norms. Second, we need to undo the normalization of McRae and colleagues, so that, say, //loud//, //noise// and //noisy// will all be counted as
-matches against property //is loud//.
-We dealt with these issues by generating an "expansion set" for each
-of the top 10 properties of each of the 44 target concepts, i.e., a
-list of single word expressions that seemed plausible ways to express
-the relevant property. The expansion set was prepared by first
-extracting from WordNet the synonyms of the words that constituted the
-last element of a property phrase (//red// in //is red//), and
-then filtering out irrelevant synonyms by hand while adding other potential matches, including inflectional  and derivational variants (//leg// for
-//legs// and //transport// for
-//transportation//, respectively), as well as other semantic neighbours or
-closely related entities (//lives on water// was expanded to
-//aquatic, lake, ocean, river, sea, water//).
-While we recognize the somewhat subjective nature of the expansion operation, we have no reason to think that matching against the expanded set introduces a bias in favour or against any specific model.
-When evaluating against the expansion set, there is the possibility that
-a model will match a property more than once (e.g., matching both
-//transport// and //transportation//). In these cases, we count the top
-match, and we ignore the lower ones (i.e., lower matches are not treated as
-hits, but they do not contribute to the n-best count either).
-==== Gold standard and evaluation script ====
-**NB: on March 7, we made a small correction to the property expansion file; if you downloaded the archive before this date, please download it again**
-This {{propgen.tar.gz|archive}} contains the gold standard (with property expansions as described above) and an evaluation script that computes average precision at various n-best thresholds.
-Detailed information about the script can be accessed by running it with the ''-h'' option:
-''evaluate-against-expanded-props.pl -h | more''
-In short, if you can organize the output of you model in a file, say ''output.txt'', with lines in format:
-''concept property score''
-then you can run the evaluation (against the gold standard set with expansions generated as described above) as:
-''evaluate-against-expanded-props.pl expanded-props.txt output.txt''
-We provide this script to have a common benchmark when comparing models, but we also encourage you to explore the McRae et al.'s database for other possible ways to evaluate the models.

You are here: start » data » esslli2008 » comparison_with_speaker-generated_features

Differences

Navigation

Search

Toolbox