Task 3: Comparison with Speaker-Generated Properties

Introduction

From a cognitive point of view, there is little doubt that salient properties of a concept are an important part of its “meaning”, and subjects show a remarkable degree of agreement in tasks that require enumerating the typical properties of a concept: a dog barks, has a tail, is a pet, etc.

Psychologists have been collecting “feature norms”, i.e., speaker-generated lists of concepts described in terms of properties, for decades now.

A particularly large and well-articulated list was recently made publicly available by McRae and colleagues:

McRae, K., Cree, G. S., Seidenberg, M. S., & McNorgan, C. (2005). Semantic feature production norms for a large set of living and nonliving things. Behavioral Research Methods, Instruments, and Computers, 37, 547-559.

The list can be obtained as described here.

Task Operationalization

We operationalize the property generation task as follows.

We focus on the same set of 44 concepts used in the concrete noun categorization task.

For each target concept, we pick the top 10 properties from the McRae norms (ranked by number of subjects that produced them) and use them as the gold standard set for that concept. Given the ranked output of a model, we compute precision for each concept with respect to this gold standard, at various n-best thresholds, and we average precision across the 44 concepts. We limit ourselves to the top 10 human-generated properties of each concept since, for about 10% of the target concepts, the norms only contain 10 properties (for one concept, snail, the norms list 9 properties).

The provided evaluation script, by default, reports average precision at the 10-, 20- and 30-best thresholds.

Property Expansion

The properties in the norms database are expressed by phrases such as tastes sweet or is loud, resulting from manual normalization of the subjects' responses (McRae et al. 2005, p. 551). Thus, we face two problems when determining whether a property generated by a model matches a property in the norms: First, all word space models we are aware of produce single orthographic words as properties, and these have to be matched against the phrases in the norms. Second, we need to undo the normalization of McRae and colleagues, so that, say, loud, noise and noisy will all be counted as matches against property is loud.

We dealt with these issues by generating an “expansion set” for each of the top 10 properties of each of the 44 target concepts, i.e., a list of single word expressions that seemed plausible ways to express the relevant property. The expansion set was prepared by first extracting from WordNet the synonyms of the words that constituted the last element of a property phrase (red in is red), and then filtering out irrelevant synonyms by hand while adding other potential matches, including inflectional and derivational variants (leg for legs and transport for transportation, respectively), as well as other semantic neighbours or closely related entities (lives on water was expanded to aquatic, lake, ocean, river, sea, water).

While we recognize the somewhat subjective nature of the expansion operation, we have no reason to think that matching against the expanded set introduces a bias in favour or against any specific model.

When evaluating against the expansion set, there is the possibility that a model will match a property more than once (e.g., matching both transport and transportation). In these cases, we count the top match, and we ignore the lower ones (i.e., lower matches are not treated as hits, but they do not contribute to the n-best count either).

Gold standard and evaluation script

NB: on March 7, we made a small correction to the property expansion file; if you downloaded the archive before this date, please download it again

This archive contains the gold standard (with property expansions as described above) and an evaluation script that computes average precision at various n-best thresholds.

Detailed information about the script can be accessed by running it with the -h option:

evaluate-against-expanded-props.pl -h | more

In short, if you can organize the output of you model in a file, say output.txt, with lines in format:

concept property score

then you can run the evaluation (against the gold standard set with expansions generated as described above) as:

evaluate-against-expanded-props.pl expanded-props.txt output.txt

We provide this script to have a common benchmark when comparing models, but we also encourage you to explore the McRae et al.'s database for other possible ways to evaluate the models.