ROMANSEVAL    
Test words
   

Composition of word list

The list is the same as the one used in the ARCADE exercise on word alignment between parallel texts. This will enable a comparison of the problems posed by monolingual sense disambiguation and translation alignment.The number of test-words and test contexts were determined according to feasibility constraints. 60 words will be submitted to the systems, comprising: 
  • 20 nouns,
  • 20 adjectives,
  • 20 verbs.
Each of the words appears, on average, in ca. 60 different contexts in the test corpus. This yields 3724 different contexts altogether. 

See example (the word chosen is not part of the actual test words): 
 
 

 

Selection procedure

Overview

The choice of test words is particularly difficult. Words should not be chosen according to intuition: intuition proves wrong in many cases when semantics is concerned, and chances are great that experimentators will pick special cases or to the contrary trivial ones, and the selection is likely not to correctly reflect the real difficulty of the WSD and translation spotting tasks. Unbiased selection criteria are not easy to find. For example, frequency alone is not a good criterion, since it was repeatedly noted since the fifties that words tend to be mostly monosemic in a given text or domain. Random selection according to frequency criteria would therefore result in a very large proportion of non-interesting words for a test based on probing a small number of words. Another possibility would be to chose the test words according to their number of senses and/or translation in a given dictionary, but in this case chances are great that most of these senses/translations do not appear in the test corpus.  

We therefore chose a selection process based on judgements by human informants of the polysemy of words in the test corpus. A subset of 200 words was first selected on frequency criteria, and then submitted to a panel of informants who were asked to judge whether the words were polysemic in the corpus. It is important that the entire process is as cheap as possible in terms of manual labour.   
 

Step 1 (segmentation)

The corpus was word-segmented, and three subsets of word forms were automatically extracted corresponding respectively to nouns, adjectives and verbs that are not POS ambiguous in a large dictionary (the Multext French dictionary, comprising 350,000 word forms), in order to eliminate the need for POS tagging of the corpus (and the corresponding costly hand-validation). 
 

Step 2 (frequency slice)

We decided to avoid the problem of context selection, which creates biases and problems of its own, by choosing word forms with comparable frequencies in the corpus, around the desired number of 50, so that, for each test word, all its contexts will be tested. In each of the three POS subsets, a 200-word frequency slice was therefore chosen such that the mean frequency of the slice is 50. When two different morphological forms of the same word appeared in the frequency slice, they were pooled together, provided that their total number did not exceed the limits of the frequency slice. 
 

Step 3 (human judgement)

Concordance lines were printed for each of the 600 word forms (i.e. a total of around 3000 contexts), and manually checked to eliminate a few undesirable cases (auxiliary verbs, POS ambiguities not recorded in the lexicon). Each concordance set was fitted on a page, which was presented to six informants, unaware of the final goal. The question asked to them was "According to you, does the word X have one sense or several senses in the following contexts?", and they were invited to tick the corresponding box or a "don't know" box. 

Somewhat to our surprise, none of the informants found the task difficult, and the rate of "don't know" responses is particularly low (4.05%). However, the agreement rate between pairs of informants is also low (ranging from 64.2% to 79.3%). Altogether, full agreement on polysemy was achieved on only 4.5% of the words. Conversely, 40.8% of words were judged as having only one sense -- the rest receiving mixed judgements. 
 

Step 4 (translation tagging)

A score was then attributed to each word by summing up the responses (1=several senses, 0=don't know, -1=one sense). The 20 words with the highest grade were selected as test words for ROMANSEVAL and ARCADE. Every occurrence is being aligned with its translation by a human annotator starting from scratch (and not using the output of an alignment program which could biais her judgment).