ARCADE    
Word track - test words
 

Composition of word list

The list is the same as the one used in the ROMANSEVAL exercise on word sense disambiguation (WSD) This will enable a comparison of the problems posed by monolingual sense disambiguation and translation alignment.The number of test-words and test contexts were determined according to feasibility constraints. 60 French words will be submitted to the systems, comprising: 
  • 20 nouns,
  • 20 adjectives,
  • 20 verbs.
Each of the words appear, on average, in ca. 60 different contexts in the test corpus. This yields 3724 different word-translation alignments. 

The French words and their translation can occasionnally be part of multi-token expressions, which causes many difficulties both in constituting the reference corpus, and in defining the evaluation metrics. This difficulties will be discussed  on the ARCADE list, in order to try to find the most reasonable solutions, and there will be a necessary discussion phase after the processing of results. The reference corpus will then be made available to the participants who will be able to question some of the alignments, and the results will be revised accordingly. 

Example (the word chosen is not part of the test words):: 
 
 
French English
La fédération de Russie, en qualité d'État doté d'armes nucléaires, est partie au traité de non-prolifération des armes nucléaires. The Russian Federation as a nuclear weapon State is a party to the Treaty on the Non-Proliferation of Nuclear Weapons.
Le service hongrois de la protection de la nature a fait savoir que, le 2 juin 1992, un groupe de pêcheurs croates en armes a franchi la frontière hongroise, a pénétré dans une réserve zoologique et y a tué deux mille cormorans. The Hungarian nature conservation authorities announced that on 2 June 1992 a group of armed Croatian fishermen had crossed the border into Hungary, entered a nature reserve and killed two thousand cormorants.
L'article 12 parle de  «. . . détention d'une arme à feu pendant un voyage . . » . Article 12 does talk of  `. . . the possession of a firearm during a journey' .
la Commission considère que la détention de munitions pendant un voyage intracommunautaires obéit aux mêmes règles que la détention des armes auxquelles les munitions sont destinées. the Commission considers that the possession of ammunition during an intraCommunity journey is subject to the same rules as the possession of the firearms for which the ammunition is intended.
Selon eux, ce qu'il est convenu d'appeler la  «taxe verte»  sur les carburants pourrait constituer une arme dans la lutte contre le réchauffement de la planète. The experts believe that a  `green tax'  on fuel could be an effective instrument in combating global warming.
 
 

 

Selection procedure

Overview

The choice of test words is particularly difficult. Words should not be chosen according to intuition: intuition proves wrong in many cases when semantics is concerned, and chances are great that experimentators will pick special cases or to the contrary trivial ones, and the selection is likely not to correctly reflect the real difficulty of the WSD and translation spotting tasks. Unbiased selection criteria are not easy to find. For example, frequency alone is not a good criterion, since it was repeatedly noted since the fifties that words tend to be mostly monosemic in a given text or domain. Random selection according to frequency criteria would therefore result in a very large proportion of non-interesting words for a test based on probing a small number of words. Another possibility would be to chose the test words according to their number of senses and/or translation in a given dictionary, but in this case chances are great that most of these senses/translations do not appear in the test corpus.  

We therefore chose a selection process based on judgements by human informants of the polysemy of words in the test corpus. A subset of 200 words was first selected on frequency criteria, and then submitted to a panel of informants who were asked to judge whether the words were polysemic in the corpus. It is important that the entire process is as cheap as possible in terms of manual labour.   
 

Step 1 (segmentation)

The corpus was word-segmented, and three subsets of word forms were automatically extracted corresponding respectively to nouns, adjectives and verbs that are not POS ambiguous in a large dictionary (the Multext French dictionary, comprising 350,000 word forms), in order to eliminate the need for POS tagging of the corpus (and the corresponding costly hand-validation). 
 

Step 2 (frequency slice)

We decided to avoid the problem of context selection, which creates biases and problems of its own, by choosing word forms with comparable frequencies in the corpus, around the desired number of 60, so that, for each test word, all its contexts will be tested. In each of the three POS subsets, a 200-word frequency slice was therefore chosen such that the mean frequency of the slice is 60. When two different morphological forms of the same word appeared in the frequency slice, they were pooled together, provided that their total number did not exceed the limits of the frequency slice. 
 

Step 3 (human judgement)

Concordance lines were printed for each of the 600 word forms (i.e. a total of around 3000 contexts), and manually checked to eliminate a few undesirable cases (auxiliary verbs, POS ambiguities not recorded in the lexicon). Each concordance set was fitted on a page, which was presented to six informants, unaware of the final goal. The question asked to them was "According to you, does the word X have one sense or several senses in the following contexts?", and they were invited to tick the corresponding box or a "don't know" box. 

Somewhat to our surprise, none of the informants found the task difficult, and the rate of "don't know" responses is particularly low (4.05%). However, the agreement rate between pairs of informants is also low (ranging from 64.2% to 79.3%). Altogether, full agreement on polysemy was achieved on only 4.5% of the words. Conversely, 40.8% of words were judged as having only one sense -- the rest receiving mixed judgements. 
 

Step 4 (translation tagging)

A score was then attributed to each word by summing up the responses (1=several senses, 0=don't know, -1=one sense). The 20 words with the highest grade were selected as test words for ROMANSEVAL and ARCADE. Every occurrence is being aligned with its translation by a human annotator starting from scratch (and not using the output of an alignment program which could biais her judgment).