ROMANSEVAL    
Interannotator agreement
   

French

The French corpus was annotated by 6 judges in parallel and agreement was computed according to several measures: 
 

Full agreement among the six annotators 

Two variants were computed: 
 
 
Min Counts agreement when judges agree on all senses proposed for a given context
Max Counts agreement when judges agree on at least one of the senses proposed for a given context
  
Of course, these measures are biaised with the number of judges: they tend to decrease asymptotically to zero as the number of judges increases, if nothing else, due to cumulative errors. However, it is still striking to note that for some words (correct, historique, économie, comprendre) there was full agreement on none of the sixty contexts or so for that word! 

Note that there is not much difference between the min and max measure, apart from a few words (sûr, comprendre, importer). 
 

Paiwise agreement  

This measure is preferrable, since it is not biased as the previous one. Three variants were computed: 
 
 
Min Counts agreement when judges agree on all senses proposed for a given context
Max Counts agreement when judges agree on at least one of the senses proposed for a given context
Weighted Accounts for partial agreement using the Dice coefficient: 
 
2 * |A inter B| / |A| + |B|
  
Again, there is not much difference between the measures, apart from a few words, interestingly enough not exactly the same as before (chef, comprendre, connaître). 
 

Agreement corrected for chance 

The measures above are not completely satisfactory, because they do not enable comparison of observed agreement and agreement that would be obtained by pure chance. The kappa statistics (Cohen, 1960), enables such a comparison. It is computed as 
 

(observed agreement - chance agreement) / (1 - chance agreement)

In our case, the kappa statistics was computed on the weighted pairwise measure using the kappa extension for partial agreement proposed in Cohen (1968). This coefficient ranges between 0 when agreement is no better than chance and 1 when there is perfect agreement (it can also become negative in case of systematic disagreement). 

It is interesting to note that kappa ranges between 0,92 and 0,01. In other terms, there is no more agreement than chance for some words! 

The kappa per category is as follows: 
 
 
Adjectives 0.41
Nouns 0.46
verbs 0.41
There values are low, and indicate an enormous amount of disagreement between judges. 
 

The detailed results are available for each word: 

The tables also give the average number of sense per judge and per context (column Nsen). 

 

 

References

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46. 

Cohen, J. (1968). Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, (70)4, 213-220. 
 

 

Acknowledgements

I would like to thank Rebecca Bruce and Jean Carletta for interesting discussions on interannotator agreement, and my student Corinne Jean for her help on the computations.