| ROMANSEVAL
Interannotator agreement |
| The French corpus was annotated
by 6 judges in parallel and agreement was computed according to several
measures:
Full agreement among the six annotators Two variants were computed:
Of course, these measures are biaised with the number of judges: they tend to decrease asymptotically to zero as the number of judges increases, if nothing else, due to cumulative errors. However, it is still striking to note that for some words (correct, historique, économie, comprendre) there was full agreement on none of the sixty contexts or so for that word! Note that there is not much
difference between the min and max measure, apart from a
few words (sûr, comprendre, importer).
Paiwise agreement This measure is preferrable,
since it is not biased as the previous one. Three variants were computed:
Again, there is not much difference between the measures, apart from a few words, interestingly enough not exactly the same as before (chef, comprendre, connaître). Agreement corrected for chance The measures above are not
completely satisfactory, because they do not enable comparison of observed
agreement and agreement that would be obtained by pure chance. The kappa
statistics (Cohen, 1960), enables such a comparison. It is computed
as
In our case, the kappa statistics was computed on the weighted pairwise measure using the kappa extension for partial agreement proposed in Cohen (1968). This coefficient ranges between 0 when agreement is no better than chance and 1 when there is perfect agreement (it can also become negative in case of systematic disagreement). It is interesting to note that kappa ranges between 0,92 and 0,01. In other terms, there is no more agreement than chance for some words! The kappa per category is
as follows:
The detailed results are available for each word: The tables also give the average number of sense per judge and per context (column Nsen).
|
| Cohen, J. (1960). A coefficient
of agreement for nominal scales. Educational and Psychological Measurement,
20, 37-46.
Cohen, J. (1968). Weighted
kappa: nominal scale agreement with provision for scaled disagreement or
partial credit. Psychological Bulletin, (70)4, 213-220.
|
| I would like to thank Rebecca
Bruce and Jean Carletta for interesting discussions on interannotator agreement,
and my student Corinne Jean for her help on the computations.
|