![]() |
ARCADE
Sentence track metrics |
| The first campaign used
four different metrics, all based on precision and recall, but at different
levels of granularity. We can use the same metrics in the second campaign,
which will enable direct comparison for the new systems and the possible
improvement of the one who participated in the first phase. Of course,
discussion is open (and has started already on the discussion
list), and additional measures can be computed.
In this document, we first propose a formal definition of parallel text alignment. Based on that definition, the usual notions of recall and precision can be used to evaluate the quality of a given alignment with respect to a reference. However, recall and precision can be computed at various levels of granularity : an alignment at a given level (i.e. sentences) can be measured in terms of units of lower level (e.g. words, characters). Such a finer-grain measure is less sensitive to segmentation problems, and can be used to weight errors according to the number of sub-units they span. The ideas presented here largely result from the discussions among participants during the first campaign. |
| This definition is a generalisation
proposed by J. Véronis of a definition proposed by P. Isabelle and
M. Simard in an internal
report (Isabelle and Simard, 1996).
If we consider a text S and its translation T as two sets of segments S = s1, s2, .., sn and T = t1, t2, ..., tm, an alignment A between S and T can be defined as a subset of the Cartesian product 2S X 2T, where 2S and 2T are respectively the set of all subsets of S and T. The triple (S, T, A) will be called bitext. Each of the elements (ordered pairs) of the alignment will be called a bisegment. This definition is fairly general. In addition to common alignment types (such as those produced by the Gale-Church method), it enables inversions, overlaps and discontinuous matches. The segments can be linguistic units (i.e. paragraph, sentences, words, characters) or be arbitrary. In the evaluation described here, segments were sentences, and all segments were supposed to be contiguous, yielding monotonous alignments. For instance, let us consider
the following alignment, which will serve as the reference alignment in
the subsequent examples:
|
| Let us consider a bitext
(S, T, Ar), and a proposed alignment A. The recall of
alignment A with respect to the reference Ar is defined as
:
("inter" is the set-theoretical intersection). It represents the proportion of bisegments in A that are correct with respect to the reference Ar. The silence corresponds to 1 - recall. The precision of alignment A with respect to the reference Ar is defined as : It represents the proportion of bisegments in A that are right with respect to the total of those proposed. The noise corresponds to 1 - precision. We will also use the F-measure (Van Rijsbergen, 1979) which combines recall and precision in a single efficiency measure (it is the harmonic mean of precision and recall) :
We note that: Recall and precision of alignment A with respect to Ar are 1 / 2 =0.50 and 1 / 3 =0.33 respectively. The F-measure is 0.40. Improving recall and improving precision are antagonistic goals: efforts to improve one often result in degrading the other. Depending on the applications, different trade-offs can be sought. For example, if the bisegments are used to automatically generate a bilingual dictionary, maximising precision (i.e. omitting uncertain couples) is likely to be the preferred option. Recall and precision as defined above are rather severe. They do not take in to account the fact that some bisegments could be partially correct. In the previous example, the bisegment ({s2}, {t3}) does not belong to the reference, but can be considered as partially correct: t3 does match a part of s2. To take partial correctness into account, we need to compute recall and precision at the sentence level instead of the alignment level. Assuming that A = {a1, a2, . . . am) and Ar ={ar1, ar2, . . ., arn}, with ai = (as,i, at,i) and arj = (as,j, at,j), we can derive the following sentence-to-sentence alignments : Sentence-level recall and precision can thus be defined in the following way : In the example above : Sentence-level recall and precision on this example are therefore 2 / 3 =0.66 and 1 respectively, to be compared to the alignment-level recall and precision, 0.50 and 0.33 respectively. The F-measure becomes 0.80 instead of 0.40. |
| In the definitions above,
the sentence is the unit of granularity used for the computation of recall
and precision at both levels. This results in two difficulties. First,
the measures are very sensitive to sentence segmentation errors. Secondly,
they do not reflect the seriousness of misalignments : it seems that
errors involving short sentences should be less penalised than errors involving
longer ones, at least from an applicative perspective.
These problems can be avoided by taking advantage of the fact that a unit of a given granularity (e.g. sentence) can always be seen as a (possibly discontinuous) sequence of units of smaller granularity (e.g. character). Thus, when an alignment A is compared to a reference alignment Ar using the recall and precision measures computed at the char-level, the values obtained are inversely proportional to the quantity of text (i.e. number of characters) in the misaligned sentences instead of the number of these misaligned sentences. For instance, in the example used above, we would have at sentence level :
|A'| = 5*6 + 0*5 + 11*6 = 96 |Ar' inter A'| = 96 recall = 96/140 = 0.69
|A| = 29*27 + 0*28 + 60*24 = 2223 |Ar' inter A'| = 2223 recall = 2223/3903 = 0.57
|
| Isabelle,
P., Simard, M. (1996). Propositions
pour la représentation et l'évaluation des alignements de
textes parallèles dans l'ARC A2. Rapport technique, CITI,
Laval, Canada.
Van Rijsbergen, C. J. (1979). Information Retrieval. 2nd edition, London, Butterworths. |