ARCADE   
Sentence track metrics
 
 

Introduction

The first campaign used four different metrics, all based on precision and recall, but at different levels of granularity. We can use the same metrics in the second campaign, which will enable direct comparison for the new systems and the possible improvement of the one who participated in the first phase. Of course, discussion is open (and has started already on the discussion list), and additional measures can be computed. 

In this document, we first propose a formal definition of parallel text alignment. Based on that definition, the usual notions of recall and precision can be used to evaluate the quality of a given alignment with respect to a reference. However, recall and precision can be computed at various levels of granularity  : an alignment at a given level (i.e. sentences) can be measured in terms of units of lower level (e.g. words, characters). Such a finer-grain measure is less sensitive to segmentation problems, and can be used to weight errors according to the number of sub-units they span. 

The ideas presented here largely result from the discussions among participants during the first campaign.

 

Formal definition

This definition is a generalisation proposed by J. Véronis of a definition proposed by P. Isabelle and M. Simard in an internal report (Isabelle and Simard, 1996). 

If we consider a text S and its translation T as two sets of segments S = s1, s2, .., sn and T = t1, t2, ..., tm, an alignment A between S and T can be defined as a subset of the Cartesian product 2S X 2T, where 2S and 2T are respectively the set of all subsets of S and T. The triple (S, T, A) will be called bitext. Each of the elements (ordered pairs) of the alignment will be called a bisegment. 

This definition is fairly general. In addition to common alignment types (such as those produced by the Gale-Church method), it enables inversions, overlaps and discontinuous matches. The segments can be linguistic units (i.e. paragraph, sentences, words, characters) or be arbitrary. In the evaluation described here, segments were sentences, and all segments were supposed to be contiguous, yielding monotonous alignments. 

For instance, let us consider the following alignment, which will serve as the reference alignment in the subsequent examples: 
 
[s1] Ceci est la phrase numéro un. [t1] This is the first sentence.
[s2] Ceci est la phrase numéro deux,  qui ressemble à la première.  [t2] This is the second sentence. [t3] It looks like the first.
The formal representation of this alignment is: 

Ar = { ({s1}, {t1}), ({s2}, {t2, t3}) }
 
 

Recall and precision

Let us consider a bitext (S, T, Ar), and a proposed alignment A. The recall of alignment A with respect to the reference Ar is defined as  :  
recall = |A inter Ar| / |Ar|.

("inter" is the set-theoretical intersection). 

It represents the proportion of bisegments in A that are correct with respect to the reference Ar. The silence corresponds to 1 - recall.  

The precision of alignment A with respect to the reference Ar is defined as  :  

precision = |A inter Ar| / |A|.

It represents the proportion of bisegments in A that are right with respect to the total of those proposed. The noise corresponds to 1 - precision.  

We will also use the F-measure (Van Rijsbergen, 1979) which combines recall and precision in a single efficiency measure (it is the harmonic mean of precision and recall)  :  

F = 2 * (recall * precision) / (recall + precision)
 
 
[s1] Ceci est la phrase numéro un. [t1] This is the first sentence.
  [t2] This is the second sentence. 
[s2] Ceci est la phrase numéro deux,  qui ressemble à la première.  [t3] It looks like the first.
The formal representation of this alignment is: 
A = { ({s1}, {t1}), ({}, {t2}), ({s2}, {t3}) }

We note that: 

A inter Ar = { ({s1}, {t1}) } 

Recall and precision of alignment A with respect to Ar are 1 / 2 =0.50 and 1 / 3 =0.33 respectively. The F-measure is 0.40. 

Improving recall and improving precision are antagonistic goals: efforts to improve one often result in degrading the other. Depending on the applications, different trade-offs can be sought. For example, if the bisegments are used to automatically generate a bilingual dictionary, maximising precision (i.e. omitting uncertain couples) is likely to be the preferred option. 

Recall and precision as defined above are rather severe. They do not take in to account the fact that some bisegments could be partially correct. In the previous example, the bisegment ({s2}, {t3}) does not belong to the reference, but can be considered as partially correct: t3 does match a part of s2. To take partial correctness into account, we need to compute recall and precision at the sentence level instead of the alignment level.  

Assuming that A = {a1, a2, . . . am) and Ar ={ar1, ar2, . . ., arn}, with ai = (as,i, at,i) and arj = (as,j, at,j), we can derive the following sentence-to-sentence alignments : 

A' = unioni (as,i X at,i)
Ar' = unionj (ar s,j X ar t,j)

Sentence-level recall and precision can thus be defined in the following way : 

recall = | A' inter Ar' | / | Ar' |
precision = | A' inter Ar' | / | A' |

In the example above : 

Ar' = { (s1, t1), (s2, t2) , (s2, t3) }
A' = { (s1, t1), (s2, t3) }

Sentence-level recall and precision on this example are therefore 2 / 3 =0.66 and 1 respectively, to be compared to the alignment-level recall and precision, 0.50 and 0.33 respectively. The F-measure becomes 0.80 instead of 0.40. 

Granularity

In the definitions above, the sentence is the unit of granularity used for the computation of recall and precision at both levels. This results in two difficulties. First, the measures are very sensitive to sentence segmentation errors. Secondly, they do not reflect the seriousness of misalignments  : it seems that errors involving short sentences should be less penalised than errors involving longer ones, at least from an applicative perspective. 

These problems can be avoided by taking advantage of the fact that a unit of a given granularity (e.g. sentence) can always be seen as a (possibly discontinuous) sequence of units of smaller granularity (e.g. character).  

Thus, when an alignment A is compared to a reference alignment Ar using the recall and precision measures computed at the char-level, the values obtained are inversely proportional to the quantity of text (i.e. number of characters) in the misaligned sentences instead of the number of these misaligned sentences. For instance, in the example used above, we would have at sentence level  :  

  • using word granularity  : 
|Ar'| = 5*6 + 11*10 = 140  
|A'| = 5*6 + 0*5 + 11*6 = 96  
|Ar' inter A'| = 96  

recall = 96/140 = 0.69  
precision = 1  
F = 0.82 

  • using character granularity (including spaces): 
|Ar| = 29*27 + 60*52 = 3903  
|A| = 29*27 + 0*28 + 60*24 = 2223  
|Ar' inter A'| = 2223 

recall = 2223/3903 = 0.57  
precision = 1  
F = 0.73 

References

Isabelle, P., Simard, M. (1996). Propositions pour la représentation et l'évaluation des alignements de textes parallèles dans l'ARC A2. Rapport technique, CITI, Laval, Canada. 

Van Rijsbergen, C. J. (1979). Information Retrieval. 2nd edition, London, Butterworths.