ARCADE    
Sentence track - Overview
 

Introduction

In the first round (1998) of the second campaign, it is suggested, for feasibility reasons, to adopt the same corpus and protocol as in the first campaign. Since new systems are participating, this will enable us to compare all systems and check whether the systems who competed in the first phase improved (of course the two groups will be evaluated separately). It is expected that, during the year 1998 the corpus providers will develop new corpora to be used in the second round (1999).

Procedure

The sentence track will take place in several steps, according to a schedule that will be updated as we go along: 
  • Step 1: the raw corpus will be distributed to participants well in advance, in order for them to understand the formats, interface their systems, tune and train them.
  • Step 2: a dry run will take place in order to check the procedures and evaluation programs.
  • Step 3: the participants will return the aligned corpus to the coordinator in the agreed format.
  • Step 4: the proposed alignements will be evaluated and the results returned on the discussion list.
  • Step 5: the results will be discussed on the list and at the SENSEVAL workshop (2-4 september)
  • Step 6: a longer discussion and analysis of results will take place in the fall, with the goal of publishing the results and planning the second round.

Evaluation method

The discussion that started on the list and the one that took place in the first campaign showed that evaluation of parallel text alignment is by no means a simple task, even at the sentence level. Again, feasibility constraints (time and human ressources) partially drive what can be practically done -- as opposed to what would be theoretically perfect. However, as in any competition, we must make every effort to ensure fairness and openness of the evaluation process. 

The first point that must be stressed is that the idea of "competition" is only a pretext to do collectively an interesting piece of scientific work, and improving our systems. The final ranking of systems (if any such ranking is possible) is not very important. It was noted during the discussions that it is extremely difficult to compare systems with different goals and different resources. For exemple, a system which does an extremely accurate alignment on one language pair only, and another one which does a rougher job but on any language pair, are both "good" in a different sense. 

The discussion seemed to move toward agreement on several ideas.  [I hope that this reflects correclty what has been said so far, please correct me if not; we can update the page as the discussion goes along -- JV] 
 

  • Various metrics can be used, and compared. There is no need to rank hte systems according to one single final score. Instead, it seems that an array of measures will more accurately reflect the behavior of the various sytems. Of course, it was pointed out during the discussion that every system can become a "winner" according to some metric, or in other terms, that everybody can propose an esoteric measure that will make their system the best according to that measure. This is true, but we can probably trust peer evaluation and scientific discussion to lead to a consensual array of "reasonable" measures. And why not have several "winners"?
  • A blind quantitative evaluation is not enough, whatever the metrics or combination of metrics we can use. The quantitative evaluation must be completed by a precise description of the systems in terms of resources used, applicative context, internal principles and overall capabilities (beyond the narrow one being tested). We could ask systems to fill a specification sheet that could result in a comparison chart which could enable us to understand at a glance what we compare with what. Such a specification sheet could be built collectively through discussion on the list (obvious items are: type and size of ressources used, language pairs accepted, etc.).
  • There should be an adjudication phase during which the participants can discuss the results, raise objections, point our errors, question the metrics, and so on. We will not have much time before the SENSEVAL workshop for such a discussion, but the workshop itself will hopefully be an occasion for discussion, and we will have all fall to analyse the results before we make them more broadly public.
  • In any case, it seems clear that results in isolation do not make sense, and are of interest only if they are accompanied by a detailed discussion explaning the observed efficiency and context, resources, algorithms and so on.