|
The discussion that started
on the list and the one that took place in the first campaign showed that
evaluation of parallel text alignment is by no means a simple task, even
at the sentence level. Again, feasibility constraints (time and human ressources)
partially drive what can be practically done -- as opposed to what would
be theoretically perfect. However, as in any competition, we must make
every effort to ensure fairness and openness of the evaluation process.
The first point that must
be stressed is that the idea of "competition" is only a pretext to do collectively
an interesting piece of scientific work, and improving our systems. The
final ranking of systems (if any such ranking is possible) is not very
important. It was noted during the discussions that it is extremely difficult
to compare systems with different goals and different resources. For exemple,
a system which does an extremely accurate alignment on one language pair
only, and another one which does a rougher job but on any language pair,
are both "good" in a different sense.
The discussion seemed to
move toward agreement on several ideas.
[I hope that this reflects correclty what has been said so far, please
correct me if not; we can update the page as the discussion goes along
-- JV]
-
Various metrics
can be used, and compared. There is no need to rank hte systems according
to one single final score. Instead, it seems that an array of measures
will more accurately reflect the behavior of the various sytems. Of course,
it was pointed out during the discussion that every system can become a
"winner" according to some metric, or in other terms, that everybody can
propose an esoteric measure that will make their system the best according
to that measure. This is true, but we can probably trust peer evaluation
and scientific discussion to lead to a consensual array of "reasonable"
measures. And why not have several "winners"?
-
A blind quantitative evaluation
is not enough, whatever the metrics or combination of metrics we can
use. The quantitative evaluation must be completed by a precise description
of the systems in terms of resources used, applicative context, internal
principles and overall capabilities (beyond the narrow one being tested).
We could ask systems to fill a specification sheet that could result
in a comparison chart which could enable us to understand at a glance what
we compare with what. Such a specification sheet could be built collectively
through discussion on the list (obvious items are: type and size of ressources
used, language pairs accepted, etc.).
-
There should be an adjudication
phase during which the participants can discuss the results, raise
objections, point our errors, question the metrics, and so on. We will
not have much time before the SENSEVAL workshop for such a discussion,
but the workshop itself will hopefully be an occasion for discussion, and
we will have all fall to analyse the results before we make them more broadly
public.
-
In any case, it seems clear
that results in isolation do not make sense, and are of interest only if
they are accompanied by a detailed discussion explaning the observed
efficiency and context, resources, algorithms and so on.
|