| |
The corpus will
be the same as the one used in the first campaign, at
least for the first round (1998), for several reasons:
- this will
enable direct comparison between the systems that
competed in the first campaign and the new ones
- this will
enable the systems of the first campaign to check
whether they improved (which after all is the
main goal of the exercise)
- from a
practical point of view, we just don't have the
time and ressouces to build another corpus before
a number of months.
We know that the
corpus has some imperfections and mistakes. They should
not prevent us from running the first round, but of
course, they should be taken into account in the
interpretation of results.
We hope that the
corpus providers will be able to revise the current
corpus, and add more (types of) texts for the second
round.
The corpus has
two main parts:
JOC
|
The JOC
corpus is composed of records of questions and
answers regarding European Community matters. The
data is regularly published as one section of the
C Series of the Official Journal of the European
Community in all official languages (previously
nine). This corpus contains written questions
asked by members of the European Parliament on a
wide variety of topics and corresponding answers
from the European Commission in 9 parallel
versions. The total size of the corpus is
approximately 10.2 million words (ca. 1.1 million
words per language) corresponding to the year
1993, which was collected and prepared within the
MLCC-MULTEXT projects. The part used for the
sentence track is composed of one fifth of the
French and English parts (ca. 200000 words per
language).
This
corpus is provided by LPL.
|
BAF
|
BAF is
also a French-English bitext of about 400000
words per language. It contains four sub-sets of
texts :
| INST |
Four institutional
texts (including a representative excerpt
of the Hansard corpus which consists of
transcription of parliamentary debates)
for a total size close to 300000 words
per language. |
| SCIENCE |
Five scientific
articles of about 50000 words per
language each. |
| TECH |
A technical
documentation with 39328 English-words
for 46828 French ones. Contains a large
glossary sorted in a different order in
each language. |
| VERNE |
Jules Verne's
novel De la terre ŕ la lune.
(40161 English-words vs 53181
French-words). This corpus is very
interesting because the translations are
sometimes divergent (only 75% of 1-1
patterns). In fact, it is even not clear
whether the English version is really a
translation of the French one or if it
has been translated from an abridged
version (lot of missing segments in the
English version)! |
This corpus
is provided by RALI.
|
All texts are
segmented into sentences.
|