ARCADE      
Sentence track - Corpus

 

Composition

  The corpus will be the same as the one used in the first campaign, at least for the first round (1998), for several reasons:   
  • this will enable direct comparison between the systems that competed in the first campaign and the new ones
  • this will enable the systems of the first campaign to check whether they improved (which after all is the main goal of the exercise)
  • from a practical point of view, we just don't have the time and ressouces to build another corpus before a number of months.

We know that the corpus has some imperfections and mistakes. They should not prevent us from running the first round, but of course, they should be taken into account in the interpretation of results.  

We hope that the corpus providers will be able to revise the current corpus, and add more (types of) texts for the second round.  

The corpus has two main parts:  
  
  

JOC 
 
The JOC corpus is composed of records of questions and answers regarding European Community matters. The data is regularly published as one section of the C Series of the Official Journal of the European Community in all official languages (previously nine). This corpus contains written questions asked by members of the European Parliament on a wide variety of topics and corresponding answers from the European Commission in 9 parallel versions. The total size of the corpus is approximately 10.2 million words (ca. 1.1 million words per language) corresponding to the year 1993, which was collected and prepared within the MLCC-MULTEXT projects.

The part used for the sentence track is composed of one fifth of the French and English parts (ca. 200000 words per language).  

This corpus is provided  by LPL. 
 

BAF 
 
BAF is also a French-English bitext of about 400000 words per language. It contains four sub-sets of texts :   
  
INST Four institutional texts (including a representative excerpt of the Hansard corpus which consists of transcription of parliamentary debates) for a total size close to 300000 words per language.
SCIENCE Five scientific articles of about 50000 words per language each.
TECH A technical documentation with 39328 English-words for 46828 French ones. Contains a large glossary sorted in a different order in each language.
VERNE Jules Verne's novel De la terre ŕ la lune. (40161 English-words vs 53181 French-words). This corpus is very interesting because the translations are sometimes divergent (only 75% of 1-1 patterns). In fact, it is even not clear whether the English version is really a translation of the French one or if it has been translated from an abridged version (lot of missing segments in the English version)!

  
This corpus is provided by RALI.

   
 All texts are segmented into sentences.

Format

  The corpus is SGML-encoded according to the Corpus Encoding Standard developed in the Multext project, using the CesAna DTD for the encoding of linguistic annotation. 

Rougly speaking, the documents are structured as follows: 
 
 

<!DOCTYPE CESANA PUBLIC "-//CES//DTD cesAna//EN" > 
<CESANA VERSION="1.12"> 
<CHUNKLIST> 
  <CHUNK ID="C1"> 
    <PAR ID="C1P1"> 
      <S ID="C1P1S1"> 
        This is the first sentence of the first paragraph 
        of the first division. 
      </S> 
      <S ID="C1P1S2"> 
        This is the second sentence of the first paragraph 
        of the first division. 
      </S> 
   </PAR> 
   <PAR ID="C1P2"> 
     <S ID="C1P2S1"> 
       This is the first sentence of the second paragraph 
       of the first division. 
     </S> 
   </PAR> 
  ... 
  </CHUNK> 
  <CHUNK ID="C2"> 
    <PAR ID="C21P1"> 
      <S ID="C2P1S1"> 
        This is the first sentence of the first paragraph 
        of the second division. 
... 
</CHUNKLIST> 
</CESANA>

In other terms, each document is a set of divisions (<chunks>) which contain a series of paragraphs (<par>), which contain themselves a series of sentences (<s>). 

Note that: 

  • Each sentence has an identifier (e.g. C1P1S1) which will be used in the alignment.
  • There is no internal markup within the sentences. 
  • There are no sentence embeddings.
  • Each line contains either markup (tags) or data (plain text).
  • Indentation is used above only for the sake of clarity; it does not occur in the real files.

 

Sample

   A sample of the JOC corpus is available: 

Availability

 
The corpus is ready, and the non-aligned version will be distributed by ftp to the participants. A password will be sent to each participant as soon as the appropriate licence agreement with the European Language Ressource Association (ELRA) is signed. You can have a preview of what it will look like by having a glance at the standard ELRA licences. 

The reference aligned corpus will be made available to the participants after they have sent their results.