Multext - Document MSG 1. MtSeg/Resources. Last modified

logo

MtSeg: Resources


The resource files

tbl.classes.xx

tbl.punct.xx & tbl.mergepunct.xx

tbl.abbrev.xx & tbl.mergeabbrev.xx

tbl.split.xx

tbl.merge.xx

tbl.date.xx

tbl.digit.xx

tbl.enum.xx

tbl.sent.xx


The resource files

Some of the subtools in the segmenter require certain language-specific information in order to accomplish their tasks. For maximum flexibility and to retain language-independence, all such information is provided directly to the subtools via external resource files. All resource files have names in the following form:

tbl.nnnnn.xx

where 'nnnnn' is replaced by a specific file name identifying the contents, and 'xx' is the two-letter ISO standard code [ISO 639:1988] for the language:

bgBulgarian
csCzech
deGerman
enEnglish
esSpanish
etEstonian
frFrench
huHungarian
itItalian
nlDutch
roRomanian
slSlovenian

Find below the descriptions of all the resource files used by the segmenter, together with a short sample of the contents of each for French.

The segmenter's resource files can be created in a boot-strapping process, since the output of the segmenter provides its idea of what is a token, what is an abbreviation, etc. By checking the output of the segmenter you will verify the entries you have already provided, and see others that should be added to the file.

Note that in the samples we have used SGML entities for characters. You should do the same in constructing your files.

Should you have any problems or questions concerning the files, contact Philippe Di Cristo.



tbl.classes.xx

The purpose of this file is to define the exact names used by the segmenter subtools to refer to classes, in order to enable re-definition by the users. Each class is assigned a label.

For example:

ABBREVIATION Abbr TOKEN

means that all abbreviations encountered will be annotated by 'Abbr' and that an abbreviation is also a token.

This file is in a table format with tabulations as field separators:

<CLASS NAME>TAB<CLASS LABEL>TAB<CHILDOF>

Where

CLASS NAME
name of the class
CLASS LABEL
output code for a class
CHILDOF
name of class which is one level above the assigned class
(ex: "TITLE Title ABBREVIATION" means that the class TITLE belongs to the class ABBREVIATION)

Example: tbl.classes.fr (french resource file)



tbl.punct.xx & tbl.mergepunct.xx

tbl.punct.xx

This file contains the definition of those characters and character configurations (not containing internal spaces) that are to be considered as punctuation. They are defined in a regular expression format and each one is assigned the associated class.

For example:

PERIOD \. TERM_PUNCT?

This entry defines the punctuation "." by the symbolic name "PERIOD" and the class "TERM_PUNCT? (i.e., possibly terminal punctuation)"

This file also classifies each type of punctuation by the MANDATORY entries:

MT_COMPOSED_PUNCT
punctuation composed of more than one character
(i.e: "...","<<") No space is allowed between punctuation.
 
MT_BREAKING_PUNCT
punctuation which always forms a boundary (whether or not separated from surrounding characters by spaces).
i.e: "look;said" will be split in "look" " ";" + "said".
 
MT_INTERNAL_PUNCT
punctuation which may be internal to a token, such as "-" "," (not surrounded by spaces) and as such should not be split off from the token.
 
MT_NON_BREAKING_LEFT_PUNCT
punctuation which may appear at the left of a token, but which should not be split from the token.
 
MT_NON_BREAKING_RIGHT_PUNCT
punctuation which may appear at the right of a token, but which should not be split from the token.

Example: tbl.punct.fr (french resource file)

tbl.mergepunct.xx

This file defines punctuation using space characters. It is in the same format as the file 'tbl.merge.xx'

Example: tbl.mergepunct.fr (french resource file)

This file contains the entry

._._. PUNCT ELLIPSIS

As a result, the following sequence of tokens (generated by the subtool mtsegspace)

PTERM_P .
PTERM_P .
PTERM_P .

will be re-combined to become

PTERM_P ._._.


tbl.abbrev.xx & tbl.mergeabbrev.xx

tbl.abbrev.xx

This file contains abbreviations. It is used by the module of the segmenter called mtsegabbrev.

The segmenter works by first isolating strings in the text which are separated by blanks or tabs, and then isolating unambiguous punctuation and other graphically unambiguous elements (the modules mtsegspace and mtsegpunct). Each string is associated with the class name TOKEN for "token" (except for punctuation, which is explicitly identified as such with the class name PUNCTUATION ), and each classname-string pair is put in a separate record. For example, the text:

Mme. Steiner, elle est d'Allemagne.

The output of these two modules yields:

TOK Mme.
TOK Steiner
PUNCT ,
TOK elle
TOK est
TOK d'Allemagne.

(In the actual output of the two modules, a location reference appears on the right of each line. This is eliminated here for simplicity.)

The period is left attached to "Allemagne" because it cannot be determined at this point whether this is the period terminating an abbreviation or not.

The next step (module mtsegabbrev) will look up all words ending with a period in the file tbl.abbrev.xx. Any string consisting of capital letters only, possibly with internal periods and with a final period (e.g., U.S.A.) is automatically assigned the class ABBREVIATION, whether or not it appears in the file. For abbreviations that do appear in the file, mtsegabbrev uses the information in the file to assign a new class (replacing TOKEN). Otherwise, the final period is separated and put in a second record and given the class assigned to. Thus the example above after treatment by mtsegabbrev becomes:

TIT Mme.
TOK Steiner
PUNCT ,
TOK elle
TOK est
TOK d'Allemagne
PTERM_P .

There are four classes of abbreviations:

ABBREVIATION
an abbreviation in which the final period could also function as end-of-sentence
 
NB_ABBREVIATION
non-breaking abbreviation, words which are never abbreviated when they appear at the end of a sentence. Thus the period is unambiguous
 
TITLE
an abbreviation which is a title, thus followed by a proper name.
 
INITIAL
an abbreviation which is composed with only one upper-case letter.

The file comprises 3 columns:

  1. abbreviation
  2. class
  3. comment (optional)

Note:
The classes defined so far are listed in the file called tbl.classes.xx (specific to your own language).

Example: tbl.abbrev.fr (french resource file)

Note that abbreviations like "U.S.A." are not necessary, since any string consisting entirely of capital letters is automatically assigned the class ABBREVIATION.

TOK Dr.

becomes

TIT Dr.
tbl.mergeabbrev.xx

This file is identical in format to tbl.abbrev.fr. It contains space-composed abbreviations (with the space replaced by underscore).

The file contains 3 columns:

  1. abbreviation declaration
  2. class
  3. comment (optional)

For example if this file contains the entry

A._T. ABBREVIATION

"A. T." is defined as an abbreviation in this file, then the following sequence (generated by the subtool mtsegspace)

TOK A.
TOK T.

will be recombined to become

ABBREV A._T.


tbl.split.xx

Contains "clitics" of the language considered. These are not real clitics; for our purposes, clitics are regarded as anything separated by a hyphen ("-", as in "est-il") or an apostrophe "'" (as in "l'avion"). Each one is identified as belonging to a class called either PROCLITIC (i.e., proclitic, for items appearing at the beginning of the string) or ENCLITIC (i.e., enclitic, for items appearing at the end of the string). These class names are associated with the identified tokens in the output of the mtsegsplit module of the segmenter.

When the segmenter (specifically, the module mtsegsplit) encounters a token containing an apostrophe or hyphen, it consults the file tbl.split.xx (where "xx" is replaced by the language code which, for example, for French is tbl.split.fr) to determine whether or not the token should be split into two (or more) separate tokens. For example, if mtsegsplit receives the following input,

TOK est-il

it will check the file tbl.split.fr and see that "-il" is identified as an enclitic. The result will be the output of two tokens:

TOK est
ENC -il

The file tbl.split.xx also includes words containing instances of the identified enclitics and proclitics that should NOT be split into separate tokens--for example, "d'abord" in French, since in this case the "d'" is not due to an elision of "de" and "abord", and is therefore not a separate grammatical item. Thus although "d'" is identified as a proclitic in the tbl.split.fr file, the segmenter will leave "d'abord" as a single token, and change its class name from TOKEN to COMPOUND. Note that the process of isolating clitics is iterated, so that a string such as "a-t-il" is analyzed to be

TOK a
ENC -t
ENC -il

The file tbl.split.xx comprises 2 columns, separated by tabs:

  1. clitic declaration
  2. class

Example: tbl.split.fr (french resource file)



tbl.merge.xx

In other instances, words which are separated by blanks, and which were therefore previously split into separate tokens by mtsegspace, should be regarded as a single token comprising a compound word. The module mtsegmerge uses the file tbl.merge.xx to determine which tokens to recombine into compounds. So, the input

TOK après
TOK que

becomes

COMPOUND après_que

All the fixed grammatical words (multiword conjunctions, adverbs etc.) should be there, especially if the text is to be subsequently processed by a statistical tagger. For this purpose it is less clear that N+N, Adj+N, N+prep+N sequences, which constitute an open lexicon, should be included. Most of these sequences have some derivational behavior, which is beyond the scope of this subtool (e.g. "pomme de terre", "pommes de terre").

Note that it is possible to define a new class "COMPOUND?" which can be used to mark ambiguous sequences such as "alors que" (= "alors que" or "alors" "que"):

  1. il s'apercut alors que le bus etait parti
    (alors que -> "alors" (adverbe) "que" (conjonction)
  2. il lui parlait alors que le bus arrivait
    (alors que -> "alors que" (locution conjonctive)

The file has 3 columns:

  1. compound, with spaces represented by underscore (same format as in the lexicon)
  2. class name
  3. comment (optional)

Example: tbl.merge.fr (french resource file)



tbl.date.xx

This file is used by the segmenter's tool 'mtsegregex' to detect and recombine a date.

The date format is given with a regular expression in the mandatory entry 'MT_DATE'.

Space characters must be replaced by underscore in the expression

The file has 3 columns:

  1. a symbolic name
  2. a regular expression
  3. a class name

Example: Accordig to the file tbl.date.fr (french resource file)

TOK 10
TOK Juillet
TOK 1997

will be recombined to become

DATE 10_Juillet_1997

tbl.digit.xx

This file defines the digital number format using the regular expressions.
This resource file uses the same format as the punctuation table.

There is only one mandatory entry, MT_DIGITAL, which assign digital numbers to the class DIGITAL.

Note that in the file, the space character inside a digital number is replaced by an underscore.

The file has 3 columns:

  1. a symbolic name
  2. a regular expression
  3. a class name

Example: tbl.digit.fr (french resource file)

If the file indicates that composite numbers can contain internal spaces and commas, then the following sequence (generated by the subtool mtsegspace)

TOK 1
TOK 342,
TOK 625
TOK 56

will be recombined to become

DIG 1_342,625_56


tbl.enum.xx

This file contains the definition of those sequences comprising enumerations--that is, sequences such as "1.", or "(a)" used in the enumeration list . It is used to detect, and sometimes recombine such sequences, and assign them the class ENUMERATION.

Note that enumerations are detected only when they appear at the beginning of a paragraph or a chunk (this is set by the directive MT_CLASS_BEFORE).

In this file, many different kinds of enumerations can be defined, using regular expressions:

BEFORE[0-n]
characters that can occur before the number/character of the enumeration.
ie:
BEFORE0 \( for the character "("
BEFORE1 \[ for the character "["
NUMBER[0-n]
characters that may comprise the enumeration itself.
ie:
NUMBER0 [0-9]+ define number like 1, 2, 3, ...
NUMBER1 [0-9]\.([0-9]\.)* define number like 1.2.1 ...
AFTER[0-n]
characters that can follow the number/character of the enumeration.
ie:
AFTER0 \) for the character ")"
AFTER1 \] for the character "]"

Then we have:

MT_ENUM
which defines the full structure of an enumeration in a regular expression format.

Only MT_ENUM is mandatory.

Example: tbl.enum.fr (french resource file)

[CHUNK <div>
TOK -1.

becomes

[CHUNK <div>
ENUM -1.


tbl.sent.xx

Example: tbl.sent.fr (french resource file)



HTML 3.2 Checked! This document is better viewed with Netscape

| Top | Next | MtSeg home page | LPL/CNRS | MULTEXT |

Copyright © Centre National de la Recherche Scientifique, 1996.