![]() |
MtSeg: Resources |
Some of the subtools in the segmenter require certain language-specific information in order to accomplish their tasks. For maximum flexibility and to retain language-independence, all such information is provided directly to the subtools via external resource files. All resource files have names in the following form:
tbl.nnnnn.xx
where 'nnnnn' is replaced by a specific file name identifying the contents, and 'xx' is the two-letter ISO standard code [ISO 639:1988] for the language:
| bg | Bulgarian |
| cs | Czech |
| de | German |
| en | English |
| es | Spanish |
| et | Estonian |
| fr | French |
| hu | Hungarian |
| it | Italian |
| nl | Dutch |
| ro | Romanian |
| sl | Slovenian |
Find below the descriptions of all the resource files used by the segmenter, together with a short sample of the contents of each for French.
The segmenter's resource files can be created in a boot-strapping process, since the output of the segmenter provides its idea of what is a token, what is an abbreviation, etc. By checking the output of the segmenter you will verify the entries you have already provided, and see others that should be added to the file.
Note that in the samples we have used SGML entities for characters. You should do the same in constructing your files.
Should you have any problems or questions concerning the files, contact Philippe Di Cristo.
The purpose of this file is to define the exact names used by the segmenter subtools to refer to classes, in order to enable re-definition by the users. Each class is assigned a label.
For example:
| ABBREVIATION | Abbr | TOKEN |
means that all abbreviations encountered will be annotated by 'Abbr' and that an abbreviation is also a token.
This file is in a table format with tabulations as field separators:
<CLASS NAME>TAB<CLASS LABEL>TAB<CHILDOF>
Where
This file contains the definition of those characters and character configurations (not containing internal spaces) that are to be considered as punctuation. They are defined in a regular expression format and each one is assigned the associated class.
For example:
| PERIOD | \. | TERM_PUNCT? |
This entry defines the punctuation "." by the symbolic name "PERIOD" and the class "TERM_PUNCT? (i.e., possibly terminal punctuation)"
This file also classifies each type of punctuation by the MANDATORY entries:
Example: tbl.punct.fr (french resource file)
This file defines punctuation using space characters. It is in the same format as the file 'tbl.merge.xx'
Example: tbl.mergepunct.fr (french resource file)
This file contains the entry
| ._._. | PUNCT | ELLIPSIS |
As a result, the following sequence of tokens (generated by the subtool mtsegspace)
| PTERM_P | . |
| PTERM_P | . |
| PTERM_P | . |
will be re-combined to become
| PTERM_P | ._._. |
This file contains abbreviations. It is used by the module of the segmenter called mtsegabbrev.
The segmenter works by first isolating strings in the text which are separated by blanks or tabs, and then isolating unambiguous punctuation and other graphically unambiguous elements (the modules mtsegspace and mtsegpunct). Each string is associated with the class name TOKEN for "token" (except for punctuation, which is explicitly identified as such with the class name PUNCTUATION ), and each classname-string pair is put in a separate record. For example, the text:
The output of these two modules yields:
| TOK | Mme. |
| TOK | Steiner |
| PUNCT | , |
| TOK | elle |
| TOK | est |
| TOK | d'Allemagne. |
(In the actual output of the two modules, a location reference appears on the right of each line. This is eliminated here for simplicity.)
The period is left attached to "Allemagne" because it cannot be determined at this point whether this is the period terminating an abbreviation or not.
The next step (module mtsegabbrev) will look up all words ending with a period in the file tbl.abbrev.xx. Any string consisting of capital letters only, possibly with internal periods and with a final period (e.g., U.S.A.) is automatically assigned the class ABBREVIATION, whether or not it appears in the file. For abbreviations that do appear in the file, mtsegabbrev uses the information in the file to assign a new class (replacing TOKEN). Otherwise, the final period is separated and put in a second record and given the class assigned to. Thus the example above after treatment by mtsegabbrev becomes:
| TIT | Mme. |
| TOK | Steiner |
| PUNCT | , |
| TOK | elle |
| TOK | est |
| TOK | d'Allemagne |
| PTERM_P | . |
There are four classes of abbreviations:
The file comprises 3 columns:
Example: tbl.abbrev.fr (french resource file)
Note that abbreviations like "U.S.A." are not necessary, since any string consisting entirely of capital letters is automatically assigned the class ABBREVIATION.
| TOK | Dr. |
becomes
| TIT | Dr. |
This file is identical in format to tbl.abbrev.fr. It contains space-composed abbreviations (with the space replaced by underscore).
The file contains 3 columns:
For example if this file contains the entry
| A._T. | ABBREVIATION |
"A. T." is defined as an abbreviation in this file, then the following sequence (generated by the subtool mtsegspace)
| TOK | A. |
| TOK | T. |
will be recombined to become
| ABBREV | A._T. |
Contains "clitics" of the language considered. These are not real clitics; for our purposes, clitics are regarded as anything separated by a hyphen ("-", as in "est-il") or an apostrophe "'" (as in "l'avion"). Each one is identified as belonging to a class called either PROCLITIC (i.e., proclitic, for items appearing at the beginning of the string) or ENCLITIC (i.e., enclitic, for items appearing at the end of the string). These class names are associated with the identified tokens in the output of the mtsegsplit module of the segmenter.
When the segmenter (specifically, the module mtsegsplit) encounters a token containing an apostrophe or hyphen, it consults the file tbl.split.xx (where "xx" is replaced by the language code which, for example, for French is tbl.split.fr) to determine whether or not the token should be split into two (or more) separate tokens. For example, if mtsegsplit receives the following input,
| TOK | est-il |
it will check the file tbl.split.fr and see that "-il" is identified as an enclitic. The result will be the output of two tokens:
| TOK | est |
| ENC | -il |
The file tbl.split.xx also includes words containing instances of the identified enclitics and proclitics that should NOT be split into separate tokens--for example, "d'abord" in French, since in this case the "d'" is not due to an elision of "de" and "abord", and is therefore not a separate grammatical item. Thus although "d'" is identified as a proclitic in the tbl.split.fr file, the segmenter will leave "d'abord" as a single token, and change its class name from TOKEN to COMPOUND. Note that the process of isolating clitics is iterated, so that a string such as "a-t-il" is analyzed to be
| TOK | a |
| ENC | -t |
| ENC | -il |
The file tbl.split.xx comprises 2 columns, separated by tabs:
Example: tbl.split.fr (french resource file)
In other instances, words which are separated by blanks, and which were therefore previously split into separate tokens by mtsegspace, should be regarded as a single token comprising a compound word. The module mtsegmerge uses the file tbl.merge.xx to determine which tokens to recombine into compounds. So, the input
| TOK | après |
| TOK | que |
becomes
| COMPOUND | après_que |
All the fixed grammatical words (multiword conjunctions, adverbs etc.) should be there, especially if the text is to be subsequently processed by a statistical tagger. For this purpose it is less clear that N+N, Adj+N, N+prep+N sequences, which constitute an open lexicon, should be included. Most of these sequences have some derivational behavior, which is beyond the scope of this subtool (e.g. "pomme de terre", "pommes de terre").
Note that it is possible to define a new class "COMPOUND?" which can be used to mark ambiguous sequences such as "alors que" (= "alors que" or "alors" "que"):
The file has 3 columns:
Example: tbl.merge.fr (french resource file)
This file is used by the segmenter's tool 'mtsegregex' to detect and recombine a date.
The date format is given with a regular expression in the mandatory entry 'MT_DATE'.
Space characters must be replaced by underscore in the expression
The file has 3 columns:
Example: Accordig to the file tbl.date.fr (french resource file)
| TOK | 10 |
| TOK | Juillet |
| TOK | 1997 |
will be recombined to become
| DATE | 10_Juillet_1997 |
This file defines the digital number format using the regular expressions.
This resource file uses the same format as the punctuation table.
There is only one mandatory entry, MT_DIGITAL, which assign digital numbers to the class DIGITAL.
Note that in the file, the space character inside a digital number is replaced by an underscore.
The file has 3 columns:
Example: tbl.digit.fr (french resource file)
If the file indicates that composite numbers can contain internal spaces and commas, then the following sequence (generated by the subtool mtsegspace)
| TOK | 1 |
| TOK | 342, |
| TOK | 625 |
| TOK | 56 |
will be recombined to become
| DIG | 1_342,625_56 |
This file contains the definition of those sequences comprising enumerations--that is, sequences such as "1.", or "(a)" used in the enumeration list . It is used to detect, and sometimes recombine such sequences, and assign them the class ENUMERATION.
Note that enumerations are detected only when they appear at the beginning of a paragraph or a chunk (this is set by the directive MT_CLASS_BEFORE).
In this file, many different kinds of enumerations can be defined, using regular expressions:
Then we have:
Only MT_ENUM is mandatory.
Example: tbl.enum.fr (french resource file)
| [CHUNK | <div> |
| TOK | -1. |
becomes
| [CHUNK | <div> |
| ENUM | -1. |
Example: tbl.sent.fr (french resource file)