| Customizing MtRecode |
Adding a new character set involves two operations:
- writing a mapping table for this character set, mapping its characters to ISO 10646;
- adding the appropriate external entities in the external entities mapping table: since this table is used as a "fallback", for example when translating to poorer character sets, you must make sure that all characters in the character set you are defining are in the external entity table;
- adding the lines concerning those characters in the reference table if you want to use the -upper, -lower or -unacc options on those characters.
If you add character sets using only characters that already exist among the character sets provided, you do not need to worry about the last step.
It is possible to completely replace the default SGML entities provided with MtRecode with another, custom, type of entities using another kind of notation. In that case the external entity mapping table must be replaced by another one with the same format.
You can look at the default external entities table and modify it: names can be changed, additional entities can be added. When you have finished the modifications, you can give it to the tool with the -fext option.
Note : the resources provided in the package are only provided as examples. So, the modification you will make to them will have no effect for the tool. However, we hardly recomende you look at them to create new tables.
Mapping tables are used to give the translation of characters into ISO 10646 for each character set.
The format is as follows:
... 0xc8 0x00C8 #LATIN CAPITAL LETTER E WITH GRAVE 0xc9 0x00C9 #LATIN CAPITAL LETTER E WITH ACUTE 0xca 0x00CA #LATIN CAPITAL LETTER E WITH CIRCUMFLEX 0xcb 0x00CB #LATIN CAPITAL LETTER E WITH DIAERESIS ...
The three fields are:
- the code of the character in the set (in hexadecimal mode : 0xhh) or its associated entity;
- the code of the character in ISO 10646 (in hexadecimal mode : 0xhhhh);
- the description of the character (this field is optional). The external entities are also defined by mapping table with the same format.
Notes:
- the hexadecimal codes are not case sensitive;
- the order does not have any importance
- Be careful, if you don't define all the character (128 or 256), don't use an unknown one. For example, in ISO 8859-6 (Arabic), there is no entry for the codes 'ae' to 'ba'.
- You can create a table where several characters of the set are associated with a single ISO-10646 code. This can be useful for translations if you want to translate several characters into a same one. For example if you want the 'é', 'è', 'ê' and 'ë' letters to be associated with the letter 'e'.
A mapping table with the same format is also used to define the external entities. Example (SGML entities provided by default):
... È 0x00C8 #LATIN CAPITAL LETTER E WITH GRAVE É 0x00C9 #LATIN CAPITAL LETTER E WITH ACUTE Ê 0x00CA #LATIN CAPITAL LETTER E WITH CIRCUMFLEX Ë 0x00CB #LATIN CAPITAL LETTER E WITH DIAERESIS ...Note: The entity names are case sensitive.
This table is used as a reference to check the lexical order, the properties of each character, and the relations among them. All the characters are referenced by their ISO 10646 code (in hexadecimal mode).
The format of the reference table is as follows:
0045 0045 0045 0065 0045 10010000000000 LATIN CAPITAL LETTER E 0065 0065 0045 0065 0065 01010000000000 LATIN SMALL LETTER E 00C8 0045 00C8 00E8 0045 10000001000000 LATIN CAPITAL LETTER E WITH GRAVE 00E8 0065 00C8 00E8 0065 01000001000000 LATIN SMALL LETTER E WITH GRAVE 00C9 0045 00C9 00E9 0045 10000001000000 LATIN CAPITAL LETTER E WITH ACUTE 00E9 0065 00C9 00E9 0065 01000001000000 LATIN SMALL LETTER E WITH ACUTE 00CA 0045 00CA 00EA 0045 10000001000000 LATIN CAPITAL LETTER E WITH CIRCUMFLEX 00EA 0065 00CA 00EA 0065 01000001000000 LATIN SMALL LETTER E WITH CIRCUMFLEX 00CB 0045 00CB 00EB 0045 10000001000000 LATIN CAPITAL LETTER E WITH DIAERESIS 00EB 0065 00CB 00EB 0065 01000001000000 LATIN SMALL LETTER E WITH DIAERESISThe order of the lines in the table gives the lexical order: the table begins with control codes, followed by various symbols (mathematical, design, ...). The alphabetic letters (uppercase and lowercase, accented and unaccented) are located at the end of the table, in the following order (this can be changed by the user):
- an uppercase letter comes before the corresponding lowercase letter
- an accented character comes after its corresponding unaccented letter. For example, the letter 'é' must be placed (in French) after 'e' but before 'f'.
Each line of the table has seven fields (lines which begin with '#' are only comments and are not loaded) separated by tabulation:
- The character (its CODE)
The content of this field is the ISO-10646 code for the character itself. The position of a character in the list gives its position in the lexical order, which can be modified.
- The associated UPPER character
This field gives the associated capitalized letter, or, if the character is already uppercase or is a symbol, the same character appears in this field. For example, the uppercase equivalent of 'E' is 'E'; for 'é', it is 'É'.
- The associated LOWER character
This field gives the associated lowercase letter, or, if the character is already lowercase or is a symbol, the same character appears in this field. For example, the lowercase equivalent of 'E' is 'e'; for 'é', it is 'é'
- The associated unaccented character (UNACC)
This field gives the associated unaccented letter or, if the character is unaccented, the same character appears in this field. For example, the unaccented equivalent of 'E' is 'E'; for 'é', it is 'e'.
- The associated FEATURES
Note this field is not used by the current version of MtRecode
A set of features is associated with each character. This is accomplished using a field of binary values, in which each column corresponds to a particular feature, as follows:If the value in the column corresponding to a given feature is '1', the feature applies to the character; if it is '0', it does not apply.
- upper : character is an uppercase letter
- lower : haracter is a lowercase letter case
- digit : character is a digit
- xdigit : character is a hexadecimal digit
- space : character is a space
- punct : character is a punctuation mark
- cntrl : character is a control character
- accented : character is an accented letter
- blank : character is a blank
- diacritical : character is a diacritic
- num|graph : character is a numeric or graphical character
- publishing : character is a publishing character
- maths : character is a mathematical symbol
- phonetic : character is a phonetic symbol
- The ISO 10646/Unicode description of the character
Note this field is not used by the current version of MtRecode
Suppose you want to use, instead of the ISO-8859-1 set of characters, a special set (new_set) used to write French texts with seven bits characters.
We can create the following table and save it as ex_mtrecode_new_table in the testsrc directory.
# A2 0x00C0 #LATIN CAPITAL LETTER A WITH GRAVE A3 0x00C2 #LATIN CAPITAL LETTER A WITH CIRCUMFLEX A4 0x00C4 #LATIN CAPITAL LETTER A WITH DIAERESIS C5 0x00C7 #LATIN CAPITAL LETTER C WITH CEDILLA E2 0x00C8 #LATIN CAPITAL LETTER E WITH GRAVE E1 0x00C9 #LATIN CAPITAL LETTER E WITH ACUTE E3 0x00CA #LATIN CAPITAL LETTER E WITH CIRCUMFLEX E4 0x00CB #LATIN CAPITAL LETTER E WITH DIAERESIS I3 0x00CE #LATIN CAPITAL LETTER I WITH CIRCUMFLEX I4 0x00CF #LATIN CAPITAL LETTER I WITH DIAERESIS O3 0x00D4 #LATIN CAPITAL LETTER O WITH CIRCUMFLEX O4 0x00D6 #LATIN CAPITAL LETTER O WITH DIAERESIS U3 0x00DB #LATIN CAPITAL LETTER U WITH CIRCUMFLEX U4 0x00DC #LATIN CAPITAL LETTER U WITH DIAERESIS a2 0x00E0 #LATIN SMALL LETTER A WITH GRAVE a3 0x00E2 #LATIN SMALL LETTER A WITH CIRCUMFLEX a4 0x00E4 #LATIN SMALL LETTER A WITH DIAERESIS c5 0x00E7 #LATIN SMALL LETTER C WITH CEDILLA e2 0x00E8 #LATIN SMALL LETTER E WITH GRAVE e1 0x00E9 #LATIN SMALL LETTER E WITH ACUTE e3 0x00EA #LATIN SMALL LETTER E WITH CIRCUMFLEX e4 0x00EB #LATIN SMALL LETTER E WITH DIAERESIS i3 0x00EE #LATIN SMALL LETTER I WITH CIRCUMFLEX i4 0x00EF #LATIN SMALL LETTER I WITH DIAERESIS o3 0x00F4 #LATIN SMALL LETTER O WITH CIRCUMFLEX o4 0x00F6 #LATIN SMALL LETTER O WITH DIAERESIS u3 0x00FB #LATIN SMALL LETTER U WITH CIRCUMFLEX u4 0x00FC #LATIN SMALL LETTER U WITH DIAERESIS
The first field contains the entity which represents the character, the second field contains its code in the ISO-10646 set, and the third field contains the Unicode name for the character (this field is optional). This table is incomplete and is loaded (by default) over the pre-defined table ISO-646:
%% mtrecode -in latin1 -out ../testsrc/ex_mtstr_new_table Et des chapeaux de bergères Défendent notre fraîcheur Et nos robes - si légères - Sont d'une extrême blancheur ; Et des chapeaux de berge2res De1fendent notre frai3cheur Et nos robes - si le1ge2res - Sont d'une extre3me blancheur ;
You can create a table where several characters are associated with a single code in ISO-10646. For example, the following table can be used to translate every accented letter into its unaccented form :
0x41 0x0041 #LATIN CAPITAL LETTER A 0x41 0x00C0 #LATIN CAPITAL LETTER A WITH GRAVE 0x41 0x00C1 #LATIN CAPITAL LETTER A WITH ACUTE 0x41 0x00C2 #LATIN CAPITAL LETTER A WITH CIRCUMFLEX 0x41 0x00C3 #LATIN CAPITAL LETTER A WITH TILDE 0x41 0x00C4 #LATIN CAPITAL LETTER A WITH DIAERESIS 0x41 0x00C5 #LATIN CAPITAL LETTER A WITH RING ABOVE 0x45 0x0045 #LATIN CAPITAL LETTER E 0x45 0x00C8 #LATIN CAPITAL LETTER E WITH GRAVE 0x45 0x00C9 #LATIN CAPITAL LETTER E WITH ACUTE 0x45 0x00CA #LATIN CAPITAL LETTER E WITH CIRCUMFLEX 0x45 0x00CB #LATIN CAPITAL LETTER E WITH DIAERESIS 0x43 0x0043 #LATIN CAPITAL LETTER C 0x43 0x00C7 #LATIN CAPITAL LETTER C WITH CEDILLA 0x49 0x0049 #LATIN CAPITAL LETTER I 0x49 0x00CC #LATIN CAPITAL LETTER I WITH GRAVE 0x49 0x00CD #LATIN CAPITAL LETTER I WITH ACUTE 0x49 0x00CE #LATIN CAPITAL LETTER I WITH CIRCUMFLEX 0x49 0x00CF #LATIN CAPITAL LETTER I WITH DIAERESIS 0x4e 0x004E #LATIN CAPITAL LETTER N 0x4e 0x00D1 #LATIN CAPITAL LETTER N WITH TILDE 0x4f 0x004F #LATIN CAPITAL LETTER O 0x4f 0x00D2 #LATIN CAPITAL LETTER O WITH GRAVE 0x4f 0x00D3 #LATIN CAPITAL LETTER O WITH ACUTE 0x4f 0x00D4 #LATIN CAPITAL LETTER O WITH CIRCUMFLEX 0x4f 0x00D5 #LATIN CAPITAL LETTER O WITH TILDE 0x4f 0x00D6 #LATIN CAPITAL LETTER O WITH DIAERESIS 0x4f 0x00D8 #LATIN CAPITAL LETTER O WITH STROKE 0x55 0x0055 #LATIN CAPITAL LETTER U 0x55 0x00D9 #LATIN CAPITAL LETTER U WITH GRAVE 0x55 0x00DA #LATIN CAPITAL LETTER U WITH ACUTE 0x55 0x00DB #LATIN CAPITAL LETTER U WITH CIRCUMFLEX 0x55 0x00DC #LATIN CAPITAL LETTER U WITH DIAERESIS 0x59 0x0059 #LATIN CAPITAL LETTER Y 0x59 0x00DD #LATIN CAPITAL LETTER Y WITH ACUTE 0x61 0x0061 #LATIN SMALL LETTER A 0x61 0x00E0 #LATIN SMALL LETTER A WITH GRAVE 0x61 0x00E1 #LATIN SMALL LETTER A WITH ACUTE 0x61 0x00E2 #LATIN SMALL LETTER A WITH CIRCUMFLEX 0x61 0x00E3 #LATIN SMALL LETTER A WITH TILDE 0x61 0x00E4 #LATIN SMALL LETTER A WITH DIAERESIS 0x61 0x00E5 #LATIN SMALL LETTER A WITH RING ABOVE 0x63 0x0063 #LATIN SMALL LETTER C 0x63 0x00E7 #LATIN SMALL LETTER C WITH CEDILLA 0x65 0x0065 #LATIN SMALL LETTER E 0x65 0x00E8 #LATIN SMALL LETTER E WITH GRAVE 0x65 0x00E9 #LATIN SMALL LETTER E WITH ACUTE 0x65 0x00EA #LATIN SMALL LETTER E WITH CIRCUMFLEX 0x65 0x00EB #LATIN SMALL LETTER E WITH DIAERESIS 0x69 0x0069 #LATIN SMALL LETTER I 0x69 0x00EC #LATIN SMALL LETTER I WITH GRAVE 0x69 0x00ED #LATIN SMALL LETTER I WITH ACUTE 0x69 0x00EE #LATIN SMALL LETTER I WITH CIRCUMFLEX 0x69 0x00EF #LATIN SMALL LETTER I WITH DIAERESIS 0x6e 0x006E #LATIN SMALL LETTER N 0x6e 0x00F1 #LATIN SMALL LETTER N WITH TILDE 0x6f 0x006F #LATIN SMALL LETTER O 0x6f 0x00F2 #LATIN SMALL LETTER O WITH GRAVE 0x6f 0x00F3 #LATIN SMALL LETTER O WITH ACUTE 0x6f 0x00F4 #LATIN SMALL LETTER O WITH CIRCUMFLEX 0x6f 0x00F5 #LATIN SMALL LETTER O WITH TILDE 0x6f 0x00F6 #LATIN SMALL LETTER O WITH DIAERESIS 0x6f 0x00F8 #LATIN SMALL LETTER O WITH STROKE 0x75 0x0075 #LATIN SMALL LETTER U 0x75 0x00F9 #LATIN SMALL LETTER U WITH GRAVE 0x75 0x00FA #LATIN SMALL LETTER U WITH ACUTE 0x75 0x00FB #LATIN SMALL LETTER U WITH CIRCUMFLEX 0x75 0x00FC #LATIN SMALL LETTER U WITH DIAERESIS 0x79 0x0079 #LATIN SMALL LETTER Y 0x79 0x00FD #LATIN SMALL LETTER Y WITH ACUTE 0x79 0x00FF #LATIN SMALL LETTER Y WITH DIAERESIS
The character with code 0x65 is repeated five times in this table, but in each case the ISO-10646 code differs. The codes in ISO-10646 correspond to the characters 'e', 'é', 'è', 'ê' and 'ë' the repeated code is the character 'e' in ISO-8859-1.
Therefore, if you load this table over the ISO-646 character set, several ISO-10646 codes are associated with a single character in the set.
mtrecode -in latin1 -out ../testsrc/ex_mtstr_spec_table Et des chapeaux de bergères Défendent notre fraîcheur Et nos robes - si légères - Sont d'une extrême blancheur ; Et des chapeaux de bergeres Defendent notre fraicheur Et nos robes - si legeres - Sont d'une extreme blancheur ;The accented letters are translated ton the associated unaccented letter.
| Top
| Next
| MtRecode
| LPL/CNRS
| MULTEXT