MULTEXT - Document MRC 1. MtRecode/Customizing.




logo

Customizing MtRecode






Content



What can be modified?

MtRecode can be customized in three ways to fit particular purposes:


Adding character sets

Adding a new character set involves two operations:

If you add character sets using only characters that already exist among the character sets provided, you do not need to worry about the last step.


Changing the external entities

It is possible to completely replace the default SGML entities provided with MtRecode with another, custom, type of entities using another kind of notation. In that case the external entity mapping table must be replaced by another one with the same format.

You can look at the default external entities table and modify it: names can be changed, additional entities can be added. When you have finished the modifications, you can give it to the tool with the -fext option.

Note : the resources provided in the package are only provided as examples. So, the modification you will make to them will have no effect for the tool. However, we hardly recomende you look at them to create new tables.


Table formats

There are default tables, but you can create your own. Copies of the default tables are in the directory where all the data is installed : tbl.char.iso_8859_10 for the ISO-8859-10 set, tbl.char.easy_french.src for the Easy French set and tbl.char.sgml.entities for the External Table. Please look at them if you wish to make your own tables. The format of the tables is simple but very important.


Mapping tables

Mapping tables are used to give the translation of characters into ISO 10646 for each character set.

The format is as follows:

...
0xc8 0x00C8 #LATIN CAPITAL LETTER E WITH GRAVE
0xc9 0x00C9 #LATIN CAPITAL LETTER E WITH ACUTE
0xca 0x00CA #LATIN CAPITAL LETTER E WITH CIRCUMFLEX
0xcb 0x00CB #LATIN CAPITAL LETTER E WITH DIAERESIS
...

The three fields are:

Notes:

A mapping table with the same format is also used to define the external entities. Example (SGML entities provided by default):

...
È  0x00C8  #LATIN CAPITAL LETTER E WITH GRAVE
É  0x00C9  #LATIN CAPITAL LETTER E WITH ACUTE
Ê   0x00CA  #LATIN CAPITAL LETTER E WITH CIRCUMFLEX
Ë    0x00CB  #LATIN CAPITAL LETTER E WITH DIAERESIS
...

Note: The entity names are case sensitive.


Reference table

This table is used as a reference to check the lexical order, the properties of each character, and the relations among them. All the characters are referenced by their ISO 10646 code (in hexadecimal mode).

The format of the reference table is as follows:
0045 0045 0045 0065 0045 10010000000000 LATIN CAPITAL LETTER E
0065 0065 0045 0065 0065 01010000000000 LATIN SMALL LETTER E
00C8 0045 00C8 00E8 0045 10000001000000 LATIN CAPITAL LETTER E WITH GRAVE
00E8 0065 00C8 00E8 0065 01000001000000 LATIN SMALL LETTER E WITH GRAVE
00C9 0045 00C9 00E9 0045 10000001000000 LATIN CAPITAL LETTER E WITH ACUTE
00E9 0065 00C9 00E9 0065 01000001000000 LATIN SMALL LETTER E WITH ACUTE
00CA 0045 00CA 00EA 0045 10000001000000 LATIN CAPITAL LETTER E WITH CIRCUMFLEX
00EA 0065 00CA 00EA 0065 01000001000000 LATIN SMALL LETTER E WITH CIRCUMFLEX
00CB 0045 00CB 00EB 0045 10000001000000 LATIN CAPITAL LETTER E WITH DIAERESIS
00EB 0065 00CB 00EB 0065 01000001000000 LATIN SMALL LETTER E WITH DIAERESIS

The order of the lines in the table gives the lexical order: the table begins with control codes, followed by various symbols (mathematical, design, ...). The alphabetic letters (uppercase and lowercase, accented and unaccented) are located at the end of the table, in the following order (this can be changed by the user):

Each line of the table has seven fields (lines which begin with '#' are only comments and are not loaded) separated by tabulation:


Example with a new character set

Defining variable width character sets

Suppose you want to use, instead of the ISO-8859-1 set of characters, a special set (new_set) used to write French texts with seven bits characters.
We can create the following table and save it as ex_mtrecode_new_table in the testsrc directory.
#
A2	0x00C0	#LATIN CAPITAL LETTER A WITH GRAVE
A3	0x00C2	#LATIN CAPITAL LETTER A WITH CIRCUMFLEX
A4	0x00C4	#LATIN CAPITAL LETTER A WITH DIAERESIS
C5	0x00C7	#LATIN CAPITAL LETTER C WITH CEDILLA
E2	0x00C8	#LATIN CAPITAL LETTER E WITH GRAVE
E1	0x00C9	#LATIN CAPITAL LETTER E WITH ACUTE
E3	0x00CA	#LATIN CAPITAL LETTER E WITH CIRCUMFLEX
E4	0x00CB	#LATIN CAPITAL LETTER E WITH DIAERESIS
I3	0x00CE	#LATIN CAPITAL LETTER I WITH CIRCUMFLEX
I4	0x00CF	#LATIN CAPITAL LETTER I WITH DIAERESIS
O3	0x00D4	#LATIN CAPITAL LETTER O WITH CIRCUMFLEX
O4	0x00D6	#LATIN CAPITAL LETTER O WITH DIAERESIS
U3	0x00DB	#LATIN CAPITAL LETTER U WITH CIRCUMFLEX
U4	0x00DC	#LATIN CAPITAL LETTER U WITH DIAERESIS
a2	0x00E0	#LATIN SMALL LETTER A WITH GRAVE
a3	0x00E2	#LATIN SMALL LETTER A WITH CIRCUMFLEX
a4	0x00E4	#LATIN SMALL LETTER A WITH DIAERESIS
c5	0x00E7	#LATIN SMALL LETTER C WITH CEDILLA
e2	0x00E8	#LATIN SMALL LETTER E WITH GRAVE
e1	0x00E9	#LATIN SMALL LETTER E WITH ACUTE
e3	0x00EA	#LATIN SMALL LETTER E WITH CIRCUMFLEX
e4	0x00EB	#LATIN SMALL LETTER E WITH DIAERESIS
i3	0x00EE	#LATIN SMALL LETTER I WITH CIRCUMFLEX
i4	0x00EF	#LATIN SMALL LETTER I WITH DIAERESIS
o3	0x00F4	#LATIN SMALL LETTER O WITH CIRCUMFLEX
o4	0x00F6	#LATIN SMALL LETTER O WITH DIAERESIS
u3	0x00FB	#LATIN SMALL LETTER U WITH CIRCUMFLEX
u4	0x00FC	#LATIN SMALL LETTER U WITH DIAERESIS

The first field contains the entity which represents the character, the second field contains its code in the ISO-10646 set, and the third field contains the Unicode name for the character (this field is optional). This table is incomplete and is loaded (by default) over the pre-defined table ISO-646:

%% mtrecode -in latin1 -out ../testsrc/ex_mtstr_new_table
Et des chapeaux de bergères
Défendent notre fraîcheur
Et nos robes - si légères -
Sont d'une extrême blancheur ;

Et des chapeaux de berge2res
De1fendent notre frai3cheur
Et nos robes - si le1ge2res -
Sont d'une extre3me blancheur ;

Creating a many-to-one mapping

You can create a table where several characters are associated with a single code in ISO-10646. For example, the following table can be used to translate every accented letter into its unaccented form :
0x41	0x0041	#LATIN CAPITAL LETTER A
0x41	0x00C0	#LATIN CAPITAL LETTER A WITH GRAVE
0x41	0x00C1	#LATIN CAPITAL LETTER A WITH ACUTE
0x41	0x00C2	#LATIN CAPITAL LETTER A WITH CIRCUMFLEX
0x41	0x00C3	#LATIN CAPITAL LETTER A WITH TILDE
0x41	0x00C4	#LATIN CAPITAL LETTER A WITH DIAERESIS
0x41	0x00C5	#LATIN CAPITAL LETTER A WITH RING ABOVE
0x45	0x0045	#LATIN CAPITAL LETTER E
0x45	0x00C8	#LATIN CAPITAL LETTER E WITH GRAVE
0x45	0x00C9	#LATIN CAPITAL LETTER E WITH ACUTE
0x45	0x00CA	#LATIN CAPITAL LETTER E WITH CIRCUMFLEX
0x45	0x00CB	#LATIN CAPITAL LETTER E WITH DIAERESIS
0x43	0x0043	#LATIN CAPITAL LETTER C
0x43	0x00C7	#LATIN CAPITAL LETTER C WITH CEDILLA
0x49	0x0049	#LATIN CAPITAL LETTER I
0x49	0x00CC	#LATIN CAPITAL LETTER I WITH GRAVE
0x49	0x00CD	#LATIN CAPITAL LETTER I WITH ACUTE
0x49	0x00CE	#LATIN CAPITAL LETTER I WITH CIRCUMFLEX
0x49	0x00CF	#LATIN CAPITAL LETTER I WITH DIAERESIS
0x4e	0x004E	#LATIN CAPITAL LETTER N
0x4e	0x00D1	#LATIN CAPITAL LETTER N WITH TILDE
0x4f	0x004F	#LATIN CAPITAL LETTER O
0x4f	0x00D2	#LATIN CAPITAL LETTER O WITH GRAVE
0x4f	0x00D3	#LATIN CAPITAL LETTER O WITH ACUTE
0x4f	0x00D4	#LATIN CAPITAL LETTER O WITH CIRCUMFLEX
0x4f	0x00D5	#LATIN CAPITAL LETTER O WITH TILDE
0x4f	0x00D6	#LATIN CAPITAL LETTER O WITH DIAERESIS
0x4f	0x00D8	#LATIN CAPITAL LETTER O WITH STROKE
0x55	0x0055	#LATIN CAPITAL LETTER U
0x55	0x00D9	#LATIN CAPITAL LETTER U WITH GRAVE
0x55	0x00DA	#LATIN CAPITAL LETTER U WITH ACUTE
0x55	0x00DB	#LATIN CAPITAL LETTER U WITH CIRCUMFLEX
0x55	0x00DC	#LATIN CAPITAL LETTER U WITH DIAERESIS
0x59	0x0059	#LATIN CAPITAL LETTER Y
0x59	0x00DD	#LATIN CAPITAL LETTER Y WITH ACUTE
0x61	0x0061	#LATIN SMALL LETTER A
0x61	0x00E0	#LATIN SMALL LETTER A WITH GRAVE
0x61	0x00E1	#LATIN SMALL LETTER A WITH ACUTE
0x61	0x00E2	#LATIN SMALL LETTER A WITH CIRCUMFLEX
0x61	0x00E3	#LATIN SMALL LETTER A WITH TILDE
0x61	0x00E4	#LATIN SMALL LETTER A WITH DIAERESIS
0x61	0x00E5	#LATIN SMALL LETTER A WITH RING ABOVE
0x63	0x0063	#LATIN SMALL LETTER C
0x63	0x00E7	#LATIN SMALL LETTER C WITH CEDILLA
0x65	0x0065	#LATIN SMALL LETTER E
0x65	0x00E8	#LATIN SMALL LETTER E WITH GRAVE
0x65	0x00E9	#LATIN SMALL LETTER E WITH ACUTE
0x65	0x00EA	#LATIN SMALL LETTER E WITH CIRCUMFLEX
0x65	0x00EB	#LATIN SMALL LETTER E WITH DIAERESIS
0x69	0x0069	#LATIN SMALL LETTER I
0x69	0x00EC	#LATIN SMALL LETTER I WITH GRAVE
0x69	0x00ED	#LATIN SMALL LETTER I WITH ACUTE
0x69	0x00EE	#LATIN SMALL LETTER I WITH CIRCUMFLEX
0x69	0x00EF	#LATIN SMALL LETTER I WITH DIAERESIS
0x6e	0x006E	#LATIN SMALL LETTER N
0x6e	0x00F1	#LATIN SMALL LETTER N WITH TILDE
0x6f	0x006F	#LATIN SMALL LETTER O
0x6f	0x00F2	#LATIN SMALL LETTER O WITH GRAVE
0x6f	0x00F3	#LATIN SMALL LETTER O WITH ACUTE
0x6f	0x00F4	#LATIN SMALL LETTER O WITH CIRCUMFLEX
0x6f	0x00F5	#LATIN SMALL LETTER O WITH TILDE
0x6f	0x00F6	#LATIN SMALL LETTER O WITH DIAERESIS
0x6f	0x00F8	#LATIN SMALL LETTER O WITH STROKE
0x75	0x0075	#LATIN SMALL LETTER U
0x75	0x00F9	#LATIN SMALL LETTER U WITH GRAVE
0x75	0x00FA	#LATIN SMALL LETTER U WITH ACUTE
0x75	0x00FB	#LATIN SMALL LETTER U WITH CIRCUMFLEX
0x75	0x00FC	#LATIN SMALL LETTER U WITH DIAERESIS
0x79	0x0079	#LATIN SMALL LETTER Y
0x79	0x00FD	#LATIN SMALL LETTER Y WITH ACUTE
0x79	0x00FF	#LATIN SMALL LETTER Y WITH DIAERESIS

The character with code 0x65 is repeated five times in this table, but in each case the ISO-10646 code differs. The codes in ISO-10646 correspond to the characters 'e', 'é', 'è', 'ê' and 'ë' the repeated code is the character 'e' in ISO-8859-1.
Therefore, if you load this table over the ISO-646 character set, several ISO-10646 codes are associated with a single character in the set.

mtrecode -in latin1 -out ../testsrc/ex_mtstr_spec_table
Et des chapeaux de bergères
Défendent notre fraîcheur
Et nos robes - si légères -
Sont d'une extrême blancheur ;

Et des chapeaux de bergeres
Defendent notre fraicheur
Et nos robes - si legeres -
Sont d'une extreme blancheur ;

The accented letters are translated ton the associated unaccented letter.



HTML3.2 Checked! | Top | Next | MtRecode | LPL/CNRS | MULTEXT

Copyright © Centre National de la Recherche Scientifique, 1996.