MULTEXT - Document MRC 1. MtRecode/Overview.




logo

MtRecode: overview






Purpose

MtRecode is a program for translation between various character sets, developed in the framework of the MULTEXT project.

Its main features are as follows:


ballit understands SGML entities in the input and translates them to a character, if it exists in the target character set:

Example:
mtrecode -in latin1 -out ascii

example 1


ballit uses SGML entities as a fallback in the output when exact translation is not possible:

Example:
mtrecode -in latin1 -out ascii

ball


ballit handles both fixed-width and variable-width character sets.

Fixed-width character sets are character sets in which all characters are coded using the same number of bytes (e.g., one byte, as in ISO 646 and ISO 8859-1). Variable-width character sets are character sets in which characters can be coded using one or more bytes.

Example:
mtrecode -in latin1 -out easy_french

ball

The picture above shows translation to EasyFrench, which is a variable-width character set--e.g., it uses one byte to encode e, but uses two bytes (e') to encode the character "e with acute accent".


ballit can be customized easily in order to add new character sets or modify the default entities, by adding or modifying the resource tables.




Principles

MtRecode is based on the MULTEXT MtStr multlingual string library, and uses ISO 10646 as pivot for all translations. Each input character (or entity) is first translated into its ISO 10646 code, and is then mapped into the target character set. If the appropriate character does not exist in the target character set, an SGML entity is output. The entity fallback mechanism can be inhibited by the user. MtRecode supports fixed-width and variable-width character sets. Fixed-width character sets are character sets in which all characters are coded using the same number of bytes (e.g. one byte for ISO 646 and ISO 8859-1). Variable-width character sets are character sets in which characters can be coded using one or several bytes. For example, the EasyFrench character set uses one byte to encode e, but two bytes to encode the character "e with acute accent" : e'.

Mapping tables

In order to provide uniform treatment of characters and strings regardless of the character set used, all characters are first mapped into their ISO 10646 code. For this purpose the library uses a set of mapping tables that provide the translation of characters into ISO 10646 for each character set.

Note the format of the character set mapping tables is the same as the one used by Unicode. You can load additional mapping tables from the Unicode site.

Example (ISO 8859-1):
...
0xc8 0x00C8 #LATIN CAPITAL LETTER E WITH GRAVE
0xc9 0x00C9 #LATIN CAPITAL LETTER E WITH ACUTE
0xca 0x00CA #LATIN CAPITAL LETTER E WITH CIRCUMFLEX
0xcb 0x00CB #LATIN CAPITAL LETTER E WITH DIAERESIS
...

The three fields are:

Example (SGML entities):
...
è 0x00C8 #LATIN CAPITAL LETTER E WITH GRAVE
é 0x00C9 #LATIN CAPITAL LETTER E WITH ACUTE
ê  0x00CA #LATIN CAPITAL LETTER E WITH CIRCUMFLEX
ë   0x00CB #LATIN CAPITAL LETTER E WITH DIAERESIS
...


The reference table

MtRecode also uses a reference table, which indicates character relations for potentially all characters in ISO 10646. In the current version, this table contains only properties and relations for the union of all characters in the character sets supported. It is possible to have one such table per language and/or application, since the properties, lexical order, and relations may change between languages and/or applications.

Example:
...
0045 0045 0045 0065 0045 10010000000000 LATIN CAPITAL LETTER E
0065 0065 0045 0065 0065 01010000000000 LATIN SMALL LETTER E
00C8 0045 00C8 00E8 0045 10000001000000 LATIN CAPITAL LETTER E WITH GRAVE
00E8 0065 00C8 00E8 0065 01000001000000 LATIN SMALL LETTER E WITH GRAVE
00C9 0045 00C9 00E9 0045 10000001000000 LATIN CAPITAL LETTER E WITH ACUTE
00E9 0065 00C9 00E9 0065 01000001000000 LATIN SMALL LETTER E WITH ACUTE
...

This table has seven fields:

The two last fields are not used by mtrecode


MtRecode in operation

The MtRecode tool uses four tables simultaneously:

The Input table contains the character set used to read input and Output table the characters to print output. These table may also contain entities such as SGML entities, etc. The External table contains entities only; it is used primarily as a "backup" in order to provide a means to read and print characters not represented in the Current table.

Note that the default tables provided with the package are hard-coded in the C code. User-defined tables must be provided in the format described above.

First, the tool checks each character in the string passed to it, which may be either an external entity (and therefore present in the External table) or a regular character (and therefore present in the Input table). In either case, the tables provide the ISO 10646 code associated with the character.

Some options require processing using the Reference Table as a next step. For example, at this point the -upper() option consults the Reference Table to find the uppercase counterpart of each input character.

Finally, the reverse process is applied to output the desired result. The Output Table is first consulted with the ISO 10646 code of the output character as the key. If the character is present in that table, the table provides the mapping to the current character set and the character is displayed. Otherwise, if the character does not belong to the current character set, the External Entity Table is consulted to determine the appropriate external entity, which is then output.






HTML3.2 Checked! | Top | Next | MtRecode | LPL/CNRS | MULTEXT

Copyright © Centre National de la Recherche Scientifique, 1996.