MULTEXT/EAGLES - Document LSD 2. Part 1-1. Version 0.5. Last modified 28 April 1996.




logo

GLOSIX Part 1-1. Characters





Contents

| Back to LSD2 Table of Contents |


Overview

The character confusion

There is an enormous number of character sets, many of which are partially or wholly overlapping. There exists, for example, a long list of existing and developing ISO standards (see for example van Wingen, 1995); in addition, there is a vast proliferation of vendor-specific character sets. There are also multiple standard organizations, within each of which often multiple committees are devoted to character sets. In addition, each character set may have variants (e.g. the ISO 646 national variants) and the same character set can even have different names (e.g., iso-ir-6, ANSI X3.4-1986, ISO 646.IRV:1991, ASCII, ISO 646-US, US-ASCII).

The result is enormous confusion within the user community, and few instances where a clear choice among character sets is evident (to get the flavor of the confusion, see the attempt to list and categorize the various characters and character sets in RFC-1345).

To make matters worse, there is also confusion among concepts, most often between the notions of fontsand characters. For instance, on most hardware it is necessary to change fonts in order to change character sets (which is in fact a cause of the confusion in the minds of many users).

Fortunately, a new character set is under development, resulting from a merging of efforts within ISO and the Unicode consortium to develop a single, universal character set (UCS). A first part of this character set has been approved as ISO 10646. This standard encodes each character in four bytes.

However:

Although there is little doubt that this standard will eventually become the basis for character representation, its full specification and implementation is long enough away that, for present purposes, it is necessary to provide a temporary solution, as compatible as possible with UCS as well as support and guidance for the eventual migration to UCS. In addition, Unicode has developed a working framework, including precise definitions for basic concepts, etc., which we adopt here.

The character/glyph model

One of the most important contribution from Unicode's "Basic Principles" is a clear distinction between characters and glyphs, that we adopt:

A principle notion embodied by Unicode is the distinction between characters and glyphs. In loose terms, a glyph is a visual depiction of a character. On the other hand, a character has no inherent image. This distinction is in contrast to the common understanding which equates these two concepts. For example, a font is often described as being comprised of "characters." In contast, according to the view adopted by Unicode, a font contains "glyphs," not "characters."
An important consequence of this conceptual distinction is that a character set should not encode glyphs, and, furthermore, that the number of extant glyphs is much larger than the number of extant characters which should be encoded.
In practice, the distinction between these two concepts is often difficult to maintain, if, for no other reason, because such a distinction has rarely been made in the past. The result of this historical confusion between the two concepts is that many existing character sets encode "glyphs," and, for reasons of backward compatibility, these "glyphs" now appear in Unicode as "characters." An example of such a glyph is the roman ligature glyphs 'ff' and 'fi'.
The conceptual problems raised by this distinction were recognized both by the designers of Unicode and by ISO in its development of ISO/IEC 10646 and other standards. In order to help clarify this area, ISO asked the US National Body to develop a draft Character/Glyph Operational Model.

A detailled discussion of the model can be found in the above-mentionned document, from which we quote the following summary:


An ideal characterization of characters and glyphs and their relationship may be stated as follows:
1. A character conveys distinctions in meaning or sounds. A character has no intrinsic appearance.
2. A glyph conveys distinctions in form. A glyph has no intrinsic meaning.
3. One or more characters may be depicted by one or more glyph representations (instances of an abstract glyph) in a possibly context dependent fashion.
The relationship between character codes and glyph identifiers may be one-to-one, one-to-many, many-to-one, or many-to-many.

Definitions

This document uses the following set of definitions, drawn from the Unicode and Internationalization Glossary:

Character
(1) an element of a computer character set; (2) an element of an alphabet; (3) an element of the Han script.
Character Set
A collection of elements used to organize, control, or represent information. Such information can be classified as either formal, functional, or a combination of both form and function. Certain types of information are normally excluded from such representation; for example, directly perceived information such as pictures, sounds, texture, etc.; in contrast, the information which is represented by character sets can normally be said to be symbolic in nature.
Code Element
A unit of character encoding referring to both the numeric code value and the character which the code value represents.
Coded Character Set
A character set in which each character is assigned a numeric code value. Frequently abbreviated as character set when the context is sufficient to determine what is intended.
Font
A collection of glyphs used for the visual depiction of character data. A font is often associated with a set of parameters, e.g., size, posture, weight, serifness, etc., which, when set to particular values, generate a collection of imagable glyphs.
Glyph
An abstract form which represents one or more glyph images, and which is used to visually depict encoded character data. In displaying Unicode character data, one or more glyphs may be selected to depict a particular character. These glyphs are selected by a rendering engine during composition and layout processing.
Script
A collection of symbols used to represent textual information in a writing system.
We will also use the following definitions
Local character set
The character set used by a given application for the purposes of data creation, editing, processing, etc. Several local character sets can be in use on a given machine or system. The local character set is not necessarily the same as the character set that will be used for data distribution and/or interchange.
Blind interchange
Interchange of data where the parties have not necessarily made any prior agreement on data format and transfer protocol.
Trusted interchange
Interchange of data where the parties make a prior agreement on data format and transfer protocol.

Acronyms and abbreviations

UCS
The Universal Multiple-Octet Coded Character Set standard known as ISO/IEC 10646-1 and its extensions to come.
Unicode
Depending on the context, "Unicode" will refer to (1) the Unicode Consortium (2) the 16 bit character set defined by "The Unicode Standard, Version 1.1". This character set is identical with the character repertoire and coding of the international standard ISO/IEC 10646-1:1993(E); Coded Representation Form=UCS-2; Subset=300; Implementation Level=3.

Local character sets

General principles

UCS is a means by which a single character set will suffice to encode all the world's languages. However, at present very few programs and operating systems support the multi-byte code used by UCS. Only 8-bit character sets are generally supported.

Our recommendation has the merit of being reasonably compatible with UCS, thus facilitating future migration to that standard:

Limitations

These recommendations do not provide for Asian languages, including Chinese, Japanese, and Korean. Independent standards have been developed for these languages. [Help needed]

Base character sets

1. ISO 8859

Use the ISO 8859-X series for all the following scripts: Arabic, Cyrillic, Greek, Hebrew, Latin.

From "ISO 8859-1 National Character Set FAQ" (Gschwind, 1995):

ISO 8859-X character sets use the characters 0xa0 through 0xff to represent national characters, while the characters in the 0x20-0x7f range are those used in the US-ASCII (ISO 646) character set. Thus, ASCII text is a proper subset of all ISO 8859-X character sets.
The characters 0x80 through 0x9f are earmarked as extended control chracters, and are not used for encoding characters. These characters are not currently used to specify anything. A practical reason for this is interoperability with 7 bit devices (or when the 8th bit gets stripped by faulty software). Devices would then interpret the character as some control character and put the device in an undefined state. (When the 8th bit gets stripped from the characters at 0xa0 to 0xff, a wrong character is represented, but this cannot change the state of a terminal or other device.)
This character set is also used by AmigaDOS, MS-Windows, VMS (DEC MCS is practically equivalent to ISO 8859-1) and (practically all) UNIX implementations.

The following is a rough list of the languages accomodated in the ISO 8859 series. See also the graphic representation of the code tables.

ISO-8859-1 - Latin 1
Western Europe and Americas: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, Galician, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish.

ISO-8859-2 Latin 2
Latin-written Slavic and Central European languages: Czech, German, Hungarian, Polish, Romanian, Croatian, Slovak, Slovene.

ISO-8859-3 - Latin 3
Esperanto, Galician, Maltese, and Turkish.

ISO-8859-4 - Latin 4
Scandinavia/Baltic (mostly covered by 8859-1 also): Estonian, Latvian, and Lithuanian. It is an incomplete predecessor of Latin 6.

ISO-8859-5 - Cyrillic
Bulgarian, Byelorussian, Macedonian, Russian, Serbian and Ukrainian.

ISO-8859-6 - Arabic
Non-accented Arabic.

ISO-8859-7- Modern Greek
Greek.

ISO-8859-8 - Hebrew
Non-accented Hebrew.

ISO-8859-9 - Latin 5
Same as 8859-1 except for Turkish instead of Icelandic

ISO-8859-10 - Latin 6
Latin6, for Lappish/Nordic/Eskimo languages: Adds the last Inuit (Greenlandic) and Sami (Lappish) letters that were missing in Latin 4 to cover the entire Nordic area.

An exact list of languages accomodated by each character set would be useful [Help needed]. A list of characters used by a large number of languages is provided in "Characters and character sets for various languages " (Alvestrand, 1995).

Shortcomings

The ISO 8859 series lacks the ligatures Dutch ij, French oe and ,,German`` quotation marks, as well as several other characters. The rationale for not representing the various ligatures is that they are not characters but rather typographical artifacts representing two characters (e.g., "oe" represented by [oe] in French). However, there is a strong counter-argument that because such ligatures are lexically differentiated from the corresponding two letter combinations (cf. "moelle" vs. "c[oe]ur"), they cannot be recovered automatically by a display engine and therefore must be treated as characters in their own right. Note that for apparently similar reasons, the ligature "æ" is (somewhat inconsistently) included in ISO 8859-1.

There are also Bulgarian and Ukranian characters missing from ISO 8859-5.

2. International Phonetic Alphabet

construction[THIS SECTION IS UNDER DEVELOPMENT]

SGML entities

No matter how large or comprehensive the character set, there may exist the need to represent additional characters. In such cases we recommend the use of ISO entities, if they exist. In the rare instance where no standard entity is pre-defined, a new one can be defined by the encoder to serve the purpose.

Note: SGML entities should NOT be used to replace characters of the base character set (this applies to local character sets only, not blind interchange).

Character set switching within SGML documents

When different character sets are mixed in a single document, three alternative methods can be used (possibly in conjunction): These implicit methods are useful when there is a systematic mapping between tags and character sets (e.g., a list of words in one character set, with their translations in another).

The MULTEXT/EAGLES Corpus Encoding Standard developed by the provides global lang and wsd attributes, as well as appropriate mechanisms to document correspondences between languages or tags with particular character sets in the header.

Note that the language tagging mechanism will still be valid with UCS. "Unicode characters do not specify the language of the text they represent; that is, they are completely language neutral. If the language of a character or character string must be known to accomplish a particular type of process (e.g. language sensitive collation), then a higher-level protocol must be used to specify the language." [from Unicode's "Basic Principles"].

Character set switching other than within SGML documents

In instances other than SGML documents, we recommend to use the ISO 2022 character switching mechanisms, <SI> ("shift-in") and <SO> ("shift out") control sequences.


Character sets for interchange

Data interchange is becoming increasingly reliable, due to major international efforts towards standardization such as the Internet effort. For example, TCP/IP and many network applications (e.g., ftp, WWW, etc.) are "8-bit clean". In addition, recent standards have been proposed to guarantee delivery by automatically packing and unpacking data as required: Even when such these standards are not yet implemented, files can be safely transferred by using universally available encoding programs such as `uuencode'.

Therefore, we recommend that all data is distributed using the recommendations above for character sets. In the case of blind interchange, data should be encoded using 'uuencode'.

In the instance where the receiver is not GLOSIX complaint, there remains the potential problem of different character sets at each end of the interchange transaction, even if transmission is 8-bit clean. In this case, files should be encoded in the recommended 8-bit formats, with the burden on the receiver to recode the files. Again, there are freely available programs such as GNU's `recode' to accomplish the re-encoding of the data where necessary.


Transliteration

Transliteration should not be used to replace appropriate character sets, either for local processing or interchange.

Transliteration schemes be used only in the following instances:

We recommend the use of the character mnemonics from the ISO committee draft (CD) of the POSIX.2 standard, also listed in RFC-1345. There exist a number of other standard transliteration schemes; however, they have serious shortcomings, such as


String operations

In addition to specifying standard character sets, it is necessary to provide precise specifications for string operations in order to ensure portability and compatibility.

The variable behavior of string operations is a common source of non-portability. It is particularly problematic because in most cases, there is no signal of an error, but rather, the results from the same function are are different on different matchines. For example, it is not uncommon to see different results from a given sort operation when it is run on different systems, even when those systems use the same character set.

String operations include:

Recently developed standards, such as ISO C (ISO/IEC 9899:1990, also called "ANSI C") and POSIX (ISO/IEC 9945-2:1993), have improved the portability of string operations (see "GLOSIX Part 2-1. C"). However, there is no publicly available library that properly handles, in a parameterizable way, the functions cited above for the character sets and languages covered by the GLOSIX recommendations.

The development of GLOSIX specifications for string operations will require:


Migration towards UCS

construction[THIS SECTION IS UNDER DEVELOPMENT]

Standards

Recommended

ANSI X3.159-1989
American National Standard for Information Systems -- Programming Language -- C, Americal National Standards Institute.
ISO 639:1988
Code for the representation of names of languages
ISO 639-2:1995
Code for the representation of names of languages--Alpha-3 code
ISO 646.IRV:1991
Informatin Processing -- ISO 7-bit coded character set for information interchange [=ANSI X3.4-1986]
ISO 2022:1986
Information processing - ISO 7-bit and 8-bit coded character sets - Code extension techniques
ISO 3166:1993
Codes for the representation of names of countries
ISO-8859
Information Processing -- 8-bit Single-Byte Coded Graphic Character Sets -- Part 1: Latin Alphabet No. 1, ISO 8859-1:1987. Part 2: Latin alphabet No. 2, ISO 8859-2, 1987. Part 3: Latin alphabet No. 3, ISO 8859-3, 1988. Part 4: Latin alphabet No. 4, ISO 8859-4, 1988. Part 5: Latin/Cyrillic alphabet, ISO 8859-5, 1988. Part 6: Latin/Arabic alphabet, ISO 8859-6, 1987. Part 7: Latin/Greek alphabet, ISO 8859-7, 1987. Part 8: Latin/Hebrew alphabet, ISO 8859-8, 1988. Part 9: Latin alphabet No. 5, ISO 8859-9, 1990.
ISO 8879:1986
Information Processing--Text and Office Systems--Standard Generalized Markup Language (SGML).
ISO 9899:1990
The C programming language, International Organization for Standardization.
[This is the international version of, and identical to, ANSI X3.159-1989]
ISO/IEC 9945-2:1993
POSIX Shell and Utilities
[Character mnemonics also available in RFC-1345].
ISO/IEC 10646-1
Information technology - Character sets and information coding -Universal multiple-octet coded character set - Part 1 - Architecture and basic multilingual plane
UNICODE 1.1
"The Unicode Standard, Version 1.1": Version 1.0, Volume 1 (ISBN 0-201-56788-1), Version 1.0, Volume 2 (ISBN 0- 201-60845-6), and "Unicode Technical Report #4, The Unicode Standard, Version 1.1" (available from The Unicode Consortium, and soon to be published by Addison- Wesley).
[This character set is identical with the character repertoire and coding of the international standard ISO/IEC 10646-1:1993(E); Coded Representation Form=UCS-2; Subset=300; Implementation Level=3.]

Other normative references


ISO 2375 registration
"International Register of Coded Character Sets to be Used With Escape Sequences", European Computer Manufacturers Association (ECMA), Rue du Rhone 114, CH-1204 Geneve, Switzerland, December 1990.
RFC-1345
Simonsen, K. (1992). Character Mnemonics & Character Sets, RFC 1345, Rationel Almen Planlaegning, June 1992.
<URL:ftp://ds.internic.net/rfc/rfc1345.txt>
RFC-1521
Borenstein N., and N. Freed, "MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies", RFC 1521, Bellcore, Innosoft, September 1993.
<URL:ftp://ds.internic.net/rfc/rfc1521.txt>
RFC-1522
Moore, K., "Representation of Non-Ascii Text in Internet Message Headers" RFC 1522, University of Tennessee, September 1993.
<URL:ftp://ds.internic.net/rfc/rfc1522.txt>
RFC-1642
Goldsmith, D., Davis, M. (1994) UTF-7 Encoding Form, RFC 1642, Taligent, Inc., July 1994.
<URL:ftp://ds.internic.net/rfc/rfc1642.txt>
<URL:http://www.stonehand.com/unicode/standard/utf7.html>
ISO/IEC 6429:1992
Information processing -- Control functions for 7-bit and 8-bit coded character sets
ISO 9:1986
Transliteration of Slavic Cyrillic characters into Latin characters
ISO 233:1984
Transliteration of Arabic characters into Latin characters
ISO 233-2:1993
Transliteration of Arabic characters into Latin characters--Part 2: simplified translitteration
ISO 259:1984
Transliteration of Hebrew characters into Latin characters
ISO R 843:1968
Transliteration of Greek characters into Latin characters
ISO 3602:1989
Romanization of Japanese (kana script)
ISO 7098:1991
Romanization of Chinese

General bibliography

This list does not include standards (see Standards above).


Alvestrand, H. T. (1995). Characters and character sets for various languages

<URL:http://domen.uninett.no/~hta/ietf/lang-chars.txt>

"There is a need to have a source of information about the characters that are used in various languages. No such information is currently readily available on the net. This document attempts to fill that void."


Connolly, D. (1995) Character Set Considered Harmful, Internet-draft [draft-ietf-html-charset-harmful-00.txt], Internet Engineering Task Force (IETF), HTML working group (HTML-WG).

<URL:http://www.w3.org/hypertext/WWW/MarkUp/html-spec/charset-harmful.html>


Czyborra, R. (1994). The ISO 8859 series.

<URL:http://www.cs.tu-berlin.de/user/czyborra/charsets/>


Gschwind, M. K. (1994). ISO 8859-1 National Character Set FAQ

<URL:ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/FAQ-ISO-8859-1>


Gschwind, M. K. (1994). More about Internationalization

<URL:http://www.vlsivie.tuwien.ac.at/mike/i18n.html>


Gschwind, M. K. (1994). Programming for Internationalization

<URL:ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/i18n-programming>


Ide, N., Véronis, J. (Eds.) (1995). The Text Encoding Initiative: background and context. Kluwer Academic Publishers, Dordrecht, 342p. [reprinted from Triple special issue of Computers and the Humanities, 29, no 1/2/3, with an original bibliography]

Kuhn, M. (1994). Frequently Asked Questions about International Standards with Answers".

FAQ posted monthly to the USENET groups comp.protocols.iso, comp.std.misc and comp.std.internat, and archived on many FAQ servers, such as:

<URL:ftp://ftp.inria.fr/faq/comp.std.internat/Standards_FAQ>


Severance, Ch. (1995). Overview of the IEEE P1003.0 Guide to the POSIX Open Software Environment

<URL:http://wxweb.msu.edu/~crs/posix/p10030/intro.htm>


Sperberg-McQueen, C.M., Burnard, L. (Eds.) (1994). Guidelines for Electronic Text Encoding and Interchange, Text Encoding Initiative, Chicago and Oxford.

<URL:http://etext.virginia.edu/TEI.html>


Unicode (1994). The Unicode Standard

<URL:http://www.stonehand.com/unicode/standard.html>


Unicode (1994). Frequently asked questions [about Unicode]

<URL:http://www.stonehand.com/unicode/faq.html>


Unicode (1994). Basic Principles

<URL:http://www.stonehand.com/unicode/standard/principles.html>


Unicode (1994). Character/Glyph Operational Model.

<http://www.stonehand.com/unicode/standard/cgmodel.html>


Unicode (1994). Language Coding Using ISO/IEC 6429

<http://www.stonehand.com/unicode/standard/tc304.html>


| Top | LSD2 Table of Contents | MULTEXT | EAGLES Tool subgroup | LPL |

Copyright (c) Centre National de la Recherche Scientifique, 1995-1996. HTML 3.2 Checked!