MULTEXT/EAGLES -
Document LSD 2. Part 1-1. Version 0.5. Last modified 28 April 1996.
|
GLOSIX Part 1-1. Characters
|
Contents
| Back to LSD2 Table of Contents
|
There is an enormous number of character sets, many of which are partially or
wholly overlapping. There exists, for example, a long list of existing and
developing ISO standards (see for example van Wingen, 1995); in addition, there
is a vast proliferation of vendor-specific character sets. There are also
multiple standard organizations, within each of which often multiple committees
are devoted to character sets. In addition, each character set may have
variants (e.g. the ISO 646 national variants) and the same character set can
even have different names (e.g., iso-ir-6, ANSI X3.4-1986, ISO 646.IRV:1991,
ASCII, ISO 646-US, US-ASCII).
The result is enormous confusion within the user community, and few instances
where a clear choice among character sets is evident (to get the flavor of the
confusion, see the attempt to list and categorize the various characters and
character sets in
RFC-1345).
To make matters worse, there is also confusion among concepts, most often
between the notions of fontsand characters. For instance, on most
hardware it is necessary to change fonts in order to change character sets
(which is in fact a cause of the confusion in the minds of many users).
Fortunately, a new character set is under development, resulting from a merging
of efforts within ISO and the
Unicode
consortium to develop a single, universal character set (UCS). A first part
of this character set has been approved as ISO 10646. This standard encodes
each character in four bytes.
However:
- controversy still exists over some details of the scheme;
- the standard is not complete;
- some languages are not yet covered;
- the standard is not yet supported in practice via appropriate
software.
Although there is little doubt that this standard will eventually
become the basis for character representation, its full specification and
implementation is long enough away that, for present purposes, it is necessary
to provide a temporary solution, as compatible as possible with UCS as well as
support and guidance for the eventual migration to UCS. In addition, Unicode
has developed a working framework, including precise definitions for basic
concepts, etc., which we adopt here.
One of the most important contribution from Unicode's
"Basic
Principles" is a clear distinction between characters and
glyphs, that we adopt:
A principle notion embodied by Unicode is the distinction between
characters and glyphs. In loose terms, a glyph is a visual
depiction of a character. On the other hand, a character has no inherent image.
This distinction is in contrast to the common understanding which equates these
two concepts. For example, a font is often described as being comprised of
"characters." In contast, according to the view adopted by Unicode, a font
contains "glyphs," not "characters."
An important consequence of this conceptual distinction is that a
character set should not encode glyphs, and, furthermore, that the number of
extant glyphs is much larger than the number of extant characters which should
be encoded.
In practice, the distinction between these two concepts is often
difficult to maintain, if, for no other reason, because such a distinction has
rarely been made in the past. The result of this historical confusion between
the two concepts is that many existing character sets encode "glyphs," and, for
reasons of backward compatibility, these "glyphs" now appear in Unicode as
"characters." An example of such a glyph is the roman ligature glyphs 'ff' and
'fi'.
The conceptual problems raised by this distinction were recognized
both by the designers of Unicode and by ISO in its development of ISO/IEC 10646
and other standards. In order to help clarify this area, ISO asked the US
National Body to develop a draft
Character/Glyph
Operational Model.
A
detailled discussion of the model can be found in the above-mentionned
document, from which we quote the following summary:
An ideal characterization of characters and glyphs and their
relationship may be stated as follows:
1. A character conveys distinctions in meaning or sounds. A
character has no intrinsic appearance.
2. A glyph conveys distinctions in form. A glyph has no intrinsic
meaning.
3. One or more characters may be depicted by one or more glyph
representations (instances of an abstract glyph) in a possibly context
dependent fashion.
The relationship between character codes and glyph identifiers may
be one-to-one, one-to-many, many-to-one, or many-to-many.
This document uses the following set of definitions, drawn from the
Unicode
and Internationalization Glossary:
- Character
- (1) an element of a computer character set; (2) an element of an
alphabet; (3) an element of the Han script.
- Character Set
- A collection of elements used to organize, control, or represent
information. Such information can be classified as either formal, functional,
or a combination of both form and function. Certain types of information are
normally excluded from such representation; for example, directly perceived
information such as pictures, sounds, texture, etc.; in contrast, the
information which is represented by character sets can normally be said to be
symbolic in nature.
- Code Element
- A unit of character encoding referring to both the numeric code value
and the character which the code value represents.
- Coded Character Set
- A character set in which each character is assigned a numeric code
value. Frequently abbreviated as character set when the context is sufficient
to determine what is intended.
- Font
- A collection of glyphs used for the visual depiction of character data.
A font is often associated with a set of parameters, e.g., size, posture,
weight, serifness, etc., which, when set to particular values, generate a
collection of imagable glyphs.
- Glyph
- An abstract form which represents one or more glyph images, and which is
used to visually depict encoded character data. In displaying Unicode character
data, one or more glyphs may be selected to depict a particular character.
These glyphs are selected by a rendering engine during composition and layout
processing.
- Script
- A collection of symbols used to represent textual information in a writing
system.
We will also use the following definitions
- Local character set
- The character set used by a given application for the purposes of data
creation, editing, processing, etc. Several local character sets can be in use
on a given machine or system. The local character set is not necessarily the
same as the character set that will be used for data distribution and/or
interchange.
- Blind interchange
- Interchange of data where the parties have not necessarily made any prior
agreement on data format and transfer protocol.
- Trusted interchange
- Interchange of data where the parties make a prior agreement on data format
and transfer protocol.
- UCS
- The Universal Multiple-Octet Coded Character Set standard known as ISO/IEC
10646-1 and its extensions to come.
- Unicode
- Depending on the context, "Unicode" will refer to (1) the Unicode
Consortium (2) the 16 bit character set defined by "The Unicode Standard,
Version 1.1". This character set is identical with the character repertoire and
coding of the international standard ISO/IEC 10646-1:1993(E); Coded
Representation Form=UCS-2; Subset=300; Implementation Level=3.
UCS is a means by which a single character set will suffice to encode all the
world's languages. However, at present very few programs and operating systems
support the multi-byte code used by UCS. Only 8-bit character sets are
generally supported.
Our recommendation has the merit of being reasonably compatible with UCS, thus
facilitating future migration to that standard:
- for each application, choose a base character set among the recommended
character sets.
- for occasional characters used within the application which are not part
of the base character set, use appropriate SGML entities
- in cases where there is considerable mixture of character sets (e.g.
heavily multilingual applications), observe the following conventions:
- within SGML documents, use the lang attribute to mark the shifts
- for all other cases, use the ISO 2022 mechanisms to mark character set
shifts.
These recommendations do not provide for Asian languages, including Chinese,
Japanese, and Korean. Independent standards have been developed for these
languages.
[Help
needed]
Use the ISO 8859-X series for all the following scripts: Arabic, Cyrillic,
Greek, Hebrew, Latin.
From
"ISO
8859-1 National Character Set FAQ" (Gschwind, 1995):
ISO 8859-X character sets use the characters 0xa0 through 0xff to
represent national
characters, while the characters in the 0x20-0x7f range are those used in the
US-ASCII (ISO 646) character set. Thus, ASCII text is a proper subset of all
ISO 8859-X character sets.
The characters 0x80 through 0x9f are earmarked as extended control
chracters, and are not used for encoding characters. These characters are not
currently used to specify anything. A practical reason for this is
interoperability with 7 bit devices (or when the 8th bit gets stripped by
faulty software). Devices would then interpret the character as some control
character and put the device in an undefined state. (When the 8th bit gets
stripped from the characters at 0xa0 to 0xff, a wrong character is represented,
but this cannot change the state of a terminal or other device.)
This character set is also used by AmigaDOS, MS-Windows, VMS (DEC
MCS is practically equivalent to ISO 8859-1) and (practically all) UNIX
implementations.
The following is a rough list of the languages accomodated in the ISO 8859
series. See also the graphic representation of the
code
tables.
- ISO-8859-1 - Latin 1
- Western Europe and Americas: Afrikaans, Basque, Catalan, Danish, Dutch,
English, Faeroese, Finnish, French, Galician, German, Icelandic, Irish,
Italian, Norwegian, Portuguese, Spanish and Swedish.
- ISO-8859-2 Latin 2
- Latin-written Slavic and Central European languages: Czech, German,
Hungarian, Polish, Romanian, Croatian, Slovak, Slovene.
- ISO-8859-3 - Latin 3
- Esperanto, Galician, Maltese, and Turkish.
- ISO-8859-4 - Latin 4
- Scandinavia/Baltic (mostly covered by 8859-1 also): Estonian, Latvian, and
Lithuanian. It is an incomplete predecessor of Latin 6.
- ISO-8859-5 - Cyrillic
- Bulgarian, Byelorussian, Macedonian, Russian, Serbian and Ukrainian.
- ISO-8859-6 - Arabic
- Non-accented Arabic.
- ISO-8859-7- Modern Greek
- Greek.
- ISO-8859-8 - Hebrew
- Non-accented Hebrew.
- ISO-8859-9 - Latin 5
- Same as 8859-1 except for Turkish instead of Icelandic
- ISO-8859-10 - Latin 6
- Latin6, for Lappish/Nordic/Eskimo languages: Adds the last Inuit
(Greenlandic) and Sami (Lappish) letters that were missing in Latin 4 to cover
the entire Nordic area.
An exact list of languages accomodated by each
character set would be useful
[Help
needed]. A list of characters used by a large number of languages is
provided in
"Characters
and character sets for various languages " (Alvestrand, 1995).
Shortcomings
The ISO 8859 series lacks the ligatures Dutch ij, French oe and
,,German`` quotation marks, as well as several other characters. The rationale
for not representing the various ligatures is that they are not characters but
rather typographical artifacts representing two characters (e.g., "oe"
represented by [oe] in French). However, there is a strong counter-argument
that because such ligatures are lexically differentiated from the corresponding two letter
combinations (cf. "moelle" vs. "c[oe]ur"), they cannot be recovered automatically by a display engine and
therefore must be treated as characters in their own right. Note that for
apparently similar reasons, the ligature "æ" is (somewhat inconsistently)
included in ISO 8859-1.
There are also Bulgarian and Ukranian characters missing from ISO 8859-5.
[THIS SECTION IS UNDER DEVELOPMENT]
No matter how large or comprehensive the character set, there may exist the
need to represent additional characters. In such cases we recommend the use of
ISO
entities, if they exist. In the rare instance where no standard entity is
pre-defined, a new one can be defined by the encoder to serve the purpose.
Note: SGML entities should NOT be used to replace characters of the base
character set (this applies to local character sets only, not blind
interchange).
When different character sets are mixed in a single document, three alternative
methods can be used (possibly in conjunction):
- Explicitly
- A wsd attribute can be used on any tag to indicate that the tag's
content is encoded in the specified character set. The value of the attribute
is the character set name (ISO-8859-1, etc.). WSD stands for "writing system
declaration", borrowed from the
TEI
terminology.
- A lang attribute can be used on any tag to indicate that the tag's
content is in the specified language. This method assumes a mapping between
languages and character sets. The value of the attribute is composed of one of
the following (the two letter-code version is recommended by the
"HyperText
Markup Language Specification Version 3.0")
- a two-letter code from ISO 639 (e.g., "en" for English;
- a three-letter code from ISO 639-2 (e.g., "eng" for English);
- one of the above extended by a country code from ISO 3166 (e.g., "en.uk"
or "eng.uk" for English as spoken in the United Kingdom).
- Implicitly
- All instances of a given element can be associated with a particular
character set.
- All instances of a given element can be associated with a particular
language, which is in turn associated with a particular character
set
These implicit methods are useful when there is a systematic
mapping between tags and character sets (e.g., a list of words in one character
set, with their translations in another).
The MULTEXT/EAGLES Corpus Encoding Standard developed by the provides global lang and wsd
attributes, as well as appropriate mechanisms to document correspondences
between languages or tags with particular character sets in the header.
Note that the language tagging mechanism will still be valid with UCS. "Unicode
characters do not specify the language of the text they represent; that is,
they are completely language neutral. If the language of a character or
character string must be known to accomplish a particular type of process (e.g.
language sensitive collation), then a higher-level protocol must be used to
specify the language." [from Unicode's
"Basic
Principles"].
In instances other than SGML documents, we recommend to use the ISO 2022
character switching mechanisms, <SI> ("shift-in") and <SO> ("shift
out") control sequences.
Data interchange is becoming increasingly reliable, due to major international
efforts towards standardization such as the Internet effort. For example,
TCP/IP and many network applications (e.g., ftp, WWW, etc.) are "8-bit clean".
In addition, recent standards have been proposed to guarantee delivery by
automatically packing and unpacking data as required:
Even
when such these standards are not yet implemented, files can be safely
transferred by using universally available encoding programs such as
`uuencode'.
Therefore, we recommend that all data is distributed using the recommendations
above for character sets. In the case of blind interchange, data should be
encoded using 'uuencode'.
In the instance where the receiver is not GLOSIX complaint, there remains the
potential problem of different character sets at each end of the interchange
transaction, even if transmission is 8-bit clean. In this case, files should be
encoded in the recommended 8-bit formats, with the burden on the receiver to
recode the files. Again, there are freely available programs such as GNU's
`recode' to accomplish the re-encoding of the data where necessary.
Transliteration should not be used to replace appropriate character
sets, either for local processing or interchange.
Transliteration schemes be used only in the following instances:
- the display of characters on poor equipment;
- the input of characters from a standard keyboard.
We recommend the use of the character mnemonics from the ISO committee draft
(CD) of the POSIX.2 standard, also listed in
RFC-1345.
There exist a number of other standard transliteration schemes; however, they
have serious shortcomings, such as
- the use of non ASCII characters (e.g., superscript ligatures in ISO 9 for
Cyrillic);
- non-reversibility.
In addition to specifying standard character sets, it is necessary to provide
precise specifications for string operations in order to ensure portability and
compatibility.
The variable behavior of string operations is a common source of
non-portability. It is particularly problematic because in most cases, there is
no signal of an error, but rather, the results from the same function are are
different on different matchines. For example, it is not uncommon to see
different results from a given sort operation when it is run on different
systems, even when those systems use the same character set.
String operations include:
- Equality and equivalence: there is a need to distinguish
- string equality: "oe" != "&eolig"
- string equivalence: "oe" ~ "&eolig"; and also "oe" ~ "[oe]".
- Lexicographic order: This is a difficult area because even with the
same character set, the lexicographic order could be different for different
languages (e.g. "ch" in Spanish is treated as a single character, falling in
the alphabetic sequence between "c" and "d"). Even for the same language,
lexicographic order may be different for different applications (e.g., the
order could be case sensitive or not, accent sensitive or not, space sensitive
or not, etc.).
- Character categorization: classical functions such as
`isupper', etc. have to be extended to work for all characters in all
languages; in addition, new functions need to be added to the standard set,
such as a function to test if a character includes a diacritic etc.
- Conversions: The argument here is the same as for character
categorization. This can also be language and application dependent, for
example, applying `toupper'to "é" can yield either "E" or
"É". In addition, there is a need for new functions such as
`removediacritic' to change, for example, "é" to
"e".
Recently developed standards, such as ISO C (ISO/IEC 9899:1990, also
called "ANSI C") and POSIX (ISO/IEC 9945-2:1993), have improved the portability
of string operations (see
"GLOSIX
Part 2-1. C"). However, there is no publicly available library that
properly handles, in a parameterizable way, the functions cited above for the
character sets and languages covered by the GLOSIX recommendations.
The development of GLOSIX specifications for string operations will require:
- the establishment of a precise typology of string operations needed by
linguistic software;
- a description of the behavior of each operation for each
language/character set.
[THIS SECTION IS UNDER DEVELOPMENT]
- ANSI X3.159-1989
- American National Standard for Information Systems -- Programming
Language -- C, Americal National Standards Institute.
- ISO 639:1988
- Code for the representation of names of languages
- ISO 639-2:1995
- Code for the representation of names of languages--Alpha-3 code
- ISO 646.IRV:1991
- Informatin Processing -- ISO 7-bit coded character set for information
interchange [=ANSI X3.4-1986]
- ISO 2022:1986
- Information processing - ISO 7-bit and 8-bit coded character sets - Code
extension techniques
- ISO 3166:1993
- Codes for the representation of names of countries
- ISO-8859
- Information Processing -- 8-bit Single-Byte Coded Graphic Character Sets --
Part 1: Latin Alphabet No. 1, ISO 8859-1:1987. Part 2: Latin alphabet No. 2,
ISO 8859-2, 1987. Part 3: Latin alphabet No. 3, ISO 8859-3, 1988. Part 4: Latin
alphabet No. 4, ISO 8859-4, 1988. Part 5: Latin/Cyrillic alphabet, ISO 8859-5,
1988. Part 6: Latin/Arabic alphabet, ISO 8859-6, 1987. Part 7: Latin/Greek
alphabet, ISO 8859-7, 1987. Part 8: Latin/Hebrew alphabet, ISO 8859-8, 1988.
Part 9: Latin alphabet No. 5, ISO 8859-9, 1990.
- ISO 8879:1986
- Information Processing--Text and Office Systems--Standard Generalized
Markup Language (SGML).
- ISO 9899:1990
- The C programming language, International Organization for
Standardization.
-
- [This is the international version of, and identical to, ANSI X3.159-1989]
- ISO/IEC 9945-2:1993
- POSIX Shell and Utilities
-
- [Character mnemonics also available in
RFC-1345].
- ISO/IEC 10646-1
- Information technology - Character sets and information coding
-Universal multiple-octet coded character set - Part 1 - Architecture and basic
multilingual plane
- UNICODE 1.1
- "The Unicode Standard, Version 1.1": Version 1.0, Volume 1 (ISBN
0-201-56788-1), Version 1.0, Volume 2 (ISBN 0- 201-60845-6), and "Unicode
Technical Report #4, The Unicode Standard, Version 1.1" (available from The
Unicode Consortium, and soon to be published by Addison- Wesley).
-
- [This character set is identical with the character repertoire and coding
of the international standard ISO/IEC 10646-1:1993(E); Coded Representation
Form=UCS-2; Subset=300; Implementation Level=3.]
- ISO 2375 registration
- "International Register of Coded Character Sets to be Used With Escape
Sequences", European Computer Manufacturers Association (ECMA), Rue du Rhone
114, CH-1204 Geneve, Switzerland, December 1990.
-
RFC-1345
- Simonsen, K. (1992). Character Mnemonics & Character Sets, RFC 1345,
Rationel Almen Planlaegning, June 1992.
-
- <URL:ftp://ds.internic.net/rfc/rfc1345.txt>
-
RFC-1521
- Borenstein N., and N. Freed, "MIME (Multipurpose Internet Mail Extensions)
Part One: Mechanisms for Specifying and Describing the Format of Internet
Message Bodies", RFC 1521, Bellcore, Innosoft, September 1993.
-
- <URL:ftp://ds.internic.net/rfc/rfc1521.txt>
-
RFC-1522
- Moore, K., "Representation of Non-Ascii Text in Internet Message Headers"
RFC 1522, University of Tennessee, September 1993.
-
- <URL:ftp://ds.internic.net/rfc/rfc1522.txt>
-
RFC-1642
- Goldsmith, D., Davis, M. (1994) UTF-7 Encoding Form, RFC 1642, Taligent,
Inc., July 1994.
-
- <URL:ftp://ds.internic.net/rfc/rfc1642.txt>
-
- <URL:http://www.stonehand.com/unicode/standard/utf7.html>
- ISO/IEC 6429:1992
- Information processing -- Control functions for 7-bit and 8-bit coded
character sets
- ISO 9:1986
- Transliteration of Slavic Cyrillic characters into Latin characters
- ISO 233:1984
- Transliteration of Arabic characters into Latin characters
- ISO 233-2:1993
- Transliteration of Arabic characters into Latin characters--Part 2:
simplified translitteration
- ISO 259:1984
- Transliteration of Hebrew characters into Latin characters
- ISO R 843:1968
- Transliteration of Greek characters into Latin characters
- ISO 3602:1989
- Romanization of Japanese (kana script)
- ISO 7098:1991
- Romanization of Chinese
This list does not include standards (see Standards above).
-
Alvestrand, H. T. (1995).
Characters
and character sets for various languages
-
<URL:http://domen.uninett.no/~hta/ietf/lang-chars.txt>
"There is a need to have a source of information about the
characters that are used in various languages. No such information is currently
readily available on the net. This document attempts to fill that
void."
-
Connolly, D. (1995)
Character
Set Considered Harmful, Internet-draft
[draft-ietf-html-charset-harmful-00.txt], Internet Engineering Task Force
(IETF), HTML working group (HTML-WG).
-
<URL:http://www.w3.org/hypertext/WWW/MarkUp/html-spec/charset-harmful.html>
-
Czyborra, R. (1994).
The
ISO 8859 series.
-
<URL:http://www.cs.tu-berlin.de/user/czyborra/charsets/>
-
Gschwind, M. K. (1994).
ISO
8859-1 National Character Set FAQ
-
<URL:ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/FAQ-ISO-8859-1>
-
Gschwind, M. K. (1994).
More
about Internationalization
-
<URL:http://www.vlsivie.tuwien.ac.at/mike/i18n.html>
-
Gschwind, M. K. (1994).
Programming
for Internationalization
-
<URL:ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/i18n-programming>
-
Ide, N., Véronis, J. (Eds.) (1995).
The Text Encoding Initiative:
background and context. Kluwer Academic Publishers, Dordrecht,
342p. [reprinted from Triple special issue of Computers and the
Humanities, 29, no 1/2/3, with an original bibliography]
-
Kuhn, M. (1994).
Frequently Asked Questions about International Standards with Answers".
-
FAQ posted monthly to the USENET groups comp.protocols.iso,
comp.std.misc and comp.std.internat, and archived on many FAQ servers, such as:
<URL:ftp://ftp.inria.fr/faq/comp.std.internat/Standards_FAQ>
-
Severance, Ch. (1995).
Overview
of the IEEE P1003.0 Guide to the POSIX Open Software Environment
-
<URL:http://wxweb.msu.edu/~crs/posix/p10030/intro.htm>
-
Sperberg-McQueen, C.M., Burnard, L. (Eds.) (1994).
Guidelines
for Electronic Text Encoding and Interchange, Text Encoding Initiative,
Chicago and Oxford.
-
<URL:http://etext.virginia.edu/TEI.html>
-
Unicode (1994).
The
Unicode Standard
-
<URL:http://www.stonehand.com/unicode/standard.html>
-
Unicode (1994).
Frequently
asked questions [about Unicode]
-
<URL:http://www.stonehand.com/unicode/faq.html>
-
Unicode (1994).
Basic
Principles
-
<URL:http://www.stonehand.com/unicode/standard/principles.html>
-
Unicode (1994).
Character/Glyph
Operational Model.
-
<http://www.stonehand.com/unicode/standard/cgmodel.html>
-
Unicode (1994).
Language
Coding Using ISO/IEC 6429
-
<http://www.stonehand.com/unicode/standard/tc304.html>
| Top
| LSD2 Table of Contents
| MULTEXT
| EAGLES Tool subgroup
| LPL
|
Copyright (c) Centre National de la Recherche Scientifique, 1995-1996.