Developer's Daily

Unix by Example



	Main Unix Man Pages

UNICODE

NAME
DESCRIPTION
COMBINING CHARACTERS
IMPLEMENTATION LEVELS
UNICODE UNDER LINUX
PRIVATE AREA
LITERATURE
BUGS
AUTHOR
SEE ALSO

NAME

Unicode − the unified 16-bit super character set

DESCRIPTION

The international standard ISO 10646 defines the Universal Character Set (UCS). UCS contains all characters of all other character set standards. It also guarantees round-trip compatibility, i.e., conversion tables can be built such that no information is lost when a string is converted from any other encoding to UCS and back.

UCS contains the characters required to represent almost all known languages. This includes apart from the many languages which use extensions of the Latin script also the following scripts and languages: Greek, Cyrillic, Hebrew, Arabic, Armenian, Gregorian, Japanese, Chinese, Hiragana, Katakana, Korean, Hangul, Devangari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayam, Thai, Lao, Bopomofo, and a number of others. Work is going on to include further scripts like Tibetian, Khmer, Runic, Ethiopian, Hieroglyphics, various Indo-European languages, and many others. For most of these latter scripts, it was not yet clear how they can be encoded best when the standard was published in 1993. In addition to the characters required by these scripts, also a large number of graphical, typographical, mathematical and scientific symbols like those provided by TeX, PostScript, MS-DOS, Macintosh, Videotext, OCR, and many word processing systems have been included, as well as special codes that guarantee round-trip compatibility to all other existing character set standards.

The UCS standard (ISO 10646) describes a 31-bit character set architecture, however, today only the first 65534 code positions (0x0000 to 0xfffd), which are called the Basic Multilingual Plane (BMP), have been assigned characters, and it is expected that only very exotic characters (e.g. Hieroglyphics) for special scientific purposes will ever get a place outside this 16-bit BMP.

The UCS characters 0x0000 to 0x007f are identical to those of the classic US-ASCII character set and the characters in the range 0x0000 to 0x00ff are identical to those in the ISO 8859-1 Latin-1 character set.

COMBINING CHARACTERS

Some code points in UCS have been assigned to combining characters. These are similar to the non-spacing accent keys on a typewriter. A combining character just adds an accent to the previous character. The most important accented characters have codes of their own in UCS, however, the combining character mechanism allows to add accents and other diacritical marks to any character. The combining characters always follow the character which they modify. For example, the German character Umlaut-A ("Latin capital letter A with diaeresis") can either be represented by the precomposed UCS code 0x00c4, or alternatively as the combination of a normal "Latin capital letter A" followed by a "combining diaeresis": 0x0041 0x0308.

IMPLEMENTATION LEVELS

As not all systems are expected to support advanced mechanisms like combining characters, ISO 10646 specifies the following three implementation levels of UCS:

	Level 1		Combining characters and Hangul Jamo characters (a special, more complicated encoding of the Korean script, where Hangul syllables are coded as two or three subcharacters) are not supported.
	Level 2		Like level 1, however in some scripts, some combining characters are now allowed (e.g. for Hebrew, Arabic, Devangari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugo, Kannada, Malayalam, Thai and Lao).
	Level 3		All UCS characters are supported.

The Unicode 1.1 standard published by the Unicode Consortium contains exactly the UCS Basic Multilingual Plane at implementation level 3, as described in ISO 10646. Unicode 1.1 also adds some semantical definitions for some characters to the definitions of ISO 10646.

UNICODE UNDER LINUX

Under Linux, only the BMP at implementation level 1 should be used at the moment, in order to keep the implementation complexity of combining characters low. The higher implementation levels are more suitable for special word processing formats, but not as a generic system character set. The C type wchar_t is on Linux an unsigned 16-bit integer type and its values are interpreted as UCS level 1 BMP codes.

The locale setting specifies, whether the system character encoding is for example UTF-8 or ISO 8859-1. Library functions like wctomb, mbtowc, or wprintf can be used to transform the internal wchar_t characters and strings into the system character encoding and back.

PRIVATE AREA

In the BMP, the range 0xe000 to 0xf8ff will never be assigned any characters by the standard and is reserved for private usage. For the Linux community, this private area has been subdivided further into the range 0xe000 to 0xefff which can be used individually by any end-user and the Linux zone in the range 0xf000 to 0xf8ff where extensions are coordinated among all Linux users. The registry of the characters assigned to the Linux zone is currently maintained by H. Peter Anvin <Peter.Anvin@linux.org>, Yggdrasil Computing, Inc. It contains some DEC VT100 graphics characters missing in Unicode, gives direct access to the characters in the console font buffer and contains the characters used by a few advanced scripts like Klingon.

LITERATURE

Information technology − Universal Multiple-Octet Coded Character Set (UCS) − Part 1: Architecture and Basic Multilingual Plane. International Standard ISO 10646-1, International Organization for Standardization, Geneva, 1993.

This is the official specification of UCS. Pretty official, pretty thick, and pretty expensive. For ordering information, check www.iso.ch.

The Unicode Standard − Worldwide Character Encoding Version 1.0. The Unicode Consortium, Addison-Wesley, Reading, MA, 1991.

There is already Unicode 1.1.4 available. The changes to the 1.0 book are available from ftp.unicode.org. Unicode 2.0 will be published again as a book in 1996.

S. Harbison, G. Steele. C − A Reference Manual. Fourth edition, Prentice Hall, Englewood Cliffs, 1995, ISBN 0-13-326224-3.

A good reference book about the C programming language. The fourth edition now covers also the 1994 Amendment 1 to the ISO C standard (ISO/IEC 9899:1990) which adds a large number of new C library functions for handling wide character sets.

BUGS

At the time when this man page was written, the Linux libc support for UCS was far from complete.

AUTHOR

Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>