Blame: unicode(7) - Waikato Linux Users Group

Annotated edit history of unicode(7) version 2, including all changes. View license author blame.

Rev	Author	#	Line
1	perry	1	`UNICODE`
		2	`!!!UNICODE`
		3	`NAME`
		4	`DESCRIPTION`
		5	`COMBINING CHARACTERS`
		6	`IMPLEMENTATION LEVELS`
		7	`UNICODE UNDER LINUX`
		8	`PRIVATE AREA`
		9	`LITERATURE`
		10	`BUGS`
		11	`AUTHOR`
		12	`SEE ALSO`
		13	`----`
		14	`!!NAME`
		15
		16
		17	`Unicode - the Universal Character Set`
		18	`!!DESCRIPTION`
		19
		20
		21	`The international standard __ISO 10646__ defines the`
		22	`__Universal Character Set (UCS)__. UCS contains all`
		23	`characters of all other character set standards. It also`
		24	`guarantees __round-trip compatibility__, i.e., conversion`
		25	`tables can be built such that no information is lost when a`
		26	`string is converted from any other encoding to UCS and`
		27	`back.`
		28
		29
		30	`UCS contains the characters required to represent`
		31	`practically all known languages. This includes not only the`
		32	`Latin, Greek, Cyrillic, Hebrew, Arabic, Armenian, and`
		33	`Georgian scripts, but also also Chinese, Japanese and Korean`
		34	`Han ideographs as well as scripts such as Hiragana,`
		35	`Katakana, Hangul, Devanagari, Bengali, Gurmukhi, Gujarati,`
		36	`Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer,`
		37	`Bopomofo, Tibetan, Runic, Ethiopic, Canadian Syllabics,`
		38	`Cherokee, Mongolian, Ogham, Myanmar, Sinhala, Thaana, Yi,`
		39	`and others. For scripts not yet covered, research on how to`
		40	`best encode them for computer usage is still going on and`
		41	`they will be added eventually. This might eventually include`
		42	`not only Hieroglyphs and various historic Indo-European`
		43	`languages, but even some selected artistic scripts such as`
		44	`Tengwar, Cirth, and Klingon. UCS also covers a large number`
		45	`of graphical, typographical, mathematical and scientific`
		46	`symbols, including those provided by TeX, Postscript, APL,`
		47	`MS-DOS, MS-Windows, Macintosh, OCR fonts, as well as many`
		48	`word processing and publishing systems, and more are being`
		49	`added.`
		50
		51
		52	`The UCS standard (ISO 10646) describes a ''31-bit character`
		53	`set architecture'' consisting of 128 24-bit ''groups'',`
		54	`each divided into 256 16-bit ''planes'' made up of 256`
		55	`8-bit ''rows'' with 256 ''column'' positions, one for`
		56	`each character. Part 1 of the standard (__ISO 10646-1__)`
		57	`defines the first 65534 code positions (0x0000 to 0xfffd),`
		58	`which form the ''Basic Multilingual Plane (BMP)'', that`
		59	`is plane 0 in group 0. Part 2 of the standard (__ISO`
		60	`10646-2__) adds characters to group 0 outside the BMP in`
		61	`several ''supplementary planes'' in the range 0x10000 to`
		62	`0x10ffff. There are no plans to add characters beyond`
		63	`0x10ffff to the standard, therefore of the entire code`
		64	`space, only a small fraction of group 0 will ever be`
		65	`actually used in the foreseeable future. The BMP contains`
		66	`all characters found in the commonly used other character`
		67	`sets. The supplemental planes added by ISO 10646-2 cover`
		68	`only more exotic characters for special scientific,`
		69	`dictionary printing, publishing industry, higher-level`
		70	`protocol and enthusiast needs.`
		71
		72
		73	`The representation of each UCS character as a 2-byte word is`
		74	`referred to as the __UCS-2__ form (only for BMP`
		75	`characters), whereas __UCS-4__ is the representation of`
		76	`each character by a 4-byte word. In addition, there exist`
		77	`two encoding forms __UTF-8__ for backwards compatibility`
		78	`with ASCII processing software and __UTF-16__ for the`
		79	`backwards compatible handling of non-BMP characters up to`
		80	`0x10ffff by UCS-2 software.`
		81
		82
		83	`The UCS characters 0x0000 to 0x007f are identical to those`
		84	`of the classic __US-ASCII__ character set and the`
		85	`characters in the range 0x0000 to 0x00ff are identical to`
		86	`those in __ISO 8859-1 Latin-1__.`
		87	`!!COMBINING CHARACTERS`
		88
		89
		90	`Some code points in __UCS__ have been assigned to`
		91	`''combining characters''. These are similar to the`
		92	`non-spacing accent keys on a typewriter. A combining`
		93	`character just adds an accent to the previous character. The`
		94	`most important accented characters have codes of their own`
		95	`in UCS, however, the combining character mechanism allows us`
		96	`to add accents and other diacritical marks to any character.`
		97	`The combining characters always follow the character which`
		98	`they modify. For example, the German character Umlaut-A`
		99	`(`
		100	`''`
		101
		102
		103	`Combining characters are essential for instance for encoding`
		104	`the Thai script or for mathematical typesetting and users of`
		105	`the International Phonetic Alphabet.`
		106	`!!IMPLEMENTATION LEVELS`
		107
		108
		109	`As not all systems are expected to support advanced`
		110	`mechanisms like combining characters, ISO 10646-1 specifies`
		111	`the following three ''implementation levels'' of`
		112	`UCS:`
		113
		114
		115	`Level 1`
		116
		117
		118	`Combining characters and __Hangul Jamo__ (a variant`
		119	`encoding of the Korean script, where a Hangul syllable glyph`
		120	`is coded as a triplet or pair of vovel/consonant codes) are`
		121	`not supported.`
		122
		123
		124	`Level 2`
		125
		126
		127	`In addition to level 1, combining characters are now allowed`
		128	`for some languages where they are essential (e.g., Thai,`
		129	`Lao, Hebrew, Arabic, Devanagari, Malayalam,`
		130	`etc.).`
		131
		132
		133	`Level 3`
		134
		135
		136	`All __UCS__ characters are supported.`
		137
		138
		139	`The __Unicode 3.0 Standard__ published by the __Unicode`
		140	`Consortium__ contains exactly the __UCS Basic`
		141	`Multilingual Plane__ at implementation level 3, as`
		142	`described in ISO 10646-1:2000. __Unicode 3.1__ added the`
		143	`supplemental planes of ISO 10646-2. The Unicode standard and`
		144	`technical reports published by the Unicode Consortium`
		145	`provide much additional information on the semantics and`
		146	`recommended usages of various characters. They provide`
		147	`guidelines and algorithms for editing, sorting, comparing,`
		148	`normalizing, converting and displaying Unicode`
		149	`strings.`
		150	`!!UNICODE UNDER LINUX`
		151
		152
		153	`Under GNU/Linux, the C type __wchar_t__ is a signed`
		154	`32-bit integer type. Its values are always interpreted by`
		155	`the C library as __UCS__ code values (in all locales), a`
		156	`convention that is signaled by the GNU C library to`
		157	`applications by defining the constant`
		158	`____STDC_ISO_10646____ as specified in the ISO C 99`
		159	`standard.`
		160
		161
		162	`UCS/Unicode can be used just like ASCII in input/output`
		163	`streams, terminal communication, plaintext files, filenames,`
		164	`and environment variables in the ASCII compatible`
		165	`__UTF-8__ multi-byte encoding. To signal the use of UTF-8`
		166	`as the character encoding to all applications, a suitable`
		167	`__locale__ has to be selected via environment variables`
		168	`(e.g., __`
		169
		170
		171	`The __nl_langinfo(CODESET)__ function returns the name of`
		172	`the selected encoding. Library functions such as`
		173	`wctomb(3) and mbsrtowcs(3) can be used to`
		174	`transform the internal __wchar_t__ characters and strings`
		175	`into the system character encoding and back and`
		176	`wcwidth(3) tells, how many positions (0-2) the cursor`
		177	`is advanced by the output of a character.`
		178
		179
		180	`Under Linux, in general only the BMP at implementation level`
		181	`1 should be used at the moment. Up to two combining`
		182	`characters per base character for certain scripts (in`
		183	`particular Thai) are also supported by some UTF-8 terminal`
		184	`emulators and ISO 10646 fonts (level 2), but in general`
		185	`precomposed characters should be preferred where available`
		186	`(Unicode calls this __Normalization Form`
		187	`C__).`
		188	`!!PRIVATE AREA`
		189
		190
		191	`In the __BMP__, the range 0xe000 to 0xf8ff will never be`
		192	`assigned to any characters by the standard and is reserved`
		193	`for private usage. For the Linux community, this private`
		194	`area has been subdivided further into the range 0xe000 to`
		195	`0xefff which can be used individually by any end-user and`
		196	`the Linux zone in the range 0xf000 to 0xf8ff where`
		197	`extensions are coordinated among all Linux users. The`
		198	`registry of the characters assigned to the Linux zone is`
		199	`currently maintained by H. Peter Anvin`
		200	`__`
		201	`!!LITERATURE`
		202
		203
		204	`*`
		205
		206
		207	`Information technology -- Universal Multiple-Octet Coded`
		208	`Character Set (UCS) -- Part 1: Architecture and Basic`
		209	`Multilingual Plane. International Standard ISO/IEC 10646-1,`
		210	`International Organization for Standardization, Geneva,`
		211	`2000.`
		212
		213
		214	`This is the official specification of __UCS__. Available`
		215	`as a PDF file on CD-ROM from`
		216	`http://www.iso.ch/.`
		217
		218
		219	`*`
		220
		221
		222	`The Unicode Standard, Version 3.0. The Unicode Consortium,`
		223	`Addison-Wesley, Reading, MA, 2000, ISBN`
		224	`0-201-61633-5.`
		225
		226
		227	`*`
		228
		229
		230	`S. Harbison, G. Steele. C: A Reference Manual. Fourth`
		231	`edition, Prentice Hall, Englewood Cliffs, 1995, ISBN`
		232	`0-13-326224-3.`
		233
		234
		235	`A good reference book about the C programming language. The`
		236	`fourth edition covers the 1994 Amendment 1 to the ISO C 90`
		237	`standard, which adds a large number of new C library`
		238	`functions for handling wide and multi-byte character`
		239	`encodings, but it does not yet cover ISO C 99, which`
		240	`improved wide and multi-byte character support even`
		241	`further.`
		242
		243
		244	`*`
		245
		246
		247	`Unicode Technical Reports.`
		248	`http://www.unicode.org/unicode/reports/`
		249
		250
		251	`*`
		252
		253
		254	`Markus Kuhn: UTF-8 and Unicode FAQ for Unix/Linux.`
		255	`http://www.cl.cam.ac.uk/~mgk25/unicode.html`
		256
		257
		258	`Provides subscription information for the __linux-utf8__`
		259	`mailing list, which is the best place to look for advice on`
		260	`using Unicode under Linux.`
		261
		262
		263	`*`
		264
		265
		266	`Bruno Haible: Unicode HOWTO.`
		267	`ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html`
		268	`!!BUGS`
		269
		270
		271	`When this man page was last revised, the GNU C Library`
		272	`support for __UTF-8__ locales was mature and XFree86`
		273	`support was in an advanced state, but work on making`
		274	`applications (most notably editors) suitable for use in`
		275	`__UTF-8__ locales was still fully in progress. Current`
		276	`general __UCS__ support under Linux usually provides for`
		277	`CJK double-width characters and sometimes even simple`
		278	`overstriking combining characters, but usually does not`
		279	`include support for scripts with right-to-left writing`
		280	`direction or ligature substitution requirements such as`
		281	`Hebrew, Arabic, or the Indic scripts. These scripts are`
		282	`currently only supported in certain GUI applications (HTML`
		283	`viewers, word processors) with sophisticated text rendering`
		284	`engines.`
		285	`!!AUTHOR`
		286
		287
		288	`Markus Kuhn`
		289	`!!SEE ALSO`
		290
		291
2	perry	292	`utf-8(7), charsets(7),`
1	perry	293	`setlocale(3)`
		294	`----`

This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.

Last edited on Tuesday, June 4, 2002 12:31:00 am by "perry"

Edit PageHistory Diff Info LikePages