Penguin
Annotated edit history of unicode(7) version 2, including all changes. View license author blame.
Rev Author # Line
1 perry 1 UNICODE
2 !!!UNICODE
3 NAME
4 DESCRIPTION
5 COMBINING CHARACTERS
6 IMPLEMENTATION LEVELS
7 UNICODE UNDER LINUX
8 PRIVATE AREA
9 LITERATURE
10 BUGS
11 AUTHOR
12 SEE ALSO
13 ----
14 !!NAME
15
16
17 Unicode - the Universal Character Set
18 !!DESCRIPTION
19
20
21 The international standard __ISO 10646__ defines the
22 __Universal Character Set (UCS)__. UCS contains all
23 characters of all other character set standards. It also
24 guarantees __round-trip compatibility__, i.e., conversion
25 tables can be built such that no information is lost when a
26 string is converted from any other encoding to UCS and
27 back.
28
29
30 UCS contains the characters required to represent
31 practically all known languages. This includes not only the
32 Latin, Greek, Cyrillic, Hebrew, Arabic, Armenian, and
33 Georgian scripts, but also also Chinese, Japanese and Korean
34 Han ideographs as well as scripts such as Hiragana,
35 Katakana, Hangul, Devanagari, Bengali, Gurmukhi, Gujarati,
36 Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer,
37 Bopomofo, Tibetan, Runic, Ethiopic, Canadian Syllabics,
38 Cherokee, Mongolian, Ogham, Myanmar, Sinhala, Thaana, Yi,
39 and others. For scripts not yet covered, research on how to
40 best encode them for computer usage is still going on and
41 they will be added eventually. This might eventually include
42 not only Hieroglyphs and various historic Indo-European
43 languages, but even some selected artistic scripts such as
44 Tengwar, Cirth, and Klingon. UCS also covers a large number
45 of graphical, typographical, mathematical and scientific
46 symbols, including those provided by TeX, Postscript, APL,
47 MS-DOS, MS-Windows, Macintosh, OCR fonts, as well as many
48 word processing and publishing systems, and more are being
49 added.
50
51
52 The UCS standard (ISO 10646) describes a ''31-bit character
53 set architecture'' consisting of 128 24-bit ''groups'',
54 each divided into 256 16-bit ''planes'' made up of 256
55 8-bit ''rows'' with 256 ''column'' positions, one for
56 each character. Part 1 of the standard (__ISO 10646-1__)
57 defines the first 65534 code positions (0x0000 to 0xfffd),
58 which form the ''Basic Multilingual Plane (BMP)'', that
59 is plane 0 in group 0. Part 2 of the standard (__ISO
60 10646-2__) adds characters to group 0 outside the BMP in
61 several ''supplementary planes'' in the range 0x10000 to
62 0x10ffff. There are no plans to add characters beyond
63 0x10ffff to the standard, therefore of the entire code
64 space, only a small fraction of group 0 will ever be
65 actually used in the foreseeable future. The BMP contains
66 all characters found in the commonly used other character
67 sets. The supplemental planes added by ISO 10646-2 cover
68 only more exotic characters for special scientific,
69 dictionary printing, publishing industry, higher-level
70 protocol and enthusiast needs.
71
72
73 The representation of each UCS character as a 2-byte word is
74 referred to as the __UCS-2__ form (only for BMP
75 characters), whereas __UCS-4__ is the representation of
76 each character by a 4-byte word. In addition, there exist
77 two encoding forms __UTF-8__ for backwards compatibility
78 with ASCII processing software and __UTF-16__ for the
79 backwards compatible handling of non-BMP characters up to
80 0x10ffff by UCS-2 software.
81
82
83 The UCS characters 0x0000 to 0x007f are identical to those
84 of the classic __US-ASCII__ character set and the
85 characters in the range 0x0000 to 0x00ff are identical to
86 those in __ISO 8859-1 Latin-1__.
87 !!COMBINING CHARACTERS
88
89
90 Some code points in __UCS__ have been assigned to
91 ''combining characters''. These are similar to the
92 non-spacing accent keys on a typewriter. A combining
93 character just adds an accent to the previous character. The
94 most important accented characters have codes of their own
95 in UCS, however, the combining character mechanism allows us
96 to add accents and other diacritical marks to any character.
97 The combining characters always follow the character which
98 they modify. For example, the German character Umlaut-A
99 (
100 ''
101
102
103 Combining characters are essential for instance for encoding
104 the Thai script or for mathematical typesetting and users of
105 the International Phonetic Alphabet.
106 !!IMPLEMENTATION LEVELS
107
108
109 As not all systems are expected to support advanced
110 mechanisms like combining characters, ISO 10646-1 specifies
111 the following three ''implementation levels'' of
112 UCS:
113
114
115 Level 1
116
117
118 Combining characters and __Hangul Jamo__ (a variant
119 encoding of the Korean script, where a Hangul syllable glyph
120 is coded as a triplet or pair of vovel/consonant codes) are
121 not supported.
122
123
124 Level 2
125
126
127 In addition to level 1, combining characters are now allowed
128 for some languages where they are essential (e.g., Thai,
129 Lao, Hebrew, Arabic, Devanagari, Malayalam,
130 etc.).
131
132
133 Level 3
134
135
136 All __UCS__ characters are supported.
137
138
139 The __Unicode 3.0 Standard__ published by the __Unicode
140 Consortium__ contains exactly the __UCS Basic
141 Multilingual Plane__ at implementation level 3, as
142 described in ISO 10646-1:2000. __Unicode 3.1__ added the
143 supplemental planes of ISO 10646-2. The Unicode standard and
144 technical reports published by the Unicode Consortium
145 provide much additional information on the semantics and
146 recommended usages of various characters. They provide
147 guidelines and algorithms for editing, sorting, comparing,
148 normalizing, converting and displaying Unicode
149 strings.
150 !!UNICODE UNDER LINUX
151
152
153 Under GNU/Linux, the C type __wchar_t__ is a signed
154 32-bit integer type. Its values are always interpreted by
155 the C library as __UCS__ code values (in all locales), a
156 convention that is signaled by the GNU C library to
157 applications by defining the constant
158 ____STDC_ISO_10646____ as specified in the ISO C 99
159 standard.
160
161
162 UCS/Unicode can be used just like ASCII in input/output
163 streams, terminal communication, plaintext files, filenames,
164 and environment variables in the ASCII compatible
165 __UTF-8__ multi-byte encoding. To signal the use of UTF-8
166 as the character encoding to all applications, a suitable
167 __locale__ has to be selected via environment variables
168 (e.g., __
169
170
171 The __nl_langinfo(CODESET)__ function returns the name of
172 the selected encoding. Library functions such as
173 wctomb(3) and mbsrtowcs(3) can be used to
174 transform the internal __wchar_t__ characters and strings
175 into the system character encoding and back and
176 wcwidth(3) tells, how many positions (0-2) the cursor
177 is advanced by the output of a character.
178
179
180 Under Linux, in general only the BMP at implementation level
181 1 should be used at the moment. Up to two combining
182 characters per base character for certain scripts (in
183 particular Thai) are also supported by some UTF-8 terminal
184 emulators and ISO 10646 fonts (level 2), but in general
185 precomposed characters should be preferred where available
186 (Unicode calls this __Normalization Form
187 C__).
188 !!PRIVATE AREA
189
190
191 In the __BMP__, the range 0xe000 to 0xf8ff will never be
192 assigned to any characters by the standard and is reserved
193 for private usage. For the Linux community, this private
194 area has been subdivided further into the range 0xe000 to
195 0xefff which can be used individually by any end-user and
196 the Linux zone in the range 0xf000 to 0xf8ff where
197 extensions are coordinated among all Linux users. The
198 registry of the characters assigned to the Linux zone is
199 currently maintained by H. Peter Anvin
200 __
201 !!LITERATURE
202
203
204 *
205
206
207 Information technology -- Universal Multiple-Octet Coded
208 Character Set (UCS) -- Part 1: Architecture and Basic
209 Multilingual Plane. International Standard ISO/IEC 10646-1,
210 International Organization for Standardization, Geneva,
211 2000.
212
213
214 This is the official specification of __UCS__. Available
215 as a PDF file on CD-ROM from
216 http://www.iso.ch/.
217
218
219 *
220
221
222 The Unicode Standard, Version 3.0. The Unicode Consortium,
223 Addison-Wesley, Reading, MA, 2000, ISBN
224 0-201-61633-5.
225
226
227 *
228
229
230 S. Harbison, G. Steele. C: A Reference Manual. Fourth
231 edition, Prentice Hall, Englewood Cliffs, 1995, ISBN
232 0-13-326224-3.
233
234
235 A good reference book about the C programming language. The
236 fourth edition covers the 1994 Amendment 1 to the ISO C 90
237 standard, which adds a large number of new C library
238 functions for handling wide and multi-byte character
239 encodings, but it does not yet cover ISO C 99, which
240 improved wide and multi-byte character support even
241 further.
242
243
244 *
245
246
247 Unicode Technical Reports.
248 http://www.unicode.org/unicode/reports/
249
250
251 *
252
253
254 Markus Kuhn: UTF-8 and Unicode FAQ for Unix/Linux.
255 http://www.cl.cam.ac.uk/~mgk25/unicode.html
256
257
258 Provides subscription information for the __linux-utf8__
259 mailing list, which is the best place to look for advice on
260 using Unicode under Linux.
261
262
263 *
264
265
266 Bruno Haible: Unicode HOWTO.
267 ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html
268 !!BUGS
269
270
271 When this man page was last revised, the GNU C Library
272 support for __UTF-8__ locales was mature and XFree86
273 support was in an advanced state, but work on making
274 applications (most notably editors) suitable for use in
275 __UTF-8__ locales was still fully in progress. Current
276 general __UCS__ support under Linux usually provides for
277 CJK double-width characters and sometimes even simple
278 overstriking combining characters, but usually does not
279 include support for scripts with right-to-left writing
280 direction or ligature substitution requirements such as
281 Hebrew, Arabic, or the Indic scripts. These scripts are
282 currently only supported in certain GUI applications (HTML
283 viewers, word processors) with sophisticated text rendering
284 engines.
285 !!AUTHOR
286
287
288 Markus Kuhn
289 !!SEE ALSO
290
291
2 perry 292 utf-8(7), charsets(7),
1 perry 293 setlocale(3)
294 ----
This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.