version 2, including all changes.
.
Rev |
Author |
# |
Line |
1 |
perry |
1 |
UNICODE |
|
|
2 |
!!!UNICODE |
|
|
3 |
NAME |
|
|
4 |
DESCRIPTION |
|
|
5 |
COMBINING CHARACTERS |
|
|
6 |
IMPLEMENTATION LEVELS |
|
|
7 |
UNICODE UNDER LINUX |
|
|
8 |
PRIVATE AREA |
|
|
9 |
LITERATURE |
|
|
10 |
BUGS |
|
|
11 |
AUTHOR |
|
|
12 |
SEE ALSO |
|
|
13 |
---- |
|
|
14 |
!!NAME |
|
|
15 |
|
|
|
16 |
|
|
|
17 |
Unicode - the Universal Character Set |
|
|
18 |
!!DESCRIPTION |
|
|
19 |
|
|
|
20 |
|
|
|
21 |
The international standard __ISO 10646__ defines the |
|
|
22 |
__Universal Character Set (UCS)__. UCS contains all |
|
|
23 |
characters of all other character set standards. It also |
|
|
24 |
guarantees __round-trip compatibility__, i.e., conversion |
|
|
25 |
tables can be built such that no information is lost when a |
|
|
26 |
string is converted from any other encoding to UCS and |
|
|
27 |
back. |
|
|
28 |
|
|
|
29 |
|
|
|
30 |
UCS contains the characters required to represent |
|
|
31 |
practically all known languages. This includes not only the |
|
|
32 |
Latin, Greek, Cyrillic, Hebrew, Arabic, Armenian, and |
|
|
33 |
Georgian scripts, but also also Chinese, Japanese and Korean |
|
|
34 |
Han ideographs as well as scripts such as Hiragana, |
|
|
35 |
Katakana, Hangul, Devanagari, Bengali, Gurmukhi, Gujarati, |
|
|
36 |
Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer, |
|
|
37 |
Bopomofo, Tibetan, Runic, Ethiopic, Canadian Syllabics, |
|
|
38 |
Cherokee, Mongolian, Ogham, Myanmar, Sinhala, Thaana, Yi, |
|
|
39 |
and others. For scripts not yet covered, research on how to |
|
|
40 |
best encode them for computer usage is still going on and |
|
|
41 |
they will be added eventually. This might eventually include |
|
|
42 |
not only Hieroglyphs and various historic Indo-European |
|
|
43 |
languages, but even some selected artistic scripts such as |
|
|
44 |
Tengwar, Cirth, and Klingon. UCS also covers a large number |
|
|
45 |
of graphical, typographical, mathematical and scientific |
|
|
46 |
symbols, including those provided by TeX, Postscript, APL, |
|
|
47 |
MS-DOS, MS-Windows, Macintosh, OCR fonts, as well as many |
|
|
48 |
word processing and publishing systems, and more are being |
|
|
49 |
added. |
|
|
50 |
|
|
|
51 |
|
|
|
52 |
The UCS standard (ISO 10646) describes a ''31-bit character |
|
|
53 |
set architecture'' consisting of 128 24-bit ''groups'', |
|
|
54 |
each divided into 256 16-bit ''planes'' made up of 256 |
|
|
55 |
8-bit ''rows'' with 256 ''column'' positions, one for |
|
|
56 |
each character. Part 1 of the standard (__ISO 10646-1__) |
|
|
57 |
defines the first 65534 code positions (0x0000 to 0xfffd), |
|
|
58 |
which form the ''Basic Multilingual Plane (BMP)'', that |
|
|
59 |
is plane 0 in group 0. Part 2 of the standard (__ISO |
|
|
60 |
10646-2__) adds characters to group 0 outside the BMP in |
|
|
61 |
several ''supplementary planes'' in the range 0x10000 to |
|
|
62 |
0x10ffff. There are no plans to add characters beyond |
|
|
63 |
0x10ffff to the standard, therefore of the entire code |
|
|
64 |
space, only a small fraction of group 0 will ever be |
|
|
65 |
actually used in the foreseeable future. The BMP contains |
|
|
66 |
all characters found in the commonly used other character |
|
|
67 |
sets. The supplemental planes added by ISO 10646-2 cover |
|
|
68 |
only more exotic characters for special scientific, |
|
|
69 |
dictionary printing, publishing industry, higher-level |
|
|
70 |
protocol and enthusiast needs. |
|
|
71 |
|
|
|
72 |
|
|
|
73 |
The representation of each UCS character as a 2-byte word is |
|
|
74 |
referred to as the __UCS-2__ form (only for BMP |
|
|
75 |
characters), whereas __UCS-4__ is the representation of |
|
|
76 |
each character by a 4-byte word. In addition, there exist |
|
|
77 |
two encoding forms __UTF-8__ for backwards compatibility |
|
|
78 |
with ASCII processing software and __UTF-16__ for the |
|
|
79 |
backwards compatible handling of non-BMP characters up to |
|
|
80 |
0x10ffff by UCS-2 software. |
|
|
81 |
|
|
|
82 |
|
|
|
83 |
The UCS characters 0x0000 to 0x007f are identical to those |
|
|
84 |
of the classic __US-ASCII__ character set and the |
|
|
85 |
characters in the range 0x0000 to 0x00ff are identical to |
|
|
86 |
those in __ISO 8859-1 Latin-1__. |
|
|
87 |
!!COMBINING CHARACTERS |
|
|
88 |
|
|
|
89 |
|
|
|
90 |
Some code points in __UCS__ have been assigned to |
|
|
91 |
''combining characters''. These are similar to the |
|
|
92 |
non-spacing accent keys on a typewriter. A combining |
|
|
93 |
character just adds an accent to the previous character. The |
|
|
94 |
most important accented characters have codes of their own |
|
|
95 |
in UCS, however, the combining character mechanism allows us |
|
|
96 |
to add accents and other diacritical marks to any character. |
|
|
97 |
The combining characters always follow the character which |
|
|
98 |
they modify. For example, the German character Umlaut-A |
|
|
99 |
( |
|
|
100 |
'' |
|
|
101 |
|
|
|
102 |
|
|
|
103 |
Combining characters are essential for instance for encoding |
|
|
104 |
the Thai script or for mathematical typesetting and users of |
|
|
105 |
the International Phonetic Alphabet. |
|
|
106 |
!!IMPLEMENTATION LEVELS |
|
|
107 |
|
|
|
108 |
|
|
|
109 |
As not all systems are expected to support advanced |
|
|
110 |
mechanisms like combining characters, ISO 10646-1 specifies |
|
|
111 |
the following three ''implementation levels'' of |
|
|
112 |
UCS: |
|
|
113 |
|
|
|
114 |
|
|
|
115 |
Level 1 |
|
|
116 |
|
|
|
117 |
|
|
|
118 |
Combining characters and __Hangul Jamo__ (a variant |
|
|
119 |
encoding of the Korean script, where a Hangul syllable glyph |
|
|
120 |
is coded as a triplet or pair of vovel/consonant codes) are |
|
|
121 |
not supported. |
|
|
122 |
|
|
|
123 |
|
|
|
124 |
Level 2 |
|
|
125 |
|
|
|
126 |
|
|
|
127 |
In addition to level 1, combining characters are now allowed |
|
|
128 |
for some languages where they are essential (e.g., Thai, |
|
|
129 |
Lao, Hebrew, Arabic, Devanagari, Malayalam, |
|
|
130 |
etc.). |
|
|
131 |
|
|
|
132 |
|
|
|
133 |
Level 3 |
|
|
134 |
|
|
|
135 |
|
|
|
136 |
All __UCS__ characters are supported. |
|
|
137 |
|
|
|
138 |
|
|
|
139 |
The __Unicode 3.0 Standard__ published by the __Unicode |
|
|
140 |
Consortium__ contains exactly the __UCS Basic |
|
|
141 |
Multilingual Plane__ at implementation level 3, as |
|
|
142 |
described in ISO 10646-1:2000. __Unicode 3.1__ added the |
|
|
143 |
supplemental planes of ISO 10646-2. The Unicode standard and |
|
|
144 |
technical reports published by the Unicode Consortium |
|
|
145 |
provide much additional information on the semantics and |
|
|
146 |
recommended usages of various characters. They provide |
|
|
147 |
guidelines and algorithms for editing, sorting, comparing, |
|
|
148 |
normalizing, converting and displaying Unicode |
|
|
149 |
strings. |
|
|
150 |
!!UNICODE UNDER LINUX |
|
|
151 |
|
|
|
152 |
|
|
|
153 |
Under GNU/Linux, the C type __wchar_t__ is a signed |
|
|
154 |
32-bit integer type. Its values are always interpreted by |
|
|
155 |
the C library as __UCS__ code values (in all locales), a |
|
|
156 |
convention that is signaled by the GNU C library to |
|
|
157 |
applications by defining the constant |
|
|
158 |
____STDC_ISO_10646____ as specified in the ISO C 99 |
|
|
159 |
standard. |
|
|
160 |
|
|
|
161 |
|
|
|
162 |
UCS/Unicode can be used just like ASCII in input/output |
|
|
163 |
streams, terminal communication, plaintext files, filenames, |
|
|
164 |
and environment variables in the ASCII compatible |
|
|
165 |
__UTF-8__ multi-byte encoding. To signal the use of UTF-8 |
|
|
166 |
as the character encoding to all applications, a suitable |
|
|
167 |
__locale__ has to be selected via environment variables |
|
|
168 |
(e.g., __ |
|
|
169 |
|
|
|
170 |
|
|
|
171 |
The __nl_langinfo(CODESET)__ function returns the name of |
|
|
172 |
the selected encoding. Library functions such as |
|
|
173 |
wctomb(3) and mbsrtowcs(3) can be used to |
|
|
174 |
transform the internal __wchar_t__ characters and strings |
|
|
175 |
into the system character encoding and back and |
|
|
176 |
wcwidth(3) tells, how many positions (0-2) the cursor |
|
|
177 |
is advanced by the output of a character. |
|
|
178 |
|
|
|
179 |
|
|
|
180 |
Under Linux, in general only the BMP at implementation level |
|
|
181 |
1 should be used at the moment. Up to two combining |
|
|
182 |
characters per base character for certain scripts (in |
|
|
183 |
particular Thai) are also supported by some UTF-8 terminal |
|
|
184 |
emulators and ISO 10646 fonts (level 2), but in general |
|
|
185 |
precomposed characters should be preferred where available |
|
|
186 |
(Unicode calls this __Normalization Form |
|
|
187 |
C__). |
|
|
188 |
!!PRIVATE AREA |
|
|
189 |
|
|
|
190 |
|
|
|
191 |
In the __BMP__, the range 0xe000 to 0xf8ff will never be |
|
|
192 |
assigned to any characters by the standard and is reserved |
|
|
193 |
for private usage. For the Linux community, this private |
|
|
194 |
area has been subdivided further into the range 0xe000 to |
|
|
195 |
0xefff which can be used individually by any end-user and |
|
|
196 |
the Linux zone in the range 0xf000 to 0xf8ff where |
|
|
197 |
extensions are coordinated among all Linux users. The |
|
|
198 |
registry of the characters assigned to the Linux zone is |
|
|
199 |
currently maintained by H. Peter Anvin |
|
|
200 |
__ |
|
|
201 |
!!LITERATURE |
|
|
202 |
|
|
|
203 |
|
|
|
204 |
* |
|
|
205 |
|
|
|
206 |
|
|
|
207 |
Information technology -- Universal Multiple-Octet Coded |
|
|
208 |
Character Set (UCS) -- Part 1: Architecture and Basic |
|
|
209 |
Multilingual Plane. International Standard ISO/IEC 10646-1, |
|
|
210 |
International Organization for Standardization, Geneva, |
|
|
211 |
2000. |
|
|
212 |
|
|
|
213 |
|
|
|
214 |
This is the official specification of __UCS__. Available |
|
|
215 |
as a PDF file on CD-ROM from |
|
|
216 |
http://www.iso.ch/. |
|
|
217 |
|
|
|
218 |
|
|
|
219 |
* |
|
|
220 |
|
|
|
221 |
|
|
|
222 |
The Unicode Standard, Version 3.0. The Unicode Consortium, |
|
|
223 |
Addison-Wesley, Reading, MA, 2000, ISBN |
|
|
224 |
0-201-61633-5. |
|
|
225 |
|
|
|
226 |
|
|
|
227 |
* |
|
|
228 |
|
|
|
229 |
|
|
|
230 |
S. Harbison, G. Steele. C: A Reference Manual. Fourth |
|
|
231 |
edition, Prentice Hall, Englewood Cliffs, 1995, ISBN |
|
|
232 |
0-13-326224-3. |
|
|
233 |
|
|
|
234 |
|
|
|
235 |
A good reference book about the C programming language. The |
|
|
236 |
fourth edition covers the 1994 Amendment 1 to the ISO C 90 |
|
|
237 |
standard, which adds a large number of new C library |
|
|
238 |
functions for handling wide and multi-byte character |
|
|
239 |
encodings, but it does not yet cover ISO C 99, which |
|
|
240 |
improved wide and multi-byte character support even |
|
|
241 |
further. |
|
|
242 |
|
|
|
243 |
|
|
|
244 |
* |
|
|
245 |
|
|
|
246 |
|
|
|
247 |
Unicode Technical Reports. |
|
|
248 |
http://www.unicode.org/unicode/reports/ |
|
|
249 |
|
|
|
250 |
|
|
|
251 |
* |
|
|
252 |
|
|
|
253 |
|
|
|
254 |
Markus Kuhn: UTF-8 and Unicode FAQ for Unix/Linux. |
|
|
255 |
http://www.cl.cam.ac.uk/~mgk25/unicode.html |
|
|
256 |
|
|
|
257 |
|
|
|
258 |
Provides subscription information for the __linux-utf8__ |
|
|
259 |
mailing list, which is the best place to look for advice on |
|
|
260 |
using Unicode under Linux. |
|
|
261 |
|
|
|
262 |
|
|
|
263 |
* |
|
|
264 |
|
|
|
265 |
|
|
|
266 |
Bruno Haible: Unicode HOWTO. |
|
|
267 |
ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html |
|
|
268 |
!!BUGS |
|
|
269 |
|
|
|
270 |
|
|
|
271 |
When this man page was last revised, the GNU C Library |
|
|
272 |
support for __UTF-8__ locales was mature and XFree86 |
|
|
273 |
support was in an advanced state, but work on making |
|
|
274 |
applications (most notably editors) suitable for use in |
|
|
275 |
__UTF-8__ locales was still fully in progress. Current |
|
|
276 |
general __UCS__ support under Linux usually provides for |
|
|
277 |
CJK double-width characters and sometimes even simple |
|
|
278 |
overstriking combining characters, but usually does not |
|
|
279 |
include support for scripts with right-to-left writing |
|
|
280 |
direction or ligature substitution requirements such as |
|
|
281 |
Hebrew, Arabic, or the Indic scripts. These scripts are |
|
|
282 |
currently only supported in certain GUI applications (HTML |
|
|
283 |
viewers, word processors) with sophisticated text rendering |
|
|
284 |
engines. |
|
|
285 |
!!AUTHOR |
|
|
286 |
|
|
|
287 |
|
|
|
288 |
Markus Kuhn |
|
|
289 |
!!SEE ALSO |
|
|
290 |
|
|
|
291 |
|
2 |
perry |
292 |
utf-8(7), charsets(7), |
1 |
perry |
293 |
setlocale(3) |
|
|
294 |
---- |