version 4, including all changes.
.
Rev |
Author |
# |
Line |
1 |
perry |
1 |
CHARSETS |
|
|
2 |
!!!CHARSETS |
|
|
3 |
NAME |
|
|
4 |
DESCRIPTION |
|
|
5 |
ASCII |
|
|
6 |
ISO 8859 |
|
|
7 |
KOI8-R |
|
|
8 |
JIS X 0208 |
|
|
9 |
KS X 1001 |
|
|
10 |
GB 2312 |
|
|
11 |
Big5 |
|
|
12 |
TIS 620 |
|
|
13 |
UNICODE |
|
|
14 |
ISO 2022 AND ISO 4873 |
|
|
15 |
SEE ALSO |
|
|
16 |
---- |
|
|
17 |
!!NAME |
|
|
18 |
|
|
|
19 |
|
|
|
20 |
charsets - programmer's view of character sets and internationalization |
|
|
21 |
!!DESCRIPTION |
|
|
22 |
|
|
|
23 |
|
|
|
24 |
Linux is an international operating system. Various of its |
|
|
25 |
utilities and device drivers (including the console driver) |
|
|
26 |
support multilingual character sets including Latin-alphabet |
|
|
27 |
letters with diacritical marks, accents, ligatures, and |
|
|
28 |
entire non-Latin alphabets including Greek, Cyrillic, |
|
|
29 |
Arabic, and Hebrew. |
|
|
30 |
|
|
|
31 |
|
|
|
32 |
This manual page presents a programmer's-eye view of |
|
|
33 |
different character-set standards and how they fit together |
|
|
34 |
on Linux. Standards discussed include ASCII, ISO 8859, |
|
|
35 |
KOI8-R, Unicode, ISO 2022 and ISO 4873. The primary emphasis |
|
|
36 |
is on character sets actually used as locale character sets, |
|
|
37 |
not the myriad others that can be found in data from other |
|
|
38 |
systems. |
|
|
39 |
|
|
|
40 |
|
|
|
41 |
A complete list of charsets used in a officially supported |
|
|
42 |
locale in glibc 2.2.3 is: ISO-8859-{1,2,3,5,6,7,8,9,13,15}, |
|
|
43 |
CP1251, UTF-8, EUC-{KR,JP,TW}, KOI8-{R,U}, GB2312, GB18030, |
|
|
44 |
GBK, BIG5, BIG5-HKSCS and TIS-620 (in no particular order.) |
|
|
45 |
(Romanian may be switching to ISO-8859-16.) |
|
|
46 |
!!ASCII |
|
|
47 |
|
|
|
48 |
|
|
|
49 |
ASCII (American Standard Code For Information Interchange) |
|
|
50 |
is the original 7-bit character set, originally designed for |
|
|
51 |
American English. It is currently described by the ECMA-6 |
|
|
52 |
standard. |
|
|
53 |
|
|
|
54 |
|
|
|
55 |
Various ASCII variants replacing the dollar sign with other |
|
|
56 |
currency symbols and replacing punctuation with non-English |
|
|
57 |
alphabetic characters to cover German, French, Spanish and |
|
|
58 |
others in 7 bits exist. All are deprecated; GNU libc doesn't |
|
|
59 |
support locales whose character sets aren't true supersets |
|
|
60 |
of ASCII. (These sets are also known as ISO-646, a close |
|
|
61 |
relative of ASCII that permitted replacing these |
|
|
62 |
characters.) |
|
|
63 |
|
|
|
64 |
|
|
|
65 |
As Linux was written for hardware designed in the US, it |
|
|
66 |
natively supports ASCII. |
|
|
67 |
!!ISO 8859 |
|
|
68 |
|
|
|
69 |
|
|
|
70 |
ISO 8859 is a series of 15 8-bit character sets all of which |
|
|
71 |
have US ASCII in their low (7-bit) half, invisible control |
|
|
72 |
characters in positions 128 to 159, and 96 fixed-width |
|
|
73 |
graphics in positions 160-255. |
|
|
74 |
|
|
|
75 |
|
|
|
76 |
Of these, the most important is ISO 8859-1 (Latin-1). It is |
|
|
77 |
natively supported in the Linux console driver, fairly well |
|
|
78 |
supported in X11R6, and is the base character set of |
|
|
79 |
HTML. |
|
|
80 |
|
|
|
81 |
|
|
|
82 |
Console support for the other 8859 character sets is |
|
|
83 |
available under Linux through user-mode utilities (such as |
|
|
84 |
consolechars(8)) that modify keyboard bindings and |
|
|
85 |
the EGA graphics table and employ the |
|
|
86 |
__ |
|
|
87 |
|
|
|
88 |
|
|
|
89 |
Here are brief descriptions of each set: |
|
|
90 |
|
|
|
91 |
|
|
|
92 |
8859-1 (Latin-1) |
|
|
93 |
|
|
|
94 |
|
|
|
95 |
Latin-1 covers most Western European languages such as |
|
|
96 |
Albanian, Catalan, Danish, Dutch, English, Faroese, Finnish, |
|
|
97 |
French, German, Galician, Irish, Icelandic, Italian, |
|
|
98 |
Norwegian, Portuguese, Spanish, and Swedish. The lack of the |
|
|
99 |
ligatures Dutch ij, French oe and old-style ,,German`` |
|
|
100 |
quotation marks is considered tolerable. |
|
|
101 |
|
|
|
102 |
|
|
|
103 |
8859-2 (Latin-2) |
|
|
104 |
|
|
|
105 |
|
|
|
106 |
Latin-2 supports most Latin-written Slavic and Central |
|
|
107 |
European languages: Croatian, Czech, German, Hungarian, |
|
|
108 |
Polish, Rumanian, Slovak, and Slovene. |
|
|
109 |
|
|
|
110 |
|
|
|
111 |
8859-3 (Latin-3) |
|
|
112 |
|
|
|
113 |
|
|
|
114 |
Latin-3 is popular with authors of Esperanto, Galician, and |
|
|
115 |
Maltese. (Turkish is now written with 8859-9 |
|
|
116 |
instead.) |
|
|
117 |
|
|
|
118 |
|
|
|
119 |
8859-4 (Latin-4) |
|
|
120 |
|
|
|
121 |
|
|
|
122 |
Latin-4 introduced letters for Estonian, Latvian, and |
|
|
123 |
Lithuanian. It is essentially obsolete; see 8859-13 |
|
|
124 |
(Latin-7). |
|
|
125 |
|
|
|
126 |
|
|
|
127 |
8859-5 |
|
|
128 |
|
|
|
129 |
|
|
|
130 |
Cyrillic letters supporting Bulgarian, Byelorussian, |
|
|
131 |
Macedonian, Russian, Serbian and Ukrainian. Ukrainians read |
|
|
132 |
the letter `ghe' with downstroke as `heh' and would need a |
|
|
133 |
ghe with upstroke to write a correct ghe. See the discussion |
|
|
134 |
of KOI8-R below. |
|
|
135 |
|
|
|
136 |
|
|
|
137 |
8859-6 |
|
|
138 |
|
|
|
139 |
|
|
|
140 |
Supports Arabic. The 8859-6 glyph table is a fixed font of |
|
|
141 |
separate letter forms, but a proper display engine should |
|
|
142 |
combine these using the proper initial, medial, and final |
|
|
143 |
forms. |
|
|
144 |
|
|
|
145 |
|
|
|
146 |
8859-7 |
|
|
147 |
|
|
|
148 |
|
|
|
149 |
Supports Modern Greek. |
|
|
150 |
|
|
|
151 |
|
|
|
152 |
8859-8 |
|
|
153 |
|
|
|
154 |
|
|
|
155 |
Supports modern Hebrew without niqud (punctuation signs). |
|
|
156 |
Niqud and full-fledged Biblical Hebrew are outside the scope |
|
|
157 |
of this character set; under Linux, UTF-8 is the preferred |
|
|
158 |
encoding for these. |
|
|
159 |
|
|
|
160 |
|
|
|
161 |
8859-9 (Latin-5) |
|
|
162 |
|
|
|
163 |
|
|
|
164 |
This is a variant of Latin-1 that replaces Icelandic letters |
|
|
165 |
with Turkish ones. |
|
|
166 |
|
|
|
167 |
|
|
|
168 |
8859-10 (Latin-6) |
|
|
169 |
|
|
|
170 |
|
|
|
171 |
Latin 6 adds the last Inuit (Greenlandic) and Sami (Lappish) |
|
|
172 |
letters that were missing in Latin 4 to cover the entire |
|
|
173 |
Nordic area. RFC 1345 listed a preliminary and different |
|
|
174 |
`latin6'. Skolt Sami still needs a few more accents than |
|
|
175 |
these. |
|
|
176 |
|
|
|
177 |
|
|
|
178 |
8859-11 |
|
|
179 |
|
|
|
180 |
|
|
|
181 |
This only exists as a rejected draft standard. The draft |
|
|
182 |
standard was identical to TIS-620, which is used under Linux |
|
|
183 |
for Thai. |
|
|
184 |
|
|
|
185 |
|
|
|
186 |
8859-12 |
|
|
187 |
|
|
|
188 |
|
|
|
189 |
This set does not exist. While Vietnamese has been suggested |
|
|
190 |
for this space, it does not fit within the 96 |
|
|
191 |
(non-combining) characters ISO 8859 offers. UTF-8 is the |
|
|
192 |
preferred character set for Vietnamese use under |
|
|
193 |
Linux. |
|
|
194 |
|
|
|
195 |
|
|
|
196 |
8859-13 (Latin-7) |
|
|
197 |
|
|
|
198 |
|
|
|
199 |
Supports the Baltic Rim languages; in particular, it |
|
|
200 |
includes Latvian characters not found in |
|
|
201 |
Latin-4. |
|
|
202 |
|
|
|
203 |
|
|
|
204 |
8859-14 (Latin-8) |
|
|
205 |
|
|
|
206 |
|
|
|
207 |
This is the Celtic character set, covering Gaelic and |
|
|
208 |
Welsh. |
|
|
209 |
|
|
|
210 |
|
|
|
211 |
8859-15 (Latin-9) |
|
|
212 |
|
|
|
213 |
|
|
|
214 |
This adds the Euro sign and French and Finnish letters that |
|
|
215 |
were missing in Latin-1. |
|
|
216 |
|
|
|
217 |
|
|
|
218 |
8859-16 (Latin-10) |
|
|
219 |
|
|
|
220 |
|
|
|
221 |
This set covers many of the languages covered by 8859-2, and |
|
|
222 |
supports Romanian more completely then that set |
|
|
223 |
does. |
|
|
224 |
!!KOI8-R |
|
|
225 |
|
|
|
226 |
|
|
|
227 |
KOI8-R is a non-ISO character set popular in Russia. The |
|
|
228 |
lower half is US ASCII; the upper is a Cyrillic character |
|
|
229 |
set somewhat better designed than ISO 8859-5. KOI8-U is a |
|
|
230 |
common character set, based off KOI8-R, that has better |
|
|
231 |
support for Ukrainian. Neither of these sets are ISO-2022 |
|
|
232 |
compatible, unlike the ISO-8859 series. |
|
|
233 |
|
|
|
234 |
|
|
|
235 |
Console support for KOI8-R is available under Linux through |
|
|
236 |
user-mode utilities that modify keyboard bindings and the |
|
|
237 |
EGA graphics table, and employ the |
|
|
238 |
!!JIS X 0208 |
|
|
239 |
|
|
|
240 |
|
|
|
241 |
JIS X 0208 is a Japanese national standard character set. |
|
|
242 |
Though there are some more Japanese national standard |
|
|
243 |
character sets (like JIS X 0201, JIS X 0212, and JIS X |
|
|
244 |
0213), this is the most important one. Characters are mapped |
|
|
245 |
into a 94x94 two-byte matrix, whose each byte is in the |
|
|
246 |
range 0x21-0x7e. Note that JIS X 0208 is a character set, |
|
|
247 |
not an encoding. This means that JIS X 0208 itself is not |
|
|
248 |
used for expressing text data. JIS X 0208 is used as a |
|
|
249 |
component to construct encodings such as EUC-JP, Shift_JIS, |
|
|
250 |
and ISO-2022-JP. EUC-JP is the most important encoding for |
|
|
251 |
Linux and includes US ASCII and JIS X 0208. In EUC-JP, JIS X |
|
|
252 |
0208 characters are expressed in two bytes, each of which is |
|
|
253 |
the JIS X 0208 code plus 0x80. |
|
|
254 |
!!KS X 1001 |
|
|
255 |
|
|
|
256 |
|
|
|
257 |
KS X 1001 is a Korean national standard character set. Just |
|
|
258 |
as JIS X 0208, characters are mapped into a 94x94 two-byte |
|
|
259 |
matrix. KS X 1001 is used like JIS X 0208, as a component to |
|
|
260 |
construct encodings such as EUC-KR, Johab, and ISO-2022-KR. |
|
|
261 |
EUC-KR is the most important encoding for Linux and includes |
|
|
262 |
US ASCII and KS X 1001. KS C 5601 is an older name for KS X |
|
|
263 |
1001. |
|
|
264 |
!!GB 2312 |
|
|
265 |
|
|
|
266 |
|
|
|
267 |
GB 2312 is a mainland Chinese national standard character |
|
|
268 |
set used to express simplified Chinese. Just like JIS X |
|
|
269 |
0208, characters are mapped into a 94x94 two-byte matrix |
|
|
270 |
used to construct EUC-CN. EUC-CN is the most important |
|
|
271 |
encoding for Linux and includes US ASCII and GB 2312. Note |
|
|
272 |
that EUC-CN is often called as GB, GB 2312, or |
|
|
273 |
CN-GB. |
|
|
274 |
!!Big5 |
|
|
275 |
|
|
|
276 |
|
|
|
277 |
Big5 is a popular character set in Taiwan to express |
|
|
278 |
traditional Chinese. (Big5 is both a character set and an |
|
|
279 |
encoding.) It is a superset of US ASCII. Non-ASCII |
|
|
280 |
characters are expressed in two bytes. Bytes 0xa1-0xfe are |
|
|
281 |
used as leading bytes for two-byte characters. Big5 and its |
|
|
282 |
extension is widely used in Taiwan and Hong Kong. It is not |
|
|
283 |
ISO 2022-compliant. |
|
|
284 |
!!TIS 620 |
|
|
285 |
|
|
|
286 |
|
|
|
287 |
TIS 620 is a Thai national standard character set and a |
|
|
288 |
superset of US ASCII. Like ISO 8859 series, Thai characters |
|
|
289 |
are mapped into 0xa1-0xfe. TIS 620 is the only commonly used |
|
|
290 |
character set under Linux besides UTF-8 to have combining |
|
|
291 |
characters. |
|
|
292 |
!!UNICODE |
|
|
293 |
|
|
|
294 |
|
|
|
295 |
Unicode (ISO 10646) is a standard which aims to |
|
|
296 |
unambiguously represent every character in every human |
|
|
297 |
language. Unicode's structure permits 20.1 bits to encode |
|
|
298 |
every character. Since most computers don't include 20.1-bit |
|
|
299 |
integers, Unicode is usually encoded as 32-bit integers |
|
|
300 |
internally and either a series of 16-bit integers (UTF-16) |
|
|
301 |
(needing two 16-bit integers only when encoding certain rare |
|
|
302 |
characters) or a series of 8-bit bytes (UTF-8). Information |
|
|
303 |
on Unicode is available at |
|
|
304 |
|
|
|
305 |
|
|
|
306 |
Linux represents Unicode using the 8-bit Unicode |
|
|
307 |
Transformation Format (UTF-8). UTF-8 is a variable length |
|
|
308 |
encoding of Unicode. It uses 1 byte to code 7 bits, 2 bytes |
|
|
309 |
for 11 bits, 3 bytes for 16 bits, and 4 bytes for the |
|
|
310 |
remainder. |
|
|
311 |
|
|
|
312 |
|
|
|
313 |
Let 0,1,x stand for a zero, one, or arbitrary bit. A byte |
|
|
314 |
0xxxxxxx stands for the Unicode 00000000 0xxxxxxx which |
|
|
315 |
codes the same symbol as the ASCII 0xxxxxxx. Thus, ASCII |
|
|
316 |
goes unchanged into UTF-8, and people using only ASCII do |
|
|
317 |
not notice any change: not in code, and not in file |
|
|
318 |
size. |
|
|
319 |
|
|
|
320 |
|
|
|
321 |
A byte 110xxxxx is the start of a 2-byte code, and 110xxxxx |
|
|
322 |
10yyyyyy is assembled into 00000000 0000000 00000xxx |
|
|
323 |
xxyyyyyy. A byte 1110xxxx is the start of a 3-byte code, and |
|
|
324 |
1110xxxx 10yyyyyy 10zzzzzz is assembled into 00000000 |
|
|
325 |
00000000 xxxxyyyy yyzzzzzz. Lastly, 110110xxx starts a |
|
|
326 |
4-byte code, and 110110xxx 10xxyyyy 10zzzzzz 10aaaaaa |
|
|
327 |
becomes 0000000 000xxxxx yyyyzzzz zzaaaaaa. |
|
|
328 |
|
|
|
329 |
|
|
|
330 |
For most people who use ISO-8859 character sets, this means |
|
|
331 |
that the characters outside of ASCII are now coded with two |
|
|
332 |
bytes. This tends to expand ordinary text files by only one |
|
|
333 |
or two percent. For Russian or Greek users, this expands |
|
|
334 |
ordinary text files by 100%, since text in those languages |
|
|
335 |
is mostly outside of ASCII. For Japanese users this means |
|
|
336 |
that the 16-bit codes now in common use will take three |
|
|
337 |
bytes. While there are algorithmic conversions from some |
|
|
338 |
character sets (esp. ISO-8859-1) to Unicode, general |
|
|
339 |
conversion requires carrying around conversion tables, which |
|
|
340 |
can be quite large for 16-bit codes. |
|
|
341 |
|
|
|
342 |
|
|
|
343 |
Note that UTF-8 is self-synchronizing: 10xxxxxx is a tail, |
|
|
344 |
any other byte is the head of a code. Note that the only way |
|
|
345 |
ASCII bytes occur in a UTF-8 stream, is as themselves. In |
|
|
346 |
particular, there are no embedded NULs or '/'s that form |
|
|
347 |
part of some larger code. |
|
|
348 |
|
|
|
349 |
|
|
|
350 |
Since ASCII, and, in particular, NUL and '/', are unchanged, |
|
|
351 |
the kernel does not notice that UTF-8 is being used. It does |
|
|
352 |
not care at all what the bytes it is handling stand |
|
|
353 |
for. |
|
|
354 |
|
|
|
355 |
|
|
|
356 |
Rendering of Unicode data streams is typically handled |
|
|
357 |
through `subfont' tables which map a subset of Unicode to |
|
|
358 |
glyphs. Internally the kernel uses Unicode to describe the |
|
|
359 |
subfont loaded in video RAM. This means that in UTF-8 mode |
|
|
360 |
one can use a character set with 512 different symbols. This |
|
|
361 |
is not enough for Japanese, Chinese and Korean, but it is |
|
|
362 |
enough for most other purposes. |
|
|
363 |
|
|
|
364 |
|
|
|
365 |
At the current time, the console driver does not handle |
|
|
366 |
combining characters. So Thai, Sioux and any other script |
|
|
367 |
needing combining characters can't be handled on the |
|
|
368 |
console. |
|
|
369 |
!!ISO 2022 AND ISO 4873 |
|
|
370 |
|
|
|
371 |
|
|
|
372 |
The ISO 2022 and 4873 standards describe a font-control |
|
|
373 |
model based on VT100 practice. This model is (partially) |
|
|
374 |
supported by the Linux kernel and by xterm(1). It is |
|
|
375 |
popular in Japan and Korea. |
|
|
376 |
|
|
|
377 |
|
|
|
378 |
There are 4 graphic character sets, called G0, G1, G2 and |
|
|
379 |
G3, and one of them is the current character set for codes |
|
|
380 |
with high bit zero (initially G0), and one of them is the |
|
|
381 |
current character set for codes with high bit one (initially |
|
|
382 |
G1). Each graphic character set has 94 or 96 characters, and |
|
|
383 |
is essentially a 7-bit character set. It uses codes either |
|
|
384 |
040-0177 (041-0176) or 0240-0377 (0241-0376). G0 always has |
|
|
385 |
size 94 and uses codes 041-0176. |
|
|
386 |
|
|
|
387 |
|
|
|
388 |
Switching between character sets is done using the shift |
|
|
389 |
functions ^N (SO or LS1), ^O (SI or LS0), ESC n (LS2), ESC o |
|
|
390 |
(LS3), ESC N (SS2), ESC O (SS3), ESC ~ (LS1R), ESC } (LS2R), |
|
|
391 |
ESC | (LS3R). The function LS''n'' makes character set |
|
|
392 |
G''n'' the current one for codes with high bit zero. The |
|
|
393 |
function LS''n''R makes character set G''n'' the |
|
|
394 |
current one for codes with high bit one. The function |
|
|
395 |
SS''n'' makes character set G''n'' (''n''=2 or 3) |
|
|
396 |
the current one for the next character only (regardless of |
|
|
397 |
the value of its high order bit). |
|
|
398 |
|
|
|
399 |
|
|
|
400 |
A 94-character set is designated as G''n'' character set |
|
|
401 |
by an escape sequence ESC ( xx (for G0), ESC ) xx (for G1), |
|
|
402 |
ESC * xx (for G2), ESC + xx (for G3), where xx is a symbol |
|
|
403 |
or a pair of symbols found in the ISO 2375 International |
|
|
404 |
Register of Coded Character Sets. For example, ESC ( @ |
|
|
405 |
selects the ISO 646 character set as G0, ESC ( A selects the |
|
|
406 |
UK standard character set (with pound instead of number |
|
|
407 |
sign), ESC ( B selects ASCII (with dollar instead of |
|
|
408 |
currency sign), ESC ( M selects a character set for African |
|
|
409 |
languages, ESC ( ! A selects the Cuban character set, etc. |
|
|
410 |
etc. |
|
|
411 |
|
|
|
412 |
|
|
|
413 |
A 96-character set is designated as G''n'' character set |
|
|
414 |
by an escape sequence ESC - xx (for G1), ESC . xx (for G2) |
|
|
415 |
or ESC / xx (for G3). For example, ESC - G selects the |
|
|
416 |
Hebrew alphabet as G1. |
|
|
417 |
|
|
|
418 |
|
|
|
419 |
A multibyte character set is designated as G''n'' |
|
|
420 |
character set by an escape sequence ESC $ xx or ESC $ ( xx |
|
|
421 |
(for G0), ESC $ ) xx (for G1), ESC $ * xx (for G2), ESC $ + |
|
|
422 |
xx (for G3). For example, ESC $ ( C selects the Korean |
|
|
423 |
character set for G0. The Japanese character set selected by |
|
|
424 |
ESC $ B has a more recent version selected by ESC |
|
|
425 |
'' |
|
|
426 |
|
|
|
427 |
|
|
|
428 |
ISO 4873 stipulates a narrower use of character sets, where |
|
|
429 |
G0 is fixed (always ASCII), so that G1, G2 and G3 can only |
|
|
430 |
be invoked for codes with the high order bit set. In |
|
|
431 |
particular, ^N and ^O are not used anymore, ESC ( xx can be |
|
|
432 |
used only with xx=B, and ESC ) xx, ESC * xx, ESC + xx are |
|
|
433 |
equivalent to ESC - xx, ESC . xx, ESC / xx, |
|
|
434 |
respectively. |
|
|
435 |
!!SEE ALSO |
|
|
436 |
|
|
|
437 |
|
4 |
perry |
438 |
console(4), console_ioctl(4), |
|
|
439 |
console_codes(4), ascii(7), |
|
|
440 |
iso_8859_1(7), unicode(7), |
|
|
441 |
utf-8(7) |
1 |
perry |
442 |
---- |