version 4, including all changes.
.
Rev |
Author |
# |
Line |
1 |
perry |
1 |
UTF-8 |
|
|
2 |
!!!UTF-8 |
|
|
3 |
NAME |
|
|
4 |
DESCRIPTION |
|
|
5 |
PROPERTIES |
|
|
6 |
ENCODING |
|
|
7 |
EXAMPLES |
|
|
8 |
APPLICATION NOTES |
|
|
9 |
SECURITY |
|
|
10 |
STANDARDS |
|
|
11 |
AUTHOR |
|
|
12 |
SEE ALSO |
|
|
13 |
---- |
|
|
14 |
!!NAME |
|
|
15 |
|
|
|
16 |
|
|
|
17 |
UTF-8 - an ASCII compatible multi-byte Unicode encoding |
|
|
18 |
!!DESCRIPTION |
|
|
19 |
|
|
|
20 |
|
|
|
21 |
The __Unicode 3.0__ character set occupies a 16-bit code |
|
|
22 |
space. The most obvious Unicode encoding (known as |
|
|
23 |
__UCS-2__) consists of a sequence of 16-bit words. Such |
|
|
24 |
strings can contain as parts of many 16-bit characters bytes |
|
|
25 |
like '0' or '/' which have a special meaning in filenames |
|
|
26 |
and other C library function parameters. In addition, the |
|
|
27 |
majority of UNIX tools expects ASCII files and can't read |
|
|
28 |
16-bit words as characters without major modifications. For |
|
|
29 |
these reasons, __UCS-2__ is not a suitable external |
|
|
30 |
encoding of __Unicode__ in filenames, text files, |
|
|
31 |
environment variables, etc. The __ISO 10646 Universal |
|
|
32 |
Character Set (UCS)__, a superset of Unicode, occupies |
|
|
33 |
even a 31-bit code space and the obvious __UCS-4__ |
|
|
34 |
encoding for it (a sequence of 32-bit words) has the same |
|
|
35 |
problems. |
|
|
36 |
|
|
|
37 |
|
|
|
38 |
The __UTF-8__ encoding of __Unicode__ and __UCS__ |
|
|
39 |
does not have these problems and is the common way in which |
|
|
40 |
__Unicode__ is used on Unix-style operating |
|
|
41 |
systems. |
|
|
42 |
!!PROPERTIES |
|
|
43 |
|
|
|
44 |
|
|
|
45 |
The __UTF-8__ encoding has the following nice |
|
|
46 |
properties: |
|
|
47 |
|
|
|
48 |
|
|
|
49 |
* |
|
|
50 |
|
|
|
51 |
|
|
|
52 |
__UCS__ characters 0x00000000 to 0x0000007f (the classic |
|
|
53 |
__US-ASCII__ characters) are encoded simply as bytes 0x00 |
|
|
54 |
to 0x7f (ASCII compatibility). This means that files and |
|
|
55 |
strings which contain only 7-bit ASCII characters have the |
|
|
56 |
same encoding under both __ASCII__ and |
|
|
57 |
__UTF-8__. |
|
|
58 |
|
|
|
59 |
|
|
|
60 |
* |
|
|
61 |
|
|
|
62 |
|
|
|
63 |
All __UCS__ characters |
|
|
64 |
__ |
|
|
65 |
|
|
|
66 |
|
|
|
67 |
* |
|
|
68 |
|
|
|
69 |
|
|
|
70 |
The lexicographic sorting order of __UCS-4__ strings is |
|
|
71 |
preserved. |
|
|
72 |
|
|
|
73 |
|
|
|
74 |
* |
|
|
75 |
|
|
|
76 |
|
|
|
77 |
All possible 2^31 UCS codes can be encoded using |
|
|
78 |
__UTF-8__. |
|
|
79 |
|
|
|
80 |
|
|
|
81 |
* |
|
|
82 |
|
|
|
83 |
|
|
|
84 |
The bytes 0xfe and 0xff are never used in the __UTF-8__ |
|
|
85 |
encoding. |
|
|
86 |
|
|
|
87 |
|
|
|
88 |
* |
|
|
89 |
|
|
|
90 |
|
|
|
91 |
The first byte of a multi-byte sequence which represents a |
|
|
92 |
single non-ASCII __UCS__ character is always in the range |
|
|
93 |
0xc0 to 0xfd and indicates how long this multi-byte sequence |
|
|
94 |
is. All further bytes in a multi-byte sequence are in the |
|
|
95 |
range 0x80 to 0xbf. This allows easy resynchronization and |
|
|
96 |
makes the encoding stateless and robust against missing |
|
|
97 |
bytes. |
|
|
98 |
|
|
|
99 |
|
|
|
100 |
* |
|
|
101 |
|
|
|
102 |
|
|
|
103 |
__UTF-8__ encoded __UCS__ characters may be up to six |
|
|
104 |
bytes long, however the __Unicode__ standard specifies no |
|
|
105 |
characters above 0x10ffff, so Unicode characters can only be |
|
|
106 |
up to four bytes long in __UTF-8__. |
|
|
107 |
!!ENCODING |
|
|
108 |
|
|
|
109 |
|
|
|
110 |
The following byte sequences are used to represent a |
|
|
111 |
character. The sequence to be used depends on the UCS code |
|
|
112 |
number of the character: |
|
|
113 |
|
|
|
114 |
|
|
|
115 |
0x00000000 - 0x0000007F: |
|
|
116 |
|
|
|
117 |
|
|
|
118 |
0''xxxxxxx'' |
|
|
119 |
|
|
|
120 |
|
|
|
121 |
0x00000080 - 0x000007FF: |
|
|
122 |
|
|
|
123 |
|
|
|
124 |
110''xxxxx'' 10''xxxxxx'' |
|
|
125 |
|
|
|
126 |
|
|
|
127 |
0x00000800 - 0x0000FFFF: |
|
|
128 |
|
|
|
129 |
|
|
|
130 |
1110''xxxx'' 10''xxxxxx'' 10''xxxxxx'' |
|
|
131 |
|
|
|
132 |
|
|
|
133 |
0x00010000 - 0x001FFFFF: |
|
|
134 |
|
|
|
135 |
|
|
|
136 |
11110''xxx'' 10''xxxxxx'' 10''xxxxxx'' |
|
|
137 |
10''xxxxxx'' |
|
|
138 |
|
|
|
139 |
|
|
|
140 |
0x00200000 - 0x03FFFFFF: |
|
|
141 |
|
|
|
142 |
|
|
|
143 |
111110''xx'' 10''xxxxxx'' 10''xxxxxx'' |
|
|
144 |
10''xxxxxx'' 10''xxxxxx'' |
|
|
145 |
|
|
|
146 |
|
|
|
147 |
0x04000000 - 0x7FFFFFFF: |
|
|
148 |
|
|
|
149 |
|
|
|
150 |
1111110''x'' 10''xxxxxx'' 10''xxxxxx'' |
|
|
151 |
10''xxxxxx'' 10''xxxxxx'' 10''xxxxxx'' |
|
|
152 |
|
|
|
153 |
|
|
|
154 |
The ''xxx'' bit positions are filled with the bits of the |
|
|
155 |
character code number in binary representation. Only the |
|
|
156 |
shortest possible multi-byte sequence which can represent |
|
|
157 |
the code number of the character can be used. |
|
|
158 |
|
|
|
159 |
|
|
|
160 |
The __UCS__ code values 0xd800-0xdfff (UTF-16 surrogates) |
|
|
161 |
as well as 0xfffe and 0xffff (UCS non-characters) should not |
|
|
162 |
appear in conforming __UTF-8__ streams. |
|
|
163 |
!!EXAMPLES |
|
|
164 |
|
|
|
165 |
|
|
|
166 |
The __Unicode__ character 0xa9 = 1010 1001 (the copyright |
|
|
167 |
sign) is encoded in UTF-8 as |
|
|
168 |
|
|
|
169 |
|
|
|
170 |
11000010 10101001 = 0xc2 0xa9 |
|
|
171 |
|
|
|
172 |
|
|
|
173 |
and character 0x2260 = 0010 0010 0110 0000 (the |
|
|
174 |
|
|
|
175 |
|
|
|
176 |
11100010 10001001 10100000 = 0xe2 0x89 0xa0 |
|
|
177 |
!!APPLICATION NOTES |
|
|
178 |
|
|
|
179 |
|
|
|
180 |
Users have to select a __UTF-8__ locale, for example |
|
|
181 |
with |
|
|
182 |
|
|
|
183 |
|
|
|
184 |
export LANG=en_GB.UTF-8 |
|
|
185 |
|
|
|
186 |
|
|
|
187 |
in order to activate the __UTF-8__ support in |
|
|
188 |
applications. |
|
|
189 |
|
|
|
190 |
|
|
|
191 |
Application software that has to be aware of the used |
|
|
192 |
character encoding should always set the locale with for |
|
|
193 |
example |
|
|
194 |
|
|
|
195 |
|
|
|
196 |
setlocale(LC_CTYPE, |
|
|
197 |
|
|
|
198 |
|
|
|
199 |
and programmers can then test the expression |
|
|
200 |
|
|
|
201 |
|
|
|
202 |
strcmp(nl_langinfo(CODESET), |
|
|
203 |
|
|
|
204 |
|
|
|
205 |
to determine whether a __UTF-8__ locale has been selected |
|
|
206 |
and whether therefore all plaintext standard input and |
|
|
207 |
output, terminal communication, plaintext file content, |
|
|
208 |
filenames and environment variables are encoded in |
|
|
209 |
__UTF-8__. |
|
|
210 |
|
|
|
211 |
|
|
|
212 |
Programmers accustomed to single-byte encodings such as |
|
|
213 |
__US-ASCII__ or __ISO 8859__ have to be aware that two |
|
|
214 |
assumptions made so far are no longer valid in __UTF-8__ |
|
|
215 |
locales. Firstly, a single byte does not necessarily |
|
|
216 |
correspond any more to a single character. Secondly, since |
|
|
217 |
modern terminal emulators in __UTF-8__ mode also support |
|
|
218 |
Chinese, Japanese, and Korean __double-width characters__ |
|
|
219 |
as well as non-spacing __combining characters__, |
|
|
220 |
outputting a single character does not necessarily advance |
|
|
221 |
the cursor by one position as it did in __ASCII__. |
|
|
222 |
Library functions such as mbsrtowcs(3) and |
|
|
223 |
wcswidth(3) should be used today to count characters |
|
|
224 |
and cursor positions. |
|
|
225 |
|
|
|
226 |
|
|
|
227 |
The official ESC sequence to switch from an __ISO 2022__ |
|
|
228 |
encoding scheme (as used for instance by VT100 terminals) to |
|
|
229 |
__UTF-8__ is ESC % G ( |
|
|
230 |
__UTF-8__ to ISO 2022 |
|
|
231 |
is ESC % @ ( |
|
|
232 |
__ |
|
|
233 |
|
|
|
234 |
|
|
|
235 |
It can be hoped that in the foreseeable future, __UTF-8__ |
|
|
236 |
will replace __ASCII__ and __ISO 8859__ at all levels |
|
|
237 |
as the common character encoding on POSIX systems, leading |
|
|
238 |
to a significantly richer environment for handling plain |
|
|
239 |
text. |
|
|
240 |
!!SECURITY |
|
|
241 |
|
|
|
242 |
|
|
|
243 |
The __Unicode__ and __UCS__ standards require that |
|
|
244 |
producers of __UTF-8__ shall use the shortest form |
|
|
245 |
possible, e.g., producing a two-byte sequence with first |
|
|
246 |
byte 0xc0 is non-conforming. __Unicode 3.1__ has added |
|
|
247 |
the requirement that conforming programs must not accept |
|
|
248 |
non-shortest forms in their input. This is for security |
|
|
249 |
reasons: if user input is checked for possible security |
|
|
250 |
violations, a program might check only for the __ASCII__ |
|
|
251 |
version of |
|
|
252 |
__ASCII__ ways to |
|
|
253 |
represent these things in a non-shortest __UTF-8__ |
|
|
254 |
encoding. |
|
|
255 |
!!STANDARDS |
|
|
256 |
|
|
|
257 |
|
|
|
258 |
ISO/IEC 10646-1:2000, Unicode 3.1, RFC 2279, Plan |
|
|
259 |
9. |
|
|
260 |
!!AUTHOR |
|
|
261 |
|
|
|
262 |
|
|
|
263 |
Markus Kuhn |
|
|
264 |
!!SEE ALSO |
|
|
265 |
|
|
|
266 |
|
4 |
perry |
267 |
nl_langinfo(3), setlocale(3), |
1 |
perry |
268 |
charsets(7), unicode(7) |
|
|
269 |
---- |