Penguin
Annotated edit history of utf-8(7) version 4, including all changes. View license author blame.
Rev Author # Line
1 perry 1 UTF-8
2 !!!UTF-8
3 NAME
4 DESCRIPTION
5 PROPERTIES
6 ENCODING
7 EXAMPLES
8 APPLICATION NOTES
9 SECURITY
10 STANDARDS
11 AUTHOR
12 SEE ALSO
13 ----
14 !!NAME
15
16
17 UTF-8 - an ASCII compatible multi-byte Unicode encoding
18 !!DESCRIPTION
19
20
21 The __Unicode 3.0__ character set occupies a 16-bit code
22 space. The most obvious Unicode encoding (known as
23 __UCS-2__) consists of a sequence of 16-bit words. Such
24 strings can contain as parts of many 16-bit characters bytes
25 like '0' or '/' which have a special meaning in filenames
26 and other C library function parameters. In addition, the
27 majority of UNIX tools expects ASCII files and can't read
28 16-bit words as characters without major modifications. For
29 these reasons, __UCS-2__ is not a suitable external
30 encoding of __Unicode__ in filenames, text files,
31 environment variables, etc. The __ISO 10646 Universal
32 Character Set (UCS)__, a superset of Unicode, occupies
33 even a 31-bit code space and the obvious __UCS-4__
34 encoding for it (a sequence of 32-bit words) has the same
35 problems.
36
37
38 The __UTF-8__ encoding of __Unicode__ and __UCS__
39 does not have these problems and is the common way in which
40 __Unicode__ is used on Unix-style operating
41 systems.
42 !!PROPERTIES
43
44
45 The __UTF-8__ encoding has the following nice
46 properties:
47
48
49 *
50
51
52 __UCS__ characters 0x00000000 to 0x0000007f (the classic
53 __US-ASCII__ characters) are encoded simply as bytes 0x00
54 to 0x7f (ASCII compatibility). This means that files and
55 strings which contain only 7-bit ASCII characters have the
56 same encoding under both __ASCII__ and
57 __UTF-8__.
58
59
60 *
61
62
63 All __UCS__ characters
64 __
65
66
67 *
68
69
70 The lexicographic sorting order of __UCS-4__ strings is
71 preserved.
72
73
74 *
75
76
77 All possible 2^31 UCS codes can be encoded using
78 __UTF-8__.
79
80
81 *
82
83
84 The bytes 0xfe and 0xff are never used in the __UTF-8__
85 encoding.
86
87
88 *
89
90
91 The first byte of a multi-byte sequence which represents a
92 single non-ASCII __UCS__ character is always in the range
93 0xc0 to 0xfd and indicates how long this multi-byte sequence
94 is. All further bytes in a multi-byte sequence are in the
95 range 0x80 to 0xbf. This allows easy resynchronization and
96 makes the encoding stateless and robust against missing
97 bytes.
98
99
100 *
101
102
103 __UTF-8__ encoded __UCS__ characters may be up to six
104 bytes long, however the __Unicode__ standard specifies no
105 characters above 0x10ffff, so Unicode characters can only be
106 up to four bytes long in __UTF-8__.
107 !!ENCODING
108
109
110 The following byte sequences are used to represent a
111 character. The sequence to be used depends on the UCS code
112 number of the character:
113
114
115 0x00000000 - 0x0000007F:
116
117
118 0''xxxxxxx''
119
120
121 0x00000080 - 0x000007FF:
122
123
124 110''xxxxx'' 10''xxxxxx''
125
126
127 0x00000800 - 0x0000FFFF:
128
129
130 1110''xxxx'' 10''xxxxxx'' 10''xxxxxx''
131
132
133 0x00010000 - 0x001FFFFF:
134
135
136 11110''xxx'' 10''xxxxxx'' 10''xxxxxx''
137 10''xxxxxx''
138
139
140 0x00200000 - 0x03FFFFFF:
141
142
143 111110''xx'' 10''xxxxxx'' 10''xxxxxx''
144 10''xxxxxx'' 10''xxxxxx''
145
146
147 0x04000000 - 0x7FFFFFFF:
148
149
150 1111110''x'' 10''xxxxxx'' 10''xxxxxx''
151 10''xxxxxx'' 10''xxxxxx'' 10''xxxxxx''
152
153
154 The ''xxx'' bit positions are filled with the bits of the
155 character code number in binary representation. Only the
156 shortest possible multi-byte sequence which can represent
157 the code number of the character can be used.
158
159
160 The __UCS__ code values 0xd800-0xdfff (UTF-16 surrogates)
161 as well as 0xfffe and 0xffff (UCS non-characters) should not
162 appear in conforming __UTF-8__ streams.
163 !!EXAMPLES
164
165
166 The __Unicode__ character 0xa9 = 1010 1001 (the copyright
167 sign) is encoded in UTF-8 as
168
169
170 11000010 10101001 = 0xc2 0xa9
171
172
173 and character 0x2260 = 0010 0010 0110 0000 (the
174
175
176 11100010 10001001 10100000 = 0xe2 0x89 0xa0
177 !!APPLICATION NOTES
178
179
180 Users have to select a __UTF-8__ locale, for example
181 with
182
183
184 export LANG=en_GB.UTF-8
185
186
187 in order to activate the __UTF-8__ support in
188 applications.
189
190
191 Application software that has to be aware of the used
192 character encoding should always set the locale with for
193 example
194
195
196 setlocale(LC_CTYPE,
197
198
199 and programmers can then test the expression
200
201
202 strcmp(nl_langinfo(CODESET),
203
204
205 to determine whether a __UTF-8__ locale has been selected
206 and whether therefore all plaintext standard input and
207 output, terminal communication, plaintext file content,
208 filenames and environment variables are encoded in
209 __UTF-8__.
210
211
212 Programmers accustomed to single-byte encodings such as
213 __US-ASCII__ or __ISO 8859__ have to be aware that two
214 assumptions made so far are no longer valid in __UTF-8__
215 locales. Firstly, a single byte does not necessarily
216 correspond any more to a single character. Secondly, since
217 modern terminal emulators in __UTF-8__ mode also support
218 Chinese, Japanese, and Korean __double-width characters__
219 as well as non-spacing __combining characters__,
220 outputting a single character does not necessarily advance
221 the cursor by one position as it did in __ASCII__.
222 Library functions such as mbsrtowcs(3) and
223 wcswidth(3) should be used today to count characters
224 and cursor positions.
225
226
227 The official ESC sequence to switch from an __ISO 2022__
228 encoding scheme (as used for instance by VT100 terminals) to
229 __UTF-8__ is ESC % G (
230 __UTF-8__ to ISO 2022
231 is ESC % @ (
232 __
233
234
235 It can be hoped that in the foreseeable future, __UTF-8__
236 will replace __ASCII__ and __ISO 8859__ at all levels
237 as the common character encoding on POSIX systems, leading
238 to a significantly richer environment for handling plain
239 text.
240 !!SECURITY
241
242
243 The __Unicode__ and __UCS__ standards require that
244 producers of __UTF-8__ shall use the shortest form
245 possible, e.g., producing a two-byte sequence with first
246 byte 0xc0 is non-conforming. __Unicode 3.1__ has added
247 the requirement that conforming programs must not accept
248 non-shortest forms in their input. This is for security
249 reasons: if user input is checked for possible security
250 violations, a program might check only for the __ASCII__
251 version of
252 __ASCII__ ways to
253 represent these things in a non-shortest __UTF-8__
254 encoding.
255 !!STANDARDS
256
257
258 ISO/IEC 10646-1:2000, Unicode 3.1, RFC 2279, Plan
259 9.
260 !!AUTHOR
261
262
263 Markus Kuhn
264 !!SEE ALSO
265
266
4 perry 267 nl_langinfo(3), setlocale(3),
1 perry 268 charsets(7), unicode(7)
269 ----
This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.