Blame: utf-8(7) - Waikato Linux Users Group

Annotated edit history of utf-8(7) version 4, including all changes. View license author blame.

Rev	Author	#	Line
1	perry	1	`UTF-8`
		2	`!!!UTF-8`
		3	`NAME`
		4	`DESCRIPTION`
		5	`PROPERTIES`
		6	`ENCODING`
		7	`EXAMPLES`
		8	`APPLICATION NOTES`
		9	`SECURITY`
		10	`STANDARDS`
		11	`AUTHOR`
		12	`SEE ALSO`
		13	`----`
		14	`!!NAME`
		15
		16
		17	`UTF-8 - an ASCII compatible multi-byte Unicode encoding`
		18	`!!DESCRIPTION`
		19
		20
		21	`The __Unicode 3.0__ character set occupies a 16-bit code`
		22	`space. The most obvious Unicode encoding (known as`
		23	`__UCS-2__) consists of a sequence of 16-bit words. Such`
		24	`strings can contain as parts of many 16-bit characters bytes`
		25	`like '0' or '/' which have a special meaning in filenames`
		26	`and other C library function parameters. In addition, the`
		27	`majority of UNIX tools expects ASCII files and can't read`
		28	`16-bit words as characters without major modifications. For`
		29	`these reasons, __UCS-2__ is not a suitable external`
		30	`encoding of __Unicode__ in filenames, text files,`
		31	`environment variables, etc. The __ISO 10646 Universal`
		32	`Character Set (UCS)__, a superset of Unicode, occupies`
		33	`even a 31-bit code space and the obvious __UCS-4__`
		34	`encoding for it (a sequence of 32-bit words) has the same`
		35	`problems.`
		36
		37
		38	`The __UTF-8__ encoding of __Unicode__ and __UCS__`
		39	`does not have these problems and is the common way in which`
		40	`__Unicode__ is used on Unix-style operating`
		41	`systems.`
		42	`!!PROPERTIES`
		43
		44
		45	`The __UTF-8__ encoding has the following nice`
		46	`properties:`
		47
		48
		49	`*`
		50
		51
		52	`__UCS__ characters 0x00000000 to 0x0000007f (the classic`
		53	`__US-ASCII__ characters) are encoded simply as bytes 0x00`
		54	`to 0x7f (ASCII compatibility). This means that files and`
		55	`strings which contain only 7-bit ASCII characters have the`
		56	`same encoding under both __ASCII__ and`
		57	`__UTF-8__.`
		58
		59
		60	`*`
		61
		62
		63	`All __UCS__ characters`
		64	`__`
		65
		66
		67	`*`
		68
		69
		70	`The lexicographic sorting order of __UCS-4__ strings is`
		71	`preserved.`
		72
		73
		74	`*`
		75
		76
		77	`All possible 2^31 UCS codes can be encoded using`
		78	`__UTF-8__.`
		79
		80
		81	`*`
		82
		83
		84	`The bytes 0xfe and 0xff are never used in the __UTF-8__`
		85	`encoding.`
		86
		87
		88	`*`
		89
		90
		91	`The first byte of a multi-byte sequence which represents a`
		92	`single non-ASCII __UCS__ character is always in the range`
		93	`0xc0 to 0xfd and indicates how long this multi-byte sequence`
		94	`is. All further bytes in a multi-byte sequence are in the`
		95	`range 0x80 to 0xbf. This allows easy resynchronization and`
		96	`makes the encoding stateless and robust against missing`
		97	`bytes.`
		98
		99
		100	`*`
		101
		102
		103	`__UTF-8__ encoded __UCS__ characters may be up to six`
		104	`bytes long, however the __Unicode__ standard specifies no`
		105	`characters above 0x10ffff, so Unicode characters can only be`
		106	`up to four bytes long in __UTF-8__.`
		107	`!!ENCODING`
		108
		109
		110	`The following byte sequences are used to represent a`
		111	`character. The sequence to be used depends on the UCS code`
		112	`number of the character:`
		113
		114
		115	`0x00000000 - 0x0000007F:`
		116
		117
		118	`0''xxxxxxx''`
		119
		120
		121	`0x00000080 - 0x000007FF:`
		122
		123
		124	`110''xxxxx'' 10''xxxxxx''`
		125
		126
		127	`0x00000800 - 0x0000FFFF:`
		128
		129
		130	`1110''xxxx'' 10''xxxxxx'' 10''xxxxxx''`
		131
		132
		133	`0x00010000 - 0x001FFFFF:`
		134
		135
		136	`11110''xxx'' 10''xxxxxx'' 10''xxxxxx''`
		137	`10''xxxxxx''`
		138
		139
		140	`0x00200000 - 0x03FFFFFF:`
		141
		142
		143	`111110''xx'' 10''xxxxxx'' 10''xxxxxx''`
		144	`10''xxxxxx'' 10''xxxxxx''`
		145
		146
		147	`0x04000000 - 0x7FFFFFFF:`
		148
		149
		150	`1111110''x'' 10''xxxxxx'' 10''xxxxxx''`
		151	`10''xxxxxx'' 10''xxxxxx'' 10''xxxxxx''`
		152
		153
		154	`The ''xxx'' bit positions are filled with the bits of the`
		155	`character code number in binary representation. Only the`
		156	`shortest possible multi-byte sequence which can represent`
		157	`the code number of the character can be used.`
		158
		159
		160	`The __UCS__ code values 0xd800-0xdfff (UTF-16 surrogates)`
		161	`as well as 0xfffe and 0xffff (UCS non-characters) should not`
		162	`appear in conforming __UTF-8__ streams.`
		163	`!!EXAMPLES`
		164
		165
		166	`The __Unicode__ character 0xa9 = 1010 1001 (the copyright`
		167	`sign) is encoded in UTF-8 as`
		168
		169
		170	`11000010 10101001 = 0xc2 0xa9`
		171
		172
		173	`and character 0x2260 = 0010 0010 0110 0000 (the`
		174
		175
		176	`11100010 10001001 10100000 = 0xe2 0x89 0xa0`
		177	`!!APPLICATION NOTES`
		178
		179
		180	`Users have to select a __UTF-8__ locale, for example`
		181	`with`
		182
		183
		184	`export LANG=en_GB.UTF-8`
		185
		186
		187	`in order to activate the __UTF-8__ support in`
		188	`applications.`
		189
		190
		191	`Application software that has to be aware of the used`
		192	`character encoding should always set the locale with for`
		193	`example`
		194
		195
		196	`setlocale(LC_CTYPE,`
		197
		198
		199	`and programmers can then test the expression`
		200
		201
		202	`strcmp(nl_langinfo(CODESET),`
		203
		204
		205	`to determine whether a __UTF-8__ locale has been selected`
		206	`and whether therefore all plaintext standard input and`
		207	`output, terminal communication, plaintext file content,`
		208	`filenames and environment variables are encoded in`
		209	`__UTF-8__.`
		210
		211
		212	`Programmers accustomed to single-byte encodings such as`
		213	`__US-ASCII__ or __ISO 8859__ have to be aware that two`
		214	`assumptions made so far are no longer valid in __UTF-8__`
		215	`locales. Firstly, a single byte does not necessarily`
		216	`correspond any more to a single character. Secondly, since`
		217	`modern terminal emulators in __UTF-8__ mode also support`
		218	`Chinese, Japanese, and Korean __double-width characters__`
		219	`as well as non-spacing __combining characters__,`
		220	`outputting a single character does not necessarily advance`
		221	`the cursor by one position as it did in __ASCII__.`
		222	`Library functions such as mbsrtowcs(3) and`
		223	`wcswidth(3) should be used today to count characters`
		224	`and cursor positions.`
		225
		226
		227	`The official ESC sequence to switch from an __ISO 2022__`
		228	`encoding scheme (as used for instance by VT100 terminals) to`
		229	`__UTF-8__ is ESC % G (`
		230	`__UTF-8__ to ISO 2022`
		231	`is ESC % @ (`
		232	`__`
		233
		234
		235	`It can be hoped that in the foreseeable future, __UTF-8__`
		236	`will replace __ASCII__ and __ISO 8859__ at all levels`
		237	`as the common character encoding on POSIX systems, leading`
		238	`to a significantly richer environment for handling plain`
		239	`text.`
		240	`!!SECURITY`
		241
		242
		243	`The __Unicode__ and __UCS__ standards require that`
		244	`producers of __UTF-8__ shall use the shortest form`
		245	`possible, e.g., producing a two-byte sequence with first`
		246	`byte 0xc0 is non-conforming. __Unicode 3.1__ has added`
		247	`the requirement that conforming programs must not accept`
		248	`non-shortest forms in their input. This is for security`
		249	`reasons: if user input is checked for possible security`
		250	`violations, a program might check only for the __ASCII__`
		251	`version of`
		252	`__ASCII__ ways to`
		253	`represent these things in a non-shortest __UTF-8__`
		254	`encoding.`
		255	`!!STANDARDS`
		256
		257
		258	`ISO/IEC 10646-1:2000, Unicode 3.1, RFC 2279, Plan`
		259	`9.`
		260	`!!AUTHOR`
		261
		262
		263	`Markus Kuhn`
		264	`!!SEE ALSO`
		265
		266
4	perry	267	`nl_langinfo(3), setlocale(3),`
1	perry	268	`charsets(7), unicode(7)`
		269	`----`

This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.

Last edited on Tuesday, June 4, 2002 12:31:00 am by "perry"

Edit PageHistory Diff Info LikePages