Penguin
Annotated edit history of UTF-8 version 2, including all changes. View license author blame.
Rev Author # Line
2 AristotlePagaltzis 1 [UTF-8] is a [Unicode] Transformation Format which encodes all [Unicode] code points in bytes. It requires anywhere from 1 to 6 of them for any given character. It is the most popular member of the [UTF] family, with good reason.
1 AristotlePagaltzis 2
2 AristotlePagaltzis 3 Because of the distribution of code points in [Unicode], 90-95% of any typical Western language text requires only one byte per character, and no more than two bytes for almost 100% of non-punctuation characters. It is also directly backwards compatible with [ASCII], which only contains 128 characters: any [UTF-8] character with its high bit reset is identical in meaning to the corresponding [ASCII] character. It can therefore provide an easy transition to [Unicode] for applications/systems that are more used to dealing with 7-bit [ASCII] (for example, [Unix] TwoLetterCommands for stream processing were traditionally very byte-oriented). Unfortunately, it penalizes Eastern scripts with three bytes per character, so Asians generally prefer [UTF]-16.
1 AristotlePagaltzis 4
2 AristotlePagaltzis 5 However, because it is byte-oriented, this encoding has important advantages that neither [UTF]-16 nor [UTF]-32 nor any other word-based encoding can offer:
6 * [Endianness] in any environment is completely irrelevant to the meaning of a blob of [UTF-8]-encoded text.
7 * No character other than NUL itself ever requires an all-zero-bits byte to be represented, so strncpy(3) and friends work just fine with [UTF-8] text.
8
9 Also, because of the rules for multibyte character encoding, odds are pretty good for being able to statistically determine whether a blob of text is [UTF-8] or not.
10
11 See also:
12 * UnicodeNotes for hints on using [UTF-8] in [Unix]/[Linux]
13 * utf8(7) for the gory technical details
14 * [UTF]