Annotated edit history of
UTF-8 version 2 showing authors affecting page license.
View with all changes included.
Rev |
Author |
# |
Line |
2 |
AristotlePagaltzis |
1 |
[UTF-8] is a [Unicode] Transformation Format which encodes all [Unicode] code points in bytes. It requires anywhere from 1 to 6 of them for any given character. It is the most popular member of the [UTF] family, with good reason. |
1 |
AristotlePagaltzis |
2 |
|
2 |
AristotlePagaltzis |
3 |
Because of the distribution of code points in [Unicode], 90-95% of any typical Western language text requires only one byte per character, and no more than two bytes for almost 100% of non-punctuation characters. It is also directly backwards compatible with [ASCII], which only contains 128 characters: any [UTF-8] character with its high bit reset is identical in meaning to the corresponding [ASCII] character. It can therefore provide an easy transition to [Unicode] for applications/systems that are more used to dealing with 7-bit [ASCII] (for example, [Unix] TwoLetterCommands for stream processing were traditionally very byte-oriented). Unfortunately, it penalizes Eastern scripts with three bytes per character, so Asians generally prefer [UTF]-16. |
1 |
AristotlePagaltzis |
4 |
|
2 |
AristotlePagaltzis |
5 |
However, because it is byte-oriented, this encoding has important advantages that neither [UTF]-16 nor [UTF]-32 nor any other word-based encoding can offer: |
|
|
6 |
* [Endianness] in any environment is completely irrelevant to the meaning of a blob of [UTF-8]-encoded text. |
|
|
7 |
* No character other than NUL itself ever requires an all-zero-bits byte to be represented, so strncpy(3) and friends work just fine with [UTF-8] text. |
|
|
8 |
|
|
|
9 |
Also, because of the rules for multibyte character encoding, odds are pretty good for being able to statistically determine whether a blob of text is [UTF-8] or not. |
|
|
10 |
|
|
|
11 |
See also: |
|
|
12 |
* UnicodeNotes for hints on using [UTF-8] in [Unix]/[Linux] |
|
|
13 |
* utf8(7) for the gory technical details |
|
|
14 |
* [UTF] |