Home
Main website
Display Sidebar
Hide Ads
Recent Changes
View Source:
UTF-8
Edit
PageHistory
Diff
Info
LikePages
[UTF-8] is a [Unicode] Transformation Format which encodes all [Unicode] code points in bytes. It requires anywhere from 1 to 6 of them for any given character. It is the most popular member of the [UTF] family, with good reason. Because of the distribution of code points in [Unicode], 90-95% of any typical Western language text requires only one byte per character, and no more than two bytes for almost 100% of non-punctuation characters. It is also directly backwards compatible with [ASCII], which only contains 128 characters: any [UTF-8] character with its high bit reset is identical in meaning to the corresponding [ASCII] character. It can therefore provide an easy transition to [Unicode] for applications/systems that are more used to dealing with 7-bit [ASCII] (for example, [Unix] TwoLetterCommands for stream processing were traditionally very byte-oriented). Unfortunately, it penalizes Eastern scripts with three bytes per character, so Asians generally prefer [UTF]-16. However, because it is byte-oriented, this encoding has important advantages that neither [UTF]-16 nor [UTF]-32 nor any other word-based encoding can offer: * [Endianness] in any environment is completely irrelevant to the meaning of a blob of [UTF-8]-encoded text. * No character other than NUL itself ever requires an all-zero-bits byte to be represented, so strncpy(3) and friends work just fine with [UTF-8] text. Also, because of the rules for multibyte character encoding, odds are pretty good for being able to statistically determine whether a blob of text is [UTF-8] or not. See also: * UnicodeNotes for hints on using [UTF-8] in [Unix]/[Linux] * utf8(7) for the gory technical details * [UTF]
11 pages link to
UTF-8
:
Plan9
gpg(1)
XtermNotes
WlugWikiCustomisations
VimNotes
HostBestPractices
PuttyNotes
UTF
ComposeKey
UnicodeNotes
WikiHistory