Differences between current version and previous revision of UTF-8.
Other diffs: Previous Major Revision, Previous Author, or view the Annotated Edit History
Newer page: | version 2 | Last edited on Thursday, June 23, 2005 6:12:09 am | by AristotlePagaltzis | |
Older page: | version 1 | Last edited on Thursday, June 23, 2005 5:21:56 am | by AristotlePagaltzis | Revert |
@@ -1,8 +1,14 @@
-UTF8
is one serialisation of
[Unicode], designed
to provide a good transition to Unicode
for applications/systems that are more used to dealing with 7-bit
[ASCII
] (for example
, Unix [CLI] TwoLetterCommands for stream processing were traditionally very byte-oriented)
.
+[UTF-8]
is a
[Unicode] Transformation Format which encodes all [Unicode] code points in bytes. It requires anywhere from 1
to 6 of them
for any given character. It is the most popular member of the
[UTF
] family
, with good reason
.
-It is mostly used where most people are expecting to be dealing mostly with 7bit ascii text
with occasional unicode
characters. The first 127 codes in UTF8 map exactly onto the 7
-bit [ASCII] range, with higher codes being reachable with escapes creating multibyte charactors
. This means
that ascii text is automatically also UTF8 text. Non-ascii text is represented as variable length 8
-bit byte sequences
(between 2 and 6 bytes long)
-- most Western accented Unicode characters are 2 bytes long in utf8
, and the most common Asian characters are 3 bytes long (
with others 4
bytes or higher). For this reason
, Asians generally prefer to use
16-bit Unicode (if they aren't using a country-specific text encoding)
.
+Because of the distribution of code points in [Unicode], 90-95% of any typical Western language text requires only one byte per character, and no more than two bytes for almost 100% of non-punctuation characters.
It is also directly backwards compatible
with [ASCII], which only contains 128
characters: any [UTF
-8] character with its high
bit reset is identical in meaning to the corresponding
[ASCII] character
. It can therefore provide an easy transition to [Unicode] for applications/systems
that are more used to dealing with 7
-bit [ASCII]
(for example, [Unix] TwoLetterCommands for stream processing were traditionally very byte
-oriented). Unfortunately
, it penalizes Eastern scripts
with three
bytes per character
, so
Asians generally prefer [UTF]-
16.
-----
-!! See Also
-* The utf8
(7
) ManPage has
the gory technical details
.
-* UnicodeNotes gives some
hints on using UTF8
in Unix/Linux
+However, because it is byte
-oriented, this encoding has important advantages that neither [UTF]
-16 nor [UTF]
-32 nor any other word
-based encoding can offer:
+* [Endianness] in any environment is completely irrelevant to the meaning of a blob of [UTF-8]-encoded text.
+* No character other than NUL itself ever requires an all-zero-bits byte to be represented, so strncpy
(3
) and friends work just fine with [UTF-8] text.
+
+Also, because of
the rules for multibyte character encoding, odds are pretty good for being able to statistically determine whether a blob of text is [UTF-8] or not
.
+
+See also:
+* UnicodeNotes for
hints on using [UTF-8]
in [
Unix]
/[
Linux]
+* utf8(7) for the gory technical details
+* [UTF]