Penguin

Differences between current version and previous revision of UTF-8.

Other diffs: Previous Major Revision, Previous Author, or view the Annotated Edit History

Newer page: version 2 Last edited on Thursday, June 23, 2005 6:12:09 am by AristotlePagaltzis
Older page: version 1 Last edited on Thursday, June 23, 2005 5:21:56 am by AristotlePagaltzis Revert
@@ -1,8 +1,14 @@
-UTF8 is one serialisation of [Unicode], designed to provide a good transition to Unicode for applications/systems that are more used to dealing with 7-bit [ASCII ] (for example , Unix [CLI] TwoLetterCommands for stream processing were traditionally very byte-oriented)
+[UTF-8] is a [Unicode] Transformation Format which encodes all [Unicode] code points in bytes. It requires anywhere from 1 to 6 of them for any given character. It is the most popular member of the [UTF ] family , with good reason
  
-It is mostly used where most people are expecting to be dealing mostly with 7bit ascii text with occasional unicode characters. The first 127 codes in UTF8 map exactly onto the 7 -bit [ASCII] range, with higher codes being reachable with escapes creating multibyte charactors . This means that ascii text is automatically also UTF8 text. Non-ascii text is represented as variable length 8 -bit byte sequences (between 2 and 6 bytes long) -- most Western accented Unicode characters are 2 bytes long in utf8 , and the most common Asian characters are 3 bytes long ( with others 4 bytes or higher). For this reason , Asians generally prefer to use 16-bit Unicode (if they aren't using a country-specific text encoding)
+Because of the distribution of code points in [Unicode], 90-95% of any typical Western language text requires only one byte per character, and no more than two bytes for almost 100% of non-punctuation characters. It is also directly backwards compatible with [ASCII], which only contains 128 characters: any [UTF -8] character with its high bit reset is identical in meaning to the corresponding [ASCII] character . It can therefore provide an easy transition to [Unicode] for applications/systems that are more used to dealing with 7 -bit [ASCII] (for example, [Unix] TwoLetterCommands for stream processing were traditionally very byte -oriented). Unfortunately , it penalizes Eastern scripts with three bytes per character , so Asians generally prefer [UTF]- 16. 
  
-----  
-!! See Also  
-* The utf8 (7 ) ManPage has the gory technical details .  
-* UnicodeNotes gives some hints on using UTF8 in Unix/Linux 
+However, because it is byte -oriented, this encoding has important advantages that neither [UTF] -16 nor [UTF] -32 nor any other word -based encoding can offer:  
+* [Endianness] in any environment is completely irrelevant to the meaning of a blob of [UTF-8]-encoded text.  
+* No character other than NUL itself ever requires an all-zero-bits byte to be represented, so strncpy (3 ) and friends work just fine with [UTF-8] text.  
+  
+Also, because of the rules for multibyte character encoding, odds are pretty good for being able to statistically determine whether a blob of text is [UTF-8] or not .  
+  
+See also:  
+* UnicodeNotes for hints on using [UTF-8] in [ Unix] /[ Linux]  
+* utf8(7) for the gory technical details  
+* [UTF]