Differences between version 2 and predecessor to the previous major change of UTF8.
Other diffs: Previous Revision, Previous Author, or view the Annotated Edit History
Newer page: | version 2 | Last edited on Sunday, March 20, 2005 11:32:00 am | by JohnMcPherson | Revert |
Older page: | version 1 | Last edited on Sunday, March 20, 2005 12:49:43 am | by PerryLorier | Revert |
@@ -1 +1,8 @@
-UTF8 is one serialisation of unicode
, designed to be
used where most people are expecting to be dealing mostly with 7bit ascii text with the
occasional unicode charactor
. The first 127 codes in UTF8 map exactly onto the 7bit
ASCII range, with higher codes being reachable with escapes creating multibyte charactors.
+UTF8 is one serialisation of [Unicode]
, designed to provide a good transition to Unicode for applications/systems that are more used to dealing with 7-bit [ASCII] (for example, Unix [CLI] TwoLetterCommands for stream processing were traditionally very byte-oriented).
+
+It is mostly
used where most people are expecting to be dealing mostly with 7bit ascii text with occasional unicode characters
. The first 127 codes in UTF8 map exactly onto the 7-bit [
ASCII]
range, with higher codes being reachable with escapes creating multibyte charactors. This means that ascii text is automatically also UTF8 text. Non-ascii text is represented as variable length 8-bit byte sequences (between 2 and 6 bytes long) -- most Western accented Unicode characters are 2 bytes long in utf8, and the most common Asian characters are 3 bytes long (with others 4 bytes or higher). For this reason, Asians generally prefer to use 16-bit Unicode (if they aren't using a country-specific text encoding).
+
+----
+!! See Also
+* The utf8(7) ManPage has the gory technical details.
+* UnicodeNotes gives some hints on using UTF8 in Unix/Linux