UTF-8 - Waikato Linux Users Group

Note: You are viewing an old revision of this page. View the current version.

UTF8 is one serialisation of Unicode, designed to provide a good transition to Unicode for applications/systems that are more used to dealing with 7-bit ASCII (for example, Unix CLI TwoLetterCommands for stream processing were traditionally very byte-oriented).

It is mostly used where most people are expecting to be dealing mostly with 7bit ascii text with occasional unicode characters. The first 127 codes in UTF8 map exactly onto the 7-bit ASCII range, with higher codes being reachable with escapes creating multibyte charactors. This means that ascii text is automatically also UTF8 text. Non-ascii text is represented as variable length 8-bit byte sequences (between 2 and 6 bytes long) -- most Western accented Unicode characters are 2 bytes long in utf8, and the most common Asian characters are 3 bytes long (with others 4 bytes or higher). For this reason, Asians generally prefer to use 16-bit Unicode (if they aren't using a country-specific text encoding).

See Also