UTF-8 is a Unicode Transformation Format which encodes all Unicode code points in bytes. It requires anywhere from 1 to 6 of them for any given character. It is the most popular member of the UTF family, with good reason.

Because of the distribution of code points in Unicode, 90-95% of any typical Western language text requires only one byte per character, and no more than two bytes for almost 100% of non-punctuation characters. It is also directly backwards compatible with ASCII, which only contains 128 characters: any UTF-8 character with its high bit reset is identical in meaning to the corresponding ASCII character. It can therefore provide an easy transition to Unicode for applications/systems that are more used to dealing with 7-bit ASCII (for example, Unix TwoLetterCommands for stream processing were traditionally very byte-oriented). Unfortunately, it penalizes Eastern scripts with three bytes per character, so Asians generally prefer UTF-16.

However, because it is byte-oriented, this encoding has important advantages that neither UTF-16 nor UTF-32 nor any other word-based encoding can offer:

  • Endianness in any environment is completely irrelevant to the meaning of a blob of UTF-8-encoded text.
  • No character other than NUL itself ever requires an all-zero-bits byte to be represented, so strncpy(3) and friends work just fine with UTF-8 text.

Also, because of the rules for multibyte character encoding, odds are pretty good for being able to statistically determine whether a blob of text is UTF-8 or not.

See also: