An encoding for character sets (such as Latin, Chinese and Cyrillic) that attempts to include all the world's languages while making as few linguistic assumptions as possible. Some of the assumptions it does embed include:

  • The classification of all characters into exactly one of 29 classes (upper case letter, lower case letter, digit, etc). This has the interesting side effect of requiring a duplication of some letters in several classes: M appears twice, once as an upper case letter and once as a digit (for roman numerals).
  • That glyphs are drawn from a countably infinite set of characters (or finite set if you disallow composition).
  • That written text has a single reading order.
  • And probably others.

Contrast ASCII.

See also: