Home
Main website
Display Sidebar
Hide Ads
Recent Changes
View Source:
utf8(7)
Edit
PageHistory
Diff
Info
LikePages
UTF-8 !!!UTF-8 NAME DESCRIPTION PROPERTIES ENCODING EXAMPLES APPLICATION NOTES SECURITY STANDARDS AUTHOR SEE ALSO ---- !!NAME UTF-8 - an ASCII compatible multi-byte Unicode encoding !!DESCRIPTION The __Unicode 3.0__ character set occupies a 16-bit code space. The most obvious Unicode encoding (known as __UCS-2__) consists of a sequence of 16-bit words. Such strings can contain as parts of many 16-bit characters bytes like '0' or '/' which have a special meaning in filenames and other C library function parameters. In addition, the majority of UNIX tools expects ASCII files and can't read 16-bit words as characters without major modifications. For these reasons, __UCS-2__ is not a suitable external encoding of __Unicode__ in filenames, text files, environment variables, etc. The __ISO 10646 Universal Character Set (UCS)__, a superset of Unicode, occupies even a 31-bit code space and the obvious __UCS-4__ encoding for it (a sequence of 32-bit words) has the same problems. The __UTF-8__ encoding of __Unicode__ and __UCS__ does not have these problems and is the common way in which __Unicode__ is used on Unix-style operating systems. !!PROPERTIES The __UTF-8__ encoding has the following nice properties: * __UCS__ characters 0x00000000 to 0x0000007f (the classic __US-ASCII__ characters) are encoded simply as bytes 0x00 to 0x7f (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both __ASCII__ and __UTF-8__. * All __UCS__ characters __ * The lexicographic sorting order of __UCS-4__ strings is preserved. * All possible 2^31 UCS codes can be encoded using __UTF-8__. * The bytes 0xfe and 0xff are never used in the __UTF-8__ encoding. * The first byte of a multi-byte sequence which represents a single non-ASCII __UCS__ character is always in the range 0xc0 to 0xfd and indicates how long this multi-byte sequence is. All further bytes in a multi-byte sequence are in the range 0x80 to 0xbf. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes. * __UTF-8__ encoded __UCS__ characters may be up to six bytes long, however the __Unicode__ standard specifies no characters above 0x10ffff, so Unicode characters can only be up to four bytes long in __UTF-8__. !!ENCODING The following byte sequences are used to represent a character. The sequence to be used depends on the UCS code number of the character: 0x00000000 - 0x0000007F: 0''xxxxxxx'' 0x00000080 - 0x000007FF: 110''xxxxx'' 10''xxxxxx'' 0x00000800 - 0x0000FFFF: 1110''xxxx'' 10''xxxxxx'' 10''xxxxxx'' 0x00010000 - 0x001FFFFF: 11110''xxx'' 10''xxxxxx'' 10''xxxxxx'' 10''xxxxxx'' 0x00200000 - 0x03FFFFFF: 111110''xx'' 10''xxxxxx'' 10''xxxxxx'' 10''xxxxxx'' 10''xxxxxx'' 0x04000000 - 0x7FFFFFFF: 1111110''x'' 10''xxxxxx'' 10''xxxxxx'' 10''xxxxxx'' 10''xxxxxx'' 10''xxxxxx'' The ''xxx'' bit positions are filled with the bits of the character code number in binary representation. Only the shortest possible multi-byte sequence which can represent the code number of the character can be used. The __UCS__ code values 0xd800-0xdfff (UTF-16 surrogates) as well as 0xfffe and 0xffff (UCS non-characters) should not appear in conforming __UTF-8__ streams. !!EXAMPLES The __Unicode__ character 0xa9 = 1010 1001 (the copyright sign) is encoded in UTF-8 as 11000010 10101001 = 0xc2 0xa9 and character 0x2260 = 0010 0010 0110 0000 (the 11100010 10001001 10100000 = 0xe2 0x89 0xa0 !!APPLICATION NOTES Users have to select a __UTF-8__ locale, for example with export LANG=en_GB.UTF-8 in order to activate the __UTF-8__ support in applications. Application software that has to be aware of the used character encoding should always set the locale with for example setlocale(LC_CTYPE, and programmers can then test the expression strcmp(nl_langinfo(CODESET), to determine whether a __UTF-8__ locale has been selected and whether therefore all plaintext standard input and output, terminal communication, plaintext file content, filenames and environment variables are encoded in __UTF-8__. Programmers accustomed to single-byte encodings such as __US-ASCII__ or __ISO 8859__ have to be aware that two assumptions made so far are no longer valid in __UTF-8__ locales. Firstly, a single byte does not necessarily correspond any more to a single character. Secondly, since modern terminal emulators in __UTF-8__ mode also support Chinese, Japanese, and Korean __double-width characters__ as well as non-spacing __combining characters__, outputting a single character does not necessarily advance the cursor by one position as it did in __ASCII__. Library functions such as mbsrtowcs(3) and wcswidth(3) should be used today to count characters and cursor positions. The official ESC sequence to switch from an __ISO 2022__ encoding scheme (as used for instance by VT100 terminals) to __UTF-8__ is ESC % G ( __UTF-8__ to ISO 2022 is ESC % @ ( __ It can be hoped that in the foreseeable future, __UTF-8__ will replace __ASCII__ and __ISO 8859__ at all levels as the common character encoding on POSIX systems, leading to a significantly richer environment for handling plain text. !!SECURITY The __Unicode__ and __UCS__ standards require that producers of __UTF-8__ shall use the shortest form possible, e.g., producing a two-byte sequence with first byte 0xc0 is non-conforming. __Unicode 3.1__ has added the requirement that conforming programs must not accept non-shortest forms in their input. This is for security reasons: if user input is checked for possible security violations, a program might check only for the __ASCII__ version of __ASCII__ ways to represent these things in a non-shortest __UTF-8__ encoding. !!STANDARDS ISO/IEC 10646-1:2000, Unicode 3.1, RFC 2279, Plan 9. !!AUTHOR Markus Kuhn !!SEE ALSO nl_langinfo(3), setlocale(3), charsets(7), unicode(7) ----
5 pages link to
utf8(7)
:
Man7u
EILSEQ
ascii(7)
UTF-8
ISO
This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.