View Source: utf-8(7) - Waikato Linux Users Group

Edit PageHistory Diff Info LikePages
UTF-8
!!!UTF-8
NAME
DESCRIPTION
PROPERTIES
ENCODING
EXAMPLES
APPLICATION NOTES
SECURITY
STANDARDS
AUTHOR
SEE ALSO
----
!!NAME


UTF-8 - an ASCII compatible multi-byte Unicode encoding
!!DESCRIPTION


The __Unicode 3.0__ character set occupies a 16-bit code
space. The most obvious Unicode encoding (known as
__UCS-2__) consists of a sequence of 16-bit words. Such
strings can contain as parts of many 16-bit characters bytes
like '0' or '/' which have a special meaning in filenames
and other C library function parameters. In addition, the
majority of UNIX tools expects ASCII files and can't read
16-bit words as characters without major modifications. For
these reasons, __UCS-2__ is not a suitable external
encoding of __Unicode__ in filenames, text files,
environment variables, etc. The __ISO 10646 Universal
Character Set (UCS)__, a superset of Unicode, occupies
even a 31-bit code space and the obvious __UCS-4__
encoding for it (a sequence of 32-bit words) has the same
problems.


The __UTF-8__ encoding of __Unicode__ and __UCS__
does not have these problems and is the common way in which
__Unicode__ is used on Unix-style operating
systems.
!!PROPERTIES


The __UTF-8__ encoding has the following nice
properties:


*


__UCS__ characters 0x00000000 to 0x0000007f (the classic
__US-ASCII__ characters) are encoded simply as bytes 0x00
to 0x7f (ASCII compatibility). This means that files and
strings which contain only 7-bit ASCII characters have the
same encoding under both __ASCII__ and
__UTF-8__.


*


All __UCS__ characters
__


*


The lexicographic sorting order of __UCS-4__ strings is
preserved.


*


All possible 2^31 UCS codes can be encoded using
__UTF-8__.


*


The bytes 0xfe and 0xff are never used in the __UTF-8__
encoding.


*


The first byte of a multi-byte sequence which represents a
single non-ASCII __UCS__ character is always in the range
0xc0 to 0xfd and indicates how long this multi-byte sequence
is. All further bytes in a multi-byte sequence are in the
range 0x80 to 0xbf. This allows easy resynchronization and
makes the encoding stateless and robust against missing
bytes.


*


__UTF-8__ encoded __UCS__ characters may be up to six
bytes long, however the __Unicode__ standard specifies no
characters above 0x10ffff, so Unicode characters can only be
up to four bytes long in __UTF-8__.
!!ENCODING


The following byte sequences are used to represent a
character. The sequence to be used depends on the UCS code
number of the character:


0x00000000 - 0x0000007F:


0''xxxxxxx''


0x00000080 - 0x000007FF:


110''xxxxx'' 10''xxxxxx''


0x00000800 - 0x0000FFFF:


1110''xxxx'' 10''xxxxxx'' 10''xxxxxx''


0x00010000 - 0x001FFFFF:


11110''xxx'' 10''xxxxxx'' 10''xxxxxx''
10''xxxxxx''


0x00200000 - 0x03FFFFFF:


111110''xx'' 10''xxxxxx'' 10''xxxxxx''
10''xxxxxx'' 10''xxxxxx''


0x04000000 - 0x7FFFFFFF:


1111110''x'' 10''xxxxxx'' 10''xxxxxx''
10''xxxxxx'' 10''xxxxxx'' 10''xxxxxx''


The ''xxx'' bit positions are filled with the bits of the
character code number in binary representation. Only the
shortest possible multi-byte sequence which can represent
the code number of the character can be used.


The __UCS__ code values 0xd800-0xdfff (UTF-16 surrogates)
as well as 0xfffe and 0xffff (UCS non-characters) should not
appear in conforming __UTF-8__ streams.
!!EXAMPLES


The __Unicode__ character 0xa9 = 1010 1001 (the copyright
sign) is encoded in UTF-8 as


11000010 10101001 = 0xc2 0xa9


and character 0x2260 = 0010 0010 0110 0000 (the


11100010 10001001 10100000 = 0xe2 0x89 0xa0
!!APPLICATION NOTES


Users have to select a __UTF-8__ locale, for example
with


export LANG=en_GB.UTF-8


in order to activate the __UTF-8__ support in
applications.


Application software that has to be aware of the used
character encoding should always set the locale with for
example


setlocale(LC_CTYPE,


and programmers can then test the expression


strcmp(nl_langinfo(CODESET),


to determine whether a __UTF-8__ locale has been selected
and whether therefore all plaintext standard input and
output, terminal communication, plaintext file content,
filenames and environment variables are encoded in
__UTF-8__.


Programmers accustomed to single-byte encodings such as
__US-ASCII__ or __ISO 8859__ have to be aware that two
assumptions made so far are no longer valid in __UTF-8__
locales. Firstly, a single byte does not necessarily
correspond any more to a single character. Secondly, since
modern terminal emulators in __UTF-8__ mode also support
Chinese, Japanese, and Korean __double-width characters__
as well as non-spacing __combining characters__,
outputting a single character does not necessarily advance
the cursor by one position as it did in __ASCII__.
Library functions such as mbsrtowcs(3) and
wcswidth(3) should be used today to count characters
and cursor positions.


The official ESC sequence to switch from an __ISO 2022__
encoding scheme (as used for instance by VT100 terminals) to
__UTF-8__ is ESC % G (
__UTF-8__ to ISO 2022
is ESC % @ (
__


It can be hoped that in the foreseeable future, __UTF-8__
will replace __ASCII__ and __ISO 8859__ at all levels
as the common character encoding on POSIX systems, leading
to a significantly richer environment for handling plain
text.
!!SECURITY


The __Unicode__ and __UCS__ standards require that
producers of __UTF-8__ shall use the shortest form
possible, e.g., producing a two-byte sequence with first
byte 0xc0 is non-conforming. __Unicode 3.1__ has added
the requirement that conforming programs must not accept
non-shortest forms in their input. This is for security
reasons: if user input is checked for possible security
violations, a program might check only for the __ASCII__
version of
__ASCII__ ways to
represent these things in a non-shortest __UTF-8__
encoding.
!!STANDARDS


ISO/IEC 10646-1:2000, Unicode 3.1, RFC 2279, Plan
9.
!!AUTHOR


Markus Kuhn
!!SEE ALSO


nl_langinfo(3), setlocale(3),
charsets(7), unicode(7)
----
7 pages link to utf-8(7):
This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.
Last edited on Tuesday, June 4, 2002 12:31:00 am by "perry"
Edit PageHistory Diff Info LikePages