perlunicode
PERLUNICODE(E)   Perl Programmers Reference Guide  PERLUNICODE(E)



NAME
       perlunicode - Unicode support in Perl (EXPERIMENTAL, sub-
       ject to change)

DESCRIPTION
       Important Caveat

           WARNING:  As of the 5.6.1 release, the implementation of Unicode
           support in Perl is incomplete, and continues to be highly experimental.

       The following areas need further work.  They are being
       rapidly addressed in the 5.7.x development branch.

       Input and Output Disciplines
           There is currently no easy way to mark data read from
           a file or other external source as being utf8.  This
           will be one of the major areas of focus in the near
           future.

       Regular Expressions
           The existing regular expression compiler does not pro-
           duce polymorphic opcodes.  This means that the deter-
           mination on whether to match Unicode characters is
           made when the pattern is compiled, based on whether
           the pattern contains Unicode characters, and not when
           the matching happens at run time.  This needs to be
           changed to adaptively match Unicode if the string to
           be matched is Unicode.

       "use utf8" still needed to enable a few features
           The "utf8" pragma implements the tables used for Uni-
           code support.  These tables are automatically loaded
           on demand, so the "utf8" pragma need not normally be
           used.

           However, as a compatibility measure, this pragma must
           be explicitly used to enable recognition of UTF-8
           encoded literals and identifiers in the source text.

       Byte and Character semantics

       Beginning with version 5.6, Perl uses logically wide char-
       acters to represent strings internally.  This internal
       representation of strings uses the UTF-8 encoding.

       In future, Perl-level operations can be expected to work
       with characters rather than bytes, in general.

       However, as strictly an interim compatibility measure,
       Perl v5.6 aims to provide a safe migration path from byte
       semantics to character semantics for programs.  For opera-
       tions where Perl can unambiguously decide that the input
       data is characters, Perl now switches to character seman-
       tics.  For operations where this determination cannot be
       made without additional information from the user, Perl
       decides in favor of compatibility, and chooses to use byte
       semantics.

       This behavior preserves compatibility with earlier ver-
       sions of Perl, which allowed byte semantics in Perl opera-
       tions, but only as long as none of the program's inputs
       are marked as being as source of Unicode character data.
       Such data may come from filehandles, from calls to exter-
       nal programs, from information provided by the system
       (such as %ENV), or from literals and constants in the
       source text.

       If the "-C" command line switch is used, (or the
       ${^WIDE_SYSTEM_CALLS} global flag is set to 1), all system
       calls will use the corresponding wide character APIs.
       This is currently only implemented on Windows.

       Regardless of the above, the "bytes" pragma can always be
       used to force byte semantics in a particular lexical
       scope.  See bytes.

       The "utf8" pragma is primarily a compatibility device that
       enables recognition of UTF-8 in literals encountered by
       the parser.  It may also be used for enabling some of the
       more experimental Unicode support features.  Note that
       this pragma is only required until a future version of
       Perl in which character semantics will become the default.
       This pragma may then become a no-op.  See utf8.

       Unless mentioned otherwise, Perl operators will use char-
       acter semantics when they are dealing with Unicode data,
       and byte semantics otherwise.  Thus, character semantics
       for these operations apply transparently; if the input
       data came from a Unicode source (for example, by adding a
       character encoding discipline to the filehandle whence it
       came, or a literal UTF-8 string constant in the program),
       character semantics apply; otherwise, byte semantics are
       in effect.  To force byte semantics on Unicode data, the
       "bytes" pragma should be used.

       Under character semantics, many operations that formerly
       operated on bytes change to operating on characters.  For
       ASCII data this makes no difference, because UTF-8 stores
       ASCII in single bytes, but for any character greater than
       "chr(r)", the character may be stored in a sequence of
       two or more bytes, all of which have the high bit set.
       But by and large, the user need not worry about this,
       because Perl hides it from the user.  A character in Perl
       is logically just a number ranging from 0 to 2**32 or so.
       Larger characters encode to longer sequences of bytes
       internally, but again, this is just an internal detail
       which is hidden at the Perl level.

       Effects of character semantics

       Character semantics have the following effects:

       o   Strings and patterns may contain characters that have
           an ordinal value larger than 255.

           Presuming you use a Unicode editor to edit your pro-
           gram, such characters will typically occur directly
           within the literal strings as UTF-8 characters, but
           you can also specify a particular character with an
           extension of the "\x" notation.  UTF-8 characters are
           specified by putting the hexadecimal code within
           curlies after the "\x".  For instance, a Unicode smi-
           ley face is "\x{263A}".

       o   Identifiers within the Perl script may contain Unicode
           alphanumeric characters, including ideographs.  (You
           are currently on your own when it comes to using the
           canonical forms of characters--Perl doesn't (yet)
           attempt to canonicalize variable names for you.)

       o   Regular expressions match characters instead of bytes.
           For instance, "." matches a character instead of a
           byte.  (However, the "\C" pattern is provided to force
           a match a single byte (""char"" in C, hence "\C").)

       o   Character classes in regular expressions match charac-
           ters instead of bytes, and match against the character
           properties specified in the Unicode properties
           database.  So "\w" can be used to match an ideograph,
           for instance.

       o   Named Unicode properties and block ranges make be used
           as character classes via the new "\p{}" (matches prop-
           erty) and "\P{}" (doesn't match property) constructs.
           For instance, "\p{Lu}" matches any character with the
           Unicode uppercase property, while "\p{M}" matches any
           mark character.  Single letter properties may omit the
           brackets, so that can be written "\pM" also.  Many
           predefined character classes are available, such as
           "\p{IsMirrored}" and  "\p{InTibetan}".

       o   The special pattern "\X" match matches any extended
           Unicode sequence (a "combining character sequence" in
           Standardese), where the first character is a base
           character and subsequent characters are mark charac-
           ters that apply to the base character.  It is equiva-
           lent to "(?:\PM\pM*)".

       o   The "tr///" operator translates characters instead of
           bytes.  Note that the "tr///CU" functionality has been
           removed, as the interface was a mistake.  For similar
           functionality see pack('U0', ...) and pack('C0', ...).

       o   Case translation operators use the Unicode case trans-
           lation tables when provided character input.  Note
           that "uc()" translates to uppercase, while "ucfirst"
           translates to titlecase (for languages that make the
           distinction).  Naturally the corresponding backslash
           sequences have the same semantics.

       o   Most operators that deal with positions or lengths in
           the string will automatically switch to using charac-
           ter positions, including "chop()", "substr()",
           "pos()", "index()", "rindex()", "sprintf()",
           "write()", and "length()".  Operators that specifi-
           cally don't switch include "vec()", "pack()", and
           "unpack()".  Operators that really don't care include
           "chomp()", as well as any other operator that treats a
           string as a bucket of bits, such as "sort()", and the
           operators dealing with filenames.

       o   The "pack()"/"unpack()" letters ""c"" and ""C"" do not
           change, since they're often used for byte-oriented
           formats.  (Again, think ""char"" in the C language.)
           However, there is a new ""U"" specifier that will con-
           vert between UTF-8 characters and integers.  (It works
           outside of the utf8 pragma too.)

       o   The "chr()" and "ord()" functions work on characters.
           This is like "pack("U")" and "unpack("U")", not like
           "pack("C")" and "unpack("C")".  In fact, the latter
           are how you now emulate byte-oriented "chr()" and
           "ord()" under utf8.

       o   The bit string operators "& | ^ ~" can operate on
           character data.  However, for backward compatibility
           reasons (bit string operations when the characters all
           are less than 256 in ordinal value) one cannot mix "~"
           (the bit complement) and characters both less than 256
           and equal or greater than 256.  Most importantly, the
           DeMorgan's laws ("~($x|$y) eq ~$x&~$y", "~($x&$y) eq
           ~$x|~$y") won't hold.  Another way to look at this is
           that the complement cannot return both the 8-bit
           (byte) wide bit complement, and the full character
           wide bit complement.

       o   And finally, "scalar reverse()" reverses by character
           rather than by byte.

       Character encodings for input and output

       [XXX: This feature is not yet implemented.]

CAVEATS
       As of yet, there is no method for automatically coercing
       input and output to some encoding other than UTF-8.  This
       is planned in the near future, however.

       Whether an arbitrary piece of data will be treated as
       "characters" or "bytes" by internal operations cannot be
       divined at the current time.

       Use of locales with utf8 may lead to odd results.  Cur-
       rently there is some attempt to apply 8-bit locale info to
       characters in the range 0..255, but this is demonstrably
       incorrect for locales that use characters above that range
       (when mapped into Unicode).  It will also tend to run
       slower.  Avoidance of locales is strongly encouraged.

SEE ALSO
       bytes, utf8, "${^WIDE_SYSTEM_CALLS}" in perlvar



perl v5.6.1                 2001-03-19             PERLUNICODE(E)