version 2 showing authors affecting page license.
.
Rev |
Author |
# |
Line |
1 |
perry |
1 |
PERLUNICODE |
|
|
2 |
!!!PERLUNICODE |
|
|
3 |
NAME |
|
|
4 |
DESCRIPTION |
|
|
5 |
CAVEATS |
|
|
6 |
SEE ALSO |
|
|
7 |
---- |
|
|
8 |
!!NAME |
|
|
9 |
|
|
|
10 |
|
|
|
11 |
perlunicode - Unicode support in Perl ( EXPERIMENTAL , subject to change) |
|
|
12 |
!!DESCRIPTION |
|
|
13 |
|
|
|
14 |
|
|
|
15 |
__Important Caveat__ |
|
|
16 |
|
|
|
17 |
|
|
|
18 |
WARNING: As of the 5.6.1 release, the implementation of Unicode |
|
|
19 |
support in Perl is incomplete, and continues to be highly experimental. |
|
|
20 |
The following areas need further work. They are being rapidly addressed in the 5.7.x development branch. |
|
|
21 |
|
|
|
22 |
|
|
|
23 |
Input and Output Disciplines |
|
|
24 |
|
|
|
25 |
|
|
|
26 |
There is currently no easy way to mark data read from a file |
|
|
27 |
or other external source as being utf8. This will be one of |
|
|
28 |
the major areas of focus in the near future. |
|
|
29 |
|
|
|
30 |
|
|
|
31 |
Regular Expressions |
|
|
32 |
|
|
|
33 |
|
|
|
34 |
The existing regular expression compiler does not produce |
|
|
35 |
polymorphic opcodes. This means that the determination on |
|
|
36 |
whether to match Unicode characters is made when the pattern |
|
|
37 |
is compiled, based on whether the pattern contains Unicode |
|
|
38 |
characters, and not when the matching happens at run time. |
|
|
39 |
This needs to be changed to adaptively match Unicode if the |
|
|
40 |
string to be matched is Unicode. |
|
|
41 |
|
|
|
42 |
|
|
|
43 |
use utf8 still needed to enable a few |
|
|
44 |
features |
|
|
45 |
|
|
|
46 |
|
|
|
47 |
The utf8 pragma implements the tables used for |
|
|
48 |
Unicode support. These tables are automatically loaded on |
|
|
49 |
demand, so the utf8 pragma need not normally be |
|
|
50 |
used. |
|
|
51 |
|
|
|
52 |
|
|
|
53 |
However, as a compatibility measure, this pragma must be |
|
|
54 |
explicitly used to enable recognition of |
|
|
55 |
UTF-8 encoded literals and identifiers in the |
|
|
56 |
source text. |
|
|
57 |
|
|
|
58 |
|
|
|
59 |
__Byte and Character semantics__ |
|
|
60 |
|
|
|
61 |
|
|
|
62 |
Beginning with version 5.6, Perl uses logically wide |
|
|
63 |
characters to represent strings internally. This internal |
|
|
64 |
representation of strings uses the UTF-8 |
|
|
65 |
encoding. |
|
|
66 |
|
|
|
67 |
|
|
|
68 |
In future, Perl-level operations can be expected to work |
|
|
69 |
with characters rather than bytes, in general. |
|
|
70 |
|
|
|
71 |
|
|
|
72 |
However, as strictly an interim compatibility measure, Perl |
|
|
73 |
v5.6 aims to provide a safe migration path from byte |
|
|
74 |
semantics to character semantics for programs. For |
|
|
75 |
operations where Perl can unambiguously decide that the |
|
|
76 |
input data is characters, Perl now switches to character |
|
|
77 |
semantics. For operations where this determination cannot be |
|
|
78 |
made without additional information from the user, Perl |
|
|
79 |
decides in favor of compatibility, and chooses to use byte |
|
|
80 |
semantics. |
|
|
81 |
|
|
|
82 |
|
|
|
83 |
This behavior preserves compatibility with earlier versions |
|
|
84 |
of Perl, which allowed byte semantics in Perl operations, |
|
|
85 |
but only as long as none of the program's inputs are marked |
|
|
86 |
as being as source of Unicode character data. Such data may |
|
|
87 |
come from filehandles, from calls to external programs, from |
|
|
88 |
information provided by the system (such as %ENV), |
|
|
89 |
or from literals and constants in the source |
|
|
90 |
text. |
|
|
91 |
|
|
|
92 |
|
|
|
93 |
If the -C command line switch is used, (or the |
|
|
94 |
${^WIDE_SYSTEM_CALLS} global flag is set to 1), all |
|
|
95 |
system calls will use the corresponding wide character APIs. |
|
|
96 |
This is currently only implemented on Windows. |
|
|
97 |
|
|
|
98 |
|
|
|
99 |
Regardless of the above, the bytes pragma can |
|
|
100 |
always be used to force byte semantics in a particular |
|
|
101 |
lexical scope. See bytes. |
|
|
102 |
|
|
|
103 |
|
|
|
104 |
The utf8 pragma is primarily a compatibility device |
|
|
105 |
that enables recognition of UTF-8 in literals |
|
|
106 |
encountered by the parser. It may also be used for enabling |
|
|
107 |
some of the more experimental Unicode support features. Note |
|
|
108 |
that this pragma is only required until a future version of |
|
|
109 |
Perl in which character semantics will become the default. |
|
|
110 |
This pragma may then become a no-op. See utf8. |
|
|
111 |
|
|
|
112 |
|
|
|
113 |
Unless mentioned otherwise, Perl operators will use |
|
|
114 |
character semantics when they are dealing with Unicode data, |
|
|
115 |
and byte semantics otherwise. Thus, character semantics for |
|
|
116 |
these operations apply transparently; if the input data came |
|
|
117 |
from a Unicode source (for example, by adding a character |
|
|
118 |
encoding discipline to the filehandle whence it came, or a |
|
|
119 |
literal UTF-8 string constant in the |
|
|
120 |
program), character semantics apply; otherwise, byte |
|
|
121 |
semantics are in effect. To force byte semantics on Unicode |
|
|
122 |
data, the bytes pragma should be used. |
|
|
123 |
|
|
|
124 |
|
|
|
125 |
Under character semantics, many operations that formerly |
|
|
126 |
operated on bytes change to operating on characters. For |
|
|
127 |
ASCII data this makes no difference, because |
|
|
128 |
UTF-8 stores ASCII in single |
|
|
129 |
bytes, but for any character greater than chr(127), |
|
|
130 |
the character may be stored in a sequence of two or more |
|
|
131 |
bytes, all of which have the high bit set. But by and large, |
|
|
132 |
the user need not worry about this, because Perl hides it |
|
|
133 |
from the user. A character in Perl is logically just a |
|
|
134 |
number ranging from 0 to 2**32 or so. Larger characters |
|
|
135 |
encode to longer sequences of bytes internally, but again, |
|
|
136 |
this is just an internal detail which is hidden at the Perl |
|
|
137 |
level. |
|
|
138 |
|
|
|
139 |
|
|
|
140 |
__Effects of character semantics__ |
|
|
141 |
|
|
|
142 |
|
|
|
143 |
Character semantics have the following effects: |
|
|
144 |
|
|
|
145 |
|
|
|
146 |
Strings and patterns may contain characters that have an |
|
|
147 |
ordinal value larger than 255. |
|
|
148 |
|
|
|
149 |
|
|
|
150 |
Presuming you use a Unicode editor to edit your program, |
|
|
151 |
such characters will typically occur directly within the |
|
|
152 |
literal strings as UTF-8 characters, but you |
|
|
153 |
can also specify a particular character with an extension of |
|
|
154 |
the x notation. UTF-8 characters are |
|
|
155 |
specified by putting the hexadecimal code within curlies |
|
|
156 |
after the x. For instance, a Unicode smiley face is |
|
|
157 |
x{263A}. |
|
|
158 |
|
|
|
159 |
|
|
|
160 |
Identifiers within the Perl script may contain Unicode |
|
|
161 |
alphanumeric characters, including ideographs. (You are |
|
|
162 |
currently on your own when it comes to using the canonical |
|
|
163 |
forms of characters--Perl doesn't (yet) attempt to |
|
|
164 |
canonicalize variable names for you.) |
|
|
165 |
|
|
|
166 |
|
|
|
167 |
Regular expressions match characters instead of bytes. For |
|
|
168 |
instance, ``.'' matches a character instead of a byte. |
|
|
169 |
(However, the C pattern is provided to force a |
|
|
170 |
match a single byte (char |
|
|
171 |
C).) |
|
|
172 |
|
|
|
173 |
|
|
|
174 |
Character classes in regular expressions match characters |
|
|
175 |
instead of bytes, and match against the character properties |
|
|
176 |
specified in the Unicode properties database. So w |
|
|
177 |
can be used to match an ideograph, for |
|
|
178 |
instance. |
|
|
179 |
|
|
|
180 |
|
|
|
181 |
Named Unicode properties and block ranges make be used as |
|
|
182 |
character classes via the new p{} (matches |
|
|
183 |
property) and P{} (doesn't match property) |
|
|
184 |
constructs. For instance, p{Lu} matches any |
|
|
185 |
character with the Unicode uppercase property, while |
|
|
186 |
p{M} matches any mark character. Single letter |
|
|
187 |
properties may omit the brackets, so that can be written |
|
|
188 |
pM also. Many predefined character classes are |
|
|
189 |
available, such as p{!IsMirrored} and |
|
|
190 |
p{!InTibetan}. |
|
|
191 |
|
|
|
192 |
|
|
|
193 |
The special pattern X match matches any extended |
|
|
194 |
Unicode sequence (a ``combining character sequence'' in |
|
|
195 |
Standardese), where the first character is a base character |
|
|
196 |
and subsequent characters are mark characters that apply to |
|
|
197 |
the base character. It is equivalent to |
|
|
198 |
(?:PMpM*). |
|
|
199 |
|
|
|
200 |
|
|
|
201 |
The tr/// operator translates characters instead of |
|
|
202 |
bytes. Note that the tr///CU functionality has been |
|
|
203 |
removed, as the interface was a mistake. For similar |
|
|
204 |
functionality see pack('U0', ...) and pack('C0', |
|
|
205 |
...). |
|
|
206 |
|
|
|
207 |
|
|
|
208 |
Case translation operators use the Unicode case translation |
|
|
209 |
tables when provided character input. Note that |
|
|
210 |
uc() translates to uppercase, while |
|
|
211 |
ucfirst translates to titlecase (for languages that |
|
|
212 |
make the distinction). Naturally the corresponding backslash |
|
|
213 |
sequences have the same semantics. |
|
|
214 |
|
|
|
215 |
|
|
|
216 |
Most operators that deal with positions or lengths in the |
|
|
217 |
string will automatically switch to using character |
|
|
218 |
positions, including chop(), substr(), |
|
|
219 |
pos(), index(), rindex(), |
|
|
220 |
sprintf(), write(), and length(). |
|
|
221 |
Operators that specifically don't switch include |
|
|
222 |
vec(), pack(), and unpack(). |
|
|
223 |
Operators that really don't care include chomp(), |
|
|
224 |
as well as any other operator that treats a string as a |
|
|
225 |
bucket of bits, such as sort(), and the operators |
|
|
226 |
dealing with filenames. |
|
|
227 |
|
|
|
228 |
|
|
|
229 |
The pack()/unpack() letters |
|
|
230 |
c`` and ''Cnot'' |
|
|
231 |
change, since they're often used for byte-oriented formats. |
|
|
232 |
(Again, think ''char`` in the C language.) |
|
|
233 |
However, there is a new ''U |
|
|
234 |
UTF-8 characters and |
|
|
235 |
integers. (It works outside of the utf8 pragma |
|
|
236 |
too.) |
|
|
237 |
|
|
|
238 |
|
|
|
239 |
The chr() and ord() functions work on |
|
|
240 |
characters. This is like pack( and |
|
|
241 |
unpack(, not like |
|
|
242 |
pack( and |
|
|
243 |
unpack(. In fact, the latter are how |
|
|
244 |
you now emulate byte-oriented chr() and |
|
|
245 |
ord() under utf8. |
|
|
246 |
|
|
|
247 |
|
|
|
248 |
The bit string operators can operate on |
|
|
249 |
character data. However, for backward compatibility reasons |
|
|
250 |
(bit string operations when the characters all are less than |
|
|
251 |
256 in ordinal value) one cannot mix ~ (the bit |
|
|
252 |
complement) and characters both less than 256 and equal or |
|
|
253 |
greater than 256. Most importantly, the !DeMorgan's laws |
|
|
254 |
(~($x$y) eq ~$x, ~($x |
|
|
255 |
) won't hold. Another way to look at this is that |
|
|
256 |
the complement cannot return __both__ the 8-bit (byte) |
|
|
257 |
wide bit complement, and the full character wide bit |
|
|
258 |
complement. |
|
|
259 |
|
|
|
260 |
|
|
|
261 |
And finally, scalar reverse() reverses by character |
|
|
262 |
rather than by byte. |
|
|
263 |
|
|
|
264 |
|
|
|
265 |
__Character encodings for input and output__ |
|
|
266 |
|
|
|
267 |
|
|
|
268 |
[[ XXX: This feature is not yet |
|
|
269 |
implemented.] |
|
|
270 |
!!CAVEATS |
|
|
271 |
|
|
|
272 |
|
|
|
273 |
As of yet, there is no method for automatically coercing |
|
|
274 |
input and output to some encoding other than |
|
|
275 |
UTF-8 . This is planned in the near future, |
|
|
276 |
however. |
|
|
277 |
|
|
|
278 |
|
|
|
279 |
Whether an arbitrary piece of data will be treated as |
|
|
280 |
``characters'' or ``bytes'' by internal operations cannot be |
|
|
281 |
divined at the current time. |
|
|
282 |
|
|
|
283 |
|
|
|
284 |
Use of locales with utf8 may lead to odd results. Currently |
|
|
285 |
there is some attempt to apply 8-bit locale info to |
|
|
286 |
characters in the range 0..255, but this is demonstrably |
|
|
287 |
incorrect for locales that use characters above that range |
|
|
288 |
(when mapped into Unicode). It will also tend to run slower. |
|
|
289 |
Avoidance of locales is strongly encouraged. |
|
|
290 |
!!SEE ALSO |
|
|
291 |
|
|
|
292 |
|
|
|
293 |
bytes, utf8, ``${^WIDE_SYSTEM_CALLS}'' in |
|
|
294 |
perlvar |
|
|
295 |
---- |