Rev | Author | # | Line |
---|---|---|---|
1 | perry | 1 | PERLUNICODE |
2 | !!!PERLUNICODE | ||
3 | NAME | ||
4 | DESCRIPTION | ||
5 | CAVEATS | ||
6 | SEE ALSO | ||
7 | ---- | ||
8 | !!NAME | ||
9 | |||
10 | |||
11 | perlunicode - Unicode support in Perl ( EXPERIMENTAL , subject to change) | ||
12 | !!DESCRIPTION | ||
13 | |||
14 | |||
15 | __Important Caveat__ | ||
16 | |||
17 | |||
18 | WARNING: As of the 5.6.1 release, the implementation of Unicode | ||
19 | support in Perl is incomplete, and continues to be highly experimental. | ||
20 | The following areas need further work. They are being rapidly addressed in the 5.7.x development branch. | ||
21 | |||
22 | |||
23 | Input and Output Disciplines | ||
24 | |||
25 | |||
26 | There is currently no easy way to mark data read from a file | ||
27 | or other external source as being utf8. This will be one of | ||
28 | the major areas of focus in the near future. | ||
29 | |||
30 | |||
31 | Regular Expressions | ||
32 | |||
33 | |||
34 | The existing regular expression compiler does not produce | ||
35 | polymorphic opcodes. This means that the determination on | ||
36 | whether to match Unicode characters is made when the pattern | ||
37 | is compiled, based on whether the pattern contains Unicode | ||
38 | characters, and not when the matching happens at run time. | ||
39 | This needs to be changed to adaptively match Unicode if the | ||
40 | string to be matched is Unicode. | ||
41 | |||
42 | |||
43 | use utf8 still needed to enable a few | ||
44 | features | ||
45 | |||
46 | |||
47 | The utf8 pragma implements the tables used for | ||
48 | Unicode support. These tables are automatically loaded on | ||
49 | demand, so the utf8 pragma need not normally be | ||
50 | used. | ||
51 | |||
52 | |||
53 | However, as a compatibility measure, this pragma must be | ||
54 | explicitly used to enable recognition of | ||
55 | UTF-8 encoded literals and identifiers in the | ||
56 | source text. | ||
57 | |||
58 | |||
59 | __Byte and Character semantics__ | ||
60 | |||
61 | |||
62 | Beginning with version 5.6, Perl uses logically wide | ||
63 | characters to represent strings internally. This internal | ||
64 | representation of strings uses the UTF-8 | ||
65 | encoding. | ||
66 | |||
67 | |||
68 | In future, Perl-level operations can be expected to work | ||
69 | with characters rather than bytes, in general. | ||
70 | |||
71 | |||
72 | However, as strictly an interim compatibility measure, Perl | ||
73 | v5.6 aims to provide a safe migration path from byte | ||
74 | semantics to character semantics for programs. For | ||
75 | operations where Perl can unambiguously decide that the | ||
76 | input data is characters, Perl now switches to character | ||
77 | semantics. For operations where this determination cannot be | ||
78 | made without additional information from the user, Perl | ||
79 | decides in favor of compatibility, and chooses to use byte | ||
80 | semantics. | ||
81 | |||
82 | |||
83 | This behavior preserves compatibility with earlier versions | ||
84 | of Perl, which allowed byte semantics in Perl operations, | ||
85 | but only as long as none of the program's inputs are marked | ||
86 | as being as source of Unicode character data. Such data may | ||
87 | come from filehandles, from calls to external programs, from | ||
88 | information provided by the system (such as %ENV), | ||
89 | or from literals and constants in the source | ||
90 | text. | ||
91 | |||
92 | |||
93 | If the -C command line switch is used, (or the | ||
94 | ${^WIDE_SYSTEM_CALLS} global flag is set to 1), all | ||
95 | system calls will use the corresponding wide character APIs. | ||
96 | This is currently only implemented on Windows. | ||
97 | |||
98 | |||
99 | Regardless of the above, the bytes pragma can | ||
100 | always be used to force byte semantics in a particular | ||
101 | lexical scope. See bytes. | ||
102 | |||
103 | |||
104 | The utf8 pragma is primarily a compatibility device | ||
105 | that enables recognition of UTF-8 in literals | ||
106 | encountered by the parser. It may also be used for enabling | ||
107 | some of the more experimental Unicode support features. Note | ||
108 | that this pragma is only required until a future version of | ||
109 | Perl in which character semantics will become the default. | ||
110 | This pragma may then become a no-op. See utf8. | ||
111 | |||
112 | |||
113 | Unless mentioned otherwise, Perl operators will use | ||
114 | character semantics when they are dealing with Unicode data, | ||
115 | and byte semantics otherwise. Thus, character semantics for | ||
116 | these operations apply transparently; if the input data came | ||
117 | from a Unicode source (for example, by adding a character | ||
118 | encoding discipline to the filehandle whence it came, or a | ||
119 | literal UTF-8 string constant in the | ||
120 | program), character semantics apply; otherwise, byte | ||
121 | semantics are in effect. To force byte semantics on Unicode | ||
122 | data, the bytes pragma should be used. | ||
123 | |||
124 | |||
125 | Under character semantics, many operations that formerly | ||
126 | operated on bytes change to operating on characters. For | ||
127 | ASCII data this makes no difference, because | ||
128 | UTF-8 stores ASCII in single | ||
129 | bytes, but for any character greater than chr(127), | ||
130 | the character may be stored in a sequence of two or more | ||
131 | bytes, all of which have the high bit set. But by and large, | ||
132 | the user need not worry about this, because Perl hides it | ||
133 | from the user. A character in Perl is logically just a | ||
134 | number ranging from 0 to 2**32 or so. Larger characters | ||
135 | encode to longer sequences of bytes internally, but again, | ||
136 | this is just an internal detail which is hidden at the Perl | ||
137 | level. | ||
138 | |||
139 | |||
140 | __Effects of character semantics__ | ||
141 | |||
142 | |||
143 | Character semantics have the following effects: | ||
144 | |||
145 | |||
146 | Strings and patterns may contain characters that have an | ||
147 | ordinal value larger than 255. | ||
148 | |||
149 | |||
150 | Presuming you use a Unicode editor to edit your program, | ||
151 | such characters will typically occur directly within the | ||
152 | literal strings as UTF-8 characters, but you | ||
153 | can also specify a particular character with an extension of | ||
154 | the x notation. UTF-8 characters are | ||
155 | specified by putting the hexadecimal code within curlies | ||
156 | after the x. For instance, a Unicode smiley face is | ||
157 | x{263A}. | ||
158 | |||
159 | |||
160 | Identifiers within the Perl script may contain Unicode | ||
161 | alphanumeric characters, including ideographs. (You are | ||
162 | currently on your own when it comes to using the canonical | ||
163 | forms of characters--Perl doesn't (yet) attempt to | ||
164 | canonicalize variable names for you.) | ||
165 | |||
166 | |||
167 | Regular expressions match characters instead of bytes. For | ||
168 | instance, ``.'' matches a character instead of a byte. | ||
169 | (However, the C pattern is provided to force a | ||
170 | match a single byte (char | ||
171 | C).) | ||
172 | |||
173 | |||
174 | Character classes in regular expressions match characters | ||
175 | instead of bytes, and match against the character properties | ||
176 | specified in the Unicode properties database. So w | ||
177 | can be used to match an ideograph, for | ||
178 | instance. | ||
179 | |||
180 | |||
181 | Named Unicode properties and block ranges make be used as | ||
182 | character classes via the new p{} (matches | ||
183 | property) and P{} (doesn't match property) | ||
184 | constructs. For instance, p{Lu} matches any | ||
185 | character with the Unicode uppercase property, while | ||
186 | p{M} matches any mark character. Single letter | ||
187 | properties may omit the brackets, so that can be written | ||
188 | pM also. Many predefined character classes are | ||
2 | perry | 189 | available, such as p{!IsMirrored} and |
190 | p{!InTibetan}. | ||
1 | perry | 191 | |
192 | |||
193 | The special pattern X match matches any extended | ||
194 | Unicode sequence (a ``combining character sequence'' in | ||
195 | Standardese), where the first character is a base character | ||
196 | and subsequent characters are mark characters that apply to | ||
197 | the base character. It is equivalent to | ||
198 | (?:PMpM*). | ||
199 | |||
200 | |||
201 | The tr/// operator translates characters instead of | ||
202 | bytes. Note that the tr///CU functionality has been | ||
203 | removed, as the interface was a mistake. For similar | ||
204 | functionality see pack('U0', ...) and pack('C0', | ||
205 | ...). | ||
206 | |||
207 | |||
208 | Case translation operators use the Unicode case translation | ||
209 | tables when provided character input. Note that | ||
210 | uc() translates to uppercase, while | ||
211 | ucfirst translates to titlecase (for languages that | ||
212 | make the distinction). Naturally the corresponding backslash | ||
213 | sequences have the same semantics. | ||
214 | |||
215 | |||
216 | Most operators that deal with positions or lengths in the | ||
217 | string will automatically switch to using character | ||
218 | positions, including chop(), substr(), | ||
219 | pos(), index(), rindex(), | ||
220 | sprintf(), write(), and length(). | ||
221 | Operators that specifically don't switch include | ||
222 | vec(), pack(), and unpack(). | ||
223 | Operators that really don't care include chomp(), | ||
224 | as well as any other operator that treats a string as a | ||
225 | bucket of bits, such as sort(), and the operators | ||
226 | dealing with filenames. | ||
227 | |||
228 | |||
229 | The pack()/unpack() letters | ||
230 | c`` and ''Cnot'' | ||
231 | change, since they're often used for byte-oriented formats. | ||
232 | (Again, think ''char`` in the C language.) | ||
233 | However, there is a new ''U | ||
234 | UTF-8 characters and | ||
235 | integers. (It works outside of the utf8 pragma | ||
236 | too.) | ||
237 | |||
238 | |||
239 | The chr() and ord() functions work on | ||
240 | characters. This is like pack( and | ||
241 | unpack(, not like | ||
242 | pack( and | ||
243 | unpack(. In fact, the latter are how | ||
244 | you now emulate byte-oriented chr() and | ||
245 | ord() under utf8. | ||
246 | |||
247 | |||
248 | The bit string operators can operate on | ||
249 | character data. However, for backward compatibility reasons | ||
250 | (bit string operations when the characters all are less than | ||
251 | 256 in ordinal value) one cannot mix ~ (the bit | ||
252 | complement) and characters both less than 256 and equal or | ||
2 | perry | 253 | greater than 256. Most importantly, the !DeMorgan's laws |
1 | perry | 254 | (~($x$y) eq ~$x, ~($x |
255 | ) won't hold. Another way to look at this is that | ||
256 | the complement cannot return __both__ the 8-bit (byte) | ||
257 | wide bit complement, and the full character wide bit | ||
258 | complement. | ||
259 | |||
260 | |||
261 | And finally, scalar reverse() reverses by character | ||
262 | rather than by byte. | ||
263 | |||
264 | |||
265 | __Character encodings for input and output__ | ||
266 | |||
267 | |||
268 | [[ XXX: This feature is not yet | ||
269 | implemented.] | ||
270 | !!CAVEATS | ||
271 | |||
272 | |||
273 | As of yet, there is no method for automatically coercing | ||
274 | input and output to some encoding other than | ||
275 | UTF-8 . This is planned in the near future, | ||
276 | however. | ||
277 | |||
278 | |||
279 | Whether an arbitrary piece of data will be treated as | ||
280 | ``characters'' or ``bytes'' by internal operations cannot be | ||
281 | divined at the current time. | ||
282 | |||
283 | |||
284 | Use of locales with utf8 may lead to odd results. Currently | ||
285 | there is some attempt to apply 8-bit locale info to | ||
286 | characters in the range 0..255, but this is demonstrably | ||
287 | incorrect for locales that use characters above that range | ||
288 | (when mapped into Unicode). It will also tend to run slower. | ||
289 | Avoidance of locales is strongly encouraged. | ||
290 | !!SEE ALSO | ||
291 | |||
292 | |||
293 | bytes, utf8, ``${^WIDE_SYSTEM_CALLS}'' in | ||
294 | perlvar | ||
295 | ---- |