Penguin
Blame: perlunicode(1)
EditPageHistoryDiffInfoLikePages
Annotated edit history of perlunicode(1) version 2, including all changes. View license author blame.
Rev Author # Line
1 perry 1 PERLUNICODE
2 !!!PERLUNICODE
3 NAME
4 DESCRIPTION
5 CAVEATS
6 SEE ALSO
7 ----
8 !!NAME
9
10
11 perlunicode - Unicode support in Perl ( EXPERIMENTAL , subject to change)
12 !!DESCRIPTION
13
14
15 __Important Caveat__
16
17
18 WARNING: As of the 5.6.1 release, the implementation of Unicode
19 support in Perl is incomplete, and continues to be highly experimental.
20 The following areas need further work. They are being rapidly addressed in the 5.7.x development branch.
21
22
23 Input and Output Disciplines
24
25
26 There is currently no easy way to mark data read from a file
27 or other external source as being utf8. This will be one of
28 the major areas of focus in the near future.
29
30
31 Regular Expressions
32
33
34 The existing regular expression compiler does not produce
35 polymorphic opcodes. This means that the determination on
36 whether to match Unicode characters is made when the pattern
37 is compiled, based on whether the pattern contains Unicode
38 characters, and not when the matching happens at run time.
39 This needs to be changed to adaptively match Unicode if the
40 string to be matched is Unicode.
41
42
43 use utf8 still needed to enable a few
44 features
45
46
47 The utf8 pragma implements the tables used for
48 Unicode support. These tables are automatically loaded on
49 demand, so the utf8 pragma need not normally be
50 used.
51
52
53 However, as a compatibility measure, this pragma must be
54 explicitly used to enable recognition of
55 UTF-8 encoded literals and identifiers in the
56 source text.
57
58
59 __Byte and Character semantics__
60
61
62 Beginning with version 5.6, Perl uses logically wide
63 characters to represent strings internally. This internal
64 representation of strings uses the UTF-8
65 encoding.
66
67
68 In future, Perl-level operations can be expected to work
69 with characters rather than bytes, in general.
70
71
72 However, as strictly an interim compatibility measure, Perl
73 v5.6 aims to provide a safe migration path from byte
74 semantics to character semantics for programs. For
75 operations where Perl can unambiguously decide that the
76 input data is characters, Perl now switches to character
77 semantics. For operations where this determination cannot be
78 made without additional information from the user, Perl
79 decides in favor of compatibility, and chooses to use byte
80 semantics.
81
82
83 This behavior preserves compatibility with earlier versions
84 of Perl, which allowed byte semantics in Perl operations,
85 but only as long as none of the program's inputs are marked
86 as being as source of Unicode character data. Such data may
87 come from filehandles, from calls to external programs, from
88 information provided by the system (such as %ENV),
89 or from literals and constants in the source
90 text.
91
92
93 If the -C command line switch is used, (or the
94 ${^WIDE_SYSTEM_CALLS} global flag is set to 1), all
95 system calls will use the corresponding wide character APIs.
96 This is currently only implemented on Windows.
97
98
99 Regardless of the above, the bytes pragma can
100 always be used to force byte semantics in a particular
101 lexical scope. See bytes.
102
103
104 The utf8 pragma is primarily a compatibility device
105 that enables recognition of UTF-8 in literals
106 encountered by the parser. It may also be used for enabling
107 some of the more experimental Unicode support features. Note
108 that this pragma is only required until a future version of
109 Perl in which character semantics will become the default.
110 This pragma may then become a no-op. See utf8.
111
112
113 Unless mentioned otherwise, Perl operators will use
114 character semantics when they are dealing with Unicode data,
115 and byte semantics otherwise. Thus, character semantics for
116 these operations apply transparently; if the input data came
117 from a Unicode source (for example, by adding a character
118 encoding discipline to the filehandle whence it came, or a
119 literal UTF-8 string constant in the
120 program), character semantics apply; otherwise, byte
121 semantics are in effect. To force byte semantics on Unicode
122 data, the bytes pragma should be used.
123
124
125 Under character semantics, many operations that formerly
126 operated on bytes change to operating on characters. For
127 ASCII data this makes no difference, because
128 UTF-8 stores ASCII in single
129 bytes, but for any character greater than chr(127),
130 the character may be stored in a sequence of two or more
131 bytes, all of which have the high bit set. But by and large,
132 the user need not worry about this, because Perl hides it
133 from the user. A character in Perl is logically just a
134 number ranging from 0 to 2**32 or so. Larger characters
135 encode to longer sequences of bytes internally, but again,
136 this is just an internal detail which is hidden at the Perl
137 level.
138
139
140 __Effects of character semantics__
141
142
143 Character semantics have the following effects:
144
145
146 Strings and patterns may contain characters that have an
147 ordinal value larger than 255.
148
149
150 Presuming you use a Unicode editor to edit your program,
151 such characters will typically occur directly within the
152 literal strings as UTF-8 characters, but you
153 can also specify a particular character with an extension of
154 the x notation. UTF-8 characters are
155 specified by putting the hexadecimal code within curlies
156 after the x. For instance, a Unicode smiley face is
157 x{263A}.
158
159
160 Identifiers within the Perl script may contain Unicode
161 alphanumeric characters, including ideographs. (You are
162 currently on your own when it comes to using the canonical
163 forms of characters--Perl doesn't (yet) attempt to
164 canonicalize variable names for you.)
165
166
167 Regular expressions match characters instead of bytes. For
168 instance, ``.'' matches a character instead of a byte.
169 (However, the C pattern is provided to force a
170 match a single byte (char
171 C).)
172
173
174 Character classes in regular expressions match characters
175 instead of bytes, and match against the character properties
176 specified in the Unicode properties database. So w
177 can be used to match an ideograph, for
178 instance.
179
180
181 Named Unicode properties and block ranges make be used as
182 character classes via the new p{} (matches
183 property) and P{} (doesn't match property)
184 constructs. For instance, p{Lu} matches any
185 character with the Unicode uppercase property, while
186 p{M} matches any mark character. Single letter
187 properties may omit the brackets, so that can be written
188 pM also. Many predefined character classes are
2 perry 189 available, such as p{!IsMirrored} and
190 p{!InTibetan}.
1 perry 191
192
193 The special pattern X match matches any extended
194 Unicode sequence (a ``combining character sequence'' in
195 Standardese), where the first character is a base character
196 and subsequent characters are mark characters that apply to
197 the base character. It is equivalent to
198 (?:PMpM*).
199
200
201 The tr/// operator translates characters instead of
202 bytes. Note that the tr///CU functionality has been
203 removed, as the interface was a mistake. For similar
204 functionality see pack('U0', ...) and pack('C0',
205 ...).
206
207
208 Case translation operators use the Unicode case translation
209 tables when provided character input. Note that
210 uc() translates to uppercase, while
211 ucfirst translates to titlecase (for languages that
212 make the distinction). Naturally the corresponding backslash
213 sequences have the same semantics.
214
215
216 Most operators that deal with positions or lengths in the
217 string will automatically switch to using character
218 positions, including chop(), substr(),
219 pos(), index(), rindex(),
220 sprintf(), write(), and length().
221 Operators that specifically don't switch include
222 vec(), pack(), and unpack().
223 Operators that really don't care include chomp(),
224 as well as any other operator that treats a string as a
225 bucket of bits, such as sort(), and the operators
226 dealing with filenames.
227
228
229 The pack()/unpack() letters
230 c`` and ''Cnot''
231 change, since they're often used for byte-oriented formats.
232 (Again, think ''char`` in the C language.)
233 However, there is a new ''U
234 UTF-8 characters and
235 integers. (It works outside of the utf8 pragma
236 too.)
237
238
239 The chr() and ord() functions work on
240 characters. This is like pack( and
241 unpack(, not like
242 pack( and
243 unpack(. In fact, the latter are how
244 you now emulate byte-oriented chr() and
245 ord() under utf8.
246
247
248 The bit string operators can operate on
249 character data. However, for backward compatibility reasons
250 (bit string operations when the characters all are less than
251 256 in ordinal value) one cannot mix ~ (the bit
252 complement) and characters both less than 256 and equal or
2 perry 253 greater than 256. Most importantly, the !DeMorgan's laws
1 perry 254 (~($x$y) eq ~$x, ~($x
255 ) won't hold. Another way to look at this is that
256 the complement cannot return __both__ the 8-bit (byte)
257 wide bit complement, and the full character wide bit
258 complement.
259
260
261 And finally, scalar reverse() reverses by character
262 rather than by byte.
263
264
265 __Character encodings for input and output__
266
267
268 [[ XXX: This feature is not yet
269 implemented.]
270 !!CAVEATS
271
272
273 As of yet, there is no method for automatically coercing
274 input and output to some encoding other than
275 UTF-8 . This is planned in the near future,
276 however.
277
278
279 Whether an arbitrary piece of data will be treated as
280 ``characters'' or ``bytes'' by internal operations cannot be
281 divined at the current time.
282
283
284 Use of locales with utf8 may lead to odd results. Currently
285 there is some attempt to apply 8-bit locale info to
286 characters in the range 0..255, but this is demonstrably
287 incorrect for locales that use characters above that range
288 (when mapped into Unicode). It will also tend to run slower.
289 Avoidance of locales is strongly encouraged.
290 !!SEE ALSO
291
292
293 bytes, utf8, ``${^WIDE_SYSTEM_CALLS}'' in
294 perlvar
295 ----
This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.