Blame: perlunicode(1) - Waikato Linux Users Group

Annotated edit history of perlunicode(1) version 2, including all changes. View license author blame.

Rev	Author	#	Line
1	perry	1	`PERLUNICODE`
		2	`!!!PERLUNICODE`
		3	`NAME`
		4	`DESCRIPTION`
		5	`CAVEATS`
		6	`SEE ALSO`
		7	`----`
		8	`!!NAME`
		9
		10
		11	`perlunicode - Unicode support in Perl ( EXPERIMENTAL , subject to change)`
		12	`!!DESCRIPTION`
		13
		14
		15	`__Important Caveat__`
		16
		17
		18	`WARNING: As of the 5.6.1 release, the implementation of Unicode`
		19	`support in Perl is incomplete, and continues to be highly experimental.`
		20	`The following areas need further work. They are being rapidly addressed in the 5.7.x development branch.`
		21
		22
		23	`Input and Output Disciplines`
		24
		25
		26	`There is currently no easy way to mark data read from a file`
		27	`or other external source as being utf8. This will be one of`
		28	`the major areas of focus in the near future.`
		29
		30
		31	`Regular Expressions`
		32
		33
		34	`The existing regular expression compiler does not produce`
		35	`polymorphic opcodes. This means that the determination on`
		36	`whether to match Unicode characters is made when the pattern`
		37	`is compiled, based on whether the pattern contains Unicode`
		38	`characters, and not when the matching happens at run time.`
		39	`This needs to be changed to adaptively match Unicode if the`
		40	`string to be matched is Unicode.`
		41
		42
		43	`use utf8 still needed to enable a few`
		44	`features`
		45
		46
		47	`The utf8 pragma implements the tables used for`
		48	`Unicode support. These tables are automatically loaded on`
		49	`demand, so the utf8 pragma need not normally be`
		50	`used.`
		51
		52
		53	`However, as a compatibility measure, this pragma must be`
		54	`explicitly used to enable recognition of`
		55	`UTF-8 encoded literals and identifiers in the`
		56	`source text.`
		57
		58
		59	`__Byte and Character semantics__`
		60
		61
		62	`Beginning with version 5.6, Perl uses logically wide`
		63	`characters to represent strings internally. This internal`
		64	`representation of strings uses the UTF-8`
		65	`encoding.`
		66
		67
		68	`In future, Perl-level operations can be expected to work`
		69	`with characters rather than bytes, in general.`
		70
		71
		72	`However, as strictly an interim compatibility measure, Perl`
		73	`v5.6 aims to provide a safe migration path from byte`
		74	`semantics to character semantics for programs. For`
		75	`operations where Perl can unambiguously decide that the`
		76	`input data is characters, Perl now switches to character`
		77	`semantics. For operations where this determination cannot be`
		78	`made without additional information from the user, Perl`
		79	`decides in favor of compatibility, and chooses to use byte`
		80	`semantics.`
		81
		82
		83	`This behavior preserves compatibility with earlier versions`
		84	`of Perl, which allowed byte semantics in Perl operations,`
		85	`but only as long as none of the program's inputs are marked`
		86	`as being as source of Unicode character data. Such data may`
		87	`come from filehandles, from calls to external programs, from`
		88	`information provided by the system (such as %ENV),`
		89	`or from literals and constants in the source`
		90	`text.`
		91
		92
		93	`If the -C command line switch is used, (or the`
		94	`${^WIDE_SYSTEM_CALLS} global flag is set to 1), all`
		95	`system calls will use the corresponding wide character APIs.`
		96	`This is currently only implemented on Windows.`
		97
		98
		99	`Regardless of the above, the bytes pragma can`
		100	`always be used to force byte semantics in a particular`
		101	`lexical scope. See bytes.`
		102
		103
		104	`The utf8 pragma is primarily a compatibility device`
		105	`that enables recognition of UTF-8 in literals`
		106	`encountered by the parser. It may also be used for enabling`
		107	`some of the more experimental Unicode support features. Note`
		108	`that this pragma is only required until a future version of`
		109	`Perl in which character semantics will become the default.`
		110	`This pragma may then become a no-op. See utf8.`
		111
		112
		113	`Unless mentioned otherwise, Perl operators will use`
		114	`character semantics when they are dealing with Unicode data,`
		115	`and byte semantics otherwise. Thus, character semantics for`
		116	`these operations apply transparently; if the input data came`
		117	`from a Unicode source (for example, by adding a character`
		118	`encoding discipline to the filehandle whence it came, or a`
		119	`literal UTF-8 string constant in the`
		120	`program), character semantics apply; otherwise, byte`
		121	`semantics are in effect. To force byte semantics on Unicode`
		122	`data, the bytes pragma should be used.`
		123
		124
		125	`Under character semantics, many operations that formerly`
		126	`operated on bytes change to operating on characters. For`
		127	`ASCII data this makes no difference, because`
		128	`UTF-8 stores ASCII in single`
		129	`bytes, but for any character greater than chr(127),`
		130	`the character may be stored in a sequence of two or more`
		131	`bytes, all of which have the high bit set. But by and large,`
		132	`the user need not worry about this, because Perl hides it`
		133	`from the user. A character in Perl is logically just a`
		134	`number ranging from 0 to 2**32 or so. Larger characters`
		135	`encode to longer sequences of bytes internally, but again,`
		136	`this is just an internal detail which is hidden at the Perl`
		137	`level.`
		138
		139
		140	`__Effects of character semantics__`
		141
		142
		143	`Character semantics have the following effects:`
		144
		145
		146	`Strings and patterns may contain characters that have an`
		147	`ordinal value larger than 255.`
		148
		149
		150	`Presuming you use a Unicode editor to edit your program,`
		151	`such characters will typically occur directly within the`
		152	`literal strings as UTF-8 characters, but you`
		153	`can also specify a particular character with an extension of`
		154	`the x notation. UTF-8 characters are`
		155	`specified by putting the hexadecimal code within curlies`
		156	`after the x. For instance, a Unicode smiley face is`
		157	`x{263A}.`
		158
		159
		160	`Identifiers within the Perl script may contain Unicode`
		161	`alphanumeric characters, including ideographs. (You are`
		162	`currently on your own when it comes to using the canonical`
		163	`forms of characters--Perl doesn't (yet) attempt to`
		164	`canonicalize variable names for you.)`
		165
		166
		167	`Regular expressions match characters instead of bytes. For`
		168	instance, ``.'' matches a character instead of a byte.
		169	`(However, the C pattern is provided to force a`
		170	`match a single byte (char`
		171	`C).)`
		172
		173
		174	`Character classes in regular expressions match characters`
		175	`instead of bytes, and match against the character properties`
		176	`specified in the Unicode properties database. So w`
		177	`can be used to match an ideograph, for`
		178	`instance.`
		179
		180
		181	`Named Unicode properties and block ranges make be used as`
		182	`character classes via the new p{} (matches`
		183	`property) and P{} (doesn't match property)`
		184	`constructs. For instance, p{Lu} matches any`
		185	`character with the Unicode uppercase property, while`
		186	`p{M} matches any mark character. Single letter`
		187	`properties may omit the brackets, so that can be written`
		188	`pM also. Many predefined character classes are`
2	perry	189	`available, such as p{!IsMirrored} and`
		190	`p{!InTibetan}.`
1	perry	191
		192
		193	`The special pattern X match matches any extended`
		194	Unicode sequence (a ``combining character sequence'' in
		195	`Standardese), where the first character is a base character`
		196	`and subsequent characters are mark characters that apply to`
		197	`the base character. It is equivalent to`
		198	`(?:PMpM*).`
		199
		200
		201	`The tr/// operator translates characters instead of`
		202	`bytes. Note that the tr///CU functionality has been`
		203	`removed, as the interface was a mistake. For similar`
		204	`functionality see pack('U0', ...) and pack('C0',`
		205	`...).`
		206
		207
		208	`Case translation operators use the Unicode case translation`
		209	`tables when provided character input. Note that`
		210	`uc() translates to uppercase, while`
		211	`ucfirst translates to titlecase (for languages that`
		212	`make the distinction). Naturally the corresponding backslash`
		213	`sequences have the same semantics.`
		214
		215
		216	`Most operators that deal with positions or lengths in the`
		217	`string will automatically switch to using character`
		218	`positions, including chop(), substr(),`
		219	`pos(), index(), rindex(),`
		220	`sprintf(), write(), and length().`
		221	`Operators that specifically don't switch include`
		222	`vec(), pack(), and unpack().`
		223	`Operators that really don't care include chomp(),`
		224	`as well as any other operator that treats a string as a`
		225	`bucket of bits, such as sort(), and the operators`
		226	`dealing with filenames.`
		227
		228
		229	`The pack()/unpack() letters`
		230	c`` and ''Cnot''
		231	`change, since they're often used for byte-oriented formats.`
		232	(Again, think ''char`` in the C language.)
		233	`However, there is a new ''U`
		234	`UTF-8 characters and`
		235	`integers. (It works outside of the utf8 pragma`
		236	`too.)`
		237
		238
		239	`The chr() and ord() functions work on`
		240	`characters. This is like pack( and`
		241	`unpack(, not like`
		242	`pack( and`
		243	`unpack(. In fact, the latter are how`
		244	`you now emulate byte-oriented chr() and`
		245	`ord() under utf8.`
		246
		247
		248	`The bit string operators can operate on`
		249	`character data. However, for backward compatibility reasons`
		250	`(bit string operations when the characters all are less than`
		251	`256 in ordinal value) one cannot mix ~ (the bit`
		252	`complement) and characters both less than 256 and equal or`
2	perry	253	`greater than 256. Most importantly, the !DeMorgan's laws`
1	perry	254	`(~($x$y) eq ~$x, ~($x`
		255	`) won't hold. Another way to look at this is that`
		256	`the complement cannot return __both__ the 8-bit (byte)`
		257	`wide bit complement, and the full character wide bit`
		258	`complement.`
		259
		260
		261	`And finally, scalar reverse() reverses by character`
		262	`rather than by byte.`
		263
		264
		265	`__Character encodings for input and output__`
		266
		267
		268	`[[ XXX: This feature is not yet`
		269	`implemented.]`
		270	`!!CAVEATS`
		271
		272
		273	`As of yet, there is no method for automatically coercing`
		274	`input and output to some encoding other than`
		275	`UTF-8 . This is planned in the near future,`
		276	`however.`
		277
		278
		279	`Whether an arbitrary piece of data will be treated as`
		280	``characters'' or ``bytes'' by internal operations cannot be
		281	`divined at the current time.`
		282
		283
		284	`Use of locales with utf8 may lead to odd results. Currently`
		285	`there is some attempt to apply 8-bit locale info to`
		286	`characters in the range 0..255, but this is demonstrably`
		287	`incorrect for locales that use characters above that range`
		288	`(when mapped into Unicode). It will also tend to run slower.`
		289	`Avoidance of locales is strongly encouraged.`
		290	`!!SEE ALSO`
		291
		292
		293	bytes, utf8, ``${^WIDE_SYSTEM_CALLS}'' in
		294	`perlvar`
		295	`----`

This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.

Last edited on Monday, June 3, 2002 6:50:53 pm by "perry"

Edit PageHistory Diff Info LikePages