Penguin
Blame: perllocale(1)
EditPageHistoryDiffInfoLikePages
Annotated edit history of perllocale(1) version 1, including all changes. View license author blame.
Rev Author # Line
1 perry 1 PERLLOCALE
2 !!!PERLLOCALE
3 NAME
4 DESCRIPTION
5 PREPARING TO USE LOCALES
6 USING LOCALES
7 LOCALE CATEGORIES
8 SECURITY
9 ENVIRONMENT
10 NOTES
11 BUGS
12 SEE ALSO
13 HISTORY
14 ----
15 !!NAME
16
17
18 perllocale - Perl locale handling (internationalization and localization)
19 !!DESCRIPTION
20
21
22 Perl supports language-specific notions of data such as ``is
23 this a letter'', ``what is the uppercase equivalent of this
24 letter'', and ``which of these letters comes first''. These
25 are important issues, especially for languages other than
26 English--but also for English: it would be naieve to imagine
27 that A-Za-z defines all the ``letters'' needed to
28 write in English. Perl is also aware that some character
29 other than '.' may be preferred as a decimal point, and that
30 output date representations may be language-specific. The
31 process of making an application take account of its users'
32 preferences in such matters is called
33 __internationalization__ (often abbreviated as
34 __i18n__); telling such an application about a particular
35 set of preferences is known as __localization__
36 (__l10n__).
37
38
39 Perl can understand language-specific data via the
40 standardized ( ISO C, XPG4 ,
41 POSIX 1.c) method called ``the locale
42 system''. The locale system is controlled per application
43 using one pragma, one function call, and several environment
44 variables.
45
46
47 __NOTE__ : This feature is new in Perl
48 5.004, and does not apply unless an application specifically
49 requests it--see ``Backward compatibility''. The one
50 exception is that ''write()'' now __always__ uses the
51 current locale - see `` NOTES
52 ''.
53 !!PREPARING TO USE LOCALES
54
55
56 If Perl applications are to understand and present your data
57 correctly according a locale of your choice, __all__ of
58 the following must be true:
59
60
61 __Your operating system must support the locale system__.
62 If it does, you should find that the ''setlocale()''
63 function is a documented part of its C library.
64
65
66 __Definitions for locales that you use must be
67 installed__. You, or your system administrator, must make
68 sure that this is the case. The available locales, the
69 location in which they are kept, and the manner in which
70 they are installed all vary from system to system. Some
71 systems provide only a few, hard-wired locales and do not
72 allow more to be added. Others allow you to add ``canned''
73 locales provided by the system supplier. Still others allow
74 you or the system administrator to define and add arbitrary
75 locales. (You may have to ask your supplier to provide
76 canned locales that are not delivered with your operating
77 system.) Read your system documentation for further
78 illumination.
79
80
81 __Perl must believe that the locale system is
82 supported__. If it does, perl -V:d_setlocale will
83 say that the value for d_setlocale is
84 define.
85
86
87 If you want a Perl application to process and present your
88 data according to a particular locale, the application code
89 should include the use locale pragma (see ``The use
90 locale pragma'') where appropriate, and __at least one__
91 of the following must be true:
92
93
94 __The locale-determining environment variables (see ``
95 ENVIRONMENT '') must be correctly set up__
96 at the time the application is started, either by yourself
97 or by whoever set up your system account.
98
99
100 __The application must set its own locale__ using the
101 method described in ``The setlocale function''.
102 !!USING LOCALES
103
104
105 __The use locale pragma__
106
107
108 By default, Perl ignores the current locale. The use
109 locale pragma tells Perl to use the current locale for
110 some operations:
111
112
113 __The comparison operators__ (lt, le,
114 cmp, ge, and gt) and the
115 POSIX string collation functions
116 ''strcoll()'' and ''strxfrm()'' use
117 LC_COLLATE. ''sort()'' is also affected if used
118 without an explicit comparison function, because it uses
119 cmp by default.
120
121
122 __Note:__ eq and ne are unaffected by
123 locale: they always perform a byte-by-byte comparison of
124 their scalar operands. What's more, if cmp finds
125 that its operands are equal according to the collation
126 sequence specified by the current locale, it goes on to
127 perform a byte-by-byte comparison, and only returns ''0''
128 (equal) if the operands are bit-for-bit identical. If you
129 really want to know whether two strings--which eq
130 and cmp may consider different--are equal as far as
131 collation in the locale is concerned, see the discussion in
132 ``Category LC_COLLATE:
133 Collation''.
134
135
136 __Regular expressions and case-modification functions__
137 (''uc()'', ''lc()'', ''ucfirst()'', and
138 ''lcfirst()'') use LC_CTYPE
139
140
141 __The formatting functions__ (''printf()'',
142 ''sprintf()'' and ''write()'') use
143 LC_NUMERIC
144
145
146 __The POSIX date formatting function__
147 (''strftime()'') uses LC_TIME.
148
149
150 LC_COLLATE, LC_CTYPE, and so on, are
151 discussed further in `` LOCALE CATEGORIES
152 ''.
153
154
155 The default behavior is restored with the no locale
156 pragma, or upon reaching the end of block enclosing use
157 locale.
158
159
160 The string result of any operation that uses locale
161 information is tainted, as it is possible for a locale to be
162 untrustworthy. See `` SECURITY
163 ''.
164
165
166 __The setlocale function__
167
168
169 You can switch locales as often as you wish at run time with
170 the ''POSIX::setlocale()'' function:
171
172
173 # This functionality not usable prior to Perl 5.004
174 require 5.004;
175 # Import locale-handling tool set from POSIX module.
176 # This example uses: setlocale -- the function call
177 # LC_CTYPE -- explained below
178 use POSIX qw(locale_h);
179 # query and save the old locale
180 $old_locale = setlocale(LC_CTYPE);
181 setlocale(LC_CTYPE,
182 setlocale(LC_CTYPE,
183 # restore the old locale
184 setlocale(LC_CTYPE, $old_locale);
185 The first argument of ''setlocale()'' gives the __category__, the second the __locale__. The category tells in what aspect of data processing you want to apply locale-specific rules. Category names are discussed in `` LOCALE CATEGORIES '' and `` ENVIRONMENT ''. The locale is the name of a collection of customization information corresponding to a particular combination of language, country or territory, and codeset. Read on for hints on the naming of locales: not all systems name locales as in the example.
186
187
188 If no second argument is provided and the category is
189 something else than LC_ALL , the function
190 returns a string naming the current locale for the category.
191 You can use this value as the second argument in a
192 subsequent call to ''setlocale()''.
193
194
195 If no second argument is provided and the category is
196 LC_ALL , the result is
197 implementation-dependent. It may be a string of concatenated
198 locales names (separator also implementation-dependent) or a
199 single locale name. Please consult your setlocale(3)
200 for details.
201
202
203 If a second argument is given and it corresponds to a valid
204 locale, the locale for the category is set to that value,
205 and the function returns the now-current locale value. You
206 can then use this in yet another call to ''setlocale()''.
207 (In some implementations, the return value may sometimes
208 differ from the value you gave as the second argument--think
209 of it as an alias for the value you gave.)
210
211
212 As the example shows, if the second argument is an empty
213 string, the category's locale is returned to the default
214 specified by the corresponding environment variables.
215 Generally, this results in a return to the default that was
216 in force when Perl started up: changes to the environment
217 made by the application after startup may or may not be
218 noticed, depending on your system's C library.
219
220
221 If the second argument does not correspond to a valid
222 locale, the locale for the category is not changed, and the
223 function returns ''undef''.
224
225
226 For further information about the categories, consult
227 setlocale(3).
228
229
230 __Finding locales__
231
232
233 For locales available in your system, consult also
234 setlocale(3) to see whether it leads to the list of
235 available locales (search for the ''SEE
236 ALSO'' section). If that fails, try the following
237 command lines:
238
239
240 locale -a
241 nlsinfo
242 ls /usr/lib/nls/loc
243 ls /usr/lib/locale
244 ls /usr/lib/nls
245 ls /usr/share/locale
246 and see whether they list something resembling these
247
248
249 en_US.ISO8859-1 de_DE.ISO8859-1 ru_RU.ISO8859-5
250 en_US.iso88591 de_DE.iso88591 ru_RU.iso88595
251 en_US de_DE ru_RU
252 en de ru
253 english german russian
254 english.iso88591 german.iso88591 russian.iso88595
255 english.roman8 russian.koi8r
256 Sadly, even though the calling interface for ''setlocale()'' has been standardized, names of locales and the directories where the configuration resides have not been. The basic form of the name is ''language_territory''__.__''codeset'', but the latter parts after ''language'' are not always present. The ''language'' and ''country'' are usually from the standards __ISO 3166__ and __ISO 639__, the two-letter abbreviations for the countries and the languages of the world, respectively. The ''codeset'' part often mentions some __ISO 8859__ character set, the Latin codesets. For example, ISO 8859-1 is the so-called ``Western European codeset'' that can be used to encode most Western European languages adequately. Again, there are several ways to write even the name of that one standard. Lamentably.
257
258
259 Two special locales are worth particular mention: ``C'' and
260 `` POSIX ''. Currently these are effectively
261 the same locale: the difference is mainly that the first one
262 is defined by the C standard, the second by the
263 POSIX standard. They define the __default
264 locale__ in which every program starts in the absence of
265 locale information in its environment. (The ''default''
266 default locale, if you will.) Its language is (American)
267 English and its character codeset ASCII
268 .
269
270
271 __NOTE__ : Not all systems have the ``
272 POSIX '' locale (not all systems are
273 POSIX-conformant), so use ``C'' when you need explicitly to
274 specify this default locale.
275
276
277 __LOCALE PROBLEMS__
278
279
280 You may encounter the following warning message at Perl
281 startup:
282
283
284 perl: warning: Setting locale failed.
285 perl: warning: Please check that your locale settings:
286 LC_ALL =
287 This means that your locale settings had LC_ALL set to ``En_US'' and LANG exists but has no value. Perl tried to believe you but could not. Instead, Perl gave up and fell back to the ``C'' locale, the default locale that is supposed to work no matter what. This usually means your locale settings were wrong, they mention locales your system has never heard of, or the locale installation in your system has problems (for example, some system files are broken or missing). There are quick and temporary fixes to these problems, as well as more thorough and lasting fixes.
288
289
290 __Temporarily fixing locale problems__
291
292
293 The two quickest fixes are either to render Perl silent
294 about any locale inconsistencies or to run Perl under the
295 default locale ``C''.
296
297
298 Perl's moaning about locale problems can be silenced by
299 setting the environment variable PERL_BADLANG
300 to a zero value, for example ``0''. This method really just
301 sweeps the problem under the carpet: you tell Perl to shut
302 up even when Perl sees that something is wrong. Do not be
303 surprised if later something locale-dependent
304 misbehaves.
305
306
307 Perl can be run under the ``C'' locale by setting the
308 environment variable LC_ALL to ``C''. This
309 method is perhaps a bit more civilized than the
310 PERL_BADLANG approach, but setting
311 LC_ALL (or other locale variables) may affect
312 other programs as well, not just Perl. In particular,
313 external programs run from within Perl will see these
314 changes. If you make the new settings permanent (read on),
315 all programs you run see the changes. See
316 ENVIRONMENT for the full list of relevant
317 environment variables and `` USING LOCALES ''
318 for their effects in Perl. Effects in other programs are
319 easily deducible. For example, the variable
320 LC_COLLATE may well affect your __sort__
321 program (or whatever the program that arranges `records'
322 alphabetically in your system is called).
323
324
325 You can test out changing these variables temporarily, and
326 if the new settings seem to help, put those settings into
327 your shell startup files. Consult your local documentation
328 for the exact details. For in Bourne-like shells (__sh__,
329 __ksh__, __bash__, __zsh__):
330
331
332 LC_ALL=en_US.ISO8859-1
333 export LC_ALL
334 This assumes that we saw the locale ``en_US.ISO8859-1'' using the commands discussed above. We decided to try that instead of the above faulty locale ``En_US''--and in Cshish shells (__csh__, __tcsh__)
335
336
337 setenv LC_ALL en_US.ISO8859-1
338 If you do not know what shell you have, consult your local helpdesk or the equivalent.
339
340
341 __Permanently fixing locale problems__
342
343
344 The slower but superior fixes are when you may be able to
345 yourself fix the misconfiguration of your own environment
346 variables. The mis(sing)configuration of the whole system's
347 locales usually requires the help of your friendly system
348 administrator.
349
350
351 First, see earlier in this document about ``Finding
352 locales''. That tells how to find which locales are really
353 supported--and more importantly, installed--on your system.
354 In our example error message, environment variables
355 affecting the locale are listed in the order of decreasing
356 importance (and unset variables do not matter). Therefore,
357 having LC_ALL set to ``En_US'' must have been
358 the bad choice, as shown by the error message. First try
359 fixing locale settings listed first.
360
361
362 Second, if using the listed commands you see something
363 __exactly__ (prefix matches do not count and case usually
364 counts) like ``En_US'' without the quotes, then you should
365 be okay because you are using a locale name that should be
366 installed and available in your system. In this case, see
367 ``Permanently fixing your system's locale
368 configuration''.
369
370
371 __Permanently fixing your system's locale
372 configuration__
373
374
375 This is when you see something like:
376
377
378 perl: warning: Please check that your locale settings:
379 LC_ALL =
380 but then cannot see that ``En_US'' listed by the above-mentioned commands. You may see things like ``en_US.ISO8859-1'', but that isn't the same. In this case, try running under a locale that you can list and which somehow matches what you tried. The rules for matching locale names are a bit vague because standardization is weak in this area. See again the ``Finding locales'' about general rules.
381
382
383 __Fixing system locale configuration__
384
385
386 Contact a system administrator (preferably your own) and
387 report the exact error message you get, and ask them to read
388 this same documentation you are now reading. They should be
389 able to check whether there is something wrong with the
390 locale configuration of the system. The ``Finding locales''
391 section is unfortunately a bit vague about the exact
392 commands and places because these things are not that
393 standardized.
394
395
396 __The localeconv function__
397
398
399 The ''POSIX::localeconv()'' function allows you to get
400 particulars of the locale-dependent numeric formatting
401 information specified by the current LC_NUMERIC and
402 LC_MONETARY locales. (If you just want the name of
403 the current locale for a particular category, use
404 ''POSIX::setlocale()'' with a single parameter--see ``The
405 setlocale function''.)
406
407
408 use POSIX qw(locale_h);
409 # Get a reference to a hash of locale-dependent info
410 $locale_values = localeconv();
411 # Output sorted list of the values
412 for (sort keys %$locale_values) {
413 printf
414 ''localeconv()'' takes no arguments, and returns __a reference to__ a hash. The keys of this hash are variable names for formatting, such as decimal_point and thousands_sep. The values are the corresponding, er, values. See ``localeconv'' in POSIX for a longer example listing the categories an implementation might be expected to provide; some provide more and others fewer. You don't need an explicit use locale, because ''localeconv()'' always observes the current locale.
415
416
417 Here's a simple-minded example program that rewrites its
418 command-line parameters as integers correctly formatted in
419 the current locale:
420
421
422 # See comments in previous example
423 require 5.004;
424 use POSIX qw(locale_h);
425 # Get some of locale's numeric formatting parameters
426 my ($thousands_sep, $grouping) =
427 @{localeconv()}{'thousands_sep', 'grouping'};
428 # Apply defaults if values are missing
429 $thousands_sep = ',' unless $thousands_sep;
430 # grouping and mon_grouping are packed lists
431 # of small integers (characters) telling the
432 # grouping (thousand_seps and mon_thousand_seps
433 # being the group dividers) of numbers and
434 # monetary quantities. The integers' meanings:
435 # 255 means no more grouping, 0 means repeat
436 # the previous grouping, 1-254 means use that
437 # as the current grouping. Grouping goes from
438 # right to left (low to high digits). In the
439 # below we cheat slightly by never using anything
440 # else than the first grouping (whatever that is).
441 if ($grouping) {
442 @grouping = unpack(
443 # Format command line params for current locale
444 for (@ARGV) {
445 $_ = int; # Chop non-integer part
446 1 while
447 s/(d)(d{$grouping[[0]}($$thousands_sep))/$1$thousands_sep$2/;
448 print
449 !!LOCALE CATEGORIES
450
451
452 The following subsections describe basic locale categories.
453 Beyond these, some combination categories allow manipulation
454 of more than one basic category at a time. See ``
455 ENVIRONMENT '' for a discussion of
456 these.
457
458
459 __Category LC_COLLATE:
460 Collation__
461
462
463 In the scope of use locale, Perl looks to the
464 LC_COLLATE environment variable to determine the
465 application's notions on collation (ordering) of characters.
466 For example, 'b' follows 'a' in Latin alphabets, but where
467 do 'a' and 'aa' belong? And while 'color' follows
468 'chocolate' in English, what about in Spanish?
469
470
471 The following collations all make sense and you may meet any
472 of them if you ``use locale''.
473
474
475 A B C D E a b c d e
476 A a B b C c D d E e
477 a A b B c C d D e E
478 a b c d e A B C D E
479 Here is a code snippet to tell what ``word'' characters are in the current locale, in that locale's order:
480
481
482 use locale;
483 print +(sort grep /w/, map { chr } 0..255),
484 Compare this with the characters that you see and their order if you state explicitly that the locale should be ignored:
485
486
487 no locale;
488 print +(sort grep /w/, map { chr } 0..255),
489 This machine-native collation (which is what you get unless use locale has appeared earlier in the same block) must be used for sorting raw binary data, whereas the locale-dependent collation of the first example is useful for natural text.
490
491
492 As noted in `` USING LOCALES '', cmp
493 compares according to the current collation locale when
494 use locale is in effect, but falls back to a
495 byte-by-byte comparison for strings that the locale says are
496 equal. You can use ''POSIX::strcoll()'' if you don't want
497 this fall-back:
498
499
500 use POSIX qw(strcoll);
501 $equal_in_locale =
502 !strcoll(
503 $equal_in_locale will be true if the collation locale specifies a dictionary-like ordering that ignores space characters completely and which folds case.
504
505
506 If you have a single string that you want to check for
507 ``equality in locale'' against several others, you might
508 think you could gain a little efficiency by using
509 ''POSIX::strxfrm()'' in conjunction with
510 eq:
511
512
513 use POSIX qw(strxfrm);
514 $xfrm_string = strxfrm(
515 ''strxfrm()'' takes a string and maps it into a transformed string for use in byte-by-byte comparisons against other transformed strings during collation. ``Under the hood'', locale-affected Perl comparison operators call ''strxfrm()'' for both operands, then do a byte-by-byte comparison of the transformed strings. By calling ''strxfrm()'' explicitly and using a non locale-affected comparison, the example attempts to save a couple of transformations. But in fact, it doesn't save anything: Perl magic (see ``Magic Variables'' in perlguts) creates the transformed version of a string the first time it's needed in a comparison, then keeps this version around in case it's needed again. An example rewritten the easy way with cmp runs just about as fast. It also copes with null characters embedded in strings; if you call ''strxfrm()'' directly, it treats the first null it finds as a terminator. don't expect the transformed strings it produces to be portable across systems--or even from one revision of your operating system to the next. In short, don't call ''strxfrm()'' directly: let Perl do it for you.
516
517
518 Note: use locale isn't shown in some of these
519 examples because it isn't needed: ''strcoll()'' and
520 ''strxfrm()'' exist only to generate locale-dependent
521 results, and so always obey the current LC_COLLATE
522 locale.
523
524
525 __Category LC_CTYPE: Character
526 Types__
527
528
529 In the scope of use locale, Perl obeys the
530 LC_CTYPE locale setting. This controls the
531 application's notion of which characters are alphabetic.
532 This affects Perl's w regular expression
533 metanotation, which stands for alphanumeric characters--that
534 is, alphabetic, numeric, and including other special
535 characters such as the underscore or hyphen. (Consult perlre
536 for more information about regular expressions.) Thanks to
537 LC_CTYPE, depending on your locale setting,
538 characters like 'ae', 'd', 'ss', and 'o/' may be understood
539 as w characters.
540
541
542 The LC_CTYPE locale also provides the map used in
543 transliterating characters between lower and uppercase. This
544 affects the case-mapping functions--''lc()'', lcfirst,
545 ''uc()'', and ''ucfirst()''; case-mapping
546 interpolation with l, L, u, or
547 U in double-quoted strings and s///
548 substitutions; and case-independent regular expression
549 pattern matching using the i modifier.
550
551
552 Finally, LC_CTYPE affects the POSIX
553 character-class test functions--''isalpha()'',
554 ''islower()'', and so on. For example, if you move from
555 the ``C'' locale to a 7-bit Scandinavian one, you may
556 find--possibly to your surprise--that ``'' moves from the
557 ''ispunct()'' class to ''isalpha()''.
558
559
560 __Note:__ A broken or malicious LC_CTYPE locale
561 definition may result in clearly ineligible characters being
562 considered to be alphanumeric by your application. For
563 strict matching of (mundane) letters and digits--for
564 example, in command strings--locale-aware applications
565 should use w inside a no locale block. See
566 `` SECURITY ''.
567
568
569 __Category LC_NUMERIC: Numeric
570 Formatting__
571
572
573 In the scope of use locale, Perl obeys the
574 LC_NUMERIC locale information, which controls an
575 application's idea of how numbers should be formatted for
576 human readability by the ''printf()'', ''sprintf()'',
577 and ''write()'' functions. String-to-numeric conversion
578 by the ''POSIX::strtod()'' function is also affected. In
579 most implementations the only effect is to change the
580 character used for the decimal point--perhaps from '.' to
581 ','. These functions aren't aware of such niceties as
582 thousands separation and so on. (See ``The localeconv
583 function'' if you care about these things.)
584
585
586 Output produced by ''print()'' is also affected by the
587 current locale: it depends on whether use locale or
588 no locale is in effect, and corresponds to what
589 you'd get from ''printf()'' in the ``C'' locale. The same
590 is true for Perl's internal conversions between numeric and
591 string formats:
592
593
594 use POSIX qw(strtod);
595 use locale;
596 $n = 5/2; # Assign numeric 2.5 to $n
597 $a =
598 print
599 printf
600 print
601
602
603 __Category LC_MONETARY: Formatting of
604 monetary amounts__
605
606
607 The C standard defines the LC_MONETARY category,
608 but no function that is affected by its contents. (Those
609 with experience of standards committees will recognize that
610 the working group decided to punt on the issue.)
611 Consequently, Perl takes no notice of it. If you really want
612 to use LC_MONETARY, you can query its contents--see
613 ``The localeconv function''--and use the information that it
614 returns in your application's own formatting of currency
615 amounts. However, you may well find that the information,
616 voluminous and complex though it may be, still does not
617 quite meet your requirements: currency formatting is a hard
618 nut to crack.
619
620
621 __LC_TIME__
622
623
624 Output produced by ''POSIX::strftime()'', which builds a
625 formatted human-readable date/time string, is affected by
626 the current LC_TIME locale. Thus, in a French
627 locale, the output produced by the %B format
628 element (full month name) for the first month of the year
629 would be ``janvier''. Here's how to get a list of long month
630 names in the current locale:
631
632
633 use POSIX qw(strftime);
634 for (0..11) {
635 $long_month_name[[$_] =
636 strftime(
637 Note: use locale isn't needed in this example: as a function that exists only to generate locale-dependent results, ''strftime()'' always obeys the current LC_TIME locale.
638
639
640 __Other categories__
641
642
643 The remaining locale category, LC_MESSAGES
644 (possibly supplemented by others in particular
645 implementations) is not currently used by Perl--except
646 possibly to affect the behavior of library functions called
647 by extensions outside the standard Perl distribution and by
648 the operating system and its utilities. Note especially that
649 the string value of $! and the error messages given
650 by external utilities may be changed by
651 LC_MESSAGES. If you want to have portable error
652 codes, use %!. See Errno.
653 !!SECURITY
654
655
656 Although the main discussion of Perl security issues can be
657 found in perlsec, a discussion of Perl's locale handling
658 would be incomplete if it did not draw your attention to
659 locale-dependent security issues. Locales--particularly on
660 systems that allow unprivileged users to build their own
661 locales--are untrustworthy. A malicious (or just plain
662 broken) locale can make a locale-aware application give
663 unexpected results. Here are a few
664 possibilities:
665
666
667 Regular expression checks for safe file names or mail
668 addresses using w may be spoofed by an
669 LC_CTYPE locale that claims that characters such as
670
671
672 String interpolation with case-mapping, as in, say,
673 $dest = , may produce
674 dangerous results if a bogus LC_CTYPE
675 case-mapping table is in effect.
676
677
678 A sneaky LC_COLLATE locale could result in the
679 names of students with ``D'' grades appearing ahead of those
680 with ``A''s.
681
682
683 An application that takes the trouble to use information in
684 LC_MONETARY may format debits as if they were
685 credits and vice versa if that locale has been subverted. Or
686 it might make payments in US dollars instead
687 of Hong Kong dollars.
688
689
690 The date and day names in dates formatted by
691 ''strftime()'' could be manipulated to advantage by a
692 malicious user able to subvert the LC_DATE locale.
693 (``Look--it says I wasn't in the building on
694 Sunday.'')
695
696
697 Such dangers are not peculiar to the locale system: any
698 aspect of an application's environment which may be modified
699 maliciously presents similar challenges. Similarly, they are
700 not specific to Perl: any programming language that allows
701 you to write programs that take account of their environment
702 exposes you to these issues.
703
704
705 Perl cannot protect you from all possibilities shown in the
706 examples--there is no substitute for your own
707 vigilance--but, when use locale is in effect, Perl
708 uses the tainting mechanism (see perlsec) to mark string
709 results that become locale-dependent, and which may be
710 untrustworthy in consequence. Here is a summary of the
711 tainting behavior of operators and functions that may be
712 affected by the locale:
713
714
715 __Comparison operators__ (lt, le,
716 ge, gt and cmp):
717
718
719 Scalar true/false (or less/equal/greater) result is never
720 tainted.
721
722
723 __Case-mapping interpolation__ (with l,
724 L, u or U)
725
726
727 Result string containing interpolated material is tainted if
728 use locale is in effect.
729
730
731 __Matching operator__ (m//):
732
733
734 Scalar true/false result never tainted.
735
736
737 Subpatterns, either delivered as a list-context result or as
738 $1 etc. are tainted if use locale is in
739 effect, and the subpattern regular expression contains
740 w (to match an alphanumeric character), W
741 (non-alphanumeric character), s (white-space
742 character), or S (non white-space character). The
743 matched-pattern variable, $
744 use locale is in effect and the regular expression
745 contains w, W, s, or
746 S.
747
748
749 __Substitution operator__ (s///):
750
751
752 Has the same behavior as the match operator. Also, the left
753 operand of =~ becomes tainted when use
754 locale in effect if modified as a result of a
755 substitution based on a regular expression match involving
756 w, W, s, or S; or of
757 case-mapping with l, L,u or
758 U.
759
760
761 __Output formatting functions__ (''printf()'' and
762 ''write()''):
763
764
765 Results are never tainted because otherwise even output from
766 print, for example print(1/7), should be tainted if
767 use locale is in effect.
768
769
770 __Case-mapping functions__ (''lc()'',
771 ''lcfirst()'', ''uc()'',
772 ''ucfirst()''):
773
774
775 Results are tainted if use locale is in
776 effect.
777
778
779 __POSIX locale-dependent functions__
780 (''localeconv()'', ''strcoll()'', ''strftime()'',
781 ''strxfrm()''):
782
783
784 Results are never tainted.
785
786
787 __POSIX character class tests__
788 (''isalnum()'', ''isalpha()'', ''isdigit()'',
789 ''isgraph()'', ''islower()'', ''isprint()'',
790 ''ispunct()'', ''isspace()'', ''isupper()'',
791 ''isxdigit()''):
792
793
794 True/false results are never tainted.
795
796
797 Three examples illustrate locale-dependent tainting. The
798 first program, which ignores its locale, won't run: a value
799 taken directly from the command line may not be used to name
800 an output file when taint checks are enabled.
801
802
803 #/usr/local/bin/perl -T
804 # Run with taint checking
805 # Command line sanity check omitted...
806 $tainted_output_file = shift;
807 open(F,
808 The program can be made to run by ``laundering'' the tainted value through a regular expression: the second example--which still ignores locale information--runs, creating the file named on its command line if it can.
809
810
811 #/usr/local/bin/perl -T
812 $tainted_output_file = shift;
813 $tainted_output_file =~ m%[[w/]+%;
814 $untainted_output_file = $
815 open(F,
816 Compare this with a similar but locale-aware program:
817
818
819 #/usr/local/bin/perl -T
820 $tainted_output_file = shift;
821 use locale;
822 $tainted_output_file =~ m%[[w/]+%;
823 $localized_output_file = $
824 open(F,
825 This third program fails to run because $w while use locale is in effect.
826 !!ENVIRONMENT
827
828
829 PERL_BADLANG
830
831
832 A string that can suppress Perl's warning about failed
833 locale settings at startup. Failure can occur if the locale
834 support in the operating system is lacking (broken) in some
835 way--or if you mistyped the name of a locale when you set up
836 your environment. If this environment variable is absent, or
837 has a value that does not evaluate to integer zero--that is,
838 ``0'' or
839
840
841 __NOTE__ : PERL_BADLANG
842 only gives you a way to hide the warning message. The
843 message tells about some problem in your system's locale
844 support, and you should investigate what the problem
845 is.
846
847
848 The following environment variables are not specific to
849 Perl: They are part of the standardized ( ISO
850 C, XPG4 , POSIX 1.c)
851 ''setlocale()'' method for controlling an application's
852 opinion on data.
853
854
855 LC_ALL
856
857
858 LC_ALL is the ``override-all'' locale environment
859 variable. If set, it overrides all the rest of the locale
860 environment variables.
861
862
863 LANGUAGE
864
865
866 __NOTE__ : LANGUAGE is a
867 GNU extension, it affects you only if you are
868 using the GNU libc. This is the case if you
869 are using e.g. Linux. If you are using ``commercial'' UNIXes
870 you are most probably ''not'' using GNU
871 libc and you can ignore LANGUAGE.
872
873
874 However, in the case you are using LANGUAGE: it
875 affects the language of informational, warning, and error
876 messages output by commands (in other words, it's like
877 LC_MESSAGES) but it has higher priority than
878 LC_ALL . Moreover, it's not a single value
879 but instead a ``path'' (``:''-separated list) of
880 ''languages'' (not locales). See the GNU
881 gettext library documentation for more
882 information.
883
884
885 LC_CTYPE
886
887
888 In the absence of LC_ALL, LC_CTYPE chooses
889 the character type locale. In the absence of both
890 LC_ALL and LC_CTYPE, LANG chooses
891 the character type locale.
892
893
894 LC_COLLATE
895
896
897 In the absence of LC_ALL, LC_COLLATE
898 chooses the collation (sorting) locale. In the absence of
899 both LC_ALL and LC_COLLATE, LANG
900 chooses the collation locale.
901
902
903 LC_MONETARY
904
905
906 In the absence of LC_ALL, LC_MONETARY
907 chooses the monetary formatting locale. In the absence of
908 both LC_ALL and LC_MONETARY, LANG
909 chooses the monetary formatting locale.
910
911
912 LC_NUMERIC
913
914
915 In the absence of LC_ALL, LC_NUMERIC
916 chooses the numeric format locale. In the absence of both
917 LC_ALL and LC_NUMERIC, LANG
918 chooses the numeric format.
919
920
921 LC_TIME
922
923
924 In the absence of LC_ALL, LC_TIME chooses
925 the date and time formatting locale. In the absence of both
926 LC_ALL and LC_TIME, LANG chooses
927 the date and time formatting locale.
928
929
930 LANG LANG is the ``catch-all''
931 locale environment variable. If it is set, it is used as the
932 last resort after the overall LC_ALL and the
933 category-specific LC_....
934 !!NOTES
935
936
937 __Backward compatibility__
938
939
940 Versions of Perl prior to 5.004 __mostly__ ignored locale
941 information, generally behaving as if something similar to
942 the locale were always in force, even
943 if the program environment suggested otherwise (see ``The
944 setlocale function''). By default, Perl still behaves this
945 way for backward compatibility. If you want a Perl
946 application to pay attention to locale information, you
947 __must__ use the use locale pragma (see ``The
948 use locale pragma'') to instruct it to do so.
949
950
951 Versions of Perl from 5.002 to 5.003 did use the
952 LC_CTYPE information if available; that is,
953 w did understand what were the letters according to
954 the locale environment variables. The problem was that the
955 user had no control over the feature: if the C library
956 supported locales, Perl used them.
957
958
959 __I18N:Collate obsolete__
960
961
962 In versions of Perl prior to 5.004, per-locale collation was
963 possible using the I18N::Collate library module.
964 This module is now mildly obsolete and should be avoided in
965 new applications. The LC_COLLATE functionality is
966 now integrated into the Perl core language: One can use
967 locale-specific scalar data completely normally with use
968 locale, so there is no longer any need to juggle with
969 the scalar references of
970 I18N::Collate.
971
972
973 __Sort speed and memory use impacts__
974
975
976 Comparing and sorting by locale is usually slower than the
977 default sorting; slow-downs of two to four times have been
978 observed. It will also consume more memory: once a Perl
979 scalar variable has participated in any string comparison or
980 sorting operation obeying the locale collation rules, it
981 will take 3-15 times more memory than before. (The exact
982 multiplier depends on the string's contents, the operating
983 system and the locale.) These downsides are dictated more by
984 the operating system's implementation of the locale system
985 than by Perl.
986
987
988 ''write()'' __and
989 LC_NUMERIC__
990
991
992 Formats are the only part of Perl that unconditionally use
993 information from a program's locale; if a program's
994 environment specifies an LC_NUMERIC locale,
995 it is always used to specify the decimal point character in
996 formatted output. Formatted output cannot be controlled by
997 use locale because the pragma is tied to the block
998 structure of the program, and, for historical reasons,
999 formats exist outside that block structure.
1000
1001
1002 __Freely available locale definitions__
1003
1004
1005 There is a large collection of locale definitions at
1006 ftp://dkuug.dk/i18n/WG15-collection. You should be
1007 aware that it is unsupported, and is not claimed to be fit
1008 for any purpose. If your system allows installation of
1009 arbitrary locales, you may find the definitions useful as
1010 they are, or as a basis for the development of your own
1011 locales.
1012
1013
1014 __I18n and l10n__
1015
1016
1017 ``Internationalization'' is often abbreviated as __i18n__
1018 because its first and last letters are separated by eighteen
1019 others. (You may guess why the internalin ... internaliti
1020 ... i18n tends to get abbreviated.) In the same way,
1021 ``localization'' is often abbreviated to
1022 __l10n__.
1023
1024
1025 __An imperfect standard__
1026
1027
1028 Internationalization, as defined in the C and
1029 POSIX standards, can be criticized as
1030 incomplete, ungainly, and having too large a granularity.
1031 (Locales apply to a whole process, when it would arguably be
1032 more useful to have them apply to a single thread, window
1033 group, or whatever.) They also have a tendency, like
1034 standards groups, to divide the world into nations, when we
1035 all know that the world can equally well be divided into
1036 bankers, bikers, gamers, and so on. But, for now, it's the
1037 only standard we've got. This may be construed as a
1038 bug.
1039 !!BUGS
1040
1041
1042 __Broken systems__
1043
1044
1045 In certain systems, the operating system's locale support is
1046 broken and cannot be fixed or used by Perl. Such
1047 deficiencies can and will result in mysterious hangs and/or
1048 Perl core dumps when the use locale is in effect.
1049 When confronted with such a system, please report in
1050 excruciating detail to perlbug@perl.org''
1051 ''
1052 !!SEE ALSO
1053
1054
1055 ``isalnum'' in POSIX , ``isalpha'' in
1056 POSIX , ``isdigit'' in POSIX ,
1057 ``isgraph'' in POSIX , ``islower'' in
1058 POSIX , ``isprint'' in POSIX ,
1059 ``ispunct'' in POSIX , ``isspace'' in
1060 POSIX , ``isupper'' in POSIX ,
1061 ``isxdigit'' in POSIX , ``localeconv'' in
1062 POSIX , ``setlocale'' in POSIX
1063 , ``strcoll'' in POSIX , ``strftime'' in
1064 POSIX , ``strtod'' in POSIX ,
1065 ``strxfrm'' in POSIX .
1066 !!HISTORY
1067
1068
1069 Jarkko Hietaniemi's original ''perli18n.pod'' heavily
1070 hacked by Dominic Dunlop, assisted by the perl5-porters.
1071 Prose worked over a bit by Tom Christiansen.
1072
1073
1074 Last update: Thu Jun 11 08:44:13 MDT
1075 1998
1076 ----
This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.