Penguin
Annotated edit history of perlre(1) version 2, including all changes. View license author blame.
Rev Author # Line
1 perry 1 PERLRE
2 !!!PERLRE
3 NAME
4 DESCRIPTION
5 BUGS
6 SEE ALSO
7 ----
8 !!NAME
9
10
11 perlre - Perl regular expressions
12 !!DESCRIPTION
13
14
15 This page describes the syntax of regular expressions in
16 Perl. For a description of how to ''use'' regular
17 expressions in matching operations, plus various examples of
18 the same, see discussions of m//, s///,
19 qr// and ?? in ``Regexp Quote-Like
20 Operators'' in perlop.
21
22
23 Matching operations can have various modifiers. Modifiers
24 that relate to the interpretation of the regular expression
25 inside are listed below. Modifiers that alter the way a
26 regular expression is used by Perl are detailed in ``Regexp
27 Quote-Like Operators'' in perlop and ``Gory details of
28 parsing quoted constructs'' in perlop.
29
30
31 i
32
33
34 Do case-insensitive pattern matching.
35
36
37 If use locale is in effect, the case map is taken
38 from the current locale. See perllocale.
39
40
41 m
42
43
44 Treat string as multiple lines. That is, change ``^'' and
45 ``$'' from matching the start or end of the string to
46 matching the start or end of any line anywhere within the
47 string.
48
49
50 s
51
52
53 Treat string as single line. That is, change ``.'' to match
54 any character whatsoever, even a newline, which normally it
55 would not match.
56
57
58 The /s and /m modifiers both override the
59 $* setting. That is, no matter what $*
60 contains, /s without /m will force ``^''
61 to match only at the beginning of the string and ``$'' to
62 match only at the end (or just before a newline at the end)
63 of the string. Together, as /ms, they let the ``.'' match
64 any character whatsoever, while still allowing ``^'' and
65 ``$'' to match, respectively, just after and just before
66 newlines within the string.
67
68
69 x
70
71
72 Extend your pattern's legibility by permitting whitespace
73 and comments.
74
75
76 These are usually written as /x
77 modifier
78 (?...) construct. See below.
79
80
81 The /x modifier itself needs a little more
82 explanation. It tells the regular expression parser to
83 ignore whitespace that is neither backslashed nor within a
84 character class. You can use this to break up your regular
85 expression into (slightly) more readable parts. The
86 # character is also treated as a metacharacter
87 introducing a comment, just as in ordinary Perl code. This
88 also means that if you want real whitespace or #
89 characters in the pattern (outside a character class, where
90 they are unaffected by /x), that you'll either have
91 to escape them or encode them using octal or hex escapes.
92 Taken together, these features go a long way towards making
93 Perl's regular expressions more readable. Note that you have
94 to be careful not to include the pattern delimiter in the
95 comment--perl has no way of knowing you did not intend to
96 close the pattern early. See the C-comment deletion code in
97 perlop.
98
99
100 __Regular Expressions__
101
102
103 The patterns used in Perl pattern matching derive from
104 supplied in the Version 8 regex routines. (The routines are
105 derived (distantly) from Henry Spencer's freely
106 redistributable reimplementation of the V8 routines.) See
107 ``Version 8 Regular Expressions'' for details.
108
109
110 In particular the following metacharacters have their
111 standard ''egrep''-ish meanings:
112
113
114 \ Quote the next metacharacter
115 ^ Match the beginning of the line
116 . Match any character (except newline)
117 $ Match the end of the line (or before newline at the end)
118 Alternation
119 () Grouping
120 [[] Character class
121 By default, the ``^'' character is guaranteed to match only the beginning of the string, the ``$'' character only the end (or before the newline at the end), and Perl does certain optimizations with the assumption that the string contains only one line. Embedded newlines will not be matched by ``^'' or ``$''. You may, however, wish to treat a string as a multi-line buffer, such that the ``^'' will match after any newline within the string, and ``$'' will match before any newline. At the cost of a little more overhead, you can do this by using the /m modifier on the pattern match operator. (Older programs did this by setting $*, but this practice is now deprecated.)
122
123
124 To simplify multi-line substitutions, the ``.'' character
125 never matches a newline unless you use the /s
126 modifier, which in effect tells Perl to pretend the string
127 is a single line--even if it isn't. The /s modifier
128 also overrides the setting of $*, in case you have
129 some (badly behaved) older code that sets it in another
130 module.
131
132
133 The following standard quantifiers are
134 recognized:
135
136
137 * Match 0 or more times
138 + Match 1 or more times
139 ? Match 1 or 0 times
140 {n} Match exactly n times
141 {n,} Match at least n times
142 {n,m} Match at least n but not more than m times
143 (If a curly bracket occurs in any other context, it is treated as a regular character.) The ``*'' modifier is equivalent to {0,}, the ``+'' modifier to {1,}, and the ``?'' modifier to {0,1}. n and m are limited to integral values less than a preset limit defined when perl is built. This is usually 32766 on the most common platforms. The actual limit can be seen in the error message generated by code such as this:
144
145
146 $_ **= $_ , / {$_} / for 2 .. 42;
147 By default, a quantified subpattern is ``greedy'', that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match. If you want it to match the minimum number of times possible, follow the quantifier with a ``?''. Note that the meanings don't change, just the ``greediness'':
148
149
150 *? Match 0 or more times
151 +? Match 1 or more times
152 ?? Match 0 or 1 time
153 {n}? Match exactly n times
154 {n,}? Match at least n times
155 {n,m}? Match at least n but not more than m times
156 Because patterns are processed as double quoted strings, the following also work:
157
158
159 t tab (HT, TAB)
160 n newline (LF, NL)
161 r return (CR)
162 f form feed (FF)
163 a alarm (bell) (BEL)
164 e escape (think troff) (ESC)
165 033 octal char (think of a PDP-11)
166 x1B hex char
167 x{263a} wide hex char (Unicode SMILEY)
168 c[[ control char
169 N{name} named char
170 l lowercase next char (think vi)
171 u uppercase next char (think vi)
172 L lowercase till E (think vi)
173 U uppercase till E (think vi)
174 E end case modification (think vi)
175 Q quote (disable) pattern metacharacters till E
176 If use locale is in effect, the case map used by l, L, u and U is taken from the current locale. See perllocale. For documentation of N{name}, see charnames.
177
178
179 You cannot include a literal $ or @ within
180 a Q sequence. An unescaped $ or @
181 interpolates the corresponding variable, while escaping will
182 cause the literal string $ to be matched. You'll
183 need to write something like
184 m/QuserE@Qhost/.
185
186
187 In addition, Perl defines the following:
188
189
190 w Match a
191 A w matches a single alphanumeric character or _, not a whole word. Use w+ to match a string of Perl-identifier characters (which isn't the same as matching an English word). If use locale is in effect, the list of alphabetic characters generated by w is taken from the current locale. See perllocale. You may use w, W, s, S, d, and D within character classes, but if you try to use them as endpoints of a range, that's not a range, the ``-'' is understood literally. See utf8 for details about pP, PP, and X.
192
193
194 The POSIX character class syntax
195
196
197 [[:class:]
198 is also available. The available classes and their backslash equivalents (if available) are as follows:
199
200
201 alpha
202 alnum
203 ascii
204 blank [[1]
205 cntrl
206 digit d
207 graph
208 lower
209 print
210 punct
211 space s [[2]
212 upper
213 word w [[3]
214 xdigit
215 [[1] A GNU extension equivalent to C
216 For example use [[:upper:] to match all the uppercase characters. Note that the [[] are part of the [[::] construct, not part of the whole character class. For example:
217
218
219 [[01[[:alpha:]%]
220 matches zero, one, any alphabetic character, and the percentage sign.
221
222
223 If the utf8 pragma is used, the following
224 equivalences to Unicode p{} constructs and equivalent
225 backslash character classes (if available), will
226 hold:
227
228
2 perry 229 alpha !IsAlpha
230 alnum !IsAlnum
1 perry 231 ascii IsASCII
2 perry 232 blank !IsSpace
233 cntrl !IsCntrl
234 digit !IsDigit d
235 graph !IsGraph
236 lower !IsLower
237 print !IsPrint
238 punct !IsPunct
239 space !IsSpace
240 !IsSpacePerl s
241 upper !IsUpper
242 word !IsWord
1 perry 243 xdigit IsXDigit
2 perry 244 For example [[:lower:] and p{!IsLower} are equivalent.
1 perry 245
246
247 If the utf8 pragma is not used but the
248 locale pragma is, the classes correlate with the
249 usual isalpha(3) interface (except for `word' and
250 `blank').
251
252
253 The assumedly non-obviously named classes are:
254
255
256 cntrl
257
258
259 Any control character. Usually characters that don't produce
260 output as such but instead control the terminal somehow: for
261 example newline and backspace are control characters. All
262 characters with ''ord()'' less than 32 are most often
263 classified as control characters (assuming
264 ASCII , the ISO Latin
265 character sets, and Unicode).
266
267
268 graph
269
270
271 Any alphanumeric or punctuation (special)
272 character.
273
274
275 print
276
277
278 Any alphanumeric or punctuation (special) character or
279 space.
280
281
282 punct
283
284
285 Any punctuation (special) character.
286
287
288 xdigit
289
290
291 Any hexadecimal digit. Though this may feel silly
292 ([[0-9A-Fa-f] would work just fine) it is included for
293 completeness.
294
295
296 You can negate the [[::] character classes by prefixing the
297 class name with a '^'. This is a Perl extension. For
298 example:
299
300
301 POSIX trad. Perl utf8 Perl
2 perry 302 [[:^digit:] D P{!IsDigit}
303 [[:^space:] S P{!IsSpace}
304 [[:^word:] W P{!IsWord}
1 perry 305 The POSIX character classes [[.cc.] and [[=cc=] are recognized but __not__ supported and trying to use them will cause an error.
306
307
308 Perl defines the following zero-width
309 assertions:
310
311
312 b Match a word boundary
313 B Match a non-(word boundary)
314 A Match only at beginning of string
315 Z Match only at end of string, or before newline at the end
316 z Match only at end of string
317 G Match only at pos() (e.g. at the end-of-match position
318 of prior m//g)
319 A word boundary (b) is a spot between two characters that has a w on one side of it and a W on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a W. (Within character classes b represents backspace rather than a word boundary, just as it normally does in any double-quoted string.) The A and Z are just like ``^'' and ``$'', except that they won't match multiple times when the /m modifier is used, while ``^'' and ``$'' will match at every internal line boundary. To match the actual end of the string and not ignore an optional trailing newline, use z.
320
321
322 The G assertion can be used to chain global matches
323 (using m//g), as described in ``Regexp Quote-Like
324 Operators'' in perlop. It is also useful when writing
325 lex-like scanners, when you have several patterns
326 that you want to match against consequent substrings of your
327 string, see the previous reference. The actual location
328 where G will match can also be influenced by using
329 pos() as an lvalue. See ``pos'' in
330 perlfunc.
331
332
333 The bracketing construct ( ... ) creates capture
334 buffers. To refer to the digit'th buffer use
335 $1 for details.) Referring back to another
336 part of the match is called a
337 ''backreference''.
338
339
340 There is no limit to the number of captured substrings that
341 you may use. However Perl also uses 10, 11, etc. as aliases
342 for 010, 011, etc. (Recall that 0 means octal, so 011 is the
343 character at number 9 in your coded character set; which
344 would be the 10th character, a horizontal tab under
345 ASCII .) Perl resolves this ambiguity by
346 interpreting 10 as a backreference only if at least 10 left
347 parentheses have opened before it. Likewise 11 is a
348 backreference only if at least 11 left parentheses have
349 opened before it. And so on. 1 through 9 are always
350 interpreted as backreferences.
351
352
353 Examples:
354
355
356 s/^([[^ ]*) *([[^ ]*)/$2 $1/; # swap first two words
357 if (/(.)1/) { # find first doubled char
358 print
359 if (/Time: (..):(..):(..)/) { # parse out values
360 $hours = $1;
361 $minutes = $2;
362 $seconds = $3;
363 }
364 Several special variables also refer back to portions of the previous match. $+ returns whatever the last bracket match matched. $ returns the entire matched string. (At one point $0 did also, but now it returns the name of the program.) $` returns everything before the matched string. And $' returns everything after the matched string.
365
366
367 The numbered variables ($1, $2, $3, etc.)
368 and the related punctuation set ($+,
369 $, $`, and $') are all
370 dynamically scoped until the end of the enclosing block or
371 until the next successful match, whichever comes first. (See
372 ``Compound Statements'' in perlsyn.)
373
374
375 __WARNING__ : Once Perl sees that you need
376 one of $, $`, or $' anywhere
377 in the program, it has to provide them for every pattern
378 match. This may substantially slow your program. Perl uses
379 the same mechanism to produce $1, $2, etc,
380 so you also pay a price for each pattern that contains
381 capturing parentheses. (To avoid this cost while retaining
382 the grouping behaviour, use the extended regular expression
383 (?: ... ) instead.) But if you never use
384 $, $` or $', then patterns
385 ''without'' capturing parentheses will not be penalized.
386 So avoid $, $', and $` if
387 you can, but if you can't (and some algorithms really
388 appreciate them), once you've used them once, use them at
389 will, because you've already paid the price. As of 5.005,
390 $ is not so costly as the other
391 two.
392
393
394 Backslashed metacharacters in Perl are alphanumeric, such as
395 b, w, n. Unlike some other
396 regular expression languages, there are no backslashed
397 symbols that aren't alphanumeric. So anything that looks
398 like \, , ),
399
400
401 $pattern =~ s/(W)/\$1/g;
402 (If use locale is set, then this depends on the current locale.) Today it is more common to use the ''quotemeta()'' function or the Q metaquoting escape sequence to disable all metacharacters' special meanings like this:
403
404
405 /$unquotedQ$quotedE$unquoted/
406 Beware that if you put literal backslashes (those not inside interpolated variables) between Q and E, double-quotish backslash interpolation may lead to confusing results. If you ''need'' to use literal backslashes within Q...E, consult ``Gory details of parsing quoted constructs'' in perlop.
407
408
409 __Extended Patterns__
410
411
412 Perl also defines a consistent extension syntax for features
413 not found in standard tools like __awk__ and __lex__.
414 The syntax is a pair of parentheses with a question mark as
415 the first thing within the parentheses. The character after
416 the question mark indicates the extension.
417
418
419 The stability of these extensions varies widely. Some have
420 been part of the core language for many years. Others are
421 experimental and may change without warning or be completely
422 removed. Check the documentation on an individual feature to
423 verify its current status.
424
425
426 A question mark was chosen for this and for the
427 minimal-matching construct because 1) question marks are
428 rare in older regular expressions, and 2) whenever you see
429 one, you should stop and ``question'' exactly what is going
430 on. That's psychology...
431
432
433 (?#text)
434
435
436 A comment. The text is ignored. If the /x modifier
437 enables whitespace formatting, a simple # will
438 suffice. Note that Perl closes the comment as soon as it
439 sees a ), so there is no way to put a literal
440 ) in the comment.
441
442
443 (?imsx-imsx)
444
445
446 One or more embedded pattern-match modifiers. This is
447 particularly useful for dynamic patterns, such as those read
448 in from a configuration file, read in as an argument, are
449 specified in a table somewhere, etc. Consider the case that
450 some of which want to be case sensitive and some do not. The
451 case insensitive ones need to include merely (?i)
452 at the front of the pattern. For example:
453
454
455 $pattern =
456 # more flexible:
457 $pattern =
458 Letters after a - turn those modifiers off. These modifiers are localized inside an enclosing group (if any). For example,
459
460
461 ( (?i) blah ) s+ 1
462 will match a repeated (''including the case''!) word blah in any case, assuming x modifier, and no i modifier outside this group.
463
464
465 (?:pattern)
466
467
468 (?imsx-imsx:pattern)
469
470
471 This is for clustering, not capturing; it groups
472 subexpressions like ``()'', but doesn't make backreferences
473 as ``()'' does. So
474
475
476 @fields = split(/b(?:abc)b/)
477 is like
478
479
480 @fields = split(/b(abc)b/)
481 but doesn't spit out extra fields. It's also cheaper not to capture characters if you don't need to.
482
483
484 Any letters between ? and : act as flags
485 modifiers as with (?imsx-imsx). For
486 example,
487
488
489 /(?s-i:more.*than).*million/i
490 is equivalent to the more verbose
491
492
493 /(?:(?s-i)more.*than).*million/i
494
495
496 (?=pattern)
497
498
499 A zero-width positive look-ahead assertion. For example,
500 /w+(?=t)/ matches a word followed by a tab, without
501 including the tab in $.
502
503
504 (?!pattern)
505
506
507 A zero-width negative look-ahead assertion. For example
508 /foo(?!bar)/ matches any occurrence of ``foo'' that
509 isn't followed by ``bar''. Note however that look-ahead and
510 look-behind are NOT the same thing. You
511 cannot use this for look-behind.
512
513
514 If you are looking for a ``bar'' that isn't preceded by a
515 ``foo'', /(?!foo)bar/ will not do what you want.
516 That's because the (?!foo) is just saying that the
517 next thing cannot be ``foo''--and it's not, it's a ``bar'',
518 so ``foobar'' will match. You would have to do something
519 like /(?!foo)...bar/ for that. We say ``like''
520 because there's the case of your ``bar'' not having three
521 characters before it. You could cover that this way:
522 /(?:(?!foo)...^.{0,2})bar/. Sometimes it's still
523 easier just to say:
524
525
526 if (/bar/
527 For look-behind see below.
528
529
530 (?
531
532
533 A zero-width positive look-behind assertion. For example,
534 /(? matches a word that follows a tab,
535 without including the tab in $. Works only for
536 fixed-width look-behind.
537
538
539 (?
540
541
542 A zero-width negative look-behind assertion. For example
543 /(? matches any occurrence of ``foo''
544 that does not follow ``bar''. Works only for fixed-width
545 look-behind.
546
547
548 (?{ code })
549
550
551 __WARNING__ : This extended regular
552 expression feature is considered highly experimental, and
553 may be changed or deleted without notice.
554
555
556 This zero-width assertion evaluate any embedded Perl code.
557 It always succeeds, and its code is not
558 interpolated. Currently, the rules to determine where the
559 code ends are somewhat convoluted.
560
561
562 The code is properly scoped in the following sense:
563 If the assertion is backtracked (compare ``Backtracking''),
564 all changes introduced after localization are
565 undone, so that
566
567
568 $_ = 'a' x 8;
569 m
570 will set $res = 4. Note that after the match, $cnt returns to the globally introduced value, because the scopes that restrict local operators are unwound.
571
572
573 This assertion may be used as a
574 (?(condition)yes-patternno-pattern) switch. If
575 ''not'' used in this way, the result of evaluation of
576 code is put into the special variable $^R.
577 This happens immediately, so $^R can be used from
578 other (?{ code }) assertions inside the same
579 regular expression.
580
581
582 The assignment to $^R above is properly localized,
583 so the old value of $^R is restored if the
584 assertion is backtracked; compare
585 ``Backtracking''.
586
587
588 For reasons of security, this construct is forbidden if the
589 regular expression involves run-time interpolation of
590 variables, unless the perilous use re 'eval' pragma
591 has been used (see re), or the variables contain results of
592 qr// operator (see ``qr/STRING/imosx'' in
593 perlop).
594
595
596 This restriction is because of the wide-spread and
597 remarkably convenient custom of using run-time determined
598 strings as patterns. For example:
599
600
601 $re =
602 Before Perl knew how to execute interpolated code within a pattern, this operation was completely safe from a security point of view, although it could raise an exception from an illegal pattern. If you turn on the use re 'eval', though, it is no longer secure, so you should only do so if you are also using taint checking. Better yet, use the carefully constrained evaluation within a Safe module. See perlsec for details about both these mechanisms.
603
604
605 (??{ code })
606
607
608 __WARNING__ : This extended regular
609 expression feature is considered highly experimental, and
610 may be changed or deleted without notice. A simplified
611 version of the syntax may be introduced for commonly used
612 idioms.
613
614
615 This is a ``postponed'' regular subexpression. The
616 code is evaluated at run time, at the moment this
617 subexpression may match. The result of evaluation is
618 considered as a regular expression and matched as if it were
619 inserted instead of this construct.
620
621
622 The code is not interpolated. As before, the rules
623 to determine where the code ends are currently
624 somewhat convoluted.
625
626
627 The following pattern matches a parenthesized
628 group:
629
630
631 $re = qr{
632 (?:
633 (?
634
635
636 (?
637
638
639 __WARNING__ : This extended regular
640 expression feature is considered highly experimental, and
641 may be changed or deleted without notice.
642
643
644 An ``independent'' subexpression, one which matches the
645 substring that a ''standalone'' pattern would
646 match if anchored at the given position, and it matches
647 ''nothing other than this substring''. This construct is
648 useful for optimizations of what would otherwise be
649 ``eternal'' matches, because it will not backtrack (see
650 ``Backtracking''). It may also be useful in places where the
651 ``grab all you can, and do not give anything back'' semantic
652 is desirable.
653
654
655 For example: ^(? will never match, since
656 (? (anchored at the beginning of string, as
657 above) will match ''all'' characters a at the
658 beginning of string, leaving no a for ab
659 to match. In contrast, a*ab will match the same as
660 a+b, since the match of the subgroup a* is
661 influenced by the following group ab (see
662 ``Backtracking''). In particular, a* inside
663 a*ab will match fewer characters than a standalone
664 a*, since this makes the tail match.
665
666
667 An effect similar to (? may be achieved
668 by writing (?=(pattern))1. This matches the same
669 substring as a standalone a+, and the following
670 1 eats the matched string; it therefore makes a
671 zero-length assertion into an analogue of
672 (?. (The difference between these two
673 constructs is that the second one uses a capturing group,
674 thus shifting ordinals of backreferences in the rest of a
675 regular expression.)
676
677
678 Consider this pattern:
679
680
681 m{
682 (
683 [[^()]+ # x+
684 [[^()]* )
685 )+
686 )
687 }x
688 That will efficiently match a nonempty group with matching parentheses two levels deep or less. However, if there is no such group, it will take virtually forever on a long string. That's because there are so many different ways to split a long string into several substrings. This is what (.+)+ is doing, and (.+)+ is similar to a subpattern of the above pattern. Consider how the pattern above detects no-match on ((()aaaaaaaaaaaaaaaaaa in several seconds, but that each extra letter doubles this time. This exponential performance will make it appear that your program has hung. However, a tiny change to this pattern
689
690
691 m{
692 (
693 (?
694 which uses (? matches exactly when the one above does (verifying this yourself would be a productive exercise), but finishes in a fourth the time when used on a similar string with 1000000 as. Be aware, however, that this pattern currently triggers a warning message under the use warnings pragma or __-w__ switch saying it ):
695
696
697 On simple groups, such as the pattern (?
698 , a comparable effect may be achieved by negative
699 look-ahead, as in [[^()]+ (?! [[^()] ). This was only
700 4 times slower on a string with 1000000
701 as.
702
703
704 The ``grab all you can, and do not give anything back''
705 semantic is desirable in many situations where on the first
706 sight a simple ()* looks like the correct solution.
707 Suppose we parse text with comments being delimited by
708 # followed by some optional (horizontal)
709 whitespace. Contrary to its appearance, #[[ t]*
710 ''is not'' the correct subexpression to match the comment
711 delimiter, because it may ``give up'' some whitespace if the
712 remainder of the pattern can be made to match that way. The
713 correct answer is either one of these:
714
715
716 (?
717 For example, to grab non-empty comments into $1, one should use either one of these:
718
719
720 / (?
721 Which one you pick depends on which of these expressions better reflects the above specification of comments.
722
723
724 (?(condition)yes-patternno-pattern)
725
726
727 (?(condition)yes-pattern)
728
729
730 __WARNING__ : This extended regular
731 expression feature is considered highly experimental, and
732 may be changed or deleted without notice.
733
734
735 Conditional expression. (condition) should be
736 either an integer in parentheses (which is valid if the
737 corresponding pair of parentheses matched), or
738 look-ahead/look-behind/evaluate zero-width
739 assertion.
740
741
742 For example:
743
744
745 m{ ( )?
746 [[^()]+
747 (?(1) ) )
748 }x
749 matches a chunk of non-parentheses, possibly included in parentheses themselves.
750
751
752 __Backtracking__
753
754
755 NOTE: This section presents an abstract
756 approximation of regular expression behavior. For a more
757 rigorous (and complicated) view of the rules involved in
758 selecting a match among possible alternatives, see
759 ``Combining pieces together''.
760
761
762 A fundamental feature of regular expression matching
763 involves the notion called ''backtracking'', which is
764 currently used (when needed) by all regular expression
765 quantifiers, namely *, *?, +,
766 +?, {n,m}, and {n,m}?.
767 Backtracking is often optimized internally, but the general
768 principle outlined here is valid.
769
770
771 For a regular expression to match, the ''entire'' regular
772 expression must match, not just part of it. So if the
773 beginning of a pattern containing a quantifier succeeds in a
774 way that causes later parts in the pattern to fail, the
775 matching engine backs up and recalculates the beginning
776 part--that's why it's called backtracking.
777
778
779 Here is an example of backtracking: Let's say you want to
780 find the word following ``foo'' in the string ``Food is on
781 the foo table.'':
782
783
784 $_ =
785 When the match runs, the first part of the regular expression (b(foo)) finds a possible match right at the beginning of the string, and loads up $1 with ``Foo''. However, as soon as the matching engine sees that there's no whitespace following the ``Foo'' that it had saved in $1, it realizes its mistake and starts over again one character after where it had the tentative match. This time it goes all the way until the next occurrence of ``foo''. The complete regular expression matches this time, and you get the expected output of ``table follows foo.''
786
787
788 Sometimes minimal matching can help a lot. Imagine you'd
789 like to match everything between ``foo'' and ``bar''.
790 Initially, you write something like this:
791
792
793 $_ =
794 Which perhaps unexpectedly yields:
795
796
797 got
798 That's because .* was greedy, so you get everything between the ''first'' ``foo'' and the ''last'' ``bar''. Here it's more effective to use minimal matching to make sure you get the text between a ``foo'' and the first ``bar'' thereafter.
799
800
801 if ( /foo(.*?)bar/ ) { print
802 Here's another example: let's say you'd like to match a number at the end of a string, and you also want to keep the preceding part the match. So you write this:
803
804
805 $_ =
806 That won't work at all, because .* was greedy and gobbled up the whole string. As d* can match on an empty string the complete regular expression matched successfully.
807
808
809 Beginning is
810 Here are some variants, most of which don't work:
811
812
813 $_ =
814 for $pat (@pats) {
815 printf
816 That will print out:
817
818
819 (.*)(d*)
820 As you see, this can be a bit tricky. It's important to realize that a regular expression is merely a set of assertions that gives a definition of success. There may be 0, 1, or several different ways that the definition might succeed against a particular string. And if there are multiple ways it might succeed, you need to understand backtracking to know which variety of success you will achieve.
821
822
823 When using look-ahead assertions and negations, this can all
824 get even tricker. Imagine you'd like to find a sequence of
825 non-digits not followed by ``123''. You might try to write
826 that as
827
828
829 $_ =
830 But that isn't going to match; at least, not the way you're hoping. It claims that there is no 123 in the string. Here's a clearer picture of why it that pattern matches, contrary to popular expectations:
831
832
833 $x = 'ABC123' ;
834 $y = 'ABC445' ;
835 print
836 print
837 This prints
838
839
840 2: got ABC
841 3: got AB
842 4: got ABC
843 You might have expected test 3 to fail because it seems to a more general purpose version of test 1. The important difference between them is that test 3 contains a quantifier (D*) and so can use backtracking, whereas test 1 will not. What's happening is that you've asked ``Is it true that at the start of $x, following 0 or more non-digits, you have something that's not 123?'' If the pattern matcher had let D* expand to `` ABC '', this would have caused the whole pattern to fail.
844
845
846 The search engine will initially match D* with ``
847 ABC ''. Then it will try to match
848 (?!123 with ``123'', which fails. But because a
849 quantifier (D*) has been used in the regular
850 expression, the search engine can backtrack and retry the
851 match differently in the hope of matching the complete
852 regular expression.
853
854
855 The pattern really, ''really'' wants to succeed, so it
856 uses the standard pattern back-off-and-retry and lets
857 D* expand to just `` AB '' this
858 time. Now there's indeed something following ``
859 AB '' that is not ``123''. It's ``C123'',
860 which suffices.
861
862
863 We can deal with this by using both an assertion and a
864 negation. We'll say that the first part in $1 must
865 be followed both by a digit and by something that's not
866 ``123''. Remember that the look-aheads are zero-width
867 expressions--they only look, but don't consume any of the
868 string in their match. So rewriting this way produces what
869 you'd expect; that is, case 5 will fail, but case 6
870 succeeds:
871
872
873 print
874 6: got ABC
875 In other words, the two zero-width assertions next to each other work as though they're ANDed together, just as you'd use any built-in assertions: /^$/ matches only if you're at the beginning of the line AND the end of the line simultaneously. The deeper underlying truth is that juxtaposition in regular expressions always means AND , except when you write an explicit OR using the vertical bar. /ab/ means match ``a'' AND (then) match ``b'', although the attempted matches are made at different positions because ``a'' is not a zero-width assertion, but a one-width assertion.
876
877
878 __WARNING__ : particularly complicated
879 regular expressions can take exponential time to solve
880 because of the immense number of possible ways they can use
881 backtracking to try match. For example, without internal
882 optimizations done by the regular expression engine, this
883 will take a painfully long time to run:
884
885
886 'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[[c]/
887 And if you used *'s in the internal groups instead of limiting them to 0 through 5 matches, then it would take forever--or until you ran out of stack space. Moreover, these internal optimizations are not always applicable. For example, if you put {0,5} instead of * on the external group, no current optimization is applicable, and the match takes a long time to finish.
888
889
890 A powerful tool for optimizing such beasts is what is known
891 as an ``independent group'', which does not backtrack (see
892 (?
893 might'' have influenced the following match, see
894 ''(?''''
895
896
897 __Version 8 Regular Expressions__
898
899
900 In case you're not familiar with the ``regular'' Version 8
901 regex routines, here are the pattern-matching rules not
902 described above.
903
904
905 Any single character matches itself, unless it is a
906 ''metacharacter'' with a special meaning described here
907 or above. You can cause characters that normally function as
908 metacharacters to be interpreted literally by prefixing them
909 with a ``'' (e.g., ``.'' matches a ``.'', not any character;
910 ``\'' matches a ``''). A series of characters matches that
911 series of characters in the target string, so the pattern
912 blurfl would match ``blurfl'' in the target
913 string.
914
915
916 You can specify a character class, by enclosing a list of
917 characters in [[], which will match any one
918 character from the list. If the first character after the
919 ``[['' is ``^'', the class matches any character not in the
920 list. Within a list, the ``-'' character specifies a range,
921 so that a-z represents all characters between ``a''
922 and ``z'', inclusive. If you want either ``-'' or ``]''
923 itself to be a member of a class, put it at the start of the
924 list (possibly after a ``^''), or escape it with a
925 backslash. ``-'' is also taken literally when it is at the
926 end of the list, just before the closing ``]''. (The
927 following all specify the same class of three characters:
928 [[-az], [[az-], and [[a-z]. All are
929 different from [[a-z], which specifies a class
930 containing twenty-six characters, even on
931 EBCDIC based coded character sets.) Also, if
932 you try to use the character classes w, W,
933 s, S, d, or D as
934 endpoints of a range, that's not a range, the ``-'' is
935 understood literally.
936
937
938 Note also that the whole range idea is rather unportable
939 between character sets--and even within character sets they
940 may cause results you probably didn't expect. A sound
941 principle is to use only ranges that begin from and end at
942 either alphabets of equal case ([[a-e], [[A-E]), or digits
943 ([[0-9]). Anything else is unsafe. If in doubt, spell out the
944 character sets in full.
945
946
947 Characters may be specified using a metacharacter syntax
948 much like that used in C: ``n'' matches a newline, ``t'' a
949 tab, ``r'' a carriage return, ``f'' a form feed, etc. More
950 generally, \''nnn'', where ''nnn'' is a string of
951 octal digits, matches the character whose coded character
952 set value is ''nnn''. Similarly, x''nn'', where
953 ''nn'' are hexadecimal digits, matches the character
954 whose numeric value is ''nn''. The expression c''x''
955 matches the character control-''x''. Finally, the ``.''
956 metacharacter matches any character except ``n'' (unless you
957 use /s).
958
959
960 You can specify a series of alternatives for a pattern using
961 ``'' to separate them, so that feefiefoe will match
962 any of ``fee'', ``fie'', or ``foe'' in the target string (as
963 would f(eio)e). The first alternative includes
964 everything from the last pattern delimiter (``('', ``[['', or
965 the beginning of the pattern) up to the first ``'', and the
966 last alternative contains everything from the last ``'' to
967 the next pattern delimiter. That's why it's common practice
968 to include alternatives in parentheses: to minimize
969 confusion about where they start and end.
970
971
972 Alternatives are tried from left to right, so the first
973 alternative found for which the entire expression matches,
974 is the one that is chosen. This means that alternatives are
975 not necessarily greedy. For example: when matching
976 foofoot against ``barefoot'', only the ``foo'' part
977 will match, as that is the first alternative tried, and it
978 successfully matches the target string. (This might not seem
979 important, but it is important when you are capturing
980 matched text using parentheses.)
981
982
983 Also remember that ``'' is interpreted as a literal within
984 square brackets, so if you write [[feefiefoe] you're
985 really only matching [[feio].
986
987
988 Within a pattern, you may designate subpatterns for later
989 reference by enclosing them in parentheses, and you may
990 refer back to the ''n''th subpattern later in the pattern
991 using the metacharacter \''n''. Subpatterns are numbered
992 based on the left to right order of their opening
993 parenthesis. A backreference matches whatever actually
994 matched the subpattern in the string being examined, not the
995 rules for that subpattern. Therefore, (00x)d*s1d*
996 will match ``0x1234 0x4321'', but not ``0x1234 01234'',
997 because subpattern 1 matched ``0x'', even though the rule
998 00x could potentially match the leading 0 in the
999 second number.
1000
1001
1002 __Warning on 1 vs__ $1
1003
1004
1005 Some people get too used to writing things
1006 like:
1007
1008
1009 $pattern =~ s/(W)/\1/g;
2 perry 1010 This is grandfathered for the RHS of a substitute to avoid shocking the __sed__ addicts, but it's a dirty habit to get into. That's because in !PerlThink, the righthand side of a s/// is a double-quoted string. 1 in the usual double-quoted string means a control-A. The customary Unix meaning of 1 is kludged in for s///. However, if you get into the habit of doing that, you get yourself into trouble if you then add an /e modifier.
1 perry 1011
1012
1013 s/(d+)/ 1 + 1 /eg; # causes warning under -w
1014 Or if you try to do
1015
1016
1017 s/(d+)/1000/;
1018 You can't disambiguate that by saying {1}000, whereas you can fix it with ${1}000. The operation of interpolation should not be confused with the operation of matching a backreference. Certainly they mean two different things on the ''left'' side of the s///.
1019
1020
1021 __Repeated patterns matching zero-length
1022 substring__
1023
1024
1025 __WARNING__ : Difficult material (and
1026 prose) ahead. This section needs a rewrite.
1027
1028
1029 Regular expressions provide a terse and powerful programming
1030 language. As with most other power tools, power comes
1031 together with the ability to wreak havoc.
1032
1033
1034 A common abuse of this power stems from the ability to make
1035 infinite loops using regular expressions, with something as
1036 innocuous as:
1037
1038
1039 'foo' =~ m{ ( o? )* }x;
1040 The o? can match at the beginning of 'foo', and since the position in the string is not moved by the match, o? would match again and again because of the * modifier. Another common way to create a similar cycle is with the looping modifier //g:
1041
1042
1043 @matches = ( 'foo' =~ m{ o? }xg );
1044 or
1045
1046
1047 print
1048 or the loop implied by ''split()''.
1049
1050
1051 However, long experience has shown that many programming
1052 tasks may be significantly simplified by using repeated
1053 subexpressions that may match zero-length substrings. Here's
1054 a simple example being:
1055
1056
1057 @chars = split //, $string; # // is not magic in split
1058 ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
1059 Thus Perl allows such constructs, by ''forcefully breaking the infinite loop''. The rules for this are different for lower-level loops given by the greedy modifiers *+{}, and for higher-level ones like the /g modifier or ''split()'' operator.
1060
1061
1062 The lower-level loops are ''interrupted'' (that is, the
1063 loop is broken) when Perl detects that a repeated expression
1064 matched a zero-length substring. Thus
1065
1066
1067 m{ (?: NON_ZERO_LENGTH ZERO_LENGTH )* }x;
1068 is made equivalent to
1069
1070
1071 m{ (?: NON_ZERO_LENGTH )*
1072 (?: ZERO_LENGTH )?
1073 }x;
1074 The higher level-loops preserve an additional state between iterations: whether the last match was zero-length. To break the loop, the following match after a zero-length match is prohibited to have a length of zero. This prohibition interacts with backtracking (see ``Backtracking''), and so the ''second best'' match is chosen if the ''best'' match is of zero length.
1075
1076
1077 For example:
1078
1079
1080 $_ = 'bar';
1081 s/w??/
1082 results in . At each position of the string the best match given by non-greedy ?? is the zero-length match, and the ''second best'' match is what is matched by w. Thus zero-length matches alternate with one-character-long matches.
1083
1084
1085 Similarly, for repeated m/()/g the second-best
1086 match is the match at the position one notch further in the
1087 string.
1088
1089
1090 The additional state of being ''matched with
1091 zero-length'' is associated with the matched string, and
1092 is reset by each assignment to ''pos()''. Zero-length
1093 matches at the end of the previous match are ignored during
1094 split.
1095
1096
1097 __Combining pieces together__
1098
1099
1100 Each of the elementary pieces of regular expressions which
1101 were described before (such as ab or Z)
1102 could match at most one substring at the given position of
1103 the input string. However, in a typical regular expression
1104 these elementary pieces are combined into more complicated
1105 patterns using combining operators ST, ST,
1106 S* etc (in these examples S and T
1107 are regular subexpressions).
1108
1109
1110 Such combinations can include alternatives, leading to a
1111 problem of choice: if we match a regular expression
1112 aab against , will it match
1113 substring or ?
1114 One way to describe which substring is actually matched is
1115 the concept of backtracking (see ``Backtracking''). However,
1116 this description is too low-level and makes you think in
1117 terms of a particular implementation.
1118
1119
1120 Another description starts with notions of
1121 ``better''/``worse''. All the substrings which may be
1122 matched by the given regular expression can be sorted from
1123 the ``best'' match to the ``worst'' match, and it is the
1124 ``best'' match which is chosen. This substitutes the
1125 question of ``what is chosen?'' by the question of ``which
1126 matches are better, and which are worse?''.
1127
1128
1129 Again, for elementary pieces there is no such question,
1130 since at most one match at a given position is possible.
1131 This section describes the notion of better/worse for
1132 combining operators. In the description below S and
1133 T are regular subexpressions.
1134
1135
1136 ST
1137
1138
1139 Consider two possible matches, AB and
1140 A'B', A and A' are substrings
1141 which can be matched by S, B and
1142 B' are substrings which can be matched by
1143 T.
1144
1145
1146 If A is better match for S than
1147 A', AB is a better match than
1148 A'B'.
1149
1150
1151 If A and A' coincide: AB is a
1152 better match than AB' if B is better match
1153 for T than B'.
1154
1155
1156 ST
1157
1158
1159 When S can match, it is a better match than when
1160 only T can match.
1161
1162
1163 Ordering of two matches for S is the same as for
1164 S. Similar for two matches for
1165 T.
1166
1167
1168 S{REPEAT_COUNT}
1169
1170
1171 Matches as SSS...S (repeated as many times as
1172 necessary).
1173
1174
1175 S{min,max}
1176
1177
1178 Matches as
1179 S{max}S{max-1}...S{min+1}S{min}.
1180
1181
1182 S{min,max}?
1183
1184
1185 Matches as
1186 S{min}S{min+1}...S{max-1}S{max}.
1187
1188
1189 S?, S*, S+
1190
1191
1192 Same as S{0,1}, S{0,BIG_NUMBER},
1193 S{1,BIG_NUMBER} respectively.
1194
1195
1196 S??, S*?, S+?
1197
1198
1199 Same as S{0,1}?, S{0,BIG_NUMBER}?,
1200 S{1,BIG_NUMBER}? respectively.
1201
1202
1203 (?
1204
1205
1206 Matches the best match for S and only
1207 that.
1208
1209
1210 (?=S), (?
1211
1212
1213 Only the best match for S is considered. (This is
1214 important only if S has capturing parentheses, and
1215 backreferences are used somewhere else in the whole regular
1216 expression.)
1217
1218
1219 (?!S), (?
1220
1221
1222 For this grouping operator there is no need to describe the
1223 ordering, since only whether or not S can match is
1224 important.
1225
1226
1227 (??{ EXPR })
1228
1229
1230 The ordering is the same as for the regular expression which
1231 is the result of EXPR .
1232
1233
1234 (?(condition)yes-patternno-pattern)
1235
1236
1237 Recall that which of yes-pattern or
1238 no-pattern actually matches is already determined.
1239 The ordering of the matches is the same as for the chosen
1240 subexpression.
1241
1242
1243 The above recipes describe the ordering of matches ''at a
1244 given position''. One more rule is needed to understand
1245 how a match is determined for the whole regular expression:
1246 a match at an earlier position is always better than a match
1247 at a later position.
1248
1249
1250 __Creating custom RE
1251 engines__
1252
1253
1254 Overloaded constants (see overload) provide a simple way to
1255 extend the functionality of the RE
1256 engine.
1257
1258
1259 Suppose that we want to enable a new RE
1260 escape-sequence Y which matches at boundary between
1261 white-space characters and non-whitespace characters. Note
1262 that (?=S)(? matches exactly
1263 at these positions, so we want to have each Y in
1264 the place of the more complicated version. We can create a
1265 module customre to do this:
1266
1267
1268 package customre;
1269 use overload;
1270 sub import {
1271 shift;
1272 die
1273 sub invalid { die
1274 my %rules = ( '\' =
1275 Now use customre enables the new escape in constant regular expressions, i.e., those without any runtime variable interpolations. As documented in overload, this conversion will work only over literal parts of regular expressions. For Y$reY the variable part of this regular expression needs to be converted explicitly (but only if the special meaning of Y should be enabled inside $re):
1276
1277
1278 use customre;
1279 $re =
1280 !!BUGS
1281
1282
1283 This document varies from difficult to understand to
1284 completely and utterly opaque. The wandering prose riddled
1285 with jargon is hard to fathom in several
1286 places.
1287
1288
1289 This document needs a rewrite that separates the tutorial
1290 content from the reference content.
1291 !!SEE ALSO
1292
1293
1294 ``Regexp Quote-Like Operators'' in perlop.
1295
1296
1297 ``Gory details of parsing quoted constructs'' in
1298 perlop.
1299
1300
1301 perlfaq6.
1302
1303
1304 ``pos'' in perlfunc.
1305
1306
1307 perllocale.
1308
1309
1310 perlebcdic.
1311
1312
1313 ''Mastering Regular Expressions'' by Jeffrey Friedl,
1314 published by O'Reilly and Associates.
1315 ----
This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.