Penguin
Annotated edit history of libpcre(7) version 1, including all changes. View license author blame.
Rev Author # Line
1 perry 1 PCRE
2 !!!PCRE
3 NAME
4 REGULAR EXPRESSION DETAILS
5 BACKSLASH
6 CIRCUMFLEX AND DOLLAR
7 FULL STOP (PERIOD, DOT)
8 SQUARE BRACKETS
9 POSIX CHARACTER CLASSES
10 VERTICAL BAR
11 INTERNAL OPTION SETTING
12 SUBPATTERNS
13 REPETITION
14 BACK REFERENCES
15 ASSERTIONS
16 ONCE-ONLY SUBPATTERNS
17 CONDITIONAL SUBPATTERNS
18 COMMENTS
19 RECURSIVE PATTERNS
20 PERFORMANCE
21 UTF-8 SUPPORT
22 DIFFERENCES FROM PERL
23 AUTHOR
24 ----
25 !!NAME
26
27
28 pcre - Perl-compatible regular expressions: expresion syntax.
29 !!REGULAR EXPRESSION DETAILS
30
31
32 The syntax and semantics of the regular expressions
33 supported by PCRE are described below. Regular expressions
34 are also described in the Perl documentation and in a number
35 of other books, some of which have copious examples. Jeffrey
36 Friedl's
37
38
39 The description here is intended as reference documentation.
40 The basic operation of PCRE is on strings of bytes. However,
41 there is the beginnings of some support for UTF-8 character
42 strings. To use this support you must configure PCRE to
43 include it, and then call __pcre_compile()__ with the
44 PCRE_UTF8 option. How this affects the pattern matching is
45 described in the final section of this
46 document.
47
48
49 A regular expression is a pattern that is matched against a
50 subject string from left to right. Most characters stand for
51 themselves in a pattern, and match the corresponding
52 characters in the subject. As a trivial example, the
53 pattern
54
55
56 The quick brown fox
57
58
59 matches a portion of a subject string that is identical to
60 itself. The power of regular expressions comes from the
61 ability to include alternatives and repetitions in the
62 pattern. These are encoded in the pattern by the use of
63 ''meta-characters'', which do not stand for themselves
64 but instead are interpreted in some special
65 way.
66
67
68 There are two different sets of meta-characters: those that
69 are recognized anywhere in the pattern except within square
70 brackets, and those that are recognized in square brackets.
71 Outside square brackets, the meta-characters are as
72 follows:
73
74
75 \ general escape character with several uses ^ assert start
76 of subject (or line, in multiline mode) $ assert end of
77 subject (or line, in multiline mode) . match any character
78 except newline (by default) [[ start character class
79 definition | start of alternative branch ( start subpattern
80 ) end subpattern ? extends the meaning of ( also 0 or 1
81 quantifier also quantifier minimizer * 0 or more quantifier
82 + 1 or more quantifier { start min/max
83 quantifier
84
85
86 Part of a pattern that is in square brackets is called a
87
88
89 \ general escape character ^ negate the class, but only if
90 the first character - indicates character range ] terminates
91 the character class
92
93
94 The following sections describe the use of each of the
95 meta-characters.
96 !!BACKSLASH
97
98
99 The backslash character has several uses. Firstly, if it is
100 followed by a non-alphameric character, it takes away any
101 special meaning that character may have. This use of
102 backslash as an escape character applies both inside and
103 outside character classes.
104
105
106 For example, if you want to match a
107
108
109 If a pattern is compiled with the PCRE_EXTENDED option,
110 whitespace in the pattern (other than in a character class)
111 and characters between a
112
113
114 A second use of backslash provides a way of encoding
115 non-printing characters in patterns in a visible manner.
116 There is no restriction on the appearance of non-printing
117 characters, apart from the binary zero that terminates a
118 pattern, but when a pattern is being prepared by text
119 editing, it is usually easier to use one of the following
120 escape sequences than the binary character it
121 represents:
122
123
124 a alarm, that is, the BEL character (hex 07) cx
125
126
127 The precise effect of
128
129
130 After
131
132
133 After
134
135
136 The handling of a backslash followed by a digit other than 0
137 is complicated. Outside a character class, PCRE reads it and
138 any following digits as a decimal number. If the number is
139 less than 10, or if there have been at least that many
140 previous capturing left parentheses in the expression, the
141 entire sequence is taken as a ''back reference''. A
142 description of how this works is given later, following the
143 discussion of parenthesized subpatterns.
144
145
146 Inside a character class, or if the decimal number is
147 greater than 9 and there have not been that many capturing
148 subpatterns, PCRE re-reads up to three octal digits
149 following the backslash, and generates a single byte from
150 the least significant 8 bits of the value. Any subsequent
151 digits stand for themselves. For example:
152
153
154 040 is another way of writing a space 40 is the same,
155 provided there are fewer than 40 previous capturing
156 subpatterns 7 is always a back reference 11 might be a back
157 reference, or another way of writing a tab 011 is always a
158 tab 0113 is a tab followed by the character
159
160
161 Note that octal values of 100 or greater must not be
162 introduced by a leading zero, because no more than three
163 octal digits are ever read.
164
165
166 All the sequences that define a single byte value can be
167 used both inside and outside character classes. In addition,
168 inside a character class, the sequence
169
170
171 The third use of backslash is for specifying generic
172 character types:
173
174
175 d any decimal digit D any character that is not a decimal
176 digit s any whitespace character S any character that is not
177 a whitespace character w any
178
179
180 Each pair of escape sequences partitions the complete set of
181 characters into two disjoint sets. Any given character
182 matches one, and only one, of each pair.
183
184
185 A
186
187
188 These character type sequences can appear both inside and
189 outside character classes. They each match one character of
190 the appropriate type. If the current matching point is at
191 the end of the subject string, all of them fail, since there
192 is no character to match.
193
194
195 The fourth use of backslash is for certain simple
196 assertions. An assertion specifies a condition that has to
197 be met at a particular point in a match, without consuming
198 any characters from the subject string. The use of
199 subpatterns for more complicated assertions is described
200 below. The backslashed assertions are
201
202
203 b word boundary B not a word boundary A start of subject
204 (independent of multiline mode) Z end of subject or newline
205 at end (independent of multiline mode) z end of subject
206 (independent of multiline mode)
207
208
209 These assertions may not appear in character classes (but
210 note that
211
212
213 A word boundary is a position in the subject string where
214 the current character and the previous character do not both
215 match w or W (i.e. one matches w and the other matches W),
216 or the start or end of the string if the first or last
217 character matches w, respectively.
218
219
220 The A, Z, and z assertions differ from the traditional
221 circumflex and dollar (described below) in that they only
222 ever match at the very start and end of the subject string,
223 whatever options are set. They are not affected by the
224 PCRE_NOTBOL or PCRE_NOTEOL options. If the
225 ''startoffset'' argument of __pcre_exec()__ is
226 non-zero, A can never match. The difference between Z and z
227 is that Z matches before a newline that is the last
228 character of the string as well as at the end of the string,
229 whereas z matches only at the end.
230 !!CIRCUMFLEX AND DOLLAR
231
232
233 Outside a character class, in the default matching mode, the
234 circumflex character is an assertion which is true only if
235 the current matching point is at the start of the subject
236 string. If the ''startoffset'' argument of
237 __pcre_exec()__ is non-zero, circumflex can never match.
238 Inside a character class, circumflex has an entirely
239 different meaning (see below).
240
241
242 Circumflex need not be the first character of the pattern if
243 a number of alternatives are involved, but it should be the
244 first thing in each alternative in which it appears if the
245 pattern is ever to match that branch. If all possible
246 alternatives start with a circumflex, that is, if the
247 pattern is constrained to match only at the start of the
248 subject, it is said to be an
249
250
251 A dollar character is an assertion which is true only if the
252 current matching point is at the end of the subject string,
253 or immediately before a newline character that is the last
254 character in the string (by default). Dollar need not be the
255 last character of the pattern if a number of alternatives
256 are involved, but it should be the last item in any branch
257 in which it appears. Dollar has no special meaning in a
258 character class.
259
260
261 The meaning of dollar can be changed so that it matches only
262 at the very end of the string, by setting the
263 PCRE_DOLLAR_ENDONLY option at compile or matching time. This
264 does not affect the Z assertion.
265
266
267 The meanings of the circumflex and dollar characters are
268 changed if the PCRE_MULTILINE option is set. When this is
269 the case, they match immediately after and immediately
270 before an internal
271 startoffset''
272 argument of __pcre_exec()__ is non-zero. The
273 PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is
274 set.
275
276
277 Note that the sequences A, Z, and z can be used to match the
278 start and end of the subject in both modes, and if all
279 branches of a pattern start with A is it always anchored,
280 whether PCRE_MULTILINE is set or not.
281 !!FULL STOP (PERIOD, DOT)
282
283
284 Outside a character class, a dot in the pattern matches any
285 one character in the subject, including a non-printing
286 character, but not (by default) newline. If the PCRE_DOTALL
287 option is set, dots match newlines as well. The handling of
288 dot is entirely independent of the handling of circumflex
289 and dollar, the only relationship being that they both
290 involve newline characters. Dot has no special meaning in a
291 character class.
292 !!SQUARE BRACKETS
293
294
295 An opening square bracket introduces a character class,
296 terminated by a closing square bracket. A closing square
297 bracket on its own is not special. If a closing square
298 bracket is required as a member of the class, it should be
299 the first data character in the class (after an initial
300 circumflex, if present) or escaped with a
301 backslash.
302
303
304 A character class matches a single character in the subject;
305 the character must be in the set of characters defined by
306 the class, unless the first character in the class is a
307 circumflex, in which case the subject character must not be
308 in the set defined by the class. If a circumflex is actually
309 required as a member of the class, ensure it is not the
310 first character, or escape it with a backslash.
311
312
313 For example, the character class [[aeiou] matches any lower
314 case vowel, while [[^aeiou] matches any character that is not
315 a lower case vowel. Note that a circumflex is just a
316 convenient notation for specifying the characters which are
317 in the class by enumerating those that are not. It is not an
318 assertion: it still consumes a character from the subject
319 string, and fails if the current pointer is at the end of
320 the string.
321
322
323 When caseless matching is set, any letters in a class
324 represent both their upper case and lower case versions, so
325 for example, a caseless [[aeiou] matches
326
327
328 The newline character is never treated in any special way in
329 character classes, whatever the setting of the PCRE_DOTALL
330 or PCRE_MULTILINE options is. A class such as [[^a] will
331 always match a newline.
332
333
334 The minus (hyphen) character can be used to specify a range
335 of characters in a character class. For example, [[d-m]
336 matches any letter between d and m, inclusive. If a minus
337 character is required in a class, it must be escaped with a
338 backslash or appear in a position where it cannot be
339 interpreted as indicating a range, typically as the first or
340 last character in the class.
341
342
343 It is not possible to have the literal character
344
345
346 Ranges operate in ASCII collating sequence. They can also be
347 used for characters specified numerically, for example
348 [[000-037]. If a range that includes letters is used when
349 caseless matching is set, it matches the letters in either
350 case. For example, [[W-c] is equivalent to [[][[^_`wxyzabc],
351 matched caselessly, and if character tables for the
352
353
354 The character types d, D, s, S, w, and W may also appear in
355 a character class, and add the characters that they match to
356 the class. For example, [[dABCDEF] matches any hexadecimal
357 digit. A circumflex can conveniently be used with the upper
358 case character types to specify a more restricted set of
359 characters than the matching lower case type. For example,
360 the class [[^W_] matches any letter or digit, but not
361 underscore.
362
363
364 All non-alphameric characters other than , -, ^ (at the
365 start) and the terminating ] are non-special in character
366 classes, but it does no harm if they are
367 escaped.
368 !!POSIX CHARACTER CLASSES
369
370
371 Perl 5.6 (not yet released at the time of writing) is going
372 to support the POSIX notation for character classes, which
373 uses names enclosed by [[: and :] within the enclosing square
374 brackets. PCRE supports this notation. For
375 example,
376
377
378 [[01[[:alpha:]%]
379
380
381 matches
382
383
384 alnum letters and digits alpha letters ascii character codes
385 0 - 127 cntrl control characters digit decimal digits (same
386 as d) graph printing characters, excluding space lower lower
387 case letters print printing characters, including space
388 punct printing characters, excluding letters and digits
389 space white space (same as s) upper upper case letters word
390
391
392 The names
393
394
395 [[12[[:^digit:]]
396
397
398 matches
399 !!VERTICAL BAR
400
401
402 Vertical bar characters are used to separate alternative
403 patterns. For example, the pattern
404
405
406 gilbert|sullivan
407
408
409 matches either
410 !!INTERNAL OPTION SETTING
411
412
413 The settings of PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL,
414 and PCRE_EXTENDED can be changed from within the pattern by
415 a sequence of Perl option letters enclosed between
416
417
418 i for PCRE_CASELESS m for PCRE_MULTILINE s for PCRE_DOTALL x
419 for PCRE_EXTENDED
420
421
422 For example, (?im) sets caseless, multiline matching. It is
423 also possible to unset these options by preceding the letter
424 with a hyphen, and a combined setting and unsetting such as
425 (?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while
426 unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted.
427 If a letter appears both before and after the hyphen, the
428 option is unset.
429
430
431 The scope of these option changes depends on where in the
432 pattern the setting occurs. For settings that are outside
433 any subpattern (defined below), the effect is the same as if
434 the options were set or unset at the start of matching. The
435 following patterns all behave in exactly the same
436 way:
437
438
439 (?i)abc a(?i)bc ab(?i)c abc(?i)
440
441
442 which in turn is the same as compiling the pattern abc with
443 PCRE_CASELESS set. In other words, such
444
445
446 If an option change occurs inside a subpattern, the effect
447 is different. This is a change of behaviour in Perl 5.005.
448 An option change inside a subpattern affects only that part
449 of the subpattern that follows it, so
450
451
452 (a(?i)b)c
453
454
455 matches abc and aBc and no other strings (assuming
456 PCRE_CASELESS is not used). By this means, options can be
457 made to have different settings in different parts of the
458 pattern. Any changes made in one alternative do carry on
459 into subsequent branches within the same subpattern. For
460 example,
461
462
463 (a(?i)b|c)
464
465
466 matches
467
468
469 The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can
470 be changed in the same way as the Perl-compatible options by
471 using the characters U and X respectively. The (?X) flag
472 setting is special in that it must always occur earlier in
473 the pattern than any of the additional features it turns on,
474 even when it is at top level. It is best put at the
475 start.
476 !!SUBPATTERNS
477
478
479 Subpatterns are delimited by parentheses (round brackets),
480 which can be nested. Marking part of a pattern as a
481 subpattern does two things:
482
483
484 1. It localizes a set of alternatives. For example, the
485 pattern
486
487
488 cat(aract|erpillar|)
489
490
491 matches one of the words
492
493
494 2. It sets up the subpattern as a capturing subpattern (as
495 defined above). When the whole pattern matches, that portion
496 of the subject string that matched the subpattern is passed
497 back to the caller via the ''ovector'' argument of
498 __pcre_exec()__. Opening parentheses are counted from
499 left to right (starting from 1) to obtain the numbers of the
500 capturing subpatterns.
501
502
503 For example, if the string
504
505
506 the ((red|white) (king|queen))
507
508
509 the captured substrings are
510
511
512 The fact that plain parentheses fulfil two functions is not
513 always helpful. There are often times when a grouping
514 subpattern is required without a capturing requirement. If
515 an opening parenthesis is followed by
516
517
518 the ((?:red|white) (king|queen))
519
520
521 the captured substrings are
522
523
524 As a convenient shorthand, if any option settings are
525 required at the start of a non-capturing subpattern, the
526 option letters may appear between the
527
528
529 (?i:saturday|sunday) (?:(?i)saturday|sunday)
530
531
532 match exactly the same set of strings. Because alternative
533 branches are tried from left to right, and options are not
534 reset until the end of the subpattern is reached, an option
535 setting in one branch does affect subsequent branches, so
536 the above patterns match
537 !!REPETITION
538
539
540 Repetition is specified by quantifiers, which can follow any
541 of the following items:
542
543
544 a single character, possibly escaped the . metacharacter a
545 character class a back reference (see next section) a
546 parenthesized subpattern (unless it is an assertion - see
547 below)
548
549
550 The general repetition quantifier specifies a minimum and
551 maximum number of permitted matches, by giving the two
552 numbers in curly brackets (braces), separated by a comma.
553 The numbers must be less than 65536, and the first must be
554 less than or equal to the second. For example:
555
556
557 z{2,4}
558
559
560 matches
561
562
563 [[aeiou]{3,}
564
565
566 matches at least 3 successive vowels, but may match many
567 more, while
568
569
570 d{8}
571
572
573 matches exactly 8 digits. An opening curly bracket that
574 appears in a position where a quantifier is not allowed, or
575 one that does not match the syntax of a quantifier, is taken
576 as a literal character. For example, {,6} is not a
577 quantifier, but a literal string of four
578 characters.
579
580
581 The quantifier {0} is permitted, causing the expression to
582 behave as if the previous item and the quantifier were not
583 present.
584
585
586 For convenience (and historical compatibility) the three
587 most common quantifiers have single-character
588 abbreviations:
589
590
591 * is equivalent to {0,} + is equivalent to {1,} ? is
592 equivalent to {0,1}
593
594
595 It is possible to construct infinite loops by following a
596 subpattern that can match no characters with a quantifier
597 that has no upper limit, for example:
598
599
600 (a?)*
601
602
603 Earlier versions of Perl and PCRE used to give an error at
604 compile time for such patterns. However, because there are
605 cases where this can be useful, such patterns are now
606 accepted, but if any repetition of the subpattern does in
607 fact match no characters, the loop is forcibly
608 broken.
609
610
611 By default, the quantifiers are
612
613
614 /*.**/
615
616
617 to the string
618
619
620 /* first command */ not comment /* second comment
621 */
622
623
624 fails, because it matches the entire string due to the
625 greediness of the .* item.
626
627
628 However, if a quantifier is followed by a question mark, it
629 ceases to be greedy, and instead matches the minimum number
630 of times possible, so the pattern
631
632
633 /*.*?*/
634
635
636 does the right thing with the C comments. The meaning of the
637 various quantifiers is not otherwise changed, just the
638 preferred number of matches. Do not confuse this use of
639 question mark with its use as a quantifier in its own right.
640 Because it has two uses, it can sometimes appear doubled, as
641 in
642
643
644 d??d
645
646
647 which matches one digit by preference, but can match two if
648 that is the only way the rest of the pattern
649 matches.
650
651
652 If the PCRE_UNGREEDY option is set (an option which is not
653 available in Perl), the quantifiers are not greedy by
654 default, but individual ones can be made greedy by following
655 them with a question mark. In other words, it inverts the
656 default behaviour.
657
658
659 When a parenthesized subpattern is quantified with a minimum
660 repeat count that is greater than 1 or with a limited
661 maximum, more store is required for the compiled pattern, in
662 proportion to the size of the minimum or
663 maximum.
664
665
666 If a pattern starts with .* or .{0,} and the PCRE_DOTALL
667 option (equivalent to Perl's /s) is set, thus allowing the .
668 to match newlines, the pattern is implicitly anchored,
669 because whatever follows will be tried against every
670 character position in the subject string, so there is no
671 point in retrying the overall match at any position after
672 the first. PCRE treats such a pattern as though it were
673 preceded by A. In cases where it is known that the subject
674 string contains no newlines, it is worth setting PCRE_DOTALL
675 when the pattern begins with .* in order to obtain this
676 optimization, or alternatively using ^ to indicate anchoring
677 explicitly.
678
679
680 When a capturing subpattern is repeated, the value captured
681 is the substring that matched the final iteration. For
682 example, after
683
684
685 (tweedle[[dume]{3}s*)+
686
687
688 has matched
689
690
691 /(a|(b))+/
692
693
694 matches
695 !!BACK REFERENCES
696
697
698 Outside a character class, a backslash followed by a digit
699 greater than 0 (and possibly further digits) is a back
700 reference to a capturing subpattern earlier (i.e. to its
701 left) in the pattern, provided there have been that many
702 previous capturing left parentheses.
703
704
705 However, if the decimal number following the backslash is
706 less than 10, it is always taken as a back reference, and
707 causes an error only if there are not that many capturing
708 left parentheses in the entire pattern. In other words, the
709 parentheses that are referenced need not be to the left of
710 the reference for numbers less than 10. See the section
711 entitled
712
713
714 A back reference matches whatever actually matched the
715 capturing subpattern in the current subject string, rather
716 than anything matching the subpattern itself. So the
717 pattern
718
719
720 (sens|respons)e and 1ibility
721
722
723 matches
724
725
726 ((?i)rah)s+1
727
728
729 matches
730
731
732 There may be more than one back reference to the same
733 subpattern. If a subpattern has not actually been used in a
734 particular match, any back references to it always fail. For
735 example, the pattern
736
737
738 (a|(bc))2
739
740
741 always fails if it starts to match
742
743
744 A back reference that occurs inside the parentheses to which
745 it refers fails when the subpattern is first used, so, for
746 example, (a1) never matches. However, such references can be
747 useful inside repeated subpatterns. For example, the
748 pattern
749
750
751 (a|b1)+
752
753
754 matches any number of
755 !!ASSERTIONS
756
757
758 An assertion is a test on the characters following or
759 preceding the current matching point that does not actually
760 consume any characters. The simple assertions coded as b, B,
761 A, Z, z, ^ and $ are described above. More complicated
762 assertions are coded as subpatterns. There are two kinds:
763 those that look ahead of the current position in the subject
764 string, and those that look behind it.
765
766
767 An assertion subpattern is matched in the normal way, except
768 that it does not cause the current matching position to be
769 changed. Lookahead assertions start with (?= for positive
770 assertions and (?! for negative assertions. For
771 example,
772
773
774 w+(?=;)
775
776
777 matches a word followed by a semicolon, but does not include
778 the semicolon in the match, and
779
780
781 foo(?!bar)
782
783
784 matches any occurrence of
785
786
787 (?!foo)bar
788
789
790 does not find an occurrence of
791
792
793 Lookbehind assertions start with (?
794
795
796 (?
797
798
799 does find an occurrence of
800
801
802 (?
803
804
805 is permitted, but
806
807
808 (?
809
810
811 causes an error at compile time. Branches that match
812 different length strings are permitted only at the top level
813 of a lookbehind assertion. This is an extension compared
814 with Perl 5.005, which requires all branches to match the
815 same length of string. An assertion such as
816
817
818 (?
819
820
821 is not permitted, because its single top-level branch can
822 match two different lengths, but it is acceptable if
823 rewritten to use two top-level branches:
824
825
826 (?
827
828
829 The implementation of lookbehind assertions is, for each
830 alternative, to temporarily move the current position back
831 by the fixed width and then try to match. If there are
832 insufficient characters before the current position, the
833 match is deemed to fail. Lookbehinds in conjunction with
834 once-only subpatterns can be particularly useful for
835 matching at the ends of strings; an example is given at the
836 end of the section on once-only subpatterns.
837
838
839 Several assertions (of any sort) may occur in succession.
840 For example,
841
842
843 (?
844
845
846 matches
847 not'' match
848 ''
849
850
851 (?
852
853
854 This time the first assertion looks at the preceding six
855 characters, checking that the first three are digits, and
856 then the second assertion checks that the preceding three
857 characters are not
858
859
860 Assertions can be nested in any combination. For
861 example,
862
863
864 (?
865
866
867 matches an occurrence of
868
869
870 (?
871
872
873 is another pattern which matches
874
875
876 Assertion subpatterns are not capturing subpatterns, and may
877 not be repeated, because it makes no sense to assert the
878 same thing several times. If any kind of assertion contains
879 capturing subpatterns within it, these are counted for the
880 purposes of numbering the capturing subpatterns in the whole
881 pattern. However, substring capturing is carried out only
882 for positive assertions, because it does not make sense for
883 negative assertions.
884
885
886 Assertions count towards the maximum of 200 parenthesized
887 subpatterns.
888 !!ONCE-ONLY SUBPATTERNS
889
890
891 With both maximizing and minimizing repetition, failure of
892 what follows normally causes the repeated item to be
893 re-evaluated to see if a different number of repeats allows
894 the rest of the pattern to match. Sometimes it is useful to
895 prevent this, either to change the nature of the match, or
896 to cause it fail earlier than it otherwise might, when the
897 author of the pattern knows there is no point in carrying
898 on.
899
900
901 Consider, for example, the pattern d+foo when applied to the
902 subject line
903
904
905 123456bar
906
907
908 After matching all 6 digits and then failing to match
909
910
911 (?
912
913
914 This kind of parenthesis
915
916
917 An alternative description is that a subpattern of this type
918 matches the string of characters that an identical
919 standalone pattern would match, if anchored at the current
920 point in the subject string.
921
922
923 Once-only subpatterns are not capturing subpatterns. Simple
924 cases such as the above example can be thought of as a
925 maximizing repeat that must swallow everything it can. So,
926 while both d+ and d+? are prepared to adjust the number of
927 digits they match in order to make the rest of the pattern
928 match, (?
929
930
931 This construction can of course contain arbitrarily
932 complicated subpatterns, and it can be nested.
933
934
935 Once-only subpatterns can be used in conjunction with
936 lookbehind assertions to specify efficient matching at the
937 end of the subject string. Consider a simple pattern such
938 as
939
940
941 abcd$
942
943
944 when applied to a long string which does not match. Because
945 matching proceeds from left to right, PCRE will look for
946 each
947
948
949 ^.*abcd$
950
951
952 the initial .* matches the entire string at first, but when
953 this fails (because there is no following
954
955
956 ^(?
957
958
959 there can be no backtracking for the .* item; it can match
960 only the entire string. The subsequent lookbehind assertion
961 does a single test on the last four characters. If it fails,
962 the match fails immediately. For long strings, this approach
963 makes a significant difference to the processing
964 time.
965
966
967 When a pattern contains an unlimited repeat inside a
968 subpattern that can itself be repeated an unlimited number
969 of times, the use of a once-only subpattern is the only way
970 to avoid some failing matches taking a very long time
971 indeed. The pattern
972
973
974 (D+|
975
976
977 matches an unlimited number of substrings that either
978 consist of non-digits, or digits enclosed in
979
980
981 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
982
983
984 it takes a long time before reporting failure. This is
985 because the string can be divided between the two repeats in
986 a large number of ways, and all have to be tried. (The
987 example used [[!?] rather than a single character at the end,
988 because both PCRE and Perl have an optimization that allows
989 for fast failure when a single character is used. They
990 remember the last single character that is required for a
991 match, and fail early if it is not present in the string.)
992 If the pattern is changed to
993
994
995 ((?
996
997
998 sequences of non-digits cannot be broken, and failure
999 happens quickly.
1000 !!CONDITIONAL SUBPATTERNS
1001
1002
1003 It is possible to cause the matching process to obey a
1004 subpattern conditionally or to choose between two
1005 alternative subpatterns, depending on the result of an
1006 assertion, or whether a previous capturing subpattern
1007 matched or not. The two possible forms of conditional
1008 subpattern are
1009
1010
1011 (?(condition)yes-pattern)
1012 (?(condition)yes-pattern|no-pattern)
1013
1014
1015 If the condition is satisfied, the yes-pattern is used;
1016 otherwise the no-pattern (if present) is used. If there are
1017 more than two alternatives in the subpattern, a compile-time
1018 error occurs.
1019
1020
1021 There are two kinds of condition. If the text between the
1022 parentheses consists of a sequence of digits, the condition
1023 is satisfied if the capturing subpattern of that number has
1024 previously matched. Consider the following pattern, which
1025 contains non-significant white space to make it more
1026 readable (assume the PCRE_EXTENDED option) and to divide it
1027 into three parts for ease of discussion:
1028
1029
1030 ( )? [[^()]+ (?(1) ) )
1031
1032
1033 The first part matches an optional opening parenthesis, and
1034 if that character is present, sets it as the first captured
1035 substring. The second part matches one or more characters
1036 that are not parentheses. The third part is a conditional
1037 subpattern that tests whether the first set of parentheses
1038 matched or not. If they did, that is, if subject started
1039 with an opening parenthesis, the condition is true, and so
1040 the yes-pattern is executed and a closing parenthesis is
1041 required. Otherwise, since no-pattern is not present, the
1042 subpattern matches nothing. In other words, this pattern
1043 matches a sequence of non-parentheses, optionally enclosed
1044 in parentheses.
1045
1046
1047 If the condition is not a sequence of digits, it must be an
1048 assertion. This may be a positive or negative lookahead or
1049 lookbehind assertion. Consider this pattern, again
1050 containing non-significant white space, and with the two
1051 alternatives on the second line:
1052
1053
1054 (?(?=[[^a-z]*[[a-z]) d{2}-[[a-z]{3}-d{2} | d{2}-d{2}-d{2}
1055 )
1056
1057
1058 The condition is a positive lookahead assertion that matches
1059 an optional sequence of non-letters followed by a letter. In
1060 other words, it tests for the presence of at least one
1061 letter in the subject. If a letter is found, the subject is
1062 matched against the first alternative; otherwise it is
1063 matched against the second. This pattern matches strings in
1064 one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1065 letters and dd are digits.
1066 !!COMMENTS
1067
1068
1069 The sequence (?# marks the start of a comment which
1070 continues up to the next closing parenthesis. Nested
1071 parentheses are not permitted. The characters that make up a
1072 comment play no part in the pattern matching at
1073 all.
1074
1075
1076 If the PCRE_EXTENDED option is set, an unescaped # character
1077 outside a character class introduces a comment that
1078 continues up to the next newline character in the
1079 pattern.
1080 !!RECURSIVE PATTERNS
1081
1082
1083 Consider the problem of matching a string in parentheses,
1084 allowing for unlimited nested parentheses. Without the use
1085 of recursion, the best that can be done is to use a pattern
1086 that matches up to some fixed depth of nesting. It is not
1087 possible to handle an arbitrary nesting depth. Perl 5.6 has
1088 provided an experimental facility that allows regular
1089 expressions to recurse (amongst other things). It does this
1090 by interpolating Perl code in the expression at run time,
1091 and the code can refer to the expression itself. A Perl
1092 pattern to solve the parentheses problem can be created like
1093 this:
1094
1095
1096 $re = qr{ (?: (?
1097
1098
1099 The (?p{...}) item interpolates Perl code at run time, and
1100 in this case refers recursively to the pattern in which it
1101 appears. Obviously, PCRE cannot support the interpolation of
1102 Perl code. Instead, the special item (?R) is provided for
1103 the specific case of recursion. This PCRE pattern solves the
1104 parentheses problem (assume the PCRE_EXTENDED option is set
1105 so that white space is ignored):
1106
1107
1108 ( (?
1109
1110
1111 First it matches an opening parenthesis. Then it matches any
1112 number of substrings which can either be a sequence of
1113 non-parentheses, or a recursive match of the pattern itself
1114 (i.e. a correctly parenthesized substring). Finally there is
1115 a closing parenthesis.
1116
1117
1118 This particular example pattern contains nested unlimited
1119 repeats, and so the use of a once-only subpattern for
1120 matching strings of non-parentheses is important when
1121 applying the pattern to strings that do not match. For
1122 example, when it is applied to
1123
1124
1125 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1126
1127
1128 it yields
1129
1130
1131 The values set for any capturing subpatterns are those from
1132 the outermost level of the recursion at which the subpattern
1133 value is set. If the pattern above is matched
1134 against
1135
1136
1137 (ab(cd)ef)
1138
1139
1140 the value for the capturing parentheses is
1141
1142
1143 ( ( (?
1144 pcre_malloc__, freeing it via __pcre_free__
1145 afterwards. If no memory can be obtained, it saves data for
1146 the first 15 capturing parentheses only, as there is no way
1147 to give an out-of-memory error from within a
1148 recursion.
1149 !!PERFORMANCE
1150
1151
1152 Certain items that may appear in patterns are more efficient
1153 than others. It is more efficient to use a character class
1154 like [[aeiou] than a set of alternatives such as (a|e|i|o|u).
1155 In general, the simplest construction that provides the
1156 required behaviour is usually the most efficient. Jeffrey
1157 Friedl's book contains a lot of discussion about optimizing
1158 regular expressions for efficient performance.
1159
1160
1161 When a pattern begins with .* and the PCRE_DOTALL option is
1162 set, the pattern is implicitly anchored by PCRE, since it
1163 can match only at the start of a subject string. However, if
1164 PCRE_DOTALL is not set, PCRE cannot make this optimization,
1165 because the . metacharacter does not then match a newline,
1166 and if the subject string contains newlines, the pattern may
1167 match from the character immediately following one of them
1168 instead of from the very start. For example, the
1169 pattern
1170
1171
1172 (.*) second
1173
1174
1175 matches the subject
1176
1177
1178 If you are using such a pattern with subject strings that do
1179 not contain newlines, the best performance is obtained by
1180 setting PCRE_DOTALL, or starting the pattern with ^.* to
1181 indicate explicit anchoring. That saves PCRE from having to
1182 scan along the subject looking for a newline to restart
1183 at.
1184
1185
1186 Beware of patterns that contain nested indefinite repeats.
1187 These can take a long time to run when applied to a string
1188 that does not match. Consider the pattern
1189 fragment
1190
1191
1192 (a+)*
1193
1194
1195 This can match
1196
1197
1198 An optimization catches some of the more simple cases such
1199 as
1200
1201
1202 (a+)*b
1203
1204
1205 where a literal character follows. Before embarking on the
1206 standard matching procedure, PCRE checks that there is a
1207
1208
1209 (a+)*d
1210
1211
1212 with the pattern above. The former gives a failure almost
1213 instantly when applied to a whole line of
1214 !!UTF-8 SUPPORT
1215
1216
1217 Starting at release 3.3, PCRE has some support for character
1218 strings encoded in the UTF-8 format. This is incomplete, and
1219 is regarded as experimental. In order to use it, you must
1220 configure PCRE to include UTF-8 support in the code, and, in
1221 addition, you must call __pcre_compile()__ with the
1222 PCRE_UTF8 option flag. When you do this, both the pattern
1223 and any subject strings that are matched against it are
1224 treated as UTF-8 strings instead of just strings of bytes,
1225 but only in the cases that are mentioned below.
1226
1227
1228 If you compile PCRE with UTF-8 support, but do not use it at
1229 run time, the library will be a bit bigger, but the
1230 additional run time overhead is limited to testing the
1231 PCRE_UTF8 flag in several places, so should not be very
1232 large.
1233
1234
1235 PCRE assumes that the strings it is given contain valid
1236 UTF-8 codes. It does not diagnose invalid UTF-8 strings. If
1237 you pass invalid UTF-8 strings to PCRE, the results are
1238 undefined.
1239
1240
1241 Running with PCRE_UTF8 set causes these changes in the way
1242 PCRE works:
1243
1244
1245 1. In a pattern, the escape sequence x{...}, where the
1246 contents of the braces is a string of hexadecimal digits, is
1247 interpreted as a UTF-8 character whose code number is the
1248 given hexadecimal number, for example: x{1234}. This inserts
1249 from one to six literal bytes into the pattern, using the
1250 UTF-8 encoding. If a non-hexadecimal digit appears between
1251 the braces, the item is not recognized.
1252
1253
1254 2. The original hexadecimal escape sequence, xhh, generates
1255 a two-byte UTF-8 character if its value is greater than
1256 127.
1257
1258
1259 3. Repeat quantifiers are NOT correctly handled if they
1260 follow a multibyte character. For example, x{100}* and xc3+
1261 do not work. If you want to repeat such characters, you must
1262 enclose them in non-capturing parentheses, for example
1263 (?:x{100}), at present.
1264
1265
1266 4. The dot metacharacter matches one UTF-8 character instead
1267 of a single byte.
1268
1269
1270 5. Unlike literal UTF-8 characters, the dot metacharacter
1271 followed by a repeat quantifier does operate correctly on
1272 UTF-8 characters instead of single bytes.
1273
1274
1275 4. Although the x{...} escape is permitted in a character
1276 class, characters whose values are greater than 255 cannot
1277 be included in a class.
1278
1279
1280 5. A class is matched against a UTF-8 character instead of
1281 just a single byte, but it can match only characters whose
1282 values are less than 256. Characters with greater values
1283 always fail to match a class.
1284
1285
1286 6. Repeated classes work correctly on multiple
1287 characters.
1288
1289
1290 7. Classes containing just a single character whose value is
1291 greater than 127 (but less than 256), for example, [[x80] or
1292 [[^x{93}], do not work because these are optimized into
1293 single byte matches. In the first case, of course, the class
1294 brackets are just redundant.
1295
1296
1297 8. Lookbehind assertions move backwards in the subject by a
1298 fixed number of characters instead of a fixed number of
1299 bytes. Simple cases have been tested to work correctly, but
1300 there may be hidden gotchas herein.
1301
1302
1303 9. The character types such as d and w do not work correctly
1304 with UTF-8 characters. They continue to test a single
1305 byte.
1306
1307
1308 10. Anything not explicitly mentioned here continues to work
1309 in bytes rather than in characters.
1310 !!DIFFERENCES FROM PERL
1311
1312
1313 The differences described here are with respect to Perl
1314 5.005.
1315
1316
1317 1. By default, a whitespace character is any character that
1318 the C library function __isspace()__ recognizes, though
1319 it is possible to compile PCRE with alternative character
1320 type tables. Normally __isspace()__ matches space,
1321 formfeed, newline, carriage return, horizontal tab, and
1322 vertical tab. Perl 5 no longer includes vertical tab in its
1323 set of whitespace characters. The v escape that was in the
1324 Perl documentation for a long time was never in fact
1325 recognized. However, the character itself was treated as
1326 whitespace at least up to 5.002. In 5.004 and 5.005 it does
1327 not match s.
1328
1329
1330 2. PCRE does not allow repeat quantifiers on lookahead
1331 assertions. Perl permits them, but they do not mean what you
1332 might think. For example, (?!a){3} does not assert that the
1333 next three characters are not
1334
1335
1336 3. Capturing subpatterns that occur inside negative
1337 lookahead assertions are counted, but their entries in the
1338 offsets vector are never set. Perl sets its numerical
1339 variables from any such patterns that are matched before the
1340 assertion fails to match something (thereby succeeding), but
1341 only if the negative lookahead assertion contains just one
1342 branch.
1343
1344
1345 4. Though binary zero characters are supported in the
1346 subject string, they are not allowed in a pattern string
1347 because it is passed as a normal C string, terminated by
1348 zero. The escape sequence
1349
1350
1351 5. The following Perl escape sequences are not supported: l,
1352 u, L, U, E, Q. In fact these are implemented by Perl's
1353 general string-handling and are not part of its pattern
1354 matching engine.
1355
1356
1357 6. The Perl G assertion is not supported as it is not
1358 relevant to single pattern matches.
1359
1360
1361 7. Fairly obviously, PCRE does not support the (?{code}) and
1362 (?p{code}) constructions. However, there is some
1363 experimental support for recursive patterns using the
1364 non-Perl item (?R).
1365
1366
1367 8. There are at the time of writing some oddities in Perl
1368 5.005_02 concerned with the settings of captured strings
1369 when part of a pattern is repeated. For example, matching
1370
1371
1372 In Perl 5.004 $2 is set in both cases, and that is also true
1373 of PCRE. If in the future Perl changes to a consistent state
1374 that is different, PCRE may change to follow.
1375
1376
1377 9. Another as yet unresolved discrepancy is that in Perl
1378 5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string
1379
1380
1381 10. The following UTF-8 features of Perl 5.6 are not
1382 implemented:
1383
1384
1385 a. The escape sequence C to match a single
1386 byte.
1387
1388
1389 b. The use of Unicode tables and properties and escapes p,
1390 P, and X.
1391
1392
1393 11. PCRE provides some extensions to the Perl regular
1394 expression facilities:
1395
1396
1397 (a) Although lookbehind assertions must match fixed length
1398 strings, each alternative branch of a lookbehind assertion
1399 can match a different length of string. Perl 5.005 requires
1400 them all to have the same length.
1401
1402
1403 (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not
1404 set, the $ meta- character matches only at the very end of
1405 the string.
1406
1407
1408 (c) If PCRE_EXTRA is set, a backslash followed by a letter
1409 with no special meaning is faulted.
1410
1411
1412 (d) If PCRE_UNGREEDY is set, the greediness of the
1413 repetition quantifiers is inverted, that is, by default they
1414 are not greedy, but if followed by a question mark they
1415 are.
1416
1417
1418 (e) PCRE_ANCHORED can be used to force a pattern to be tried
1419 only at the start of the subject.
1420
1421
1422 (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options
1423 for __pcre_exec()__ have no Perl
1424 equivalents.
1425
1426
1427 (g) The (?R) construct allows for recursive pattern matching
1428 (Perl 5.6 can do this using the (?p{code}) construct, which
1429 PCRE cannot of course support.)
1430 !!AUTHOR
1431
1432
1433 Philip Hazel
1434 University Computing Service,
1435 New Museums Site,
1436 Cambridge CB2 3QG, England.
1437 Phone: +44 1223 334714
1438
1439
1440 Last updated: 28 August 2000,
1441 the 250th anniversary of the death of J.S. Bach.
1442 Copyright (c) 1997-2000 University of
1443 Cambridge.
1444 ----
This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.