Penguin
Annotated edit history of pcre(7) version 1, including all changes. View license author blame.
Rev Author # Line
1 perry 1 PCRE
2 !!!PCRE
3 NAME
4 DESCRIPTION
5 REGULAR EXPRESSION DETAILS
6 BACKSLASH
7 CIRCUMFLEX AND DOLLAR
8 FULL STOP (PERIOD, DOT)
9 SQUARE BRACKETS
10 VERTICAL BAR
11 INTERNAL OPTION SETTING
12 SUBPATTERNS
13 REPETITION
14 BACK REFERENCES
15 ASSERTIONS
16 ONCE-ONLY SUBPATTERNS
17 CONDITIONAL SUBPATTERNS
18 COMMENTS
19 PERFORMANCE
20 DIFFERENCES FROM PERL
21 LIMITATIONS
22 AUTHOR
23 ----
24 !!NAME
25
26
27 pcre - Perl-compatible regular expressions.
28 !!DESCRIPTION
29
30
31 The PCRE library is a set of functions that implement
32 regular expression pattern matching using the same syntax
33 and semantics as Perl 5, with just a few differences (see
34 below). The current implementation corresponds to Perl
35 5.005.
36
37
38 This man page describes the regular expressions understood
39 by programs that use PCRE.
40 !!REGULAR EXPRESSION DETAILS
41
42
43 The syntax and semantics of the regular expressions
44 supported by PCRE are described below. Regular expressions
45 are also described in the Perl documentation and in a number
46 of other books, some of which have copious examples. Jeffrey
47 Friedl's
48
49
50 A regular expression is a pattern that is matched against a
51 subject string from left to right. Most characters stand for
52 themselves in a pattern, and match the corresponding
53 characters in the subject. As a trivial example, the
54 pattern
55
56
57 The quick brown fox
58
59
60 matches a portion of a subject string that is identical to
61 itself. The power of regular expressions comes from the
62 ability to include alternatives and repetitions in the
63 pattern. These are encoded in the pattern by the use of
64 ''meta-characters'', which do not stand for themselves
65 but instead are interpreted in some special
66 way.
67
68
69 There are two different sets of meta-characters: those that
70 are recognized anywhere in the pattern except within square
71 brackets, and those that are recognized in square brackets.
72 Outside square brackets, the meta-characters are as
73 follows:
74
75
76 \ general escape character with several uses ^ assert start
77 of subject (or line, in multiline mode) $ assert end of
78 subject (or line, in multiline mode) . match any character
79 except newline (by default) [[ start character class
80 definition | start of alternative branch ( start subpattern
81 ) end subpattern ? extends the meaning of ( also 0 or 1
82 quantifier also quantifier minimizer * 0 or more quantifier
83 + 1 or more quantifier { start min/max
84 quantifier
85
86
87 Part of a pattern that is in square brackets is called a
88
89
90 \ general escape character ^ negate the class, but only if
91 the first character - indicates character range ] terminates
92 the character class
93
94
95 The following sections describe the use of each of the
96 meta-characters.
97 !!BACKSLASH
98
99
100 The backslash character has several uses. Firstly, if it is
101 followed by a non-alphameric character, it takes away any
102 special meaning that character may have. This use of
103 backslash as an escape character applies both inside and
104 outside character classes.
105
106
107 For example, if you want to match a
108
109
110 If a pattern is compiled with the PCRE_EXTENDED option,
111 whitespace in the pattern (other than in a character class)
112 and characters between a
113
114
115 A second use of backslash provides a way of encoding
116 non-printing characters in patterns in a visible manner.
117 There is no restriction on the appearance of non-printing
118 characters, apart from the binary zero that terminates a
119 pattern, but when a pattern is being prepared by text
120 editing, it is usually easier to use one of the following
121 escape sequences than the binary character it
122 represents:
123
124
125 a alarm, that is, the BEL character (hex 07) cx
126
127
128 The precise effect of
129
130
131 After
132
133
134 After
135
136
137 The handling of a backslash followed by a digit other than 0
138 is complicated. Outside a character class, PCRE reads it and
139 any following digits as a decimal number. If the number is
140 less than 10, or if there have been at least that many
141 previous capturing left parentheses in the expression, the
142 entire sequence is taken as a ''back reference''. A
143 description of how this works is given later, following the
144 discussion of parenthesized subpatterns.
145
146
147 Inside a character class, or if the decimal number is
148 greater than 9 and there have not been that many capturing
149 subpatterns, PCRE re-reads up to three octal digits
150 following the backslash, and generates a single byte from
151 the least significant 8 bits of the value. Any subsequent
152 digits stand for themselves. For example:
153
154
155 040 is another way of writing a space 40 is the same,
156 provided there are fewer than 40 previous capturing
157 subpatterns 7 is always a back reference 11 might be a back
158 reference, or another way of writing a tab 011 is always a
159 tab 0113 is a tab followed by the character
160
161
162 Note that octal values of 100 or greater must not be
163 introduced by a leading zero, because no more than three
164 octal digits are ever read.
165
166
167 All the sequences that define a single byte value can be
168 used both inside and outside character classes. In addition,
169 inside a character class, the sequence
170
171
172 The third use of backslash is for specifying generic
173 character types:
174
175
176 d any decimal digit D any character that is not a decimal
177 digit s any whitespace character S any character that is not
178 a whitespace character w any
179
180
181 Each pair of escape sequences partitions the complete set of
182 characters into two disjoint sets. Any given character
183 matches one, and only one, of each pair.
184
185
186 A
187
188
189 These character type sequences can appear both inside and
190 outside character classes. They each match one character of
191 the appropriate type. If the current matching point is at
192 the end of the subject string, all of them fail, since there
193 is no character to match.
194
195
196 The fourth use of backslash is for certain simple
197 assertions. An assertion specifies a condition that has to
198 be met at a particular point in a match, without consuming
199 any characters from the subject string. The use of
200 subpatterns for more complicated assertions is described
201 below. The backslashed assertions are
202
203
204 b word boundary B not a word boundary A start of subject
205 (independent of multiline mode) Z end of subject or newline
206 at end (independent of multiline mode) z end of subject
207 (independent of multiline mode)
208
209
210 These assertions may not appear in character classes (but
211 note that
212
213
214 A word boundary is a position in the subject string where
215 the current character and the previous character do not both
216 match w or W (i.e. one matches w and the other matches W),
217 or the start or end of the string if the first or last
218 character matches w, respectively.
219
220
221 The A, Z, and z assertions differ from the traditional
222 circumflex and dollar (described below) in that they only
223 ever match at the very start and end of the subject string,
224 whatever options are set. They are not affected by the
225 PCRE_NOTBOL or PCRE_NOTEOL options. If the
226 ''startoffset'' argument of __pcre_exec()__ is
227 non-zero, A can never match. The difference between Z and z
228 is that Z matches before a newline that is the last
229 character of the string as well as at the end of the string,
230 whereas z matches only at the end.
231 !!CIRCUMFLEX AND DOLLAR
232
233
234 Outside a character class, in the default matching mode, the
235 circumflex character is an assertion which is true only if
236 the current matching point is at the start of the subject
237 string. If the ''startoffset'' argument of
238 __pcre_exec()__ is non-zero, circumflex can never match.
239 Inside a character class, circumflex has an entirely
240 different meaning (see below).
241
242
243 Circumflex need not be the first character of the pattern if
244 a number of alternatives are involved, but it should be the
245 first thing in each alternative in which it appears if the
246 pattern is ever to match that branch. If all possible
247 alternatives start with a circumflex, that is, if the
248 pattern is constrained to match only at the start of the
249 subject, it is said to be an
250
251
252 A dollar character is an assertion which is true only if the
253 current matching point is at the end of the subject string,
254 or immediately before a newline character that is the last
255 character in the string (by default). Dollar need not be the
256 last character of the pattern if a number of alternatives
257 are involved, but it should be the last item in any branch
258 in which it appears. Dollar has no special meaning in a
259 character class.
260
261
262 The meaning of dollar can be changed so that it matches only
263 at the very end of the string, by setting the
264 PCRE_DOLLAR_ENDONLY option at compile or matching time. This
265 does not affect the Z assertion.
266
267
268 The meanings of the circumflex and dollar characters are
269 changed if the PCRE_MULTILINE option is set. When this is
270 the case, they match immediately after and immediately
271 before an internal
272 startoffset''
273 argument of __pcre_exec()__ is non-zero. The
274 PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is
275 set.
276
277
278 Note that the sequences A, Z, and z can be used to match the
279 start and end of the subject in both modes, and if all
280 branches of a pattern start with A is it always anchored,
281 whether PCRE_MULTILINE is set or not.
282 !!FULL STOP (PERIOD, DOT)
283
284
285 Outside a character class, a dot in the pattern matches any
286 one character in the subject, including a non-printing
287 character, but not (by default) newline. If the PCRE_DOTALL
288 option is set, then dots match newlines as well. The
289 handling of dot is entirely independent of the handling of
290 circumflex and dollar, the only relationship being that they
291 both involve newline characters. Dot has no special meaning
292 in a character class.
293 !!SQUARE BRACKETS
294
295
296 An opening square bracket introduces a character class,
297 terminated by a closing square bracket. A closing square
298 bracket on its own is not special. If a closing square
299 bracket is required as a member of the class, it should be
300 the first data character in the class (after an initial
301 circumflex, if present) or escaped with a
302 backslash.
303
304
305 A character class matches a single character in the subject;
306 the character must be in the set of characters defined by
307 the class, unless the first character in the class is a
308 circumflex, in which case the subject character must not be
309 in the set defined by the class. If a circumflex is actually
310 required as a member of the class, ensure it is not the
311 first character, or escape it with a backslash.
312
313
314 For example, the character class [[aeiou] matches any lower
315 case vowel, while [[^aeiou] matches any character that is not
316 a lower case vowel. Note that a circumflex is just a
317 convenient notation for specifying the characters which are
318 in the class by enumerating those that are not. It is not an
319 assertion: it still consumes a character from the subject
320 string, and fails if the current pointer is at the end of
321 the string.
322
323
324 When caseless matching is set, any letters in a class
325 represent both their upper case and lower case versions, so
326 for example, a caseless [[aeiou] matches
327
328
329 The newline character is never treated in any special way in
330 character classes, whatever the setting of the PCRE_DOTALL
331 or PCRE_MULTILINE options is. A class such as [[^a] will
332 always match a newline.
333
334
335 The minus (hyphen) character can be used to specify a range
336 of characters in a character class. For example, [[d-m]
337 matches any letter between d and m, inclusive. If a minus
338 character is required in a class, it must be escaped with a
339 backslash or appear in a position where it cannot be
340 interpreted as indicating a range, typically as the first or
341 last character in the class.
342
343
344 It is not possible to have the literal character
345
346
347 Ranges operate in ASCII collating sequence. They can also be
348 used for characters specified numerically, for example
349 [[000-037]. If a range that includes letters is used when
350 caseless matching is set, it matches the letters in either
351 case. For example, [[W-c] is equivalent to [[][[^_`wxyzabc],
352 matched caselessly, and if character tables for the
353
354
355 The character types d, D, s, S, w, and W may also appear in
356 a character class, and add the characters that they match to
357 the class. For example, [[dABCDEF] matches any hexadecimal
358 digit. A circumflex can conveniently be used with the upper
359 case character types to specify a more restricted set of
360 characters than the matching lower case type. For example,
361 the class [[^W_] matches any letter or digit, but not
362 underscore.
363
364
365 All non-alphameric characters other than , -, ^ (at the
366 start) and the terminating ] are non-special in character
367 classes, but it does no harm if they are
368 escaped.
369 !!VERTICAL BAR
370
371
372 Vertical bar characters are used to separate alternative
373 patterns. For example, the pattern
374
375
376 gilbert|sullivan
377
378
379 matches either
380 !!INTERNAL OPTION SETTING
381
382
383 The settings of PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL,
384 and PCRE_EXTENDED can be changed from within the pattern by
385 a sequence of Perl option letters enclosed between
386
387
388 i for PCRE_CASELESS m for PCRE_MULTILINE s for PCRE_DOTALL x
389 for PCRE_EXTENDED
390
391
392 For example, (?im) sets caseless, multiline matching. It is
393 also possible to unset these options by preceding the letter
394 with a hyphen, and a combined setting and unsetting such as
395 (?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while
396 unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted.
397 If a letter appears both before and after the hyphen, the
398 option is unset.
399
400
401 The scope of these option changes depends on where in the
402 pattern the setting occurs. For settings that are outside
403 any subpattern (defined below), the effect is the same as if
404 the options were set or unset at the start of matching. The
405 following patterns all behave in exactly the same
406 way:
407
408
409 (?i)abc a(?i)bc ab(?i)c abc(?i)
410
411
412 which in turn is the same as compiling the pattern abc with
413 PCRE_CASELESS set. In other words, such
414
415
416 If an option change occurs inside a subpattern, the effect
417 is different. This is a change of behaviour in Perl 5.005.
418 An option change inside a subpattern affects only that part
419 of the subpattern that follows it, so
420
421
422 (a(?i)b)c
423
424
425 matches abc and aBc and no other strings (assuming
426 PCRE_CASELESS is not used). By this means, options can be
427 made to have different settings in different parts of the
428 pattern. Any changes made in one alternative do carry on
429 into subsequent branches within the same subpattern. For
430 example,
431
432
433 (a(?i)b|c)
434
435
436 matches
437
438
439 The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can
440 be changed in the same way as the Perl-compatible options by
441 using the characters U and X respectively. The (?X) flag
442 setting is special in that it must always occur earlier in
443 the pattern than any of the additional features it turns on,
444 even when it is at top level. It is best put at the
445 start.
446 !!SUBPATTERNS
447
448
449 Subpatterns are delimited by parentheses (round brackets),
450 which can be nested. Marking part of a pattern as a
451 subpattern does two things:
452
453
454 1. It localizes a set of alternatives. For example, the
455 pattern
456
457
458 cat(aract|erpillar|)
459
460
461 matches one of the words
462
463
464 2. It sets up the subpattern as a capturing subpattern (as
465 defined above). When the whole pattern matches, that portion
466 of the subject string that matched the subpattern is passed
467 back to the caller via the ''ovector'' argument of
468 __pcre_exec()__. Opening parentheses are counted from
469 left to right (starting from 1) to obtain the numbers of the
470 capturing subpatterns.
471
472
473 For example, if the string
474
475
476 the ((red|white) (king|queen))
477
478
479 the captured substrings are
480
481
482 The fact that plain parentheses fulfil two functions is not
483 always helpful. There are often times when a grouping
484 subpattern is required without a capturing requirement. If
485 an opening parenthesis is followed by
486
487
488 the ((?:red|white) (king|queen))
489
490
491 the captured substrings are
492
493
494 As a convenient shorthand, if any option settings are
495 required at the start of a non-capturing subpattern, the
496 option letters may appear between the
497
498
499 (?i:saturday|sunday) (?:(?i)saturday|sunday)
500
501
502 match exactly the same set of strings. Because alternative
503 branches are tried from left to right, and options are not
504 reset until the end of the subpattern is reached, an option
505 setting in one branch does affect subsequent branches, so
506 the above patterns match
507 !!REPETITION
508
509
510 Repetition is specified by quantifiers, which can follow any
511 of the following items:
512
513
514 a single character, possibly escaped the . metacharacter a
515 character class a back reference (see next section) a
516 parenthesized subpattern (unless it is an assertion - see
517 below)
518
519
520 The general repetition quantifier specifies a minimum and
521 maximum number of permitted matches, by giving the two
522 numbers in curly brackets (braces), separated by a comma.
523 The numbers must be less than 65536, and the first must be
524 less than or equal to the second. For example:
525
526
527 z{2,4}
528
529
530 matches
531
532
533 [[aeiou]{3,}
534
535
536 matches at least 3 successive vowels, but may match many
537 more, while
538
539
540 d{8}
541
542
543 matches exactly 8 digits. An opening curly bracket that
544 appears in a position where a quantifier is not allowed, or
545 one that does not match the syntax of a quantifier, is taken
546 as a literal character. For example, {,6} is not a
547 quantifier, but a literal string of four
548 characters.
549
550
551 The quantifier {0} is permitted, causing the expression to
552 behave as if the previous item and the quantifier were not
553 present.
554
555
556 For convenience (and historical compatibility) the three
557 most common quantifiers have single-character
558 abbreviations:
559
560
561 * is equivalent to {0,} + is equivalent to {1,} ? is
562 equivalent to {0,1}
563
564
565 It is possible to construct infinite loops by following a
566 subpattern that can match no characters with a quantifier
567 that has no upper limit, for example:
568
569
570 (a?)*
571
572
573 Earlier versions of Perl and PCRE used to give an error at
574 compile time for such patterns. However, because there are
575 cases where this can be useful, such patterns are now
576 accepted, but if any repetition of the subpattern does in
577 fact match no characters, the loop is forcibly
578 broken.
579
580
581 By default, the quantifiers are
582
583
584 /*.**/
585
586
587 to the string
588
589
590 /* first command */ not comment /* second comment
591 */
592
593
594 fails, because it matches the entire string due to the
595 greediness of the .* item.
596
597
598 However, if a quantifier is followed by a question mark,
599 then it ceases to be greedy, and instead matches the minimum
600 number of times possible, so the pattern
601
602
603 /*.*?*/
604
605
606 does the right thing with the C comments. The meaning of the
607 various quantifiers is not otherwise changed, just the
608 preferred number of matches. Do not confuse this use of
609 question mark with its use as a quantifier in its own right.
610 Because it has two uses, it can sometimes appear doubled, as
611 in
612
613
614 d??d
615
616
617 which matches one digit by preference, but can match two if
618 that is the only way the rest of the pattern
619 matches.
620
621
622 If the PCRE_UNGREEDY option is set (an option which is not
623 available in Perl) then the quantifiers are not greedy by
624 default, but individual ones can be made greedy by following
625 them with a question mark. In other words, it inverts the
626 default behaviour.
627
628
629 When a parenthesized subpattern is quantified with a minimum
630 repeat count that is greater than 1 or with a limited
631 maximum, more store is required for the compiled pattern, in
632 proportion to the size of the minimum or
633 maximum.
634
635
636 If a pattern starts with .* or .{0,} and the PCRE_DOTALL
637 option (equivalent to Perl's /s) is set, thus allowing the .
638 to match newlines, then the pattern is implicitly anchored,
639 because whatever follows will be tried against every
640 character position in the subject string, so there is no
641 point in retrying the overall match at any position after
642 the first. PCRE treats such a pattern as though it were
643 preceded by A. In cases where it is known that the subject
644 string contains no newlines, it is worth setting PCRE_DOTALL
645 when the pattern begins with .* in order to obtain this
646 optimization, or alternatively using ^ to indicate anchoring
647 explicitly.
648
649
650 When a capturing subpattern is repeated, the value captured
651 is the substring that matched the final iteration. For
652 example, after
653
654
655 (tweedle[[dume]{3}s*)+
656
657
658 has matched
659
660
661 /(a|(b))+/
662
663
664 matches
665 !!BACK REFERENCES
666
667
668 Outside a character class, a backslash followed by a digit
669 greater than 0 (and possibly further digits) is a back
670 reference to a capturing subpattern earlier (i.e. to its
671 left) in the pattern, provided there have been that many
672 previous capturing left parentheses.
673
674
675 However, if the decimal number following the backslash is
676 less than 10, it is always taken as a back reference, and
677 causes an error only if there are not that many capturing
678 left parentheses in the entire pattern. In other words, the
679 parentheses that are referenced need not be to the left of
680 the reference for numbers less than 10. See the section
681 entitled
682
683
684 A back reference matches whatever actually matched the
685 capturing subpattern in the current subject string, rather
686 than anything matching the subpattern itself. So the
687 pattern
688
689
690 (sens|respons)e and 1ibility
691
692
693 matches
694
695
696 ((?i)rah)s+1
697
698
699 matches
700
701
702 There may be more than one back reference to the same
703 subpattern. If a subpattern has not actually been used in a
704 particular match, then any back references to it always
705 fail. For example, the pattern
706
707
708 (a|(bc))2
709
710
711 always fails if it starts to match
712
713
714 A back reference that occurs inside the parentheses to which
715 it refers fails when the subpattern is first used, so, for
716 example, (a1) never matches. However, such references can be
717 useful inside repeated subpatterns. For example, the
718 pattern
719
720
721 (a|b1)+
722
723
724 matches any number of
725 !!ASSERTIONS
726
727
728 An assertion is a test on the characters following or
729 preceding the current matching point that does not actually
730 consume any characters. The simple assertions coded as b, B,
731 A, Z, z, ^ and $ are described above. More complicated
732 assertions are coded as subpatterns. There are two kinds:
733 those that look ahead of the current position in the subject
734 string, and those that look behind it.
735
736
737 An assertion subpattern is matched in the normal way, except
738 that it does not cause the current matching position to be
739 changed. Lookahead assertions start with (?= for positive
740 assertions and (?! for negative assertions. For
741 example,
742
743
744 w+(?=;)
745
746
747 matches a word followed by a semicolon, but does not include
748 the semicolon in the match, and
749
750
751 foo(?!bar)
752
753
754 matches any occurrence of
755
756
757 (?!foo)bar
758
759
760 does not find an occurrence of
761
762
763 Lookbehind assertions start with (?
764
765
766 (?
767
768
769 does find an occurrence of
770
771
772 (?
773
774
775 is permitted, but
776
777
778 (?
779
780
781 causes an error at compile time. Branches that match
782 different length strings are permitted only at the top level
783 of a lookbehind assertion. This is an extension compared
784 with Perl 5.005, which requires all branches to match the
785 same length of string. An assertion such as
786
787
788 (?
789
790
791 is not permitted, because its single top-level branch can
792 match two different lengths, but it is acceptable if
793 rewritten to use two top-level branches:
794
795
796 (?
797
798
799 The implementation of lookbehind assertions is, for each
800 alternative, to temporarily move the current position back
801 by the fixed width and then try to match. If there are
802 insufficient characters before the current position, the
803 match is deemed to fail. Lookbehinds in conjunction with
804 once-only subpatterns can be particularly useful for
805 matching at the ends of strings; an example is given at the
806 end of the section on once-only subpatterns.
807
808
809 Several assertions (of any sort) may occur in succession.
810 For example,
811
812
813 (?
814
815
816 matches
817 not'' match
818 ''
819
820
821 (?
822
823
824 This time the first assertion looks at the preceding six
825 characters, checking that the first three are digits, and
826 then the second assertion checks that the preceding three
827 characters are not
828
829
830 Assertions can be nested in any combination. For
831 example,
832
833
834 (?
835
836
837 matches an occurrence of
838
839
840 (?
841
842
843 is another pattern which matches
844
845
846 Assertion subpatterns are not capturing subpatterns, and may
847 not be repeated, because it makes no sense to assert the
848 same thing several times. If any kind of assertion contains
849 capturing subpatterns within it, these are counted for the
850 purposes of numbering the capturing subpatterns in the whole
851 pattern. However, substring capturing is carried out only
852 for positive assertions, because it does not make sense for
853 negative assertions.
854
855
856 Assertions count towards the maximum of 200 parenthesized
857 subpatterns.
858 !!ONCE-ONLY SUBPATTERNS
859
860
861 With both maximizing and minimizing repetition, failure of
862 what follows normally causes the repeated item to be
863 re-evaluated to see if a different number of repeats allows
864 the rest of the pattern to match. Sometimes it is useful to
865 prevent this, either to change the nature of the match, or
866 to cause it fail earlier than it otherwise might, when the
867 author of the pattern knows there is no point in carrying
868 on.
869
870
871 Consider, for example, the pattern d+foo when applied to the
872 subject line
873
874
875 123456bar
876
877
878 After matching all 6 digits and then failing to match
879
880
881 (?
882
883
884 This kind of parenthesis
885
886
887 An alternative description is that a subpattern of this type
888 matches the string of characters that an identical
889 standalone pattern would match, if anchored at the current
890 point in the subject string.
891
892
893 Once-only subpatterns are not capturing subpatterns. Simple
894 cases such as the above example can be thought of as a
895 maximizing repeat that must swallow everything it can. So,
896 while both d+ and d+? are prepared to adjust the number of
897 digits they match in order to make the rest of the pattern
898 match, (?
899
900
901 This construction can of course contain arbitrarily
902 complicated subpatterns, and it can be nested.
903
904
905 Once-only subpatterns can be used in conjunction with
906 lookbehind assertions to specify efficient matching at the
907 end of the subject string. Consider a simple pattern such
908 as
909
910
911 abcd$
912
913
914 when applied to a long string which does not match it.
915 Because matching proceeds from left to right, PCRE will look
916 for each
917
918
919 ^.*abcd$
920
921
922 then the initial .* matches the entire string at first, but
923 when this fails, it backtracks to match all but the last
924 character, then all but the last two characters, and so on.
925 Once again the search for
926
927
928 ^(?
929
930
931 then there can be no backtracking for the .* item; it can
932 match only the entire string. The subsequent lookbehind
933 assertion does a single test on the last four characters. If
934 it fails, the match fails immediately. For long strings,
935 this approach makes a significant difference to the
936 processing time.
937 !!CONDITIONAL SUBPATTERNS
938
939
940 It is possible to cause the matching process to obey a
941 subpattern conditionally or to choose between two
942 alternative subpatterns, depending on the result of an
943 assertion, or whether a previous capturing subpattern
944 matched or not. The two possible forms of conditional
945 subpattern are
946
947
948 (?(condition)yes-pattern)
949 (?(condition)yes-pattern|no-pattern)
950
951
952 If the condition is satisfied, the yes-pattern is used;
953 otherwise the no-pattern (if present) is used. If there are
954 more than two alternatives in the subpattern, a compile-time
955 error occurs.
956
957
958 There are two kinds of condition. If the text between the
959 parentheses consists of a sequence of digits, then the
960 condition is satisfied if the capturing subpattern of that
961 number has previously matched. Consider the following
962 pattern, which contains non-significant white space to make
963 it more readable (assume the PCRE_EXTENDED option) and to
964 divide it into three parts for ease of
965 discussion:
966
967
968 ( )? [[^()]+ (?(1) ) )
969
970
971 The first part matches an optional opening parenthesis, and
972 if that character is present, sets it as the first captured
973 substring. The second part matches one or more characters
974 that are not parentheses. The third part is a conditional
975 subpattern that tests whether the first set of parentheses
976 matched or not. If they did, that is, if subject started
977 with an opening parenthesis, the condition is true, and so
978 the yes-pattern is executed and a closing parenthesis is
979 required. Otherwise, since no-pattern is not present, the
980 subpattern matches nothing. In other words, this pattern
981 matches a sequence of non-parentheses, optionally enclosed
982 in parentheses.
983
984
985 If the condition is not a sequence of digits, it must be an
986 assertion. This may be a positive or negative lookahead or
987 lookbehind assertion. Consider this pattern, again
988 containing non-significant white space, and with the two
989 alternatives on the second line:
990
991
992 (?(?=[[^a-z]*[[a-z]) d{2}[[a-z]{3}-d{2} | d{2}-d{2}-d{2}
993 )
994
995
996 The condition is a positive lookahead assertion that matches
997 an optional sequence of non-letters followed by a letter. In
998 other words, it tests for the presence of at least one
999 letter in the subject. If a letter is found, the subject is
1000 matched against the first alternative; otherwise it is
1001 matched against the second. This pattern matches strings in
1002 one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1003 letters and dd are digits.
1004 !!COMMENTS
1005
1006
1007 The sequence (?# marks the start of a comment which
1008 continues up to the next closing parenthesis. Nested
1009 parentheses are not permitted. The characters that make up a
1010 comment play no part in the pattern matching at
1011 all.
1012
1013
1014 If the PCRE_EXTENDED option is set, an unescaped # character
1015 outside a character class introduces a comment that
1016 continues up to the next newline character in the
1017 pattern.
1018 !!PERFORMANCE
1019
1020
1021 Certain items that may appear in patterns are more efficient
1022 than others. It is more efficient to use a character class
1023 like [[aeiou] than a set of alternatives such as (a|e|i|o|u).
1024 In general, the simplest construction that provides the
1025 required behaviour is usually the most efficient. Jeffrey
1026 Friedl's book contains a lot of discussion about optimizing
1027 regular expressions for efficient performance.
1028
1029
1030 When a pattern begins with .* and the PCRE_DOTALL option is
1031 set, the pattern is implicitly anchored by PCRE, since it
1032 can match only at the start of a subject string. However, if
1033 PCRE_DOTALL is not set, PCRE cannot make this optimization,
1034 because the . metacharacter does not then match a newline,
1035 and if the subject string contains newlines, the pattern may
1036 match from the character immediately following one of them
1037 instead of from the very start. For example, the
1038 pattern
1039
1040
1041 (.*) second
1042
1043
1044 matches the subject
1045
1046
1047 If you are using such a pattern with subject strings that do
1048 not contain newlines, the best performance is obtained by
1049 setting PCRE_DOTALL, or starting the pattern with ^.* to
1050 indicate explicit anchoring. That saves PCRE from having to
1051 scan along the subject looking for a newline to restart
1052 at.
1053
1054
1055 Beware of patterns that contain nested indefinite repeats.
1056 These can take a long time to run when applied to a string
1057 that does not match. Consider the pattern
1058 fragment
1059
1060
1061 (a+)*
1062
1063
1064 This can match
1065
1066
1067 An optimization catches some of the more simple cases such
1068 as
1069
1070
1071 (a+)*b
1072
1073
1074 where a literal character follows. Before embarking on the
1075 standard matching procedure, PCRE checks that there is a
1076
1077
1078 (a+)*d
1079
1080
1081 with the pattern above. The former gives a failure almost
1082 instantly when applied to a whole line of
1083 !!DIFFERENCES FROM PERL
1084
1085
1086 The differences described here are with respect to Perl
1087 5.005.
1088
1089
1090 1. By default, a whitespace character is any character that
1091 the C library function __isspace()__ recognizes, though
1092 it is possible to compile PCRE with alternative character
1093 type tables. Normally __isspace()__ matches space,
1094 formfeed, newline, carriage return, horizontal tab, and
1095 vertical tab. Perl 5 no longer includes vertical tab in its
1096 set of whitespace characters. The v escape that was in the
1097 Perl documentation for a long time was never in fact
1098 recognized. However, the character itself was treated as
1099 whitespace at least up to 5.002. In 5.004 and 5.005 it does
1100 not match s.
1101
1102
1103 2. PCRE does not allow repeat quantifiers on lookahead
1104 assertions. Perl permits them, but they do not mean what you
1105 might think. For example, (?!a){3} does not assert that the
1106 next three characters are not
1107
1108
1109 3. Capturing subpatterns that occur inside negative
1110 lookahead assertions are counted, but their entries in the
1111 offsets vector are never set. Perl sets its numerical
1112 variables from any such patterns that are matched before the
1113 assertion fails to match something (thereby succeeding), but
1114 only if the negative lookahead assertion contains just one
1115 branch.
1116
1117
1118 4. Though binary zero characters are supported in the
1119 subject string, they are not allowed in a pattern string
1120 because it is passed as a normal C string, terminated by
1121 zero. The escape sequence
1122
1123
1124 5. The following Perl escape sequences are not supported: l,
1125 u, L, U, E, Q. In fact these are implemented by Perl's
1126 general string-handling and are not part of its pattern
1127 matching engine.
1128
1129
1130 6. The Perl G assertion is not supported as it is not
1131 relevant to single pattern matches.
1132
1133
1134 7. Fairly obviously, PCRE does not support the (?{code})
1135 construction.
1136
1137
1138 8. There are at the time of writing some oddities in Perl
1139 5.005_02 concerned with the settings of captured strings
1140 when part of a pattern is repeated. For example, matching
1141
1142
1143 In Perl 5.004 $2 is set in both cases, and that is also true
1144 of PCRE. If in the future Perl changes to a consistent state
1145 that is different, PCRE may change to follow.
1146
1147
1148 9. Another as yet unresolved discrepancy is that in Perl
1149 5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string
1150
1151
1152 10. PCRE provides some extensions to the Perl regular
1153 expression facilities:
1154
1155
1156 (a) Although lookbehind assertions must match fixed length
1157 strings, each alternative branch of a lookbehind assertion
1158 can match a different length of string. Perl 5.005 requires
1159 them all to have the same length.
1160
1161
1162 (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not
1163 set, the $ meta- character matches only at the very end of
1164 the string.
1165
1166
1167 (c) If PCRE_EXTRA is set, a backslash followed by a letter
1168 with no special meaning is faulted.
1169
1170
1171 (d) If PCRE_UNGREEDY is set, the greediness of the
1172 repetition quantifiers is inverted, that is, by default they
1173 are not greedy, but if followed by a question mark they
1174 are.
1175 !!LIMITATIONS
1176
1177
1178 There are some size limitations in PCRE but it is hoped that
1179 they will never in practice be relevant. The maximum length
1180 of a compiled pattern is 65539 (sic) bytes. All values in
1181 repeating quantifiers must be less than 65536. The maximum
1182 number of capturing subpatterns is 99. The maximum number of
1183 all parenthesized subpatterns, including capturing
1184 subpatterns, assertions, and other types of subpattern, is
1185 200.
1186
1187
1188 The maximum length of a subject string is the largest
1189 positive number that an integer variable can hold. However,
1190 PCRE uses recursion to handle subpatterns and indefinite
1191 repetition. This means that the available stack space may
1192 limit the size of a subject string that can be processed by
1193 certain patterns.
1194 !!AUTHOR
1195
1196
1197 Philip Hazel
1198 University Computing Service,
1199 New Museums Site,
1200 Cambridge CB2 3QG, England.
1201 Phone: +44 1223 334714
1202
1203
1204 Copyright (c) 1997-1999 University of
1205 Cambridge.
1206 ----
This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.