version 1, including all changes.
.
Rev |
Author |
# |
Line |
1 |
perry |
1 |
PCRE |
|
|
2 |
!!!PCRE |
|
|
3 |
NAME |
|
|
4 |
REGULAR EXPRESSION DETAILS |
|
|
5 |
BACKSLASH |
|
|
6 |
CIRCUMFLEX AND DOLLAR |
|
|
7 |
FULL STOP (PERIOD, DOT) |
|
|
8 |
SQUARE BRACKETS |
|
|
9 |
POSIX CHARACTER CLASSES |
|
|
10 |
VERTICAL BAR |
|
|
11 |
INTERNAL OPTION SETTING |
|
|
12 |
SUBPATTERNS |
|
|
13 |
REPETITION |
|
|
14 |
BACK REFERENCES |
|
|
15 |
ASSERTIONS |
|
|
16 |
ONCE-ONLY SUBPATTERNS |
|
|
17 |
CONDITIONAL SUBPATTERNS |
|
|
18 |
COMMENTS |
|
|
19 |
RECURSIVE PATTERNS |
|
|
20 |
PERFORMANCE |
|
|
21 |
UTF-8 SUPPORT |
|
|
22 |
DIFFERENCES FROM PERL |
|
|
23 |
AUTHOR |
|
|
24 |
---- |
|
|
25 |
!!NAME |
|
|
26 |
|
|
|
27 |
|
|
|
28 |
pcre - Perl-compatible regular expressions: expresion syntax. |
|
|
29 |
!!REGULAR EXPRESSION DETAILS |
|
|
30 |
|
|
|
31 |
|
|
|
32 |
The syntax and semantics of the regular expressions |
|
|
33 |
supported by PCRE are described below. Regular expressions |
|
|
34 |
are also described in the Perl documentation and in a number |
|
|
35 |
of other books, some of which have copious examples. Jeffrey |
|
|
36 |
Friedl's |
|
|
37 |
|
|
|
38 |
|
|
|
39 |
The description here is intended as reference documentation. |
|
|
40 |
The basic operation of PCRE is on strings of bytes. However, |
|
|
41 |
there is the beginnings of some support for UTF-8 character |
|
|
42 |
strings. To use this support you must configure PCRE to |
|
|
43 |
include it, and then call __pcre_compile()__ with the |
|
|
44 |
PCRE_UTF8 option. How this affects the pattern matching is |
|
|
45 |
described in the final section of this |
|
|
46 |
document. |
|
|
47 |
|
|
|
48 |
|
|
|
49 |
A regular expression is a pattern that is matched against a |
|
|
50 |
subject string from left to right. Most characters stand for |
|
|
51 |
themselves in a pattern, and match the corresponding |
|
|
52 |
characters in the subject. As a trivial example, the |
|
|
53 |
pattern |
|
|
54 |
|
|
|
55 |
|
|
|
56 |
The quick brown fox |
|
|
57 |
|
|
|
58 |
|
|
|
59 |
matches a portion of a subject string that is identical to |
|
|
60 |
itself. The power of regular expressions comes from the |
|
|
61 |
ability to include alternatives and repetitions in the |
|
|
62 |
pattern. These are encoded in the pattern by the use of |
|
|
63 |
''meta-characters'', which do not stand for themselves |
|
|
64 |
but instead are interpreted in some special |
|
|
65 |
way. |
|
|
66 |
|
|
|
67 |
|
|
|
68 |
There are two different sets of meta-characters: those that |
|
|
69 |
are recognized anywhere in the pattern except within square |
|
|
70 |
brackets, and those that are recognized in square brackets. |
|
|
71 |
Outside square brackets, the meta-characters are as |
|
|
72 |
follows: |
|
|
73 |
|
|
|
74 |
|
|
|
75 |
\ general escape character with several uses ^ assert start |
|
|
76 |
of subject (or line, in multiline mode) $ assert end of |
|
|
77 |
subject (or line, in multiline mode) . match any character |
|
|
78 |
except newline (by default) [[ start character class |
|
|
79 |
definition | start of alternative branch ( start subpattern |
|
|
80 |
) end subpattern ? extends the meaning of ( also 0 or 1 |
|
|
81 |
quantifier also quantifier minimizer * 0 or more quantifier |
|
|
82 |
+ 1 or more quantifier { start min/max |
|
|
83 |
quantifier |
|
|
84 |
|
|
|
85 |
|
|
|
86 |
Part of a pattern that is in square brackets is called a |
|
|
87 |
|
|
|
88 |
|
|
|
89 |
\ general escape character ^ negate the class, but only if |
|
|
90 |
the first character - indicates character range ] terminates |
|
|
91 |
the character class |
|
|
92 |
|
|
|
93 |
|
|
|
94 |
The following sections describe the use of each of the |
|
|
95 |
meta-characters. |
|
|
96 |
!!BACKSLASH |
|
|
97 |
|
|
|
98 |
|
|
|
99 |
The backslash character has several uses. Firstly, if it is |
|
|
100 |
followed by a non-alphameric character, it takes away any |
|
|
101 |
special meaning that character may have. This use of |
|
|
102 |
backslash as an escape character applies both inside and |
|
|
103 |
outside character classes. |
|
|
104 |
|
|
|
105 |
|
|
|
106 |
For example, if you want to match a |
|
|
107 |
|
|
|
108 |
|
|
|
109 |
If a pattern is compiled with the PCRE_EXTENDED option, |
|
|
110 |
whitespace in the pattern (other than in a character class) |
|
|
111 |
and characters between a |
|
|
112 |
|
|
|
113 |
|
|
|
114 |
A second use of backslash provides a way of encoding |
|
|
115 |
non-printing characters in patterns in a visible manner. |
|
|
116 |
There is no restriction on the appearance of non-printing |
|
|
117 |
characters, apart from the binary zero that terminates a |
|
|
118 |
pattern, but when a pattern is being prepared by text |
|
|
119 |
editing, it is usually easier to use one of the following |
|
|
120 |
escape sequences than the binary character it |
|
|
121 |
represents: |
|
|
122 |
|
|
|
123 |
|
|
|
124 |
a alarm, that is, the BEL character (hex 07) cx |
|
|
125 |
|
|
|
126 |
|
|
|
127 |
The precise effect of |
|
|
128 |
|
|
|
129 |
|
|
|
130 |
After |
|
|
131 |
|
|
|
132 |
|
|
|
133 |
After |
|
|
134 |
|
|
|
135 |
|
|
|
136 |
The handling of a backslash followed by a digit other than 0 |
|
|
137 |
is complicated. Outside a character class, PCRE reads it and |
|
|
138 |
any following digits as a decimal number. If the number is |
|
|
139 |
less than 10, or if there have been at least that many |
|
|
140 |
previous capturing left parentheses in the expression, the |
|
|
141 |
entire sequence is taken as a ''back reference''. A |
|
|
142 |
description of how this works is given later, following the |
|
|
143 |
discussion of parenthesized subpatterns. |
|
|
144 |
|
|
|
145 |
|
|
|
146 |
Inside a character class, or if the decimal number is |
|
|
147 |
greater than 9 and there have not been that many capturing |
|
|
148 |
subpatterns, PCRE re-reads up to three octal digits |
|
|
149 |
following the backslash, and generates a single byte from |
|
|
150 |
the least significant 8 bits of the value. Any subsequent |
|
|
151 |
digits stand for themselves. For example: |
|
|
152 |
|
|
|
153 |
|
|
|
154 |
040 is another way of writing a space 40 is the same, |
|
|
155 |
provided there are fewer than 40 previous capturing |
|
|
156 |
subpatterns 7 is always a back reference 11 might be a back |
|
|
157 |
reference, or another way of writing a tab 011 is always a |
|
|
158 |
tab 0113 is a tab followed by the character |
|
|
159 |
|
|
|
160 |
|
|
|
161 |
Note that octal values of 100 or greater must not be |
|
|
162 |
introduced by a leading zero, because no more than three |
|
|
163 |
octal digits are ever read. |
|
|
164 |
|
|
|
165 |
|
|
|
166 |
All the sequences that define a single byte value can be |
|
|
167 |
used both inside and outside character classes. In addition, |
|
|
168 |
inside a character class, the sequence |
|
|
169 |
|
|
|
170 |
|
|
|
171 |
The third use of backslash is for specifying generic |
|
|
172 |
character types: |
|
|
173 |
|
|
|
174 |
|
|
|
175 |
d any decimal digit D any character that is not a decimal |
|
|
176 |
digit s any whitespace character S any character that is not |
|
|
177 |
a whitespace character w any |
|
|
178 |
|
|
|
179 |
|
|
|
180 |
Each pair of escape sequences partitions the complete set of |
|
|
181 |
characters into two disjoint sets. Any given character |
|
|
182 |
matches one, and only one, of each pair. |
|
|
183 |
|
|
|
184 |
|
|
|
185 |
A |
|
|
186 |
|
|
|
187 |
|
|
|
188 |
These character type sequences can appear both inside and |
|
|
189 |
outside character classes. They each match one character of |
|
|
190 |
the appropriate type. If the current matching point is at |
|
|
191 |
the end of the subject string, all of them fail, since there |
|
|
192 |
is no character to match. |
|
|
193 |
|
|
|
194 |
|
|
|
195 |
The fourth use of backslash is for certain simple |
|
|
196 |
assertions. An assertion specifies a condition that has to |
|
|
197 |
be met at a particular point in a match, without consuming |
|
|
198 |
any characters from the subject string. The use of |
|
|
199 |
subpatterns for more complicated assertions is described |
|
|
200 |
below. The backslashed assertions are |
|
|
201 |
|
|
|
202 |
|
|
|
203 |
b word boundary B not a word boundary A start of subject |
|
|
204 |
(independent of multiline mode) Z end of subject or newline |
|
|
205 |
at end (independent of multiline mode) z end of subject |
|
|
206 |
(independent of multiline mode) |
|
|
207 |
|
|
|
208 |
|
|
|
209 |
These assertions may not appear in character classes (but |
|
|
210 |
note that |
|
|
211 |
|
|
|
212 |
|
|
|
213 |
A word boundary is a position in the subject string where |
|
|
214 |
the current character and the previous character do not both |
|
|
215 |
match w or W (i.e. one matches w and the other matches W), |
|
|
216 |
or the start or end of the string if the first or last |
|
|
217 |
character matches w, respectively. |
|
|
218 |
|
|
|
219 |
|
|
|
220 |
The A, Z, and z assertions differ from the traditional |
|
|
221 |
circumflex and dollar (described below) in that they only |
|
|
222 |
ever match at the very start and end of the subject string, |
|
|
223 |
whatever options are set. They are not affected by the |
|
|
224 |
PCRE_NOTBOL or PCRE_NOTEOL options. If the |
|
|
225 |
''startoffset'' argument of __pcre_exec()__ is |
|
|
226 |
non-zero, A can never match. The difference between Z and z |
|
|
227 |
is that Z matches before a newline that is the last |
|
|
228 |
character of the string as well as at the end of the string, |
|
|
229 |
whereas z matches only at the end. |
|
|
230 |
!!CIRCUMFLEX AND DOLLAR |
|
|
231 |
|
|
|
232 |
|
|
|
233 |
Outside a character class, in the default matching mode, the |
|
|
234 |
circumflex character is an assertion which is true only if |
|
|
235 |
the current matching point is at the start of the subject |
|
|
236 |
string. If the ''startoffset'' argument of |
|
|
237 |
__pcre_exec()__ is non-zero, circumflex can never match. |
|
|
238 |
Inside a character class, circumflex has an entirely |
|
|
239 |
different meaning (see below). |
|
|
240 |
|
|
|
241 |
|
|
|
242 |
Circumflex need not be the first character of the pattern if |
|
|
243 |
a number of alternatives are involved, but it should be the |
|
|
244 |
first thing in each alternative in which it appears if the |
|
|
245 |
pattern is ever to match that branch. If all possible |
|
|
246 |
alternatives start with a circumflex, that is, if the |
|
|
247 |
pattern is constrained to match only at the start of the |
|
|
248 |
subject, it is said to be an |
|
|
249 |
|
|
|
250 |
|
|
|
251 |
A dollar character is an assertion which is true only if the |
|
|
252 |
current matching point is at the end of the subject string, |
|
|
253 |
or immediately before a newline character that is the last |
|
|
254 |
character in the string (by default). Dollar need not be the |
|
|
255 |
last character of the pattern if a number of alternatives |
|
|
256 |
are involved, but it should be the last item in any branch |
|
|
257 |
in which it appears. Dollar has no special meaning in a |
|
|
258 |
character class. |
|
|
259 |
|
|
|
260 |
|
|
|
261 |
The meaning of dollar can be changed so that it matches only |
|
|
262 |
at the very end of the string, by setting the |
|
|
263 |
PCRE_DOLLAR_ENDONLY option at compile or matching time. This |
|
|
264 |
does not affect the Z assertion. |
|
|
265 |
|
|
|
266 |
|
|
|
267 |
The meanings of the circumflex and dollar characters are |
|
|
268 |
changed if the PCRE_MULTILINE option is set. When this is |
|
|
269 |
the case, they match immediately after and immediately |
|
|
270 |
before an internal |
|
|
271 |
startoffset'' |
|
|
272 |
argument of __pcre_exec()__ is non-zero. The |
|
|
273 |
PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is |
|
|
274 |
set. |
|
|
275 |
|
|
|
276 |
|
|
|
277 |
Note that the sequences A, Z, and z can be used to match the |
|
|
278 |
start and end of the subject in both modes, and if all |
|
|
279 |
branches of a pattern start with A is it always anchored, |
|
|
280 |
whether PCRE_MULTILINE is set or not. |
|
|
281 |
!!FULL STOP (PERIOD, DOT) |
|
|
282 |
|
|
|
283 |
|
|
|
284 |
Outside a character class, a dot in the pattern matches any |
|
|
285 |
one character in the subject, including a non-printing |
|
|
286 |
character, but not (by default) newline. If the PCRE_DOTALL |
|
|
287 |
option is set, dots match newlines as well. The handling of |
|
|
288 |
dot is entirely independent of the handling of circumflex |
|
|
289 |
and dollar, the only relationship being that they both |
|
|
290 |
involve newline characters. Dot has no special meaning in a |
|
|
291 |
character class. |
|
|
292 |
!!SQUARE BRACKETS |
|
|
293 |
|
|
|
294 |
|
|
|
295 |
An opening square bracket introduces a character class, |
|
|
296 |
terminated by a closing square bracket. A closing square |
|
|
297 |
bracket on its own is not special. If a closing square |
|
|
298 |
bracket is required as a member of the class, it should be |
|
|
299 |
the first data character in the class (after an initial |
|
|
300 |
circumflex, if present) or escaped with a |
|
|
301 |
backslash. |
|
|
302 |
|
|
|
303 |
|
|
|
304 |
A character class matches a single character in the subject; |
|
|
305 |
the character must be in the set of characters defined by |
|
|
306 |
the class, unless the first character in the class is a |
|
|
307 |
circumflex, in which case the subject character must not be |
|
|
308 |
in the set defined by the class. If a circumflex is actually |
|
|
309 |
required as a member of the class, ensure it is not the |
|
|
310 |
first character, or escape it with a backslash. |
|
|
311 |
|
|
|
312 |
|
|
|
313 |
For example, the character class [[aeiou] matches any lower |
|
|
314 |
case vowel, while [[^aeiou] matches any character that is not |
|
|
315 |
a lower case vowel. Note that a circumflex is just a |
|
|
316 |
convenient notation for specifying the characters which are |
|
|
317 |
in the class by enumerating those that are not. It is not an |
|
|
318 |
assertion: it still consumes a character from the subject |
|
|
319 |
string, and fails if the current pointer is at the end of |
|
|
320 |
the string. |
|
|
321 |
|
|
|
322 |
|
|
|
323 |
When caseless matching is set, any letters in a class |
|
|
324 |
represent both their upper case and lower case versions, so |
|
|
325 |
for example, a caseless [[aeiou] matches |
|
|
326 |
|
|
|
327 |
|
|
|
328 |
The newline character is never treated in any special way in |
|
|
329 |
character classes, whatever the setting of the PCRE_DOTALL |
|
|
330 |
or PCRE_MULTILINE options is. A class such as [[^a] will |
|
|
331 |
always match a newline. |
|
|
332 |
|
|
|
333 |
|
|
|
334 |
The minus (hyphen) character can be used to specify a range |
|
|
335 |
of characters in a character class. For example, [[d-m] |
|
|
336 |
matches any letter between d and m, inclusive. If a minus |
|
|
337 |
character is required in a class, it must be escaped with a |
|
|
338 |
backslash or appear in a position where it cannot be |
|
|
339 |
interpreted as indicating a range, typically as the first or |
|
|
340 |
last character in the class. |
|
|
341 |
|
|
|
342 |
|
|
|
343 |
It is not possible to have the literal character |
|
|
344 |
|
|
|
345 |
|
|
|
346 |
Ranges operate in ASCII collating sequence. They can also be |
|
|
347 |
used for characters specified numerically, for example |
|
|
348 |
[[000-037]. If a range that includes letters is used when |
|
|
349 |
caseless matching is set, it matches the letters in either |
|
|
350 |
case. For example, [[W-c] is equivalent to [[][[^_`wxyzabc], |
|
|
351 |
matched caselessly, and if character tables for the |
|
|
352 |
|
|
|
353 |
|
|
|
354 |
The character types d, D, s, S, w, and W may also appear in |
|
|
355 |
a character class, and add the characters that they match to |
|
|
356 |
the class. For example, [[dABCDEF] matches any hexadecimal |
|
|
357 |
digit. A circumflex can conveniently be used with the upper |
|
|
358 |
case character types to specify a more restricted set of |
|
|
359 |
characters than the matching lower case type. For example, |
|
|
360 |
the class [[^W_] matches any letter or digit, but not |
|
|
361 |
underscore. |
|
|
362 |
|
|
|
363 |
|
|
|
364 |
All non-alphameric characters other than , -, ^ (at the |
|
|
365 |
start) and the terminating ] are non-special in character |
|
|
366 |
classes, but it does no harm if they are |
|
|
367 |
escaped. |
|
|
368 |
!!POSIX CHARACTER CLASSES |
|
|
369 |
|
|
|
370 |
|
|
|
371 |
Perl 5.6 (not yet released at the time of writing) is going |
|
|
372 |
to support the POSIX notation for character classes, which |
|
|
373 |
uses names enclosed by [[: and :] within the enclosing square |
|
|
374 |
brackets. PCRE supports this notation. For |
|
|
375 |
example, |
|
|
376 |
|
|
|
377 |
|
|
|
378 |
[[01[[:alpha:]%] |
|
|
379 |
|
|
|
380 |
|
|
|
381 |
matches |
|
|
382 |
|
|
|
383 |
|
|
|
384 |
alnum letters and digits alpha letters ascii character codes |
|
|
385 |
0 - 127 cntrl control characters digit decimal digits (same |
|
|
386 |
as d) graph printing characters, excluding space lower lower |
|
|
387 |
case letters print printing characters, including space |
|
|
388 |
punct printing characters, excluding letters and digits |
|
|
389 |
space white space (same as s) upper upper case letters word |
|
|
390 |
|
|
|
391 |
|
|
|
392 |
The names |
|
|
393 |
|
|
|
394 |
|
|
|
395 |
[[12[[:^digit:]] |
|
|
396 |
|
|
|
397 |
|
|
|
398 |
matches |
|
|
399 |
!!VERTICAL BAR |
|
|
400 |
|
|
|
401 |
|
|
|
402 |
Vertical bar characters are used to separate alternative |
|
|
403 |
patterns. For example, the pattern |
|
|
404 |
|
|
|
405 |
|
|
|
406 |
gilbert|sullivan |
|
|
407 |
|
|
|
408 |
|
|
|
409 |
matches either |
|
|
410 |
!!INTERNAL OPTION SETTING |
|
|
411 |
|
|
|
412 |
|
|
|
413 |
The settings of PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, |
|
|
414 |
and PCRE_EXTENDED can be changed from within the pattern by |
|
|
415 |
a sequence of Perl option letters enclosed between |
|
|
416 |
|
|
|
417 |
|
|
|
418 |
i for PCRE_CASELESS m for PCRE_MULTILINE s for PCRE_DOTALL x |
|
|
419 |
for PCRE_EXTENDED |
|
|
420 |
|
|
|
421 |
|
|
|
422 |
For example, (?im) sets caseless, multiline matching. It is |
|
|
423 |
also possible to unset these options by preceding the letter |
|
|
424 |
with a hyphen, and a combined setting and unsetting such as |
|
|
425 |
(?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while |
|
|
426 |
unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted. |
|
|
427 |
If a letter appears both before and after the hyphen, the |
|
|
428 |
option is unset. |
|
|
429 |
|
|
|
430 |
|
|
|
431 |
The scope of these option changes depends on where in the |
|
|
432 |
pattern the setting occurs. For settings that are outside |
|
|
433 |
any subpattern (defined below), the effect is the same as if |
|
|
434 |
the options were set or unset at the start of matching. The |
|
|
435 |
following patterns all behave in exactly the same |
|
|
436 |
way: |
|
|
437 |
|
|
|
438 |
|
|
|
439 |
(?i)abc a(?i)bc ab(?i)c abc(?i) |
|
|
440 |
|
|
|
441 |
|
|
|
442 |
which in turn is the same as compiling the pattern abc with |
|
|
443 |
PCRE_CASELESS set. In other words, such |
|
|
444 |
|
|
|
445 |
|
|
|
446 |
If an option change occurs inside a subpattern, the effect |
|
|
447 |
is different. This is a change of behaviour in Perl 5.005. |
|
|
448 |
An option change inside a subpattern affects only that part |
|
|
449 |
of the subpattern that follows it, so |
|
|
450 |
|
|
|
451 |
|
|
|
452 |
(a(?i)b)c |
|
|
453 |
|
|
|
454 |
|
|
|
455 |
matches abc and aBc and no other strings (assuming |
|
|
456 |
PCRE_CASELESS is not used). By this means, options can be |
|
|
457 |
made to have different settings in different parts of the |
|
|
458 |
pattern. Any changes made in one alternative do carry on |
|
|
459 |
into subsequent branches within the same subpattern. For |
|
|
460 |
example, |
|
|
461 |
|
|
|
462 |
|
|
|
463 |
(a(?i)b|c) |
|
|
464 |
|
|
|
465 |
|
|
|
466 |
matches |
|
|
467 |
|
|
|
468 |
|
|
|
469 |
The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can |
|
|
470 |
be changed in the same way as the Perl-compatible options by |
|
|
471 |
using the characters U and X respectively. The (?X) flag |
|
|
472 |
setting is special in that it must always occur earlier in |
|
|
473 |
the pattern than any of the additional features it turns on, |
|
|
474 |
even when it is at top level. It is best put at the |
|
|
475 |
start. |
|
|
476 |
!!SUBPATTERNS |
|
|
477 |
|
|
|
478 |
|
|
|
479 |
Subpatterns are delimited by parentheses (round brackets), |
|
|
480 |
which can be nested. Marking part of a pattern as a |
|
|
481 |
subpattern does two things: |
|
|
482 |
|
|
|
483 |
|
|
|
484 |
1. It localizes a set of alternatives. For example, the |
|
|
485 |
pattern |
|
|
486 |
|
|
|
487 |
|
|
|
488 |
cat(aract|erpillar|) |
|
|
489 |
|
|
|
490 |
|
|
|
491 |
matches one of the words |
|
|
492 |
|
|
|
493 |
|
|
|
494 |
2. It sets up the subpattern as a capturing subpattern (as |
|
|
495 |
defined above). When the whole pattern matches, that portion |
|
|
496 |
of the subject string that matched the subpattern is passed |
|
|
497 |
back to the caller via the ''ovector'' argument of |
|
|
498 |
__pcre_exec()__. Opening parentheses are counted from |
|
|
499 |
left to right (starting from 1) to obtain the numbers of the |
|
|
500 |
capturing subpatterns. |
|
|
501 |
|
|
|
502 |
|
|
|
503 |
For example, if the string |
|
|
504 |
|
|
|
505 |
|
|
|
506 |
the ((red|white) (king|queen)) |
|
|
507 |
|
|
|
508 |
|
|
|
509 |
the captured substrings are |
|
|
510 |
|
|
|
511 |
|
|
|
512 |
The fact that plain parentheses fulfil two functions is not |
|
|
513 |
always helpful. There are often times when a grouping |
|
|
514 |
subpattern is required without a capturing requirement. If |
|
|
515 |
an opening parenthesis is followed by |
|
|
516 |
|
|
|
517 |
|
|
|
518 |
the ((?:red|white) (king|queen)) |
|
|
519 |
|
|
|
520 |
|
|
|
521 |
the captured substrings are |
|
|
522 |
|
|
|
523 |
|
|
|
524 |
As a convenient shorthand, if any option settings are |
|
|
525 |
required at the start of a non-capturing subpattern, the |
|
|
526 |
option letters may appear between the |
|
|
527 |
|
|
|
528 |
|
|
|
529 |
(?i:saturday|sunday) (?:(?i)saturday|sunday) |
|
|
530 |
|
|
|
531 |
|
|
|
532 |
match exactly the same set of strings. Because alternative |
|
|
533 |
branches are tried from left to right, and options are not |
|
|
534 |
reset until the end of the subpattern is reached, an option |
|
|
535 |
setting in one branch does affect subsequent branches, so |
|
|
536 |
the above patterns match |
|
|
537 |
!!REPETITION |
|
|
538 |
|
|
|
539 |
|
|
|
540 |
Repetition is specified by quantifiers, which can follow any |
|
|
541 |
of the following items: |
|
|
542 |
|
|
|
543 |
|
|
|
544 |
a single character, possibly escaped the . metacharacter a |
|
|
545 |
character class a back reference (see next section) a |
|
|
546 |
parenthesized subpattern (unless it is an assertion - see |
|
|
547 |
below) |
|
|
548 |
|
|
|
549 |
|
|
|
550 |
The general repetition quantifier specifies a minimum and |
|
|
551 |
maximum number of permitted matches, by giving the two |
|
|
552 |
numbers in curly brackets (braces), separated by a comma. |
|
|
553 |
The numbers must be less than 65536, and the first must be |
|
|
554 |
less than or equal to the second. For example: |
|
|
555 |
|
|
|
556 |
|
|
|
557 |
z{2,4} |
|
|
558 |
|
|
|
559 |
|
|
|
560 |
matches |
|
|
561 |
|
|
|
562 |
|
|
|
563 |
[[aeiou]{3,} |
|
|
564 |
|
|
|
565 |
|
|
|
566 |
matches at least 3 successive vowels, but may match many |
|
|
567 |
more, while |
|
|
568 |
|
|
|
569 |
|
|
|
570 |
d{8} |
|
|
571 |
|
|
|
572 |
|
|
|
573 |
matches exactly 8 digits. An opening curly bracket that |
|
|
574 |
appears in a position where a quantifier is not allowed, or |
|
|
575 |
one that does not match the syntax of a quantifier, is taken |
|
|
576 |
as a literal character. For example, {,6} is not a |
|
|
577 |
quantifier, but a literal string of four |
|
|
578 |
characters. |
|
|
579 |
|
|
|
580 |
|
|
|
581 |
The quantifier {0} is permitted, causing the expression to |
|
|
582 |
behave as if the previous item and the quantifier were not |
|
|
583 |
present. |
|
|
584 |
|
|
|
585 |
|
|
|
586 |
For convenience (and historical compatibility) the three |
|
|
587 |
most common quantifiers have single-character |
|
|
588 |
abbreviations: |
|
|
589 |
|
|
|
590 |
|
|
|
591 |
* is equivalent to {0,} + is equivalent to {1,} ? is |
|
|
592 |
equivalent to {0,1} |
|
|
593 |
|
|
|
594 |
|
|
|
595 |
It is possible to construct infinite loops by following a |
|
|
596 |
subpattern that can match no characters with a quantifier |
|
|
597 |
that has no upper limit, for example: |
|
|
598 |
|
|
|
599 |
|
|
|
600 |
(a?)* |
|
|
601 |
|
|
|
602 |
|
|
|
603 |
Earlier versions of Perl and PCRE used to give an error at |
|
|
604 |
compile time for such patterns. However, because there are |
|
|
605 |
cases where this can be useful, such patterns are now |
|
|
606 |
accepted, but if any repetition of the subpattern does in |
|
|
607 |
fact match no characters, the loop is forcibly |
|
|
608 |
broken. |
|
|
609 |
|
|
|
610 |
|
|
|
611 |
By default, the quantifiers are |
|
|
612 |
|
|
|
613 |
|
|
|
614 |
/*.**/ |
|
|
615 |
|
|
|
616 |
|
|
|
617 |
to the string |
|
|
618 |
|
|
|
619 |
|
|
|
620 |
/* first command */ not comment /* second comment |
|
|
621 |
*/ |
|
|
622 |
|
|
|
623 |
|
|
|
624 |
fails, because it matches the entire string due to the |
|
|
625 |
greediness of the .* item. |
|
|
626 |
|
|
|
627 |
|
|
|
628 |
However, if a quantifier is followed by a question mark, it |
|
|
629 |
ceases to be greedy, and instead matches the minimum number |
|
|
630 |
of times possible, so the pattern |
|
|
631 |
|
|
|
632 |
|
|
|
633 |
/*.*?*/ |
|
|
634 |
|
|
|
635 |
|
|
|
636 |
does the right thing with the C comments. The meaning of the |
|
|
637 |
various quantifiers is not otherwise changed, just the |
|
|
638 |
preferred number of matches. Do not confuse this use of |
|
|
639 |
question mark with its use as a quantifier in its own right. |
|
|
640 |
Because it has two uses, it can sometimes appear doubled, as |
|
|
641 |
in |
|
|
642 |
|
|
|
643 |
|
|
|
644 |
d??d |
|
|
645 |
|
|
|
646 |
|
|
|
647 |
which matches one digit by preference, but can match two if |
|
|
648 |
that is the only way the rest of the pattern |
|
|
649 |
matches. |
|
|
650 |
|
|
|
651 |
|
|
|
652 |
If the PCRE_UNGREEDY option is set (an option which is not |
|
|
653 |
available in Perl), the quantifiers are not greedy by |
|
|
654 |
default, but individual ones can be made greedy by following |
|
|
655 |
them with a question mark. In other words, it inverts the |
|
|
656 |
default behaviour. |
|
|
657 |
|
|
|
658 |
|
|
|
659 |
When a parenthesized subpattern is quantified with a minimum |
|
|
660 |
repeat count that is greater than 1 or with a limited |
|
|
661 |
maximum, more store is required for the compiled pattern, in |
|
|
662 |
proportion to the size of the minimum or |
|
|
663 |
maximum. |
|
|
664 |
|
|
|
665 |
|
|
|
666 |
If a pattern starts with .* or .{0,} and the PCRE_DOTALL |
|
|
667 |
option (equivalent to Perl's /s) is set, thus allowing the . |
|
|
668 |
to match newlines, the pattern is implicitly anchored, |
|
|
669 |
because whatever follows will be tried against every |
|
|
670 |
character position in the subject string, so there is no |
|
|
671 |
point in retrying the overall match at any position after |
|
|
672 |
the first. PCRE treats such a pattern as though it were |
|
|
673 |
preceded by A. In cases where it is known that the subject |
|
|
674 |
string contains no newlines, it is worth setting PCRE_DOTALL |
|
|
675 |
when the pattern begins with .* in order to obtain this |
|
|
676 |
optimization, or alternatively using ^ to indicate anchoring |
|
|
677 |
explicitly. |
|
|
678 |
|
|
|
679 |
|
|
|
680 |
When a capturing subpattern is repeated, the value captured |
|
|
681 |
is the substring that matched the final iteration. For |
|
|
682 |
example, after |
|
|
683 |
|
|
|
684 |
|
|
|
685 |
(tweedle[[dume]{3}s*)+ |
|
|
686 |
|
|
|
687 |
|
|
|
688 |
has matched |
|
|
689 |
|
|
|
690 |
|
|
|
691 |
/(a|(b))+/ |
|
|
692 |
|
|
|
693 |
|
|
|
694 |
matches |
|
|
695 |
!!BACK REFERENCES |
|
|
696 |
|
|
|
697 |
|
|
|
698 |
Outside a character class, a backslash followed by a digit |
|
|
699 |
greater than 0 (and possibly further digits) is a back |
|
|
700 |
reference to a capturing subpattern earlier (i.e. to its |
|
|
701 |
left) in the pattern, provided there have been that many |
|
|
702 |
previous capturing left parentheses. |
|
|
703 |
|
|
|
704 |
|
|
|
705 |
However, if the decimal number following the backslash is |
|
|
706 |
less than 10, it is always taken as a back reference, and |
|
|
707 |
causes an error only if there are not that many capturing |
|
|
708 |
left parentheses in the entire pattern. In other words, the |
|
|
709 |
parentheses that are referenced need not be to the left of |
|
|
710 |
the reference for numbers less than 10. See the section |
|
|
711 |
entitled |
|
|
712 |
|
|
|
713 |
|
|
|
714 |
A back reference matches whatever actually matched the |
|
|
715 |
capturing subpattern in the current subject string, rather |
|
|
716 |
than anything matching the subpattern itself. So the |
|
|
717 |
pattern |
|
|
718 |
|
|
|
719 |
|
|
|
720 |
(sens|respons)e and 1ibility |
|
|
721 |
|
|
|
722 |
|
|
|
723 |
matches |
|
|
724 |
|
|
|
725 |
|
|
|
726 |
((?i)rah)s+1 |
|
|
727 |
|
|
|
728 |
|
|
|
729 |
matches |
|
|
730 |
|
|
|
731 |
|
|
|
732 |
There may be more than one back reference to the same |
|
|
733 |
subpattern. If a subpattern has not actually been used in a |
|
|
734 |
particular match, any back references to it always fail. For |
|
|
735 |
example, the pattern |
|
|
736 |
|
|
|
737 |
|
|
|
738 |
(a|(bc))2 |
|
|
739 |
|
|
|
740 |
|
|
|
741 |
always fails if it starts to match |
|
|
742 |
|
|
|
743 |
|
|
|
744 |
A back reference that occurs inside the parentheses to which |
|
|
745 |
it refers fails when the subpattern is first used, so, for |
|
|
746 |
example, (a1) never matches. However, such references can be |
|
|
747 |
useful inside repeated subpatterns. For example, the |
|
|
748 |
pattern |
|
|
749 |
|
|
|
750 |
|
|
|
751 |
(a|b1)+ |
|
|
752 |
|
|
|
753 |
|
|
|
754 |
matches any number of |
|
|
755 |
!!ASSERTIONS |
|
|
756 |
|
|
|
757 |
|
|
|
758 |
An assertion is a test on the characters following or |
|
|
759 |
preceding the current matching point that does not actually |
|
|
760 |
consume any characters. The simple assertions coded as b, B, |
|
|
761 |
A, Z, z, ^ and $ are described above. More complicated |
|
|
762 |
assertions are coded as subpatterns. There are two kinds: |
|
|
763 |
those that look ahead of the current position in the subject |
|
|
764 |
string, and those that look behind it. |
|
|
765 |
|
|
|
766 |
|
|
|
767 |
An assertion subpattern is matched in the normal way, except |
|
|
768 |
that it does not cause the current matching position to be |
|
|
769 |
changed. Lookahead assertions start with (?= for positive |
|
|
770 |
assertions and (?! for negative assertions. For |
|
|
771 |
example, |
|
|
772 |
|
|
|
773 |
|
|
|
774 |
w+(?=;) |
|
|
775 |
|
|
|
776 |
|
|
|
777 |
matches a word followed by a semicolon, but does not include |
|
|
778 |
the semicolon in the match, and |
|
|
779 |
|
|
|
780 |
|
|
|
781 |
foo(?!bar) |
|
|
782 |
|
|
|
783 |
|
|
|
784 |
matches any occurrence of |
|
|
785 |
|
|
|
786 |
|
|
|
787 |
(?!foo)bar |
|
|
788 |
|
|
|
789 |
|
|
|
790 |
does not find an occurrence of |
|
|
791 |
|
|
|
792 |
|
|
|
793 |
Lookbehind assertions start with (? |
|
|
794 |
|
|
|
795 |
|
|
|
796 |
(? |
|
|
797 |
|
|
|
798 |
|
|
|
799 |
does find an occurrence of |
|
|
800 |
|
|
|
801 |
|
|
|
802 |
(? |
|
|
803 |
|
|
|
804 |
|
|
|
805 |
is permitted, but |
|
|
806 |
|
|
|
807 |
|
|
|
808 |
(? |
|
|
809 |
|
|
|
810 |
|
|
|
811 |
causes an error at compile time. Branches that match |
|
|
812 |
different length strings are permitted only at the top level |
|
|
813 |
of a lookbehind assertion. This is an extension compared |
|
|
814 |
with Perl 5.005, which requires all branches to match the |
|
|
815 |
same length of string. An assertion such as |
|
|
816 |
|
|
|
817 |
|
|
|
818 |
(? |
|
|
819 |
|
|
|
820 |
|
|
|
821 |
is not permitted, because its single top-level branch can |
|
|
822 |
match two different lengths, but it is acceptable if |
|
|
823 |
rewritten to use two top-level branches: |
|
|
824 |
|
|
|
825 |
|
|
|
826 |
(? |
|
|
827 |
|
|
|
828 |
|
|
|
829 |
The implementation of lookbehind assertions is, for each |
|
|
830 |
alternative, to temporarily move the current position back |
|
|
831 |
by the fixed width and then try to match. If there are |
|
|
832 |
insufficient characters before the current position, the |
|
|
833 |
match is deemed to fail. Lookbehinds in conjunction with |
|
|
834 |
once-only subpatterns can be particularly useful for |
|
|
835 |
matching at the ends of strings; an example is given at the |
|
|
836 |
end of the section on once-only subpatterns. |
|
|
837 |
|
|
|
838 |
|
|
|
839 |
Several assertions (of any sort) may occur in succession. |
|
|
840 |
For example, |
|
|
841 |
|
|
|
842 |
|
|
|
843 |
(? |
|
|
844 |
|
|
|
845 |
|
|
|
846 |
matches |
|
|
847 |
not'' match |
|
|
848 |
'' |
|
|
849 |
|
|
|
850 |
|
|
|
851 |
(? |
|
|
852 |
|
|
|
853 |
|
|
|
854 |
This time the first assertion looks at the preceding six |
|
|
855 |
characters, checking that the first three are digits, and |
|
|
856 |
then the second assertion checks that the preceding three |
|
|
857 |
characters are not |
|
|
858 |
|
|
|
859 |
|
|
|
860 |
Assertions can be nested in any combination. For |
|
|
861 |
example, |
|
|
862 |
|
|
|
863 |
|
|
|
864 |
(? |
|
|
865 |
|
|
|
866 |
|
|
|
867 |
matches an occurrence of |
|
|
868 |
|
|
|
869 |
|
|
|
870 |
(? |
|
|
871 |
|
|
|
872 |
|
|
|
873 |
is another pattern which matches |
|
|
874 |
|
|
|
875 |
|
|
|
876 |
Assertion subpatterns are not capturing subpatterns, and may |
|
|
877 |
not be repeated, because it makes no sense to assert the |
|
|
878 |
same thing several times. If any kind of assertion contains |
|
|
879 |
capturing subpatterns within it, these are counted for the |
|
|
880 |
purposes of numbering the capturing subpatterns in the whole |
|
|
881 |
pattern. However, substring capturing is carried out only |
|
|
882 |
for positive assertions, because it does not make sense for |
|
|
883 |
negative assertions. |
|
|
884 |
|
|
|
885 |
|
|
|
886 |
Assertions count towards the maximum of 200 parenthesized |
|
|
887 |
subpatterns. |
|
|
888 |
!!ONCE-ONLY SUBPATTERNS |
|
|
889 |
|
|
|
890 |
|
|
|
891 |
With both maximizing and minimizing repetition, failure of |
|
|
892 |
what follows normally causes the repeated item to be |
|
|
893 |
re-evaluated to see if a different number of repeats allows |
|
|
894 |
the rest of the pattern to match. Sometimes it is useful to |
|
|
895 |
prevent this, either to change the nature of the match, or |
|
|
896 |
to cause it fail earlier than it otherwise might, when the |
|
|
897 |
author of the pattern knows there is no point in carrying |
|
|
898 |
on. |
|
|
899 |
|
|
|
900 |
|
|
|
901 |
Consider, for example, the pattern d+foo when applied to the |
|
|
902 |
subject line |
|
|
903 |
|
|
|
904 |
|
|
|
905 |
123456bar |
|
|
906 |
|
|
|
907 |
|
|
|
908 |
After matching all 6 digits and then failing to match |
|
|
909 |
|
|
|
910 |
|
|
|
911 |
(? |
|
|
912 |
|
|
|
913 |
|
|
|
914 |
This kind of parenthesis |
|
|
915 |
|
|
|
916 |
|
|
|
917 |
An alternative description is that a subpattern of this type |
|
|
918 |
matches the string of characters that an identical |
|
|
919 |
standalone pattern would match, if anchored at the current |
|
|
920 |
point in the subject string. |
|
|
921 |
|
|
|
922 |
|
|
|
923 |
Once-only subpatterns are not capturing subpatterns. Simple |
|
|
924 |
cases such as the above example can be thought of as a |
|
|
925 |
maximizing repeat that must swallow everything it can. So, |
|
|
926 |
while both d+ and d+? are prepared to adjust the number of |
|
|
927 |
digits they match in order to make the rest of the pattern |
|
|
928 |
match, (? |
|
|
929 |
|
|
|
930 |
|
|
|
931 |
This construction can of course contain arbitrarily |
|
|
932 |
complicated subpatterns, and it can be nested. |
|
|
933 |
|
|
|
934 |
|
|
|
935 |
Once-only subpatterns can be used in conjunction with |
|
|
936 |
lookbehind assertions to specify efficient matching at the |
|
|
937 |
end of the subject string. Consider a simple pattern such |
|
|
938 |
as |
|
|
939 |
|
|
|
940 |
|
|
|
941 |
abcd$ |
|
|
942 |
|
|
|
943 |
|
|
|
944 |
when applied to a long string which does not match. Because |
|
|
945 |
matching proceeds from left to right, PCRE will look for |
|
|
946 |
each |
|
|
947 |
|
|
|
948 |
|
|
|
949 |
^.*abcd$ |
|
|
950 |
|
|
|
951 |
|
|
|
952 |
the initial .* matches the entire string at first, but when |
|
|
953 |
this fails (because there is no following |
|
|
954 |
|
|
|
955 |
|
|
|
956 |
^(? |
|
|
957 |
|
|
|
958 |
|
|
|
959 |
there can be no backtracking for the .* item; it can match |
|
|
960 |
only the entire string. The subsequent lookbehind assertion |
|
|
961 |
does a single test on the last four characters. If it fails, |
|
|
962 |
the match fails immediately. For long strings, this approach |
|
|
963 |
makes a significant difference to the processing |
|
|
964 |
time. |
|
|
965 |
|
|
|
966 |
|
|
|
967 |
When a pattern contains an unlimited repeat inside a |
|
|
968 |
subpattern that can itself be repeated an unlimited number |
|
|
969 |
of times, the use of a once-only subpattern is the only way |
|
|
970 |
to avoid some failing matches taking a very long time |
|
|
971 |
indeed. The pattern |
|
|
972 |
|
|
|
973 |
|
|
|
974 |
(D+| |
|
|
975 |
|
|
|
976 |
|
|
|
977 |
matches an unlimited number of substrings that either |
|
|
978 |
consist of non-digits, or digits enclosed in |
|
|
979 |
|
|
|
980 |
|
|
|
981 |
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
|
|
982 |
|
|
|
983 |
|
|
|
984 |
it takes a long time before reporting failure. This is |
|
|
985 |
because the string can be divided between the two repeats in |
|
|
986 |
a large number of ways, and all have to be tried. (The |
|
|
987 |
example used [[!?] rather than a single character at the end, |
|
|
988 |
because both PCRE and Perl have an optimization that allows |
|
|
989 |
for fast failure when a single character is used. They |
|
|
990 |
remember the last single character that is required for a |
|
|
991 |
match, and fail early if it is not present in the string.) |
|
|
992 |
If the pattern is changed to |
|
|
993 |
|
|
|
994 |
|
|
|
995 |
((? |
|
|
996 |
|
|
|
997 |
|
|
|
998 |
sequences of non-digits cannot be broken, and failure |
|
|
999 |
happens quickly. |
|
|
1000 |
!!CONDITIONAL SUBPATTERNS |
|
|
1001 |
|
|
|
1002 |
|
|
|
1003 |
It is possible to cause the matching process to obey a |
|
|
1004 |
subpattern conditionally or to choose between two |
|
|
1005 |
alternative subpatterns, depending on the result of an |
|
|
1006 |
assertion, or whether a previous capturing subpattern |
|
|
1007 |
matched or not. The two possible forms of conditional |
|
|
1008 |
subpattern are |
|
|
1009 |
|
|
|
1010 |
|
|
|
1011 |
(?(condition)yes-pattern) |
|
|
1012 |
(?(condition)yes-pattern|no-pattern) |
|
|
1013 |
|
|
|
1014 |
|
|
|
1015 |
If the condition is satisfied, the yes-pattern is used; |
|
|
1016 |
otherwise the no-pattern (if present) is used. If there are |
|
|
1017 |
more than two alternatives in the subpattern, a compile-time |
|
|
1018 |
error occurs. |
|
|
1019 |
|
|
|
1020 |
|
|
|
1021 |
There are two kinds of condition. If the text between the |
|
|
1022 |
parentheses consists of a sequence of digits, the condition |
|
|
1023 |
is satisfied if the capturing subpattern of that number has |
|
|
1024 |
previously matched. Consider the following pattern, which |
|
|
1025 |
contains non-significant white space to make it more |
|
|
1026 |
readable (assume the PCRE_EXTENDED option) and to divide it |
|
|
1027 |
into three parts for ease of discussion: |
|
|
1028 |
|
|
|
1029 |
|
|
|
1030 |
( )? [[^()]+ (?(1) ) ) |
|
|
1031 |
|
|
|
1032 |
|
|
|
1033 |
The first part matches an optional opening parenthesis, and |
|
|
1034 |
if that character is present, sets it as the first captured |
|
|
1035 |
substring. The second part matches one or more characters |
|
|
1036 |
that are not parentheses. The third part is a conditional |
|
|
1037 |
subpattern that tests whether the first set of parentheses |
|
|
1038 |
matched or not. If they did, that is, if subject started |
|
|
1039 |
with an opening parenthesis, the condition is true, and so |
|
|
1040 |
the yes-pattern is executed and a closing parenthesis is |
|
|
1041 |
required. Otherwise, since no-pattern is not present, the |
|
|
1042 |
subpattern matches nothing. In other words, this pattern |
|
|
1043 |
matches a sequence of non-parentheses, optionally enclosed |
|
|
1044 |
in parentheses. |
|
|
1045 |
|
|
|
1046 |
|
|
|
1047 |
If the condition is not a sequence of digits, it must be an |
|
|
1048 |
assertion. This may be a positive or negative lookahead or |
|
|
1049 |
lookbehind assertion. Consider this pattern, again |
|
|
1050 |
containing non-significant white space, and with the two |
|
|
1051 |
alternatives on the second line: |
|
|
1052 |
|
|
|
1053 |
|
|
|
1054 |
(?(?=[[^a-z]*[[a-z]) d{2}-[[a-z]{3}-d{2} | d{2}-d{2}-d{2} |
|
|
1055 |
) |
|
|
1056 |
|
|
|
1057 |
|
|
|
1058 |
The condition is a positive lookahead assertion that matches |
|
|
1059 |
an optional sequence of non-letters followed by a letter. In |
|
|
1060 |
other words, it tests for the presence of at least one |
|
|
1061 |
letter in the subject. If a letter is found, the subject is |
|
|
1062 |
matched against the first alternative; otherwise it is |
|
|
1063 |
matched against the second. This pattern matches strings in |
|
|
1064 |
one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are |
|
|
1065 |
letters and dd are digits. |
|
|
1066 |
!!COMMENTS |
|
|
1067 |
|
|
|
1068 |
|
|
|
1069 |
The sequence (?# marks the start of a comment which |
|
|
1070 |
continues up to the next closing parenthesis. Nested |
|
|
1071 |
parentheses are not permitted. The characters that make up a |
|
|
1072 |
comment play no part in the pattern matching at |
|
|
1073 |
all. |
|
|
1074 |
|
|
|
1075 |
|
|
|
1076 |
If the PCRE_EXTENDED option is set, an unescaped # character |
|
|
1077 |
outside a character class introduces a comment that |
|
|
1078 |
continues up to the next newline character in the |
|
|
1079 |
pattern. |
|
|
1080 |
!!RECURSIVE PATTERNS |
|
|
1081 |
|
|
|
1082 |
|
|
|
1083 |
Consider the problem of matching a string in parentheses, |
|
|
1084 |
allowing for unlimited nested parentheses. Without the use |
|
|
1085 |
of recursion, the best that can be done is to use a pattern |
|
|
1086 |
that matches up to some fixed depth of nesting. It is not |
|
|
1087 |
possible to handle an arbitrary nesting depth. Perl 5.6 has |
|
|
1088 |
provided an experimental facility that allows regular |
|
|
1089 |
expressions to recurse (amongst other things). It does this |
|
|
1090 |
by interpolating Perl code in the expression at run time, |
|
|
1091 |
and the code can refer to the expression itself. A Perl |
|
|
1092 |
pattern to solve the parentheses problem can be created like |
|
|
1093 |
this: |
|
|
1094 |
|
|
|
1095 |
|
|
|
1096 |
$re = qr{ (?: (? |
|
|
1097 |
|
|
|
1098 |
|
|
|
1099 |
The (?p{...}) item interpolates Perl code at run time, and |
|
|
1100 |
in this case refers recursively to the pattern in which it |
|
|
1101 |
appears. Obviously, PCRE cannot support the interpolation of |
|
|
1102 |
Perl code. Instead, the special item (?R) is provided for |
|
|
1103 |
the specific case of recursion. This PCRE pattern solves the |
|
|
1104 |
parentheses problem (assume the PCRE_EXTENDED option is set |
|
|
1105 |
so that white space is ignored): |
|
|
1106 |
|
|
|
1107 |
|
|
|
1108 |
( (? |
|
|
1109 |
|
|
|
1110 |
|
|
|
1111 |
First it matches an opening parenthesis. Then it matches any |
|
|
1112 |
number of substrings which can either be a sequence of |
|
|
1113 |
non-parentheses, or a recursive match of the pattern itself |
|
|
1114 |
(i.e. a correctly parenthesized substring). Finally there is |
|
|
1115 |
a closing parenthesis. |
|
|
1116 |
|
|
|
1117 |
|
|
|
1118 |
This particular example pattern contains nested unlimited |
|
|
1119 |
repeats, and so the use of a once-only subpattern for |
|
|
1120 |
matching strings of non-parentheses is important when |
|
|
1121 |
applying the pattern to strings that do not match. For |
|
|
1122 |
example, when it is applied to |
|
|
1123 |
|
|
|
1124 |
|
|
|
1125 |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
|
|
1126 |
|
|
|
1127 |
|
|
|
1128 |
it yields |
|
|
1129 |
|
|
|
1130 |
|
|
|
1131 |
The values set for any capturing subpatterns are those from |
|
|
1132 |
the outermost level of the recursion at which the subpattern |
|
|
1133 |
value is set. If the pattern above is matched |
|
|
1134 |
against |
|
|
1135 |
|
|
|
1136 |
|
|
|
1137 |
(ab(cd)ef) |
|
|
1138 |
|
|
|
1139 |
|
|
|
1140 |
the value for the capturing parentheses is |
|
|
1141 |
|
|
|
1142 |
|
|
|
1143 |
( ( (? |
|
|
1144 |
pcre_malloc__, freeing it via __pcre_free__ |
|
|
1145 |
afterwards. If no memory can be obtained, it saves data for |
|
|
1146 |
the first 15 capturing parentheses only, as there is no way |
|
|
1147 |
to give an out-of-memory error from within a |
|
|
1148 |
recursion. |
|
|
1149 |
!!PERFORMANCE |
|
|
1150 |
|
|
|
1151 |
|
|
|
1152 |
Certain items that may appear in patterns are more efficient |
|
|
1153 |
than others. It is more efficient to use a character class |
|
|
1154 |
like [[aeiou] than a set of alternatives such as (a|e|i|o|u). |
|
|
1155 |
In general, the simplest construction that provides the |
|
|
1156 |
required behaviour is usually the most efficient. Jeffrey |
|
|
1157 |
Friedl's book contains a lot of discussion about optimizing |
|
|
1158 |
regular expressions for efficient performance. |
|
|
1159 |
|
|
|
1160 |
|
|
|
1161 |
When a pattern begins with .* and the PCRE_DOTALL option is |
|
|
1162 |
set, the pattern is implicitly anchored by PCRE, since it |
|
|
1163 |
can match only at the start of a subject string. However, if |
|
|
1164 |
PCRE_DOTALL is not set, PCRE cannot make this optimization, |
|
|
1165 |
because the . metacharacter does not then match a newline, |
|
|
1166 |
and if the subject string contains newlines, the pattern may |
|
|
1167 |
match from the character immediately following one of them |
|
|
1168 |
instead of from the very start. For example, the |
|
|
1169 |
pattern |
|
|
1170 |
|
|
|
1171 |
|
|
|
1172 |
(.*) second |
|
|
1173 |
|
|
|
1174 |
|
|
|
1175 |
matches the subject |
|
|
1176 |
|
|
|
1177 |
|
|
|
1178 |
If you are using such a pattern with subject strings that do |
|
|
1179 |
not contain newlines, the best performance is obtained by |
|
|
1180 |
setting PCRE_DOTALL, or starting the pattern with ^.* to |
|
|
1181 |
indicate explicit anchoring. That saves PCRE from having to |
|
|
1182 |
scan along the subject looking for a newline to restart |
|
|
1183 |
at. |
|
|
1184 |
|
|
|
1185 |
|
|
|
1186 |
Beware of patterns that contain nested indefinite repeats. |
|
|
1187 |
These can take a long time to run when applied to a string |
|
|
1188 |
that does not match. Consider the pattern |
|
|
1189 |
fragment |
|
|
1190 |
|
|
|
1191 |
|
|
|
1192 |
(a+)* |
|
|
1193 |
|
|
|
1194 |
|
|
|
1195 |
This can match |
|
|
1196 |
|
|
|
1197 |
|
|
|
1198 |
An optimization catches some of the more simple cases such |
|
|
1199 |
as |
|
|
1200 |
|
|
|
1201 |
|
|
|
1202 |
(a+)*b |
|
|
1203 |
|
|
|
1204 |
|
|
|
1205 |
where a literal character follows. Before embarking on the |
|
|
1206 |
standard matching procedure, PCRE checks that there is a |
|
|
1207 |
|
|
|
1208 |
|
|
|
1209 |
(a+)*d |
|
|
1210 |
|
|
|
1211 |
|
|
|
1212 |
with the pattern above. The former gives a failure almost |
|
|
1213 |
instantly when applied to a whole line of |
|
|
1214 |
!!UTF-8 SUPPORT |
|
|
1215 |
|
|
|
1216 |
|
|
|
1217 |
Starting at release 3.3, PCRE has some support for character |
|
|
1218 |
strings encoded in the UTF-8 format. This is incomplete, and |
|
|
1219 |
is regarded as experimental. In order to use it, you must |
|
|
1220 |
configure PCRE to include UTF-8 support in the code, and, in |
|
|
1221 |
addition, you must call __pcre_compile()__ with the |
|
|
1222 |
PCRE_UTF8 option flag. When you do this, both the pattern |
|
|
1223 |
and any subject strings that are matched against it are |
|
|
1224 |
treated as UTF-8 strings instead of just strings of bytes, |
|
|
1225 |
but only in the cases that are mentioned below. |
|
|
1226 |
|
|
|
1227 |
|
|
|
1228 |
If you compile PCRE with UTF-8 support, but do not use it at |
|
|
1229 |
run time, the library will be a bit bigger, but the |
|
|
1230 |
additional run time overhead is limited to testing the |
|
|
1231 |
PCRE_UTF8 flag in several places, so should not be very |
|
|
1232 |
large. |
|
|
1233 |
|
|
|
1234 |
|
|
|
1235 |
PCRE assumes that the strings it is given contain valid |
|
|
1236 |
UTF-8 codes. It does not diagnose invalid UTF-8 strings. If |
|
|
1237 |
you pass invalid UTF-8 strings to PCRE, the results are |
|
|
1238 |
undefined. |
|
|
1239 |
|
|
|
1240 |
|
|
|
1241 |
Running with PCRE_UTF8 set causes these changes in the way |
|
|
1242 |
PCRE works: |
|
|
1243 |
|
|
|
1244 |
|
|
|
1245 |
1. In a pattern, the escape sequence x{...}, where the |
|
|
1246 |
contents of the braces is a string of hexadecimal digits, is |
|
|
1247 |
interpreted as a UTF-8 character whose code number is the |
|
|
1248 |
given hexadecimal number, for example: x{1234}. This inserts |
|
|
1249 |
from one to six literal bytes into the pattern, using the |
|
|
1250 |
UTF-8 encoding. If a non-hexadecimal digit appears between |
|
|
1251 |
the braces, the item is not recognized. |
|
|
1252 |
|
|
|
1253 |
|
|
|
1254 |
2. The original hexadecimal escape sequence, xhh, generates |
|
|
1255 |
a two-byte UTF-8 character if its value is greater than |
|
|
1256 |
127. |
|
|
1257 |
|
|
|
1258 |
|
|
|
1259 |
3. Repeat quantifiers are NOT correctly handled if they |
|
|
1260 |
follow a multibyte character. For example, x{100}* and xc3+ |
|
|
1261 |
do not work. If you want to repeat such characters, you must |
|
|
1262 |
enclose them in non-capturing parentheses, for example |
|
|
1263 |
(?:x{100}), at present. |
|
|
1264 |
|
|
|
1265 |
|
|
|
1266 |
4. The dot metacharacter matches one UTF-8 character instead |
|
|
1267 |
of a single byte. |
|
|
1268 |
|
|
|
1269 |
|
|
|
1270 |
5. Unlike literal UTF-8 characters, the dot metacharacter |
|
|
1271 |
followed by a repeat quantifier does operate correctly on |
|
|
1272 |
UTF-8 characters instead of single bytes. |
|
|
1273 |
|
|
|
1274 |
|
|
|
1275 |
4. Although the x{...} escape is permitted in a character |
|
|
1276 |
class, characters whose values are greater than 255 cannot |
|
|
1277 |
be included in a class. |
|
|
1278 |
|
|
|
1279 |
|
|
|
1280 |
5. A class is matched against a UTF-8 character instead of |
|
|
1281 |
just a single byte, but it can match only characters whose |
|
|
1282 |
values are less than 256. Characters with greater values |
|
|
1283 |
always fail to match a class. |
|
|
1284 |
|
|
|
1285 |
|
|
|
1286 |
6. Repeated classes work correctly on multiple |
|
|
1287 |
characters. |
|
|
1288 |
|
|
|
1289 |
|
|
|
1290 |
7. Classes containing just a single character whose value is |
|
|
1291 |
greater than 127 (but less than 256), for example, [[x80] or |
|
|
1292 |
[[^x{93}], do not work because these are optimized into |
|
|
1293 |
single byte matches. In the first case, of course, the class |
|
|
1294 |
brackets are just redundant. |
|
|
1295 |
|
|
|
1296 |
|
|
|
1297 |
8. Lookbehind assertions move backwards in the subject by a |
|
|
1298 |
fixed number of characters instead of a fixed number of |
|
|
1299 |
bytes. Simple cases have been tested to work correctly, but |
|
|
1300 |
there may be hidden gotchas herein. |
|
|
1301 |
|
|
|
1302 |
|
|
|
1303 |
9. The character types such as d and w do not work correctly |
|
|
1304 |
with UTF-8 characters. They continue to test a single |
|
|
1305 |
byte. |
|
|
1306 |
|
|
|
1307 |
|
|
|
1308 |
10. Anything not explicitly mentioned here continues to work |
|
|
1309 |
in bytes rather than in characters. |
|
|
1310 |
!!DIFFERENCES FROM PERL |
|
|
1311 |
|
|
|
1312 |
|
|
|
1313 |
The differences described here are with respect to Perl |
|
|
1314 |
5.005. |
|
|
1315 |
|
|
|
1316 |
|
|
|
1317 |
1. By default, a whitespace character is any character that |
|
|
1318 |
the C library function __isspace()__ recognizes, though |
|
|
1319 |
it is possible to compile PCRE with alternative character |
|
|
1320 |
type tables. Normally __isspace()__ matches space, |
|
|
1321 |
formfeed, newline, carriage return, horizontal tab, and |
|
|
1322 |
vertical tab. Perl 5 no longer includes vertical tab in its |
|
|
1323 |
set of whitespace characters. The v escape that was in the |
|
|
1324 |
Perl documentation for a long time was never in fact |
|
|
1325 |
recognized. However, the character itself was treated as |
|
|
1326 |
whitespace at least up to 5.002. In 5.004 and 5.005 it does |
|
|
1327 |
not match s. |
|
|
1328 |
|
|
|
1329 |
|
|
|
1330 |
2. PCRE does not allow repeat quantifiers on lookahead |
|
|
1331 |
assertions. Perl permits them, but they do not mean what you |
|
|
1332 |
might think. For example, (?!a){3} does not assert that the |
|
|
1333 |
next three characters are not |
|
|
1334 |
|
|
|
1335 |
|
|
|
1336 |
3. Capturing subpatterns that occur inside negative |
|
|
1337 |
lookahead assertions are counted, but their entries in the |
|
|
1338 |
offsets vector are never set. Perl sets its numerical |
|
|
1339 |
variables from any such patterns that are matched before the |
|
|
1340 |
assertion fails to match something (thereby succeeding), but |
|
|
1341 |
only if the negative lookahead assertion contains just one |
|
|
1342 |
branch. |
|
|
1343 |
|
|
|
1344 |
|
|
|
1345 |
4. Though binary zero characters are supported in the |
|
|
1346 |
subject string, they are not allowed in a pattern string |
|
|
1347 |
because it is passed as a normal C string, terminated by |
|
|
1348 |
zero. The escape sequence |
|
|
1349 |
|
|
|
1350 |
|
|
|
1351 |
5. The following Perl escape sequences are not supported: l, |
|
|
1352 |
u, L, U, E, Q. In fact these are implemented by Perl's |
|
|
1353 |
general string-handling and are not part of its pattern |
|
|
1354 |
matching engine. |
|
|
1355 |
|
|
|
1356 |
|
|
|
1357 |
6. The Perl G assertion is not supported as it is not |
|
|
1358 |
relevant to single pattern matches. |
|
|
1359 |
|
|
|
1360 |
|
|
|
1361 |
7. Fairly obviously, PCRE does not support the (?{code}) and |
|
|
1362 |
(?p{code}) constructions. However, there is some |
|
|
1363 |
experimental support for recursive patterns using the |
|
|
1364 |
non-Perl item (?R). |
|
|
1365 |
|
|
|
1366 |
|
|
|
1367 |
8. There are at the time of writing some oddities in Perl |
|
|
1368 |
5.005_02 concerned with the settings of captured strings |
|
|
1369 |
when part of a pattern is repeated. For example, matching |
|
|
1370 |
|
|
|
1371 |
|
|
|
1372 |
In Perl 5.004 $2 is set in both cases, and that is also true |
|
|
1373 |
of PCRE. If in the future Perl changes to a consistent state |
|
|
1374 |
that is different, PCRE may change to follow. |
|
|
1375 |
|
|
|
1376 |
|
|
|
1377 |
9. Another as yet unresolved discrepancy is that in Perl |
|
|
1378 |
5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string |
|
|
1379 |
|
|
|
1380 |
|
|
|
1381 |
10. The following UTF-8 features of Perl 5.6 are not |
|
|
1382 |
implemented: |
|
|
1383 |
|
|
|
1384 |
|
|
|
1385 |
a. The escape sequence C to match a single |
|
|
1386 |
byte. |
|
|
1387 |
|
|
|
1388 |
|
|
|
1389 |
b. The use of Unicode tables and properties and escapes p, |
|
|
1390 |
P, and X. |
|
|
1391 |
|
|
|
1392 |
|
|
|
1393 |
11. PCRE provides some extensions to the Perl regular |
|
|
1394 |
expression facilities: |
|
|
1395 |
|
|
|
1396 |
|
|
|
1397 |
(a) Although lookbehind assertions must match fixed length |
|
|
1398 |
strings, each alternative branch of a lookbehind assertion |
|
|
1399 |
can match a different length of string. Perl 5.005 requires |
|
|
1400 |
them all to have the same length. |
|
|
1401 |
|
|
|
1402 |
|
|
|
1403 |
(b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not |
|
|
1404 |
set, the $ meta- character matches only at the very end of |
|
|
1405 |
the string. |
|
|
1406 |
|
|
|
1407 |
|
|
|
1408 |
(c) If PCRE_EXTRA is set, a backslash followed by a letter |
|
|
1409 |
with no special meaning is faulted. |
|
|
1410 |
|
|
|
1411 |
|
|
|
1412 |
(d) If PCRE_UNGREEDY is set, the greediness of the |
|
|
1413 |
repetition quantifiers is inverted, that is, by default they |
|
|
1414 |
are not greedy, but if followed by a question mark they |
|
|
1415 |
are. |
|
|
1416 |
|
|
|
1417 |
|
|
|
1418 |
(e) PCRE_ANCHORED can be used to force a pattern to be tried |
|
|
1419 |
only at the start of the subject. |
|
|
1420 |
|
|
|
1421 |
|
|
|
1422 |
(f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options |
|
|
1423 |
for __pcre_exec()__ have no Perl |
|
|
1424 |
equivalents. |
|
|
1425 |
|
|
|
1426 |
|
|
|
1427 |
(g) The (?R) construct allows for recursive pattern matching |
|
|
1428 |
(Perl 5.6 can do this using the (?p{code}) construct, which |
|
|
1429 |
PCRE cannot of course support.) |
|
|
1430 |
!!AUTHOR |
|
|
1431 |
|
|
|
1432 |
|
|
|
1433 |
Philip Hazel |
|
|
1434 |
University Computing Service, |
|
|
1435 |
New Museums Site, |
|
|
1436 |
Cambridge CB2 3QG, England. |
|
|
1437 |
Phone: +44 1223 334714 |
|
|
1438 |
|
|
|
1439 |
|
|
|
1440 |
Last updated: 28 August 2000, |
|
|
1441 |
the 250th anniversary of the death of J.S. Bach. |
|
|
1442 |
Copyright (c) 1997-2000 University of |
|
|
1443 |
Cambridge. |
|
|
1444 |
---- |