version 1, including all changes.
.
Rev |
Author |
# |
Line |
1 |
perry |
1 |
PCRE |
|
|
2 |
!!!PCRE |
|
|
3 |
NAME |
|
|
4 |
DESCRIPTION |
|
|
5 |
REGULAR EXPRESSION DETAILS |
|
|
6 |
BACKSLASH |
|
|
7 |
CIRCUMFLEX AND DOLLAR |
|
|
8 |
FULL STOP (PERIOD, DOT) |
|
|
9 |
SQUARE BRACKETS |
|
|
10 |
VERTICAL BAR |
|
|
11 |
INTERNAL OPTION SETTING |
|
|
12 |
SUBPATTERNS |
|
|
13 |
REPETITION |
|
|
14 |
BACK REFERENCES |
|
|
15 |
ASSERTIONS |
|
|
16 |
ONCE-ONLY SUBPATTERNS |
|
|
17 |
CONDITIONAL SUBPATTERNS |
|
|
18 |
COMMENTS |
|
|
19 |
PERFORMANCE |
|
|
20 |
DIFFERENCES FROM PERL |
|
|
21 |
LIMITATIONS |
|
|
22 |
AUTHOR |
|
|
23 |
---- |
|
|
24 |
!!NAME |
|
|
25 |
|
|
|
26 |
|
|
|
27 |
pcre - Perl-compatible regular expressions. |
|
|
28 |
!!DESCRIPTION |
|
|
29 |
|
|
|
30 |
|
|
|
31 |
The PCRE library is a set of functions that implement |
|
|
32 |
regular expression pattern matching using the same syntax |
|
|
33 |
and semantics as Perl 5, with just a few differences (see |
|
|
34 |
below). The current implementation corresponds to Perl |
|
|
35 |
5.005. |
|
|
36 |
|
|
|
37 |
|
|
|
38 |
This man page describes the regular expressions understood |
|
|
39 |
by programs that use PCRE. |
|
|
40 |
!!REGULAR EXPRESSION DETAILS |
|
|
41 |
|
|
|
42 |
|
|
|
43 |
The syntax and semantics of the regular expressions |
|
|
44 |
supported by PCRE are described below. Regular expressions |
|
|
45 |
are also described in the Perl documentation and in a number |
|
|
46 |
of other books, some of which have copious examples. Jeffrey |
|
|
47 |
Friedl's |
|
|
48 |
|
|
|
49 |
|
|
|
50 |
A regular expression is a pattern that is matched against a |
|
|
51 |
subject string from left to right. Most characters stand for |
|
|
52 |
themselves in a pattern, and match the corresponding |
|
|
53 |
characters in the subject. As a trivial example, the |
|
|
54 |
pattern |
|
|
55 |
|
|
|
56 |
|
|
|
57 |
The quick brown fox |
|
|
58 |
|
|
|
59 |
|
|
|
60 |
matches a portion of a subject string that is identical to |
|
|
61 |
itself. The power of regular expressions comes from the |
|
|
62 |
ability to include alternatives and repetitions in the |
|
|
63 |
pattern. These are encoded in the pattern by the use of |
|
|
64 |
''meta-characters'', which do not stand for themselves |
|
|
65 |
but instead are interpreted in some special |
|
|
66 |
way. |
|
|
67 |
|
|
|
68 |
|
|
|
69 |
There are two different sets of meta-characters: those that |
|
|
70 |
are recognized anywhere in the pattern except within square |
|
|
71 |
brackets, and those that are recognized in square brackets. |
|
|
72 |
Outside square brackets, the meta-characters are as |
|
|
73 |
follows: |
|
|
74 |
|
|
|
75 |
|
|
|
76 |
\ general escape character with several uses ^ assert start |
|
|
77 |
of subject (or line, in multiline mode) $ assert end of |
|
|
78 |
subject (or line, in multiline mode) . match any character |
|
|
79 |
except newline (by default) [[ start character class |
|
|
80 |
definition | start of alternative branch ( start subpattern |
|
|
81 |
) end subpattern ? extends the meaning of ( also 0 or 1 |
|
|
82 |
quantifier also quantifier minimizer * 0 or more quantifier |
|
|
83 |
+ 1 or more quantifier { start min/max |
|
|
84 |
quantifier |
|
|
85 |
|
|
|
86 |
|
|
|
87 |
Part of a pattern that is in square brackets is called a |
|
|
88 |
|
|
|
89 |
|
|
|
90 |
\ general escape character ^ negate the class, but only if |
|
|
91 |
the first character - indicates character range ] terminates |
|
|
92 |
the character class |
|
|
93 |
|
|
|
94 |
|
|
|
95 |
The following sections describe the use of each of the |
|
|
96 |
meta-characters. |
|
|
97 |
!!BACKSLASH |
|
|
98 |
|
|
|
99 |
|
|
|
100 |
The backslash character has several uses. Firstly, if it is |
|
|
101 |
followed by a non-alphameric character, it takes away any |
|
|
102 |
special meaning that character may have. This use of |
|
|
103 |
backslash as an escape character applies both inside and |
|
|
104 |
outside character classes. |
|
|
105 |
|
|
|
106 |
|
|
|
107 |
For example, if you want to match a |
|
|
108 |
|
|
|
109 |
|
|
|
110 |
If a pattern is compiled with the PCRE_EXTENDED option, |
|
|
111 |
whitespace in the pattern (other than in a character class) |
|
|
112 |
and characters between a |
|
|
113 |
|
|
|
114 |
|
|
|
115 |
A second use of backslash provides a way of encoding |
|
|
116 |
non-printing characters in patterns in a visible manner. |
|
|
117 |
There is no restriction on the appearance of non-printing |
|
|
118 |
characters, apart from the binary zero that terminates a |
|
|
119 |
pattern, but when a pattern is being prepared by text |
|
|
120 |
editing, it is usually easier to use one of the following |
|
|
121 |
escape sequences than the binary character it |
|
|
122 |
represents: |
|
|
123 |
|
|
|
124 |
|
|
|
125 |
a alarm, that is, the BEL character (hex 07) cx |
|
|
126 |
|
|
|
127 |
|
|
|
128 |
The precise effect of |
|
|
129 |
|
|
|
130 |
|
|
|
131 |
After |
|
|
132 |
|
|
|
133 |
|
|
|
134 |
After |
|
|
135 |
|
|
|
136 |
|
|
|
137 |
The handling of a backslash followed by a digit other than 0 |
|
|
138 |
is complicated. Outside a character class, PCRE reads it and |
|
|
139 |
any following digits as a decimal number. If the number is |
|
|
140 |
less than 10, or if there have been at least that many |
|
|
141 |
previous capturing left parentheses in the expression, the |
|
|
142 |
entire sequence is taken as a ''back reference''. A |
|
|
143 |
description of how this works is given later, following the |
|
|
144 |
discussion of parenthesized subpatterns. |
|
|
145 |
|
|
|
146 |
|
|
|
147 |
Inside a character class, or if the decimal number is |
|
|
148 |
greater than 9 and there have not been that many capturing |
|
|
149 |
subpatterns, PCRE re-reads up to three octal digits |
|
|
150 |
following the backslash, and generates a single byte from |
|
|
151 |
the least significant 8 bits of the value. Any subsequent |
|
|
152 |
digits stand for themselves. For example: |
|
|
153 |
|
|
|
154 |
|
|
|
155 |
040 is another way of writing a space 40 is the same, |
|
|
156 |
provided there are fewer than 40 previous capturing |
|
|
157 |
subpatterns 7 is always a back reference 11 might be a back |
|
|
158 |
reference, or another way of writing a tab 011 is always a |
|
|
159 |
tab 0113 is a tab followed by the character |
|
|
160 |
|
|
|
161 |
|
|
|
162 |
Note that octal values of 100 or greater must not be |
|
|
163 |
introduced by a leading zero, because no more than three |
|
|
164 |
octal digits are ever read. |
|
|
165 |
|
|
|
166 |
|
|
|
167 |
All the sequences that define a single byte value can be |
|
|
168 |
used both inside and outside character classes. In addition, |
|
|
169 |
inside a character class, the sequence |
|
|
170 |
|
|
|
171 |
|
|
|
172 |
The third use of backslash is for specifying generic |
|
|
173 |
character types: |
|
|
174 |
|
|
|
175 |
|
|
|
176 |
d any decimal digit D any character that is not a decimal |
|
|
177 |
digit s any whitespace character S any character that is not |
|
|
178 |
a whitespace character w any |
|
|
179 |
|
|
|
180 |
|
|
|
181 |
Each pair of escape sequences partitions the complete set of |
|
|
182 |
characters into two disjoint sets. Any given character |
|
|
183 |
matches one, and only one, of each pair. |
|
|
184 |
|
|
|
185 |
|
|
|
186 |
A |
|
|
187 |
|
|
|
188 |
|
|
|
189 |
These character type sequences can appear both inside and |
|
|
190 |
outside character classes. They each match one character of |
|
|
191 |
the appropriate type. If the current matching point is at |
|
|
192 |
the end of the subject string, all of them fail, since there |
|
|
193 |
is no character to match. |
|
|
194 |
|
|
|
195 |
|
|
|
196 |
The fourth use of backslash is for certain simple |
|
|
197 |
assertions. An assertion specifies a condition that has to |
|
|
198 |
be met at a particular point in a match, without consuming |
|
|
199 |
any characters from the subject string. The use of |
|
|
200 |
subpatterns for more complicated assertions is described |
|
|
201 |
below. The backslashed assertions are |
|
|
202 |
|
|
|
203 |
|
|
|
204 |
b word boundary B not a word boundary A start of subject |
|
|
205 |
(independent of multiline mode) Z end of subject or newline |
|
|
206 |
at end (independent of multiline mode) z end of subject |
|
|
207 |
(independent of multiline mode) |
|
|
208 |
|
|
|
209 |
|
|
|
210 |
These assertions may not appear in character classes (but |
|
|
211 |
note that |
|
|
212 |
|
|
|
213 |
|
|
|
214 |
A word boundary is a position in the subject string where |
|
|
215 |
the current character and the previous character do not both |
|
|
216 |
match w or W (i.e. one matches w and the other matches W), |
|
|
217 |
or the start or end of the string if the first or last |
|
|
218 |
character matches w, respectively. |
|
|
219 |
|
|
|
220 |
|
|
|
221 |
The A, Z, and z assertions differ from the traditional |
|
|
222 |
circumflex and dollar (described below) in that they only |
|
|
223 |
ever match at the very start and end of the subject string, |
|
|
224 |
whatever options are set. They are not affected by the |
|
|
225 |
PCRE_NOTBOL or PCRE_NOTEOL options. If the |
|
|
226 |
''startoffset'' argument of __pcre_exec()__ is |
|
|
227 |
non-zero, A can never match. The difference between Z and z |
|
|
228 |
is that Z matches before a newline that is the last |
|
|
229 |
character of the string as well as at the end of the string, |
|
|
230 |
whereas z matches only at the end. |
|
|
231 |
!!CIRCUMFLEX AND DOLLAR |
|
|
232 |
|
|
|
233 |
|
|
|
234 |
Outside a character class, in the default matching mode, the |
|
|
235 |
circumflex character is an assertion which is true only if |
|
|
236 |
the current matching point is at the start of the subject |
|
|
237 |
string. If the ''startoffset'' argument of |
|
|
238 |
__pcre_exec()__ is non-zero, circumflex can never match. |
|
|
239 |
Inside a character class, circumflex has an entirely |
|
|
240 |
different meaning (see below). |
|
|
241 |
|
|
|
242 |
|
|
|
243 |
Circumflex need not be the first character of the pattern if |
|
|
244 |
a number of alternatives are involved, but it should be the |
|
|
245 |
first thing in each alternative in which it appears if the |
|
|
246 |
pattern is ever to match that branch. If all possible |
|
|
247 |
alternatives start with a circumflex, that is, if the |
|
|
248 |
pattern is constrained to match only at the start of the |
|
|
249 |
subject, it is said to be an |
|
|
250 |
|
|
|
251 |
|
|
|
252 |
A dollar character is an assertion which is true only if the |
|
|
253 |
current matching point is at the end of the subject string, |
|
|
254 |
or immediately before a newline character that is the last |
|
|
255 |
character in the string (by default). Dollar need not be the |
|
|
256 |
last character of the pattern if a number of alternatives |
|
|
257 |
are involved, but it should be the last item in any branch |
|
|
258 |
in which it appears. Dollar has no special meaning in a |
|
|
259 |
character class. |
|
|
260 |
|
|
|
261 |
|
|
|
262 |
The meaning of dollar can be changed so that it matches only |
|
|
263 |
at the very end of the string, by setting the |
|
|
264 |
PCRE_DOLLAR_ENDONLY option at compile or matching time. This |
|
|
265 |
does not affect the Z assertion. |
|
|
266 |
|
|
|
267 |
|
|
|
268 |
The meanings of the circumflex and dollar characters are |
|
|
269 |
changed if the PCRE_MULTILINE option is set. When this is |
|
|
270 |
the case, they match immediately after and immediately |
|
|
271 |
before an internal |
|
|
272 |
startoffset'' |
|
|
273 |
argument of __pcre_exec()__ is non-zero. The |
|
|
274 |
PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is |
|
|
275 |
set. |
|
|
276 |
|
|
|
277 |
|
|
|
278 |
Note that the sequences A, Z, and z can be used to match the |
|
|
279 |
start and end of the subject in both modes, and if all |
|
|
280 |
branches of a pattern start with A is it always anchored, |
|
|
281 |
whether PCRE_MULTILINE is set or not. |
|
|
282 |
!!FULL STOP (PERIOD, DOT) |
|
|
283 |
|
|
|
284 |
|
|
|
285 |
Outside a character class, a dot in the pattern matches any |
|
|
286 |
one character in the subject, including a non-printing |
|
|
287 |
character, but not (by default) newline. If the PCRE_DOTALL |
|
|
288 |
option is set, then dots match newlines as well. The |
|
|
289 |
handling of dot is entirely independent of the handling of |
|
|
290 |
circumflex and dollar, the only relationship being that they |
|
|
291 |
both involve newline characters. Dot has no special meaning |
|
|
292 |
in a character class. |
|
|
293 |
!!SQUARE BRACKETS |
|
|
294 |
|
|
|
295 |
|
|
|
296 |
An opening square bracket introduces a character class, |
|
|
297 |
terminated by a closing square bracket. A closing square |
|
|
298 |
bracket on its own is not special. If a closing square |
|
|
299 |
bracket is required as a member of the class, it should be |
|
|
300 |
the first data character in the class (after an initial |
|
|
301 |
circumflex, if present) or escaped with a |
|
|
302 |
backslash. |
|
|
303 |
|
|
|
304 |
|
|
|
305 |
A character class matches a single character in the subject; |
|
|
306 |
the character must be in the set of characters defined by |
|
|
307 |
the class, unless the first character in the class is a |
|
|
308 |
circumflex, in which case the subject character must not be |
|
|
309 |
in the set defined by the class. If a circumflex is actually |
|
|
310 |
required as a member of the class, ensure it is not the |
|
|
311 |
first character, or escape it with a backslash. |
|
|
312 |
|
|
|
313 |
|
|
|
314 |
For example, the character class [[aeiou] matches any lower |
|
|
315 |
case vowel, while [[^aeiou] matches any character that is not |
|
|
316 |
a lower case vowel. Note that a circumflex is just a |
|
|
317 |
convenient notation for specifying the characters which are |
|
|
318 |
in the class by enumerating those that are not. It is not an |
|
|
319 |
assertion: it still consumes a character from the subject |
|
|
320 |
string, and fails if the current pointer is at the end of |
|
|
321 |
the string. |
|
|
322 |
|
|
|
323 |
|
|
|
324 |
When caseless matching is set, any letters in a class |
|
|
325 |
represent both their upper case and lower case versions, so |
|
|
326 |
for example, a caseless [[aeiou] matches |
|
|
327 |
|
|
|
328 |
|
|
|
329 |
The newline character is never treated in any special way in |
|
|
330 |
character classes, whatever the setting of the PCRE_DOTALL |
|
|
331 |
or PCRE_MULTILINE options is. A class such as [[^a] will |
|
|
332 |
always match a newline. |
|
|
333 |
|
|
|
334 |
|
|
|
335 |
The minus (hyphen) character can be used to specify a range |
|
|
336 |
of characters in a character class. For example, [[d-m] |
|
|
337 |
matches any letter between d and m, inclusive. If a minus |
|
|
338 |
character is required in a class, it must be escaped with a |
|
|
339 |
backslash or appear in a position where it cannot be |
|
|
340 |
interpreted as indicating a range, typically as the first or |
|
|
341 |
last character in the class. |
|
|
342 |
|
|
|
343 |
|
|
|
344 |
It is not possible to have the literal character |
|
|
345 |
|
|
|
346 |
|
|
|
347 |
Ranges operate in ASCII collating sequence. They can also be |
|
|
348 |
used for characters specified numerically, for example |
|
|
349 |
[[000-037]. If a range that includes letters is used when |
|
|
350 |
caseless matching is set, it matches the letters in either |
|
|
351 |
case. For example, [[W-c] is equivalent to [[][[^_`wxyzabc], |
|
|
352 |
matched caselessly, and if character tables for the |
|
|
353 |
|
|
|
354 |
|
|
|
355 |
The character types d, D, s, S, w, and W may also appear in |
|
|
356 |
a character class, and add the characters that they match to |
|
|
357 |
the class. For example, [[dABCDEF] matches any hexadecimal |
|
|
358 |
digit. A circumflex can conveniently be used with the upper |
|
|
359 |
case character types to specify a more restricted set of |
|
|
360 |
characters than the matching lower case type. For example, |
|
|
361 |
the class [[^W_] matches any letter or digit, but not |
|
|
362 |
underscore. |
|
|
363 |
|
|
|
364 |
|
|
|
365 |
All non-alphameric characters other than , -, ^ (at the |
|
|
366 |
start) and the terminating ] are non-special in character |
|
|
367 |
classes, but it does no harm if they are |
|
|
368 |
escaped. |
|
|
369 |
!!VERTICAL BAR |
|
|
370 |
|
|
|
371 |
|
|
|
372 |
Vertical bar characters are used to separate alternative |
|
|
373 |
patterns. For example, the pattern |
|
|
374 |
|
|
|
375 |
|
|
|
376 |
gilbert|sullivan |
|
|
377 |
|
|
|
378 |
|
|
|
379 |
matches either |
|
|
380 |
!!INTERNAL OPTION SETTING |
|
|
381 |
|
|
|
382 |
|
|
|
383 |
The settings of PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, |
|
|
384 |
and PCRE_EXTENDED can be changed from within the pattern by |
|
|
385 |
a sequence of Perl option letters enclosed between |
|
|
386 |
|
|
|
387 |
|
|
|
388 |
i for PCRE_CASELESS m for PCRE_MULTILINE s for PCRE_DOTALL x |
|
|
389 |
for PCRE_EXTENDED |
|
|
390 |
|
|
|
391 |
|
|
|
392 |
For example, (?im) sets caseless, multiline matching. It is |
|
|
393 |
also possible to unset these options by preceding the letter |
|
|
394 |
with a hyphen, and a combined setting and unsetting such as |
|
|
395 |
(?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while |
|
|
396 |
unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted. |
|
|
397 |
If a letter appears both before and after the hyphen, the |
|
|
398 |
option is unset. |
|
|
399 |
|
|
|
400 |
|
|
|
401 |
The scope of these option changes depends on where in the |
|
|
402 |
pattern the setting occurs. For settings that are outside |
|
|
403 |
any subpattern (defined below), the effect is the same as if |
|
|
404 |
the options were set or unset at the start of matching. The |
|
|
405 |
following patterns all behave in exactly the same |
|
|
406 |
way: |
|
|
407 |
|
|
|
408 |
|
|
|
409 |
(?i)abc a(?i)bc ab(?i)c abc(?i) |
|
|
410 |
|
|
|
411 |
|
|
|
412 |
which in turn is the same as compiling the pattern abc with |
|
|
413 |
PCRE_CASELESS set. In other words, such |
|
|
414 |
|
|
|
415 |
|
|
|
416 |
If an option change occurs inside a subpattern, the effect |
|
|
417 |
is different. This is a change of behaviour in Perl 5.005. |
|
|
418 |
An option change inside a subpattern affects only that part |
|
|
419 |
of the subpattern that follows it, so |
|
|
420 |
|
|
|
421 |
|
|
|
422 |
(a(?i)b)c |
|
|
423 |
|
|
|
424 |
|
|
|
425 |
matches abc and aBc and no other strings (assuming |
|
|
426 |
PCRE_CASELESS is not used). By this means, options can be |
|
|
427 |
made to have different settings in different parts of the |
|
|
428 |
pattern. Any changes made in one alternative do carry on |
|
|
429 |
into subsequent branches within the same subpattern. For |
|
|
430 |
example, |
|
|
431 |
|
|
|
432 |
|
|
|
433 |
(a(?i)b|c) |
|
|
434 |
|
|
|
435 |
|
|
|
436 |
matches |
|
|
437 |
|
|
|
438 |
|
|
|
439 |
The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can |
|
|
440 |
be changed in the same way as the Perl-compatible options by |
|
|
441 |
using the characters U and X respectively. The (?X) flag |
|
|
442 |
setting is special in that it must always occur earlier in |
|
|
443 |
the pattern than any of the additional features it turns on, |
|
|
444 |
even when it is at top level. It is best put at the |
|
|
445 |
start. |
|
|
446 |
!!SUBPATTERNS |
|
|
447 |
|
|
|
448 |
|
|
|
449 |
Subpatterns are delimited by parentheses (round brackets), |
|
|
450 |
which can be nested. Marking part of a pattern as a |
|
|
451 |
subpattern does two things: |
|
|
452 |
|
|
|
453 |
|
|
|
454 |
1. It localizes a set of alternatives. For example, the |
|
|
455 |
pattern |
|
|
456 |
|
|
|
457 |
|
|
|
458 |
cat(aract|erpillar|) |
|
|
459 |
|
|
|
460 |
|
|
|
461 |
matches one of the words |
|
|
462 |
|
|
|
463 |
|
|
|
464 |
2. It sets up the subpattern as a capturing subpattern (as |
|
|
465 |
defined above). When the whole pattern matches, that portion |
|
|
466 |
of the subject string that matched the subpattern is passed |
|
|
467 |
back to the caller via the ''ovector'' argument of |
|
|
468 |
__pcre_exec()__. Opening parentheses are counted from |
|
|
469 |
left to right (starting from 1) to obtain the numbers of the |
|
|
470 |
capturing subpatterns. |
|
|
471 |
|
|
|
472 |
|
|
|
473 |
For example, if the string |
|
|
474 |
|
|
|
475 |
|
|
|
476 |
the ((red|white) (king|queen)) |
|
|
477 |
|
|
|
478 |
|
|
|
479 |
the captured substrings are |
|
|
480 |
|
|
|
481 |
|
|
|
482 |
The fact that plain parentheses fulfil two functions is not |
|
|
483 |
always helpful. There are often times when a grouping |
|
|
484 |
subpattern is required without a capturing requirement. If |
|
|
485 |
an opening parenthesis is followed by |
|
|
486 |
|
|
|
487 |
|
|
|
488 |
the ((?:red|white) (king|queen)) |
|
|
489 |
|
|
|
490 |
|
|
|
491 |
the captured substrings are |
|
|
492 |
|
|
|
493 |
|
|
|
494 |
As a convenient shorthand, if any option settings are |
|
|
495 |
required at the start of a non-capturing subpattern, the |
|
|
496 |
option letters may appear between the |
|
|
497 |
|
|
|
498 |
|
|
|
499 |
(?i:saturday|sunday) (?:(?i)saturday|sunday) |
|
|
500 |
|
|
|
501 |
|
|
|
502 |
match exactly the same set of strings. Because alternative |
|
|
503 |
branches are tried from left to right, and options are not |
|
|
504 |
reset until the end of the subpattern is reached, an option |
|
|
505 |
setting in one branch does affect subsequent branches, so |
|
|
506 |
the above patterns match |
|
|
507 |
!!REPETITION |
|
|
508 |
|
|
|
509 |
|
|
|
510 |
Repetition is specified by quantifiers, which can follow any |
|
|
511 |
of the following items: |
|
|
512 |
|
|
|
513 |
|
|
|
514 |
a single character, possibly escaped the . metacharacter a |
|
|
515 |
character class a back reference (see next section) a |
|
|
516 |
parenthesized subpattern (unless it is an assertion - see |
|
|
517 |
below) |
|
|
518 |
|
|
|
519 |
|
|
|
520 |
The general repetition quantifier specifies a minimum and |
|
|
521 |
maximum number of permitted matches, by giving the two |
|
|
522 |
numbers in curly brackets (braces), separated by a comma. |
|
|
523 |
The numbers must be less than 65536, and the first must be |
|
|
524 |
less than or equal to the second. For example: |
|
|
525 |
|
|
|
526 |
|
|
|
527 |
z{2,4} |
|
|
528 |
|
|
|
529 |
|
|
|
530 |
matches |
|
|
531 |
|
|
|
532 |
|
|
|
533 |
[[aeiou]{3,} |
|
|
534 |
|
|
|
535 |
|
|
|
536 |
matches at least 3 successive vowels, but may match many |
|
|
537 |
more, while |
|
|
538 |
|
|
|
539 |
|
|
|
540 |
d{8} |
|
|
541 |
|
|
|
542 |
|
|
|
543 |
matches exactly 8 digits. An opening curly bracket that |
|
|
544 |
appears in a position where a quantifier is not allowed, or |
|
|
545 |
one that does not match the syntax of a quantifier, is taken |
|
|
546 |
as a literal character. For example, {,6} is not a |
|
|
547 |
quantifier, but a literal string of four |
|
|
548 |
characters. |
|
|
549 |
|
|
|
550 |
|
|
|
551 |
The quantifier {0} is permitted, causing the expression to |
|
|
552 |
behave as if the previous item and the quantifier were not |
|
|
553 |
present. |
|
|
554 |
|
|
|
555 |
|
|
|
556 |
For convenience (and historical compatibility) the three |
|
|
557 |
most common quantifiers have single-character |
|
|
558 |
abbreviations: |
|
|
559 |
|
|
|
560 |
|
|
|
561 |
* is equivalent to {0,} + is equivalent to {1,} ? is |
|
|
562 |
equivalent to {0,1} |
|
|
563 |
|
|
|
564 |
|
|
|
565 |
It is possible to construct infinite loops by following a |
|
|
566 |
subpattern that can match no characters with a quantifier |
|
|
567 |
that has no upper limit, for example: |
|
|
568 |
|
|
|
569 |
|
|
|
570 |
(a?)* |
|
|
571 |
|
|
|
572 |
|
|
|
573 |
Earlier versions of Perl and PCRE used to give an error at |
|
|
574 |
compile time for such patterns. However, because there are |
|
|
575 |
cases where this can be useful, such patterns are now |
|
|
576 |
accepted, but if any repetition of the subpattern does in |
|
|
577 |
fact match no characters, the loop is forcibly |
|
|
578 |
broken. |
|
|
579 |
|
|
|
580 |
|
|
|
581 |
By default, the quantifiers are |
|
|
582 |
|
|
|
583 |
|
|
|
584 |
/*.**/ |
|
|
585 |
|
|
|
586 |
|
|
|
587 |
to the string |
|
|
588 |
|
|
|
589 |
|
|
|
590 |
/* first command */ not comment /* second comment |
|
|
591 |
*/ |
|
|
592 |
|
|
|
593 |
|
|
|
594 |
fails, because it matches the entire string due to the |
|
|
595 |
greediness of the .* item. |
|
|
596 |
|
|
|
597 |
|
|
|
598 |
However, if a quantifier is followed by a question mark, |
|
|
599 |
then it ceases to be greedy, and instead matches the minimum |
|
|
600 |
number of times possible, so the pattern |
|
|
601 |
|
|
|
602 |
|
|
|
603 |
/*.*?*/ |
|
|
604 |
|
|
|
605 |
|
|
|
606 |
does the right thing with the C comments. The meaning of the |
|
|
607 |
various quantifiers is not otherwise changed, just the |
|
|
608 |
preferred number of matches. Do not confuse this use of |
|
|
609 |
question mark with its use as a quantifier in its own right. |
|
|
610 |
Because it has two uses, it can sometimes appear doubled, as |
|
|
611 |
in |
|
|
612 |
|
|
|
613 |
|
|
|
614 |
d??d |
|
|
615 |
|
|
|
616 |
|
|
|
617 |
which matches one digit by preference, but can match two if |
|
|
618 |
that is the only way the rest of the pattern |
|
|
619 |
matches. |
|
|
620 |
|
|
|
621 |
|
|
|
622 |
If the PCRE_UNGREEDY option is set (an option which is not |
|
|
623 |
available in Perl) then the quantifiers are not greedy by |
|
|
624 |
default, but individual ones can be made greedy by following |
|
|
625 |
them with a question mark. In other words, it inverts the |
|
|
626 |
default behaviour. |
|
|
627 |
|
|
|
628 |
|
|
|
629 |
When a parenthesized subpattern is quantified with a minimum |
|
|
630 |
repeat count that is greater than 1 or with a limited |
|
|
631 |
maximum, more store is required for the compiled pattern, in |
|
|
632 |
proportion to the size of the minimum or |
|
|
633 |
maximum. |
|
|
634 |
|
|
|
635 |
|
|
|
636 |
If a pattern starts with .* or .{0,} and the PCRE_DOTALL |
|
|
637 |
option (equivalent to Perl's /s) is set, thus allowing the . |
|
|
638 |
to match newlines, then the pattern is implicitly anchored, |
|
|
639 |
because whatever follows will be tried against every |
|
|
640 |
character position in the subject string, so there is no |
|
|
641 |
point in retrying the overall match at any position after |
|
|
642 |
the first. PCRE treats such a pattern as though it were |
|
|
643 |
preceded by A. In cases where it is known that the subject |
|
|
644 |
string contains no newlines, it is worth setting PCRE_DOTALL |
|
|
645 |
when the pattern begins with .* in order to obtain this |
|
|
646 |
optimization, or alternatively using ^ to indicate anchoring |
|
|
647 |
explicitly. |
|
|
648 |
|
|
|
649 |
|
|
|
650 |
When a capturing subpattern is repeated, the value captured |
|
|
651 |
is the substring that matched the final iteration. For |
|
|
652 |
example, after |
|
|
653 |
|
|
|
654 |
|
|
|
655 |
(tweedle[[dume]{3}s*)+ |
|
|
656 |
|
|
|
657 |
|
|
|
658 |
has matched |
|
|
659 |
|
|
|
660 |
|
|
|
661 |
/(a|(b))+/ |
|
|
662 |
|
|
|
663 |
|
|
|
664 |
matches |
|
|
665 |
!!BACK REFERENCES |
|
|
666 |
|
|
|
667 |
|
|
|
668 |
Outside a character class, a backslash followed by a digit |
|
|
669 |
greater than 0 (and possibly further digits) is a back |
|
|
670 |
reference to a capturing subpattern earlier (i.e. to its |
|
|
671 |
left) in the pattern, provided there have been that many |
|
|
672 |
previous capturing left parentheses. |
|
|
673 |
|
|
|
674 |
|
|
|
675 |
However, if the decimal number following the backslash is |
|
|
676 |
less than 10, it is always taken as a back reference, and |
|
|
677 |
causes an error only if there are not that many capturing |
|
|
678 |
left parentheses in the entire pattern. In other words, the |
|
|
679 |
parentheses that are referenced need not be to the left of |
|
|
680 |
the reference for numbers less than 10. See the section |
|
|
681 |
entitled |
|
|
682 |
|
|
|
683 |
|
|
|
684 |
A back reference matches whatever actually matched the |
|
|
685 |
capturing subpattern in the current subject string, rather |
|
|
686 |
than anything matching the subpattern itself. So the |
|
|
687 |
pattern |
|
|
688 |
|
|
|
689 |
|
|
|
690 |
(sens|respons)e and 1ibility |
|
|
691 |
|
|
|
692 |
|
|
|
693 |
matches |
|
|
694 |
|
|
|
695 |
|
|
|
696 |
((?i)rah)s+1 |
|
|
697 |
|
|
|
698 |
|
|
|
699 |
matches |
|
|
700 |
|
|
|
701 |
|
|
|
702 |
There may be more than one back reference to the same |
|
|
703 |
subpattern. If a subpattern has not actually been used in a |
|
|
704 |
particular match, then any back references to it always |
|
|
705 |
fail. For example, the pattern |
|
|
706 |
|
|
|
707 |
|
|
|
708 |
(a|(bc))2 |
|
|
709 |
|
|
|
710 |
|
|
|
711 |
always fails if it starts to match |
|
|
712 |
|
|
|
713 |
|
|
|
714 |
A back reference that occurs inside the parentheses to which |
|
|
715 |
it refers fails when the subpattern is first used, so, for |
|
|
716 |
example, (a1) never matches. However, such references can be |
|
|
717 |
useful inside repeated subpatterns. For example, the |
|
|
718 |
pattern |
|
|
719 |
|
|
|
720 |
|
|
|
721 |
(a|b1)+ |
|
|
722 |
|
|
|
723 |
|
|
|
724 |
matches any number of |
|
|
725 |
!!ASSERTIONS |
|
|
726 |
|
|
|
727 |
|
|
|
728 |
An assertion is a test on the characters following or |
|
|
729 |
preceding the current matching point that does not actually |
|
|
730 |
consume any characters. The simple assertions coded as b, B, |
|
|
731 |
A, Z, z, ^ and $ are described above. More complicated |
|
|
732 |
assertions are coded as subpatterns. There are two kinds: |
|
|
733 |
those that look ahead of the current position in the subject |
|
|
734 |
string, and those that look behind it. |
|
|
735 |
|
|
|
736 |
|
|
|
737 |
An assertion subpattern is matched in the normal way, except |
|
|
738 |
that it does not cause the current matching position to be |
|
|
739 |
changed. Lookahead assertions start with (?= for positive |
|
|
740 |
assertions and (?! for negative assertions. For |
|
|
741 |
example, |
|
|
742 |
|
|
|
743 |
|
|
|
744 |
w+(?=;) |
|
|
745 |
|
|
|
746 |
|
|
|
747 |
matches a word followed by a semicolon, but does not include |
|
|
748 |
the semicolon in the match, and |
|
|
749 |
|
|
|
750 |
|
|
|
751 |
foo(?!bar) |
|
|
752 |
|
|
|
753 |
|
|
|
754 |
matches any occurrence of |
|
|
755 |
|
|
|
756 |
|
|
|
757 |
(?!foo)bar |
|
|
758 |
|
|
|
759 |
|
|
|
760 |
does not find an occurrence of |
|
|
761 |
|
|
|
762 |
|
|
|
763 |
Lookbehind assertions start with (? |
|
|
764 |
|
|
|
765 |
|
|
|
766 |
(? |
|
|
767 |
|
|
|
768 |
|
|
|
769 |
does find an occurrence of |
|
|
770 |
|
|
|
771 |
|
|
|
772 |
(? |
|
|
773 |
|
|
|
774 |
|
|
|
775 |
is permitted, but |
|
|
776 |
|
|
|
777 |
|
|
|
778 |
(? |
|
|
779 |
|
|
|
780 |
|
|
|
781 |
causes an error at compile time. Branches that match |
|
|
782 |
different length strings are permitted only at the top level |
|
|
783 |
of a lookbehind assertion. This is an extension compared |
|
|
784 |
with Perl 5.005, which requires all branches to match the |
|
|
785 |
same length of string. An assertion such as |
|
|
786 |
|
|
|
787 |
|
|
|
788 |
(? |
|
|
789 |
|
|
|
790 |
|
|
|
791 |
is not permitted, because its single top-level branch can |
|
|
792 |
match two different lengths, but it is acceptable if |
|
|
793 |
rewritten to use two top-level branches: |
|
|
794 |
|
|
|
795 |
|
|
|
796 |
(? |
|
|
797 |
|
|
|
798 |
|
|
|
799 |
The implementation of lookbehind assertions is, for each |
|
|
800 |
alternative, to temporarily move the current position back |
|
|
801 |
by the fixed width and then try to match. If there are |
|
|
802 |
insufficient characters before the current position, the |
|
|
803 |
match is deemed to fail. Lookbehinds in conjunction with |
|
|
804 |
once-only subpatterns can be particularly useful for |
|
|
805 |
matching at the ends of strings; an example is given at the |
|
|
806 |
end of the section on once-only subpatterns. |
|
|
807 |
|
|
|
808 |
|
|
|
809 |
Several assertions (of any sort) may occur in succession. |
|
|
810 |
For example, |
|
|
811 |
|
|
|
812 |
|
|
|
813 |
(? |
|
|
814 |
|
|
|
815 |
|
|
|
816 |
matches |
|
|
817 |
not'' match |
|
|
818 |
'' |
|
|
819 |
|
|
|
820 |
|
|
|
821 |
(? |
|
|
822 |
|
|
|
823 |
|
|
|
824 |
This time the first assertion looks at the preceding six |
|
|
825 |
characters, checking that the first three are digits, and |
|
|
826 |
then the second assertion checks that the preceding three |
|
|
827 |
characters are not |
|
|
828 |
|
|
|
829 |
|
|
|
830 |
Assertions can be nested in any combination. For |
|
|
831 |
example, |
|
|
832 |
|
|
|
833 |
|
|
|
834 |
(? |
|
|
835 |
|
|
|
836 |
|
|
|
837 |
matches an occurrence of |
|
|
838 |
|
|
|
839 |
|
|
|
840 |
(? |
|
|
841 |
|
|
|
842 |
|
|
|
843 |
is another pattern which matches |
|
|
844 |
|
|
|
845 |
|
|
|
846 |
Assertion subpatterns are not capturing subpatterns, and may |
|
|
847 |
not be repeated, because it makes no sense to assert the |
|
|
848 |
same thing several times. If any kind of assertion contains |
|
|
849 |
capturing subpatterns within it, these are counted for the |
|
|
850 |
purposes of numbering the capturing subpatterns in the whole |
|
|
851 |
pattern. However, substring capturing is carried out only |
|
|
852 |
for positive assertions, because it does not make sense for |
|
|
853 |
negative assertions. |
|
|
854 |
|
|
|
855 |
|
|
|
856 |
Assertions count towards the maximum of 200 parenthesized |
|
|
857 |
subpatterns. |
|
|
858 |
!!ONCE-ONLY SUBPATTERNS |
|
|
859 |
|
|
|
860 |
|
|
|
861 |
With both maximizing and minimizing repetition, failure of |
|
|
862 |
what follows normally causes the repeated item to be |
|
|
863 |
re-evaluated to see if a different number of repeats allows |
|
|
864 |
the rest of the pattern to match. Sometimes it is useful to |
|
|
865 |
prevent this, either to change the nature of the match, or |
|
|
866 |
to cause it fail earlier than it otherwise might, when the |
|
|
867 |
author of the pattern knows there is no point in carrying |
|
|
868 |
on. |
|
|
869 |
|
|
|
870 |
|
|
|
871 |
Consider, for example, the pattern d+foo when applied to the |
|
|
872 |
subject line |
|
|
873 |
|
|
|
874 |
|
|
|
875 |
123456bar |
|
|
876 |
|
|
|
877 |
|
|
|
878 |
After matching all 6 digits and then failing to match |
|
|
879 |
|
|
|
880 |
|
|
|
881 |
(? |
|
|
882 |
|
|
|
883 |
|
|
|
884 |
This kind of parenthesis |
|
|
885 |
|
|
|
886 |
|
|
|
887 |
An alternative description is that a subpattern of this type |
|
|
888 |
matches the string of characters that an identical |
|
|
889 |
standalone pattern would match, if anchored at the current |
|
|
890 |
point in the subject string. |
|
|
891 |
|
|
|
892 |
|
|
|
893 |
Once-only subpatterns are not capturing subpatterns. Simple |
|
|
894 |
cases such as the above example can be thought of as a |
|
|
895 |
maximizing repeat that must swallow everything it can. So, |
|
|
896 |
while both d+ and d+? are prepared to adjust the number of |
|
|
897 |
digits they match in order to make the rest of the pattern |
|
|
898 |
match, (? |
|
|
899 |
|
|
|
900 |
|
|
|
901 |
This construction can of course contain arbitrarily |
|
|
902 |
complicated subpatterns, and it can be nested. |
|
|
903 |
|
|
|
904 |
|
|
|
905 |
Once-only subpatterns can be used in conjunction with |
|
|
906 |
lookbehind assertions to specify efficient matching at the |
|
|
907 |
end of the subject string. Consider a simple pattern such |
|
|
908 |
as |
|
|
909 |
|
|
|
910 |
|
|
|
911 |
abcd$ |
|
|
912 |
|
|
|
913 |
|
|
|
914 |
when applied to a long string which does not match it. |
|
|
915 |
Because matching proceeds from left to right, PCRE will look |
|
|
916 |
for each |
|
|
917 |
|
|
|
918 |
|
|
|
919 |
^.*abcd$ |
|
|
920 |
|
|
|
921 |
|
|
|
922 |
then the initial .* matches the entire string at first, but |
|
|
923 |
when this fails, it backtracks to match all but the last |
|
|
924 |
character, then all but the last two characters, and so on. |
|
|
925 |
Once again the search for |
|
|
926 |
|
|
|
927 |
|
|
|
928 |
^(? |
|
|
929 |
|
|
|
930 |
|
|
|
931 |
then there can be no backtracking for the .* item; it can |
|
|
932 |
match only the entire string. The subsequent lookbehind |
|
|
933 |
assertion does a single test on the last four characters. If |
|
|
934 |
it fails, the match fails immediately. For long strings, |
|
|
935 |
this approach makes a significant difference to the |
|
|
936 |
processing time. |
|
|
937 |
!!CONDITIONAL SUBPATTERNS |
|
|
938 |
|
|
|
939 |
|
|
|
940 |
It is possible to cause the matching process to obey a |
|
|
941 |
subpattern conditionally or to choose between two |
|
|
942 |
alternative subpatterns, depending on the result of an |
|
|
943 |
assertion, or whether a previous capturing subpattern |
|
|
944 |
matched or not. The two possible forms of conditional |
|
|
945 |
subpattern are |
|
|
946 |
|
|
|
947 |
|
|
|
948 |
(?(condition)yes-pattern) |
|
|
949 |
(?(condition)yes-pattern|no-pattern) |
|
|
950 |
|
|
|
951 |
|
|
|
952 |
If the condition is satisfied, the yes-pattern is used; |
|
|
953 |
otherwise the no-pattern (if present) is used. If there are |
|
|
954 |
more than two alternatives in the subpattern, a compile-time |
|
|
955 |
error occurs. |
|
|
956 |
|
|
|
957 |
|
|
|
958 |
There are two kinds of condition. If the text between the |
|
|
959 |
parentheses consists of a sequence of digits, then the |
|
|
960 |
condition is satisfied if the capturing subpattern of that |
|
|
961 |
number has previously matched. Consider the following |
|
|
962 |
pattern, which contains non-significant white space to make |
|
|
963 |
it more readable (assume the PCRE_EXTENDED option) and to |
|
|
964 |
divide it into three parts for ease of |
|
|
965 |
discussion: |
|
|
966 |
|
|
|
967 |
|
|
|
968 |
( )? [[^()]+ (?(1) ) ) |
|
|
969 |
|
|
|
970 |
|
|
|
971 |
The first part matches an optional opening parenthesis, and |
|
|
972 |
if that character is present, sets it as the first captured |
|
|
973 |
substring. The second part matches one or more characters |
|
|
974 |
that are not parentheses. The third part is a conditional |
|
|
975 |
subpattern that tests whether the first set of parentheses |
|
|
976 |
matched or not. If they did, that is, if subject started |
|
|
977 |
with an opening parenthesis, the condition is true, and so |
|
|
978 |
the yes-pattern is executed and a closing parenthesis is |
|
|
979 |
required. Otherwise, since no-pattern is not present, the |
|
|
980 |
subpattern matches nothing. In other words, this pattern |
|
|
981 |
matches a sequence of non-parentheses, optionally enclosed |
|
|
982 |
in parentheses. |
|
|
983 |
|
|
|
984 |
|
|
|
985 |
If the condition is not a sequence of digits, it must be an |
|
|
986 |
assertion. This may be a positive or negative lookahead or |
|
|
987 |
lookbehind assertion. Consider this pattern, again |
|
|
988 |
containing non-significant white space, and with the two |
|
|
989 |
alternatives on the second line: |
|
|
990 |
|
|
|
991 |
|
|
|
992 |
(?(?=[[^a-z]*[[a-z]) d{2}[[a-z]{3}-d{2} | d{2}-d{2}-d{2} |
|
|
993 |
) |
|
|
994 |
|
|
|
995 |
|
|
|
996 |
The condition is a positive lookahead assertion that matches |
|
|
997 |
an optional sequence of non-letters followed by a letter. In |
|
|
998 |
other words, it tests for the presence of at least one |
|
|
999 |
letter in the subject. If a letter is found, the subject is |
|
|
1000 |
matched against the first alternative; otherwise it is |
|
|
1001 |
matched against the second. This pattern matches strings in |
|
|
1002 |
one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are |
|
|
1003 |
letters and dd are digits. |
|
|
1004 |
!!COMMENTS |
|
|
1005 |
|
|
|
1006 |
|
|
|
1007 |
The sequence (?# marks the start of a comment which |
|
|
1008 |
continues up to the next closing parenthesis. Nested |
|
|
1009 |
parentheses are not permitted. The characters that make up a |
|
|
1010 |
comment play no part in the pattern matching at |
|
|
1011 |
all. |
|
|
1012 |
|
|
|
1013 |
|
|
|
1014 |
If the PCRE_EXTENDED option is set, an unescaped # character |
|
|
1015 |
outside a character class introduces a comment that |
|
|
1016 |
continues up to the next newline character in the |
|
|
1017 |
pattern. |
|
|
1018 |
!!PERFORMANCE |
|
|
1019 |
|
|
|
1020 |
|
|
|
1021 |
Certain items that may appear in patterns are more efficient |
|
|
1022 |
than others. It is more efficient to use a character class |
|
|
1023 |
like [[aeiou] than a set of alternatives such as (a|e|i|o|u). |
|
|
1024 |
In general, the simplest construction that provides the |
|
|
1025 |
required behaviour is usually the most efficient. Jeffrey |
|
|
1026 |
Friedl's book contains a lot of discussion about optimizing |
|
|
1027 |
regular expressions for efficient performance. |
|
|
1028 |
|
|
|
1029 |
|
|
|
1030 |
When a pattern begins with .* and the PCRE_DOTALL option is |
|
|
1031 |
set, the pattern is implicitly anchored by PCRE, since it |
|
|
1032 |
can match only at the start of a subject string. However, if |
|
|
1033 |
PCRE_DOTALL is not set, PCRE cannot make this optimization, |
|
|
1034 |
because the . metacharacter does not then match a newline, |
|
|
1035 |
and if the subject string contains newlines, the pattern may |
|
|
1036 |
match from the character immediately following one of them |
|
|
1037 |
instead of from the very start. For example, the |
|
|
1038 |
pattern |
|
|
1039 |
|
|
|
1040 |
|
|
|
1041 |
(.*) second |
|
|
1042 |
|
|
|
1043 |
|
|
|
1044 |
matches the subject |
|
|
1045 |
|
|
|
1046 |
|
|
|
1047 |
If you are using such a pattern with subject strings that do |
|
|
1048 |
not contain newlines, the best performance is obtained by |
|
|
1049 |
setting PCRE_DOTALL, or starting the pattern with ^.* to |
|
|
1050 |
indicate explicit anchoring. That saves PCRE from having to |
|
|
1051 |
scan along the subject looking for a newline to restart |
|
|
1052 |
at. |
|
|
1053 |
|
|
|
1054 |
|
|
|
1055 |
Beware of patterns that contain nested indefinite repeats. |
|
|
1056 |
These can take a long time to run when applied to a string |
|
|
1057 |
that does not match. Consider the pattern |
|
|
1058 |
fragment |
|
|
1059 |
|
|
|
1060 |
|
|
|
1061 |
(a+)* |
|
|
1062 |
|
|
|
1063 |
|
|
|
1064 |
This can match |
|
|
1065 |
|
|
|
1066 |
|
|
|
1067 |
An optimization catches some of the more simple cases such |
|
|
1068 |
as |
|
|
1069 |
|
|
|
1070 |
|
|
|
1071 |
(a+)*b |
|
|
1072 |
|
|
|
1073 |
|
|
|
1074 |
where a literal character follows. Before embarking on the |
|
|
1075 |
standard matching procedure, PCRE checks that there is a |
|
|
1076 |
|
|
|
1077 |
|
|
|
1078 |
(a+)*d |
|
|
1079 |
|
|
|
1080 |
|
|
|
1081 |
with the pattern above. The former gives a failure almost |
|
|
1082 |
instantly when applied to a whole line of |
|
|
1083 |
!!DIFFERENCES FROM PERL |
|
|
1084 |
|
|
|
1085 |
|
|
|
1086 |
The differences described here are with respect to Perl |
|
|
1087 |
5.005. |
|
|
1088 |
|
|
|
1089 |
|
|
|
1090 |
1. By default, a whitespace character is any character that |
|
|
1091 |
the C library function __isspace()__ recognizes, though |
|
|
1092 |
it is possible to compile PCRE with alternative character |
|
|
1093 |
type tables. Normally __isspace()__ matches space, |
|
|
1094 |
formfeed, newline, carriage return, horizontal tab, and |
|
|
1095 |
vertical tab. Perl 5 no longer includes vertical tab in its |
|
|
1096 |
set of whitespace characters. The v escape that was in the |
|
|
1097 |
Perl documentation for a long time was never in fact |
|
|
1098 |
recognized. However, the character itself was treated as |
|
|
1099 |
whitespace at least up to 5.002. In 5.004 and 5.005 it does |
|
|
1100 |
not match s. |
|
|
1101 |
|
|
|
1102 |
|
|
|
1103 |
2. PCRE does not allow repeat quantifiers on lookahead |
|
|
1104 |
assertions. Perl permits them, but they do not mean what you |
|
|
1105 |
might think. For example, (?!a){3} does not assert that the |
|
|
1106 |
next three characters are not |
|
|
1107 |
|
|
|
1108 |
|
|
|
1109 |
3. Capturing subpatterns that occur inside negative |
|
|
1110 |
lookahead assertions are counted, but their entries in the |
|
|
1111 |
offsets vector are never set. Perl sets its numerical |
|
|
1112 |
variables from any such patterns that are matched before the |
|
|
1113 |
assertion fails to match something (thereby succeeding), but |
|
|
1114 |
only if the negative lookahead assertion contains just one |
|
|
1115 |
branch. |
|
|
1116 |
|
|
|
1117 |
|
|
|
1118 |
4. Though binary zero characters are supported in the |
|
|
1119 |
subject string, they are not allowed in a pattern string |
|
|
1120 |
because it is passed as a normal C string, terminated by |
|
|
1121 |
zero. The escape sequence |
|
|
1122 |
|
|
|
1123 |
|
|
|
1124 |
5. The following Perl escape sequences are not supported: l, |
|
|
1125 |
u, L, U, E, Q. In fact these are implemented by Perl's |
|
|
1126 |
general string-handling and are not part of its pattern |
|
|
1127 |
matching engine. |
|
|
1128 |
|
|
|
1129 |
|
|
|
1130 |
6. The Perl G assertion is not supported as it is not |
|
|
1131 |
relevant to single pattern matches. |
|
|
1132 |
|
|
|
1133 |
|
|
|
1134 |
7. Fairly obviously, PCRE does not support the (?{code}) |
|
|
1135 |
construction. |
|
|
1136 |
|
|
|
1137 |
|
|
|
1138 |
8. There are at the time of writing some oddities in Perl |
|
|
1139 |
5.005_02 concerned with the settings of captured strings |
|
|
1140 |
when part of a pattern is repeated. For example, matching |
|
|
1141 |
|
|
|
1142 |
|
|
|
1143 |
In Perl 5.004 $2 is set in both cases, and that is also true |
|
|
1144 |
of PCRE. If in the future Perl changes to a consistent state |
|
|
1145 |
that is different, PCRE may change to follow. |
|
|
1146 |
|
|
|
1147 |
|
|
|
1148 |
9. Another as yet unresolved discrepancy is that in Perl |
|
|
1149 |
5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string |
|
|
1150 |
|
|
|
1151 |
|
|
|
1152 |
10. PCRE provides some extensions to the Perl regular |
|
|
1153 |
expression facilities: |
|
|
1154 |
|
|
|
1155 |
|
|
|
1156 |
(a) Although lookbehind assertions must match fixed length |
|
|
1157 |
strings, each alternative branch of a lookbehind assertion |
|
|
1158 |
can match a different length of string. Perl 5.005 requires |
|
|
1159 |
them all to have the same length. |
|
|
1160 |
|
|
|
1161 |
|
|
|
1162 |
(b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not |
|
|
1163 |
set, the $ meta- character matches only at the very end of |
|
|
1164 |
the string. |
|
|
1165 |
|
|
|
1166 |
|
|
|
1167 |
(c) If PCRE_EXTRA is set, a backslash followed by a letter |
|
|
1168 |
with no special meaning is faulted. |
|
|
1169 |
|
|
|
1170 |
|
|
|
1171 |
(d) If PCRE_UNGREEDY is set, the greediness of the |
|
|
1172 |
repetition quantifiers is inverted, that is, by default they |
|
|
1173 |
are not greedy, but if followed by a question mark they |
|
|
1174 |
are. |
|
|
1175 |
!!LIMITATIONS |
|
|
1176 |
|
|
|
1177 |
|
|
|
1178 |
There are some size limitations in PCRE but it is hoped that |
|
|
1179 |
they will never in practice be relevant. The maximum length |
|
|
1180 |
of a compiled pattern is 65539 (sic) bytes. All values in |
|
|
1181 |
repeating quantifiers must be less than 65536. The maximum |
|
|
1182 |
number of capturing subpatterns is 99. The maximum number of |
|
|
1183 |
all parenthesized subpatterns, including capturing |
|
|
1184 |
subpatterns, assertions, and other types of subpattern, is |
|
|
1185 |
200. |
|
|
1186 |
|
|
|
1187 |
|
|
|
1188 |
The maximum length of a subject string is the largest |
|
|
1189 |
positive number that an integer variable can hold. However, |
|
|
1190 |
PCRE uses recursion to handle subpatterns and indefinite |
|
|
1191 |
repetition. This means that the available stack space may |
|
|
1192 |
limit the size of a subject string that can be processed by |
|
|
1193 |
certain patterns. |
|
|
1194 |
!!AUTHOR |
|
|
1195 |
|
|
|
1196 |
|
|
|
1197 |
Philip Hazel |
|
|
1198 |
University Computing Service, |
|
|
1199 |
New Museums Site, |
|
|
1200 |
Cambridge CB2 3QG, England. |
|
|
1201 |
Phone: +44 1223 334714 |
|
|
1202 |
|
|
|
1203 |
|
|
|
1204 |
Copyright (c) 1997-1999 University of |
|
|
1205 |
Cambridge. |
|
|
1206 |
---- |