Blame: pcre(7) - Waikato Linux Users Group

Annotated edit history of pcre(7) version 1, including all changes. View license author blame.

Rev	Author	#	Line
1	perry	1	`PCRE`
		2	`!!!PCRE`
		3	`NAME`
		4	`DESCRIPTION`
		5	`REGULAR EXPRESSION DETAILS`
		6	`BACKSLASH`
		7	`CIRCUMFLEX AND DOLLAR`
		8	`FULL STOP (PERIOD, DOT)`
		9	`SQUARE BRACKETS`
		10	`VERTICAL BAR`
		11	`INTERNAL OPTION SETTING`
		12	`SUBPATTERNS`
		13	`REPETITION`
		14	`BACK REFERENCES`
		15	`ASSERTIONS`
		16	`ONCE-ONLY SUBPATTERNS`
		17	`CONDITIONAL SUBPATTERNS`
		18	`COMMENTS`
		19	`PERFORMANCE`
		20	`DIFFERENCES FROM PERL`
		21	`LIMITATIONS`
		22	`AUTHOR`
		23	`----`
		24	`!!NAME`
		25
		26
		27	`pcre - Perl-compatible regular expressions.`
		28	`!!DESCRIPTION`
		29
		30
		31	`The PCRE library is a set of functions that implement`
		32	`regular expression pattern matching using the same syntax`
		33	`and semantics as Perl 5, with just a few differences (see`
		34	`below). The current implementation corresponds to Perl`
		35	`5.005.`
		36
		37
		38	`This man page describes the regular expressions understood`
		39	`by programs that use PCRE.`
		40	`!!REGULAR EXPRESSION DETAILS`
		41
		42
		43	`The syntax and semantics of the regular expressions`
		44	`supported by PCRE are described below. Regular expressions`
		45	`are also described in the Perl documentation and in a number`
		46	`of other books, some of which have copious examples. Jeffrey`
		47	`Friedl's`
		48
		49
		50	`A regular expression is a pattern that is matched against a`
		51	`subject string from left to right. Most characters stand for`
		52	`themselves in a pattern, and match the corresponding`
		53	`characters in the subject. As a trivial example, the`
		54	`pattern`
		55
		56
		57	`The quick brown fox`
		58
		59
		60	`matches a portion of a subject string that is identical to`
		61	`itself. The power of regular expressions comes from the`
		62	`ability to include alternatives and repetitions in the`
		63	`pattern. These are encoded in the pattern by the use of`
		64	`''meta-characters'', which do not stand for themselves`
		65	`but instead are interpreted in some special`
		66	`way.`
		67
		68
		69	`There are two different sets of meta-characters: those that`
		70	`are recognized anywhere in the pattern except within square`
		71	`brackets, and those that are recognized in square brackets.`
		72	`Outside square brackets, the meta-characters are as`
		73	`follows:`
		74
		75
		76	`\ general escape character with several uses ^ assert start`
		77	`of subject (or line, in multiline mode) $ assert end of`
		78	`subject (or line, in multiline mode) . match any character`
		79	`except newline (by default) [[ start character class`
		80	`definition \| start of alternative branch ( start subpattern`
		81	`) end subpattern ? extends the meaning of ( also 0 or 1`
		82	`quantifier also quantifier minimizer * 0 or more quantifier`
		83	`+ 1 or more quantifier { start min/max`
		84	`quantifier`
		85
		86
		87	`Part of a pattern that is in square brackets is called a`
		88
		89
		90	`\ general escape character ^ negate the class, but only if`
		91	`the first character - indicates character range ] terminates`
		92	`the character class`
		93
		94
		95	`The following sections describe the use of each of the`
		96	`meta-characters.`
		97	`!!BACKSLASH`
		98
		99
		100	`The backslash character has several uses. Firstly, if it is`
		101	`followed by a non-alphameric character, it takes away any`
		102	`special meaning that character may have. This use of`
		103	`backslash as an escape character applies both inside and`
		104	`outside character classes.`
		105
		106
		107	`For example, if you want to match a`
		108
		109
		110	`If a pattern is compiled with the PCRE_EXTENDED option,`
		111	`whitespace in the pattern (other than in a character class)`
		112	`and characters between a`
		113
		114
		115	`A second use of backslash provides a way of encoding`
		116	`non-printing characters in patterns in a visible manner.`
		117	`There is no restriction on the appearance of non-printing`
		118	`characters, apart from the binary zero that terminates a`
		119	`pattern, but when a pattern is being prepared by text`
		120	`editing, it is usually easier to use one of the following`
		121	`escape sequences than the binary character it`
		122	`represents:`
		123
		124
		125	`a alarm, that is, the BEL character (hex 07) cx`
		126
		127
		128	`The precise effect of`
		129
		130
		131	`After`
		132
		133
		134	`After`
		135
		136
		137	`The handling of a backslash followed by a digit other than 0`
		138	`is complicated. Outside a character class, PCRE reads it and`
		139	`any following digits as a decimal number. If the number is`
		140	`less than 10, or if there have been at least that many`
		141	`previous capturing left parentheses in the expression, the`
		142	`entire sequence is taken as a ''back reference''. A`
		143	`description of how this works is given later, following the`
		144	`discussion of parenthesized subpatterns.`
		145
		146
		147	`Inside a character class, or if the decimal number is`
		148	`greater than 9 and there have not been that many capturing`
		149	`subpatterns, PCRE re-reads up to three octal digits`
		150	`following the backslash, and generates a single byte from`
		151	`the least significant 8 bits of the value. Any subsequent`
		152	`digits stand for themselves. For example:`
		153
		154
		155	`040 is another way of writing a space 40 is the same,`
		156	`provided there are fewer than 40 previous capturing`
		157	`subpatterns 7 is always a back reference 11 might be a back`
		158	`reference, or another way of writing a tab 011 is always a`
		159	`tab 0113 is a tab followed by the character`
		160
		161
		162	`Note that octal values of 100 or greater must not be`
		163	`introduced by a leading zero, because no more than three`
		164	`octal digits are ever read.`
		165
		166
		167	`All the sequences that define a single byte value can be`
		168	`used both inside and outside character classes. In addition,`
		169	`inside a character class, the sequence`
		170
		171
		172	`The third use of backslash is for specifying generic`
		173	`character types:`
		174
		175
		176	`d any decimal digit D any character that is not a decimal`
		177	`digit s any whitespace character S any character that is not`
		178	`a whitespace character w any`
		179
		180
		181	`Each pair of escape sequences partitions the complete set of`
		182	`characters into two disjoint sets. Any given character`
		183	`matches one, and only one, of each pair.`
		184
		185
		186	`A`
		187
		188
		189	`These character type sequences can appear both inside and`
		190	`outside character classes. They each match one character of`
		191	`the appropriate type. If the current matching point is at`
		192	`the end of the subject string, all of them fail, since there`
		193	`is no character to match.`
		194
		195
		196	`The fourth use of backslash is for certain simple`
		197	`assertions. An assertion specifies a condition that has to`
		198	`be met at a particular point in a match, without consuming`
		199	`any characters from the subject string. The use of`
		200	`subpatterns for more complicated assertions is described`
		201	`below. The backslashed assertions are`
		202
		203
		204	`b word boundary B not a word boundary A start of subject`
		205	`(independent of multiline mode) Z end of subject or newline`
		206	`at end (independent of multiline mode) z end of subject`
		207	`(independent of multiline mode)`
		208
		209
		210	`These assertions may not appear in character classes (but`
		211	`note that`
		212
		213
		214	`A word boundary is a position in the subject string where`
		215	`the current character and the previous character do not both`
		216	`match w or W (i.e. one matches w and the other matches W),`
		217	`or the start or end of the string if the first or last`
		218	`character matches w, respectively.`
		219
		220
		221	`The A, Z, and z assertions differ from the traditional`
		222	`circumflex and dollar (described below) in that they only`
		223	`ever match at the very start and end of the subject string,`
		224	`whatever options are set. They are not affected by the`
		225	`PCRE_NOTBOL or PCRE_NOTEOL options. If the`
		226	`''startoffset'' argument of __pcre_exec()__ is`
		227	`non-zero, A can never match. The difference between Z and z`
		228	`is that Z matches before a newline that is the last`
		229	`character of the string as well as at the end of the string,`
		230	`whereas z matches only at the end.`
		231	`!!CIRCUMFLEX AND DOLLAR`
		232
		233
		234	`Outside a character class, in the default matching mode, the`
		235	`circumflex character is an assertion which is true only if`
		236	`the current matching point is at the start of the subject`
		237	`string. If the ''startoffset'' argument of`
		238	`__pcre_exec()__ is non-zero, circumflex can never match.`
		239	`Inside a character class, circumflex has an entirely`
		240	`different meaning (see below).`
		241
		242
		243	`Circumflex need not be the first character of the pattern if`
		244	`a number of alternatives are involved, but it should be the`
		245	`first thing in each alternative in which it appears if the`
		246	`pattern is ever to match that branch. If all possible`
		247	`alternatives start with a circumflex, that is, if the`
		248	`pattern is constrained to match only at the start of the`
		249	`subject, it is said to be an`
		250
		251
		252	`A dollar character is an assertion which is true only if the`
		253	`current matching point is at the end of the subject string,`
		254	`or immediately before a newline character that is the last`
		255	`character in the string (by default). Dollar need not be the`
		256	`last character of the pattern if a number of alternatives`
		257	`are involved, but it should be the last item in any branch`
		258	`in which it appears. Dollar has no special meaning in a`
		259	`character class.`
		260
		261
		262	`The meaning of dollar can be changed so that it matches only`
		263	`at the very end of the string, by setting the`
		264	`PCRE_DOLLAR_ENDONLY option at compile or matching time. This`
		265	`does not affect the Z assertion.`
		266
		267
		268	`The meanings of the circumflex and dollar characters are`
		269	`changed if the PCRE_MULTILINE option is set. When this is`
		270	`the case, they match immediately after and immediately`
		271	`before an internal`
		272	`startoffset''`
		273	`argument of __pcre_exec()__ is non-zero. The`
		274	`PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is`
		275	`set.`
		276
		277
		278	`Note that the sequences A, Z, and z can be used to match the`
		279	`start and end of the subject in both modes, and if all`
		280	`branches of a pattern start with A is it always anchored,`
		281	`whether PCRE_MULTILINE is set or not.`
		282	`!!FULL STOP (PERIOD, DOT)`
		283
		284
		285	`Outside a character class, a dot in the pattern matches any`
		286	`one character in the subject, including a non-printing`
		287	`character, but not (by default) newline. If the PCRE_DOTALL`
		288	`option is set, then dots match newlines as well. The`
		289	`handling of dot is entirely independent of the handling of`
		290	`circumflex and dollar, the only relationship being that they`
		291	`both involve newline characters. Dot has no special meaning`
		292	`in a character class.`
		293	`!!SQUARE BRACKETS`
		294
		295
		296	`An opening square bracket introduces a character class,`
		297	`terminated by a closing square bracket. A closing square`
		298	`bracket on its own is not special. If a closing square`
		299	`bracket is required as a member of the class, it should be`
		300	`the first data character in the class (after an initial`
		301	`circumflex, if present) or escaped with a`
		302	`backslash.`
		303
		304
		305	`A character class matches a single character in the subject;`
		306	`the character must be in the set of characters defined by`
		307	`the class, unless the first character in the class is a`
		308	`circumflex, in which case the subject character must not be`
		309	`in the set defined by the class. If a circumflex is actually`
		310	`required as a member of the class, ensure it is not the`
		311	`first character, or escape it with a backslash.`
		312
		313
		314	`For example, the character class [[aeiou] matches any lower`
		315	`case vowel, while [[^aeiou] matches any character that is not`
		316	`a lower case vowel. Note that a circumflex is just a`
		317	`convenient notation for specifying the characters which are`
		318	`in the class by enumerating those that are not. It is not an`
		319	`assertion: it still consumes a character from the subject`
		320	`string, and fails if the current pointer is at the end of`
		321	`the string.`
		322
		323
		324	`When caseless matching is set, any letters in a class`
		325	`represent both their upper case and lower case versions, so`
		326	`for example, a caseless [[aeiou] matches`
		327
		328
		329	`The newline character is never treated in any special way in`
		330	`character classes, whatever the setting of the PCRE_DOTALL`
		331	`or PCRE_MULTILINE options is. A class such as [[^a] will`
		332	`always match a newline.`
		333
		334
		335	`The minus (hyphen) character can be used to specify a range`
		336	`of characters in a character class. For example, [[d-m]`
		337	`matches any letter between d and m, inclusive. If a minus`
		338	`character is required in a class, it must be escaped with a`
		339	`backslash or appear in a position where it cannot be`
		340	`interpreted as indicating a range, typically as the first or`
		341	`last character in the class.`
		342
		343
		344	`It is not possible to have the literal character`
		345
		346
		347	`Ranges operate in ASCII collating sequence. They can also be`
		348	`used for characters specified numerically, for example`
		349	`[[000-037]. If a range that includes letters is used when`
		350	`caseless matching is set, it matches the letters in either`
		351	case. For example, [[W-c] is equivalent to [[][[^_`wxyzabc],
		352	`matched caselessly, and if character tables for the`
		353
		354
		355	`The character types d, D, s, S, w, and W may also appear in`
		356	`a character class, and add the characters that they match to`
		357	`the class. For example, [[dABCDEF] matches any hexadecimal`
		358	`digit. A circumflex can conveniently be used with the upper`
		359	`case character types to specify a more restricted set of`
		360	`characters than the matching lower case type. For example,`
		361	`the class [[^W_] matches any letter or digit, but not`
		362	`underscore.`
		363
		364
		365	`All non-alphameric characters other than , -, ^ (at the`
		366	`start) and the terminating ] are non-special in character`
		367	`classes, but it does no harm if they are`
		368	`escaped.`
		369	`!!VERTICAL BAR`
		370
		371
		372	`Vertical bar characters are used to separate alternative`
		373	`patterns. For example, the pattern`
		374
		375
		376	`gilbert\|sullivan`
		377
		378
		379	`matches either`
		380	`!!INTERNAL OPTION SETTING`
		381
		382
		383	`The settings of PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL,`
		384	`and PCRE_EXTENDED can be changed from within the pattern by`
		385	`a sequence of Perl option letters enclosed between`
		386
		387
		388	`i for PCRE_CASELESS m for PCRE_MULTILINE s for PCRE_DOTALL x`
		389	`for PCRE_EXTENDED`
		390
		391
		392	`For example, (?im) sets caseless, multiline matching. It is`
		393	`also possible to unset these options by preceding the letter`
		394	`with a hyphen, and a combined setting and unsetting such as`
		395	`(?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while`
		396	`unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted.`
		397	`If a letter appears both before and after the hyphen, the`
		398	`option is unset.`
		399
		400
		401	`The scope of these option changes depends on where in the`
		402	`pattern the setting occurs. For settings that are outside`
		403	`any subpattern (defined below), the effect is the same as if`
		404	`the options were set or unset at the start of matching. The`
		405	`following patterns all behave in exactly the same`
		406	`way:`
		407
		408
		409	`(?i)abc a(?i)bc ab(?i)c abc(?i)`
		410
		411
		412	`which in turn is the same as compiling the pattern abc with`
		413	`PCRE_CASELESS set. In other words, such`
		414
		415
		416	`If an option change occurs inside a subpattern, the effect`
		417	`is different. This is a change of behaviour in Perl 5.005.`
		418	`An option change inside a subpattern affects only that part`
		419	`of the subpattern that follows it, so`
		420
		421
		422	`(a(?i)b)c`
		423
		424
		425	`matches abc and aBc and no other strings (assuming`
		426	`PCRE_CASELESS is not used). By this means, options can be`
		427	`made to have different settings in different parts of the`
		428	`pattern. Any changes made in one alternative do carry on`
		429	`into subsequent branches within the same subpattern. For`
		430	`example,`
		431
		432
		433	`(a(?i)b\|c)`
		434
		435
		436	`matches`
		437
		438
		439	`The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can`
		440	`be changed in the same way as the Perl-compatible options by`
		441	`using the characters U and X respectively. The (?X) flag`
		442	`setting is special in that it must always occur earlier in`
		443	`the pattern than any of the additional features it turns on,`
		444	`even when it is at top level. It is best put at the`
		445	`start.`
		446	`!!SUBPATTERNS`
		447
		448
		449	`Subpatterns are delimited by parentheses (round brackets),`
		450	`which can be nested. Marking part of a pattern as a`
		451	`subpattern does two things:`
		452
		453
		454	`1. It localizes a set of alternatives. For example, the`
		455	`pattern`
		456
		457
		458	`cat(aract\|erpillar\|)`
		459
		460
		461	`matches one of the words`
		462
		463
		464	`2. It sets up the subpattern as a capturing subpattern (as`
		465	`defined above). When the whole pattern matches, that portion`
		466	`of the subject string that matched the subpattern is passed`
		467	`back to the caller via the ''ovector'' argument of`
		468	`__pcre_exec()__. Opening parentheses are counted from`
		469	`left to right (starting from 1) to obtain the numbers of the`
		470	`capturing subpatterns.`
		471
		472
		473	`For example, if the string`
		474
		475
		476	`the ((red\|white) (king\|queen))`
		477
		478
		479	`the captured substrings are`
		480
		481
		482	`The fact that plain parentheses fulfil two functions is not`
		483	`always helpful. There are often times when a grouping`
		484	`subpattern is required without a capturing requirement. If`
		485	`an opening parenthesis is followed by`
		486
		487
		488	`the ((?:red\|white) (king\|queen))`
		489
		490
		491	`the captured substrings are`
		492
		493
		494	`As a convenient shorthand, if any option settings are`
		495	`required at the start of a non-capturing subpattern, the`
		496	`option letters may appear between the`
		497
		498
		499	`(?i:saturday\|sunday) (?:(?i)saturday\|sunday)`
		500
		501
		502	`match exactly the same set of strings. Because alternative`
		503	`branches are tried from left to right, and options are not`
		504	`reset until the end of the subpattern is reached, an option`
		505	`setting in one branch does affect subsequent branches, so`
		506	`the above patterns match`
		507	`!!REPETITION`
		508
		509
		510	`Repetition is specified by quantifiers, which can follow any`
		511	`of the following items:`
		512
		513
		514	`a single character, possibly escaped the . metacharacter a`
		515	`character class a back reference (see next section) a`
		516	`parenthesized subpattern (unless it is an assertion - see`
		517	`below)`
		518
		519
		520	`The general repetition quantifier specifies a minimum and`
		521	`maximum number of permitted matches, by giving the two`
		522	`numbers in curly brackets (braces), separated by a comma.`
		523	`The numbers must be less than 65536, and the first must be`
		524	`less than or equal to the second. For example:`
		525
		526
		527	`z{2,4}`
		528
		529
		530	`matches`
		531
		532
		533	`[[aeiou]{3,}`
		534
		535
		536	`matches at least 3 successive vowels, but may match many`
		537	`more, while`
		538
		539
		540	`d{8}`
		541
		542
		543	`matches exactly 8 digits. An opening curly bracket that`
		544	`appears in a position where a quantifier is not allowed, or`
		545	`one that does not match the syntax of a quantifier, is taken`
		546	`as a literal character. For example, {,6} is not a`
		547	`quantifier, but a literal string of four`
		548	`characters.`
		549
		550
		551	`The quantifier {0} is permitted, causing the expression to`
		552	`behave as if the previous item and the quantifier were not`
		553	`present.`
		554
		555
		556	`For convenience (and historical compatibility) the three`
		557	`most common quantifiers have single-character`
		558	`abbreviations:`
		559
		560
		561	`* is equivalent to {0,} + is equivalent to {1,} ? is`
		562	`equivalent to {0,1}`
		563
		564
		565	`It is possible to construct infinite loops by following a`
		566	`subpattern that can match no characters with a quantifier`
		567	`that has no upper limit, for example:`
		568
		569
		570	`(a?)*`
		571
		572
		573	`Earlier versions of Perl and PCRE used to give an error at`
		574	`compile time for such patterns. However, because there are`
		575	`cases where this can be useful, such patterns are now`
		576	`accepted, but if any repetition of the subpattern does in`
		577	`fact match no characters, the loop is forcibly`
		578	`broken.`
		579
		580
		581	`By default, the quantifiers are`
		582
		583
		584	`/.*/`
		585
		586
		587	`to the string`
		588
		589
		590	`/* first command / not comment / second comment`
		591	`*/`
		592
		593
		594	`fails, because it matches the entire string due to the`
		595	`greediness of the .* item.`
		596
		597
		598	`However, if a quantifier is followed by a question mark,`
		599	`then it ceases to be greedy, and instead matches the minimum`
		600	`number of times possible, so the pattern`
		601
		602
		603	`/.?*/`
		604
		605
		606	`does the right thing with the C comments. The meaning of the`
		607	`various quantifiers is not otherwise changed, just the`
		608	`preferred number of matches. Do not confuse this use of`
		609	`question mark with its use as a quantifier in its own right.`
		610	`Because it has two uses, it can sometimes appear doubled, as`
		611	`in`
		612
		613
		614	`d??d`
		615
		616
		617	`which matches one digit by preference, but can match two if`
		618	`that is the only way the rest of the pattern`
		619	`matches.`
		620
		621
		622	`If the PCRE_UNGREEDY option is set (an option which is not`
		623	`available in Perl) then the quantifiers are not greedy by`
		624	`default, but individual ones can be made greedy by following`
		625	`them with a question mark. In other words, it inverts the`
		626	`default behaviour.`
		627
		628
		629	`When a parenthesized subpattern is quantified with a minimum`
		630	`repeat count that is greater than 1 or with a limited`
		631	`maximum, more store is required for the compiled pattern, in`
		632	`proportion to the size of the minimum or`
		633	`maximum.`
		634
		635
		636	`If a pattern starts with .* or .{0,} and the PCRE_DOTALL`
		637	`option (equivalent to Perl's /s) is set, thus allowing the .`
		638	`to match newlines, then the pattern is implicitly anchored,`
		639	`because whatever follows will be tried against every`
		640	`character position in the subject string, so there is no`
		641	`point in retrying the overall match at any position after`
		642	`the first. PCRE treats such a pattern as though it were`
		643	`preceded by A. In cases where it is known that the subject`
		644	`string contains no newlines, it is worth setting PCRE_DOTALL`
		645	`when the pattern begins with .* in order to obtain this`
		646	`optimization, or alternatively using ^ to indicate anchoring`
		647	`explicitly.`
		648
		649
		650	`When a capturing subpattern is repeated, the value captured`
		651	`is the substring that matched the final iteration. For`
		652	`example, after`
		653
		654
		655	`(tweedle[[dume]{3}s*)+`
		656
		657
		658	`has matched`
		659
		660
		661	`/(a\|(b))+/`
		662
		663
		664	`matches`
		665	`!!BACK REFERENCES`
		666
		667
		668	`Outside a character class, a backslash followed by a digit`
		669	`greater than 0 (and possibly further digits) is a back`
		670	`reference to a capturing subpattern earlier (i.e. to its`
		671	`left) in the pattern, provided there have been that many`
		672	`previous capturing left parentheses.`
		673
		674
		675	`However, if the decimal number following the backslash is`
		676	`less than 10, it is always taken as a back reference, and`
		677	`causes an error only if there are not that many capturing`
		678	`left parentheses in the entire pattern. In other words, the`
		679	`parentheses that are referenced need not be to the left of`
		680	`the reference for numbers less than 10. See the section`
		681	`entitled`
		682
		683
		684	`A back reference matches whatever actually matched the`
		685	`capturing subpattern in the current subject string, rather`
		686	`than anything matching the subpattern itself. So the`
		687	`pattern`
		688
		689
		690	`(sens\|respons)e and 1ibility`
		691
		692
		693	`matches`
		694
		695
		696	`((?i)rah)s+1`
		697
		698
		699	`matches`
		700
		701
		702	`There may be more than one back reference to the same`
		703	`subpattern. If a subpattern has not actually been used in a`
		704	`particular match, then any back references to it always`
		705	`fail. For example, the pattern`
		706
		707
		708	`(a\|(bc))2`
		709
		710
		711	`always fails if it starts to match`
		712
		713
		714	`A back reference that occurs inside the parentheses to which`
		715	`it refers fails when the subpattern is first used, so, for`
		716	`example, (a1) never matches. However, such references can be`
		717	`useful inside repeated subpatterns. For example, the`
		718	`pattern`
		719
		720
		721	`(a\|b1)+`
		722
		723
		724	`matches any number of`
		725	`!!ASSERTIONS`
		726
		727
		728	`An assertion is a test on the characters following or`
		729	`preceding the current matching point that does not actually`
		730	`consume any characters. The simple assertions coded as b, B,`
		731	`A, Z, z, ^ and $ are described above. More complicated`
		732	`assertions are coded as subpatterns. There are two kinds:`
		733	`those that look ahead of the current position in the subject`
		734	`string, and those that look behind it.`
		735
		736
		737	`An assertion subpattern is matched in the normal way, except`
		738	`that it does not cause the current matching position to be`
		739	`changed. Lookahead assertions start with (?= for positive`
		740	`assertions and (?! for negative assertions. For`
		741	`example,`
		742
		743
		744	`w+(?=;)`
		745
		746
		747	`matches a word followed by a semicolon, but does not include`
		748	`the semicolon in the match, and`
		749
		750
		751	`foo(?!bar)`
		752
		753
		754	`matches any occurrence of`
		755
		756
		757	`(?!foo)bar`
		758
		759
		760	`does not find an occurrence of`
		761
		762
		763	`Lookbehind assertions start with (?`
		764
		765
		766	`(?`
		767
		768
		769	`does find an occurrence of`
		770
		771
		772	`(?`
		773
		774
		775	`is permitted, but`
		776
		777
		778	`(?`
		779
		780
		781	`causes an error at compile time. Branches that match`
		782	`different length strings are permitted only at the top level`
		783	`of a lookbehind assertion. This is an extension compared`
		784	`with Perl 5.005, which requires all branches to match the`
		785	`same length of string. An assertion such as`
		786
		787
		788	`(?`
		789
		790
		791	`is not permitted, because its single top-level branch can`
		792	`match two different lengths, but it is acceptable if`
		793	`rewritten to use two top-level branches:`
		794
		795
		796	`(?`
		797
		798
		799	`The implementation of lookbehind assertions is, for each`
		800	`alternative, to temporarily move the current position back`
		801	`by the fixed width and then try to match. If there are`
		802	`insufficient characters before the current position, the`
		803	`match is deemed to fail. Lookbehinds in conjunction with`
		804	`once-only subpatterns can be particularly useful for`
		805	`matching at the ends of strings; an example is given at the`
		806	`end of the section on once-only subpatterns.`
		807
		808
		809	`Several assertions (of any sort) may occur in succession.`
		810	`For example,`
		811
		812
		813	`(?`
		814
		815
		816	`matches`
		817	`not'' match`
		818	`''`
		819
		820
		821	`(?`
		822
		823
		824	`This time the first assertion looks at the preceding six`
		825	`characters, checking that the first three are digits, and`
		826	`then the second assertion checks that the preceding three`
		827	`characters are not`
		828
		829
		830	`Assertions can be nested in any combination. For`
		831	`example,`
		832
		833
		834	`(?`
		835
		836
		837	`matches an occurrence of`
		838
		839
		840	`(?`
		841
		842
		843	`is another pattern which matches`
		844
		845
		846	`Assertion subpatterns are not capturing subpatterns, and may`
		847	`not be repeated, because it makes no sense to assert the`
		848	`same thing several times. If any kind of assertion contains`
		849	`capturing subpatterns within it, these are counted for the`
		850	`purposes of numbering the capturing subpatterns in the whole`
		851	`pattern. However, substring capturing is carried out only`
		852	`for positive assertions, because it does not make sense for`
		853	`negative assertions.`
		854
		855
		856	`Assertions count towards the maximum of 200 parenthesized`
		857	`subpatterns.`
		858	`!!ONCE-ONLY SUBPATTERNS`
		859
		860
		861	`With both maximizing and minimizing repetition, failure of`
		862	`what follows normally causes the repeated item to be`
		863	`re-evaluated to see if a different number of repeats allows`
		864	`the rest of the pattern to match. Sometimes it is useful to`
		865	`prevent this, either to change the nature of the match, or`
		866	`to cause it fail earlier than it otherwise might, when the`
		867	`author of the pattern knows there is no point in carrying`
		868	`on.`
		869
		870
		871	`Consider, for example, the pattern d+foo when applied to the`
		872	`subject line`
		873
		874
		875	`123456bar`
		876
		877
		878	`After matching all 6 digits and then failing to match`
		879
		880
		881	`(?`
		882
		883
		884	`This kind of parenthesis`
		885
		886
		887	`An alternative description is that a subpattern of this type`
		888	`matches the string of characters that an identical`
		889	`standalone pattern would match, if anchored at the current`
		890	`point in the subject string.`
		891
		892
		893	`Once-only subpatterns are not capturing subpatterns. Simple`
		894	`cases such as the above example can be thought of as a`
		895	`maximizing repeat that must swallow everything it can. So,`
		896	`while both d+ and d+? are prepared to adjust the number of`
		897	`digits they match in order to make the rest of the pattern`
		898	`match, (?`
		899
		900
		901	`This construction can of course contain arbitrarily`
		902	`complicated subpatterns, and it can be nested.`
		903
		904
		905	`Once-only subpatterns can be used in conjunction with`
		906	`lookbehind assertions to specify efficient matching at the`
		907	`end of the subject string. Consider a simple pattern such`
		908	`as`
		909
		910
		911	`abcd$`
		912
		913
		914	`when applied to a long string which does not match it.`
		915	`Because matching proceeds from left to right, PCRE will look`
		916	`for each`
		917
		918
		919	`^.*abcd$`
		920
		921
		922	`then the initial .* matches the entire string at first, but`
		923	`when this fails, it backtracks to match all but the last`
		924	`character, then all but the last two characters, and so on.`
		925	`Once again the search for`
		926
		927
		928	`^(?`
		929
		930
		931	`then there can be no backtracking for the .* item; it can`
		932	`match only the entire string. The subsequent lookbehind`
		933	`assertion does a single test on the last four characters. If`
		934	`it fails, the match fails immediately. For long strings,`
		935	`this approach makes a significant difference to the`
		936	`processing time.`
		937	`!!CONDITIONAL SUBPATTERNS`
		938
		939
		940	`It is possible to cause the matching process to obey a`
		941	`subpattern conditionally or to choose between two`
		942	`alternative subpatterns, depending on the result of an`
		943	`assertion, or whether a previous capturing subpattern`
		944	`matched or not. The two possible forms of conditional`
		945	`subpattern are`
		946
		947
		948	`(?(condition)yes-pattern)`
		949	`(?(condition)yes-pattern\|no-pattern)`
		950
		951
		952	`If the condition is satisfied, the yes-pattern is used;`
		953	`otherwise the no-pattern (if present) is used. If there are`
		954	`more than two alternatives in the subpattern, a compile-time`
		955	`error occurs.`
		956
		957
		958	`There are two kinds of condition. If the text between the`
		959	`parentheses consists of a sequence of digits, then the`
		960	`condition is satisfied if the capturing subpattern of that`
		961	`number has previously matched. Consider the following`
		962	`pattern, which contains non-significant white space to make`
		963	`it more readable (assume the PCRE_EXTENDED option) and to`
		964	`divide it into three parts for ease of`
		965	`discussion:`
		966
		967
		968	`( )? [[^()]+ (?(1) ) )`
		969
		970
		971	`The first part matches an optional opening parenthesis, and`
		972	`if that character is present, sets it as the first captured`
		973	`substring. The second part matches one or more characters`
		974	`that are not parentheses. The third part is a conditional`
		975	`subpattern that tests whether the first set of parentheses`
		976	`matched or not. If they did, that is, if subject started`
		977	`with an opening parenthesis, the condition is true, and so`
		978	`the yes-pattern is executed and a closing parenthesis is`
		979	`required. Otherwise, since no-pattern is not present, the`
		980	`subpattern matches nothing. In other words, this pattern`
		981	`matches a sequence of non-parentheses, optionally enclosed`
		982	`in parentheses.`
		983
		984
		985	`If the condition is not a sequence of digits, it must be an`
		986	`assertion. This may be a positive or negative lookahead or`
		987	`lookbehind assertion. Consider this pattern, again`
		988	`containing non-significant white space, and with the two`
		989	`alternatives on the second line:`
		990
		991
		992	`(?(?=[[^a-z]*[[a-z]) d{2}[[a-z]{3}-d{2} \| d{2}-d{2}-d{2}`
		993	`)`
		994
		995
		996	`The condition is a positive lookahead assertion that matches`
		997	`an optional sequence of non-letters followed by a letter. In`
		998	`other words, it tests for the presence of at least one`
		999	`letter in the subject. If a letter is found, the subject is`
		1000	`matched against the first alternative; otherwise it is`
		1001	`matched against the second. This pattern matches strings in`
		1002	`one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are`
		1003	`letters and dd are digits.`
		1004	`!!COMMENTS`
		1005
		1006
		1007	`The sequence (?# marks the start of a comment which`
		1008	`continues up to the next closing parenthesis. Nested`
		1009	`parentheses are not permitted. The characters that make up a`
		1010	`comment play no part in the pattern matching at`
		1011	`all.`
		1012
		1013
		1014	`If the PCRE_EXTENDED option is set, an unescaped # character`
		1015	`outside a character class introduces a comment that`
		1016	`continues up to the next newline character in the`
		1017	`pattern.`
		1018	`!!PERFORMANCE`
		1019
		1020
		1021	`Certain items that may appear in patterns are more efficient`
		1022	`than others. It is more efficient to use a character class`
		1023	`like [[aeiou] than a set of alternatives such as (a\|e\|i\|o\|u).`
		1024	`In general, the simplest construction that provides the`
		1025	`required behaviour is usually the most efficient. Jeffrey`
		1026	`Friedl's book contains a lot of discussion about optimizing`
		1027	`regular expressions for efficient performance.`
		1028
		1029
		1030	`When a pattern begins with .* and the PCRE_DOTALL option is`
		1031	`set, the pattern is implicitly anchored by PCRE, since it`
		1032	`can match only at the start of a subject string. However, if`
		1033	`PCRE_DOTALL is not set, PCRE cannot make this optimization,`
		1034	`because the . metacharacter does not then match a newline,`
		1035	`and if the subject string contains newlines, the pattern may`
		1036	`match from the character immediately following one of them`
		1037	`instead of from the very start. For example, the`
		1038	`pattern`
		1039
		1040
		1041	`(.*) second`
		1042
		1043
		1044	`matches the subject`
		1045
		1046
		1047	`If you are using such a pattern with subject strings that do`
		1048	`not contain newlines, the best performance is obtained by`
		1049	`setting PCRE_DOTALL, or starting the pattern with ^.* to`
		1050	`indicate explicit anchoring. That saves PCRE from having to`
		1051	`scan along the subject looking for a newline to restart`
		1052	`at.`
		1053
		1054
		1055	`Beware of patterns that contain nested indefinite repeats.`
		1056	`These can take a long time to run when applied to a string`
		1057	`that does not match. Consider the pattern`
		1058	`fragment`
		1059
		1060
		1061	`(a+)*`
		1062
		1063
		1064	`This can match`
		1065
		1066
		1067	`An optimization catches some of the more simple cases such`
		1068	`as`
		1069
		1070
		1071	`(a+)*b`
		1072
		1073
		1074	`where a literal character follows. Before embarking on the`
		1075	`standard matching procedure, PCRE checks that there is a`
		1076
		1077
		1078	`(a+)*d`
		1079
		1080
		1081	`with the pattern above. The former gives a failure almost`
		1082	`instantly when applied to a whole line of`
		1083	`!!DIFFERENCES FROM PERL`
		1084
		1085
		1086	`The differences described here are with respect to Perl`
		1087	`5.005.`
		1088
		1089
		1090	`1. By default, a whitespace character is any character that`
		1091	`the C library function __isspace()__ recognizes, though`
		1092	`it is possible to compile PCRE with alternative character`
		1093	`type tables. Normally __isspace()__ matches space,`
		1094	`formfeed, newline, carriage return, horizontal tab, and`
		1095	`vertical tab. Perl 5 no longer includes vertical tab in its`
		1096	`set of whitespace characters. The v escape that was in the`
		1097	`Perl documentation for a long time was never in fact`
		1098	`recognized. However, the character itself was treated as`
		1099	`whitespace at least up to 5.002. In 5.004 and 5.005 it does`
		1100	`not match s.`
		1101
		1102
		1103	`2. PCRE does not allow repeat quantifiers on lookahead`
		1104	`assertions. Perl permits them, but they do not mean what you`
		1105	`might think. For example, (?!a){3} does not assert that the`
		1106	`next three characters are not`
		1107
		1108
		1109	`3. Capturing subpatterns that occur inside negative`
		1110	`lookahead assertions are counted, but their entries in the`
		1111	`offsets vector are never set. Perl sets its numerical`
		1112	`variables from any such patterns that are matched before the`
		1113	`assertion fails to match something (thereby succeeding), but`
		1114	`only if the negative lookahead assertion contains just one`
		1115	`branch.`
		1116
		1117
		1118	`4. Though binary zero characters are supported in the`
		1119	`subject string, they are not allowed in a pattern string`
		1120	`because it is passed as a normal C string, terminated by`
		1121	`zero. The escape sequence`
		1122
		1123
		1124	`5. The following Perl escape sequences are not supported: l,`
		1125	`u, L, U, E, Q. In fact these are implemented by Perl's`
		1126	`general string-handling and are not part of its pattern`
		1127	`matching engine.`
		1128
		1129
		1130	`6. The Perl G assertion is not supported as it is not`
		1131	`relevant to single pattern matches.`
		1132
		1133
		1134	`7. Fairly obviously, PCRE does not support the (?{code})`
		1135	`construction.`
		1136
		1137
		1138	`8. There are at the time of writing some oddities in Perl`
		1139	`5.005_02 concerned with the settings of captured strings`
		1140	`when part of a pattern is repeated. For example, matching`
		1141
		1142
		1143	`In Perl 5.004 $2 is set in both cases, and that is also true`
		1144	`of PCRE. If in the future Perl changes to a consistent state`
		1145	`that is different, PCRE may change to follow.`
		1146
		1147
		1148	`9. Another as yet unresolved discrepancy is that in Perl`
		1149	`5.005_02 the pattern /^(a)?(?(1)a\|b)+$/ matches the string`
		1150
		1151
		1152	`10. PCRE provides some extensions to the Perl regular`
		1153	`expression facilities:`
		1154
		1155
		1156	`(a) Although lookbehind assertions must match fixed length`
		1157	`strings, each alternative branch of a lookbehind assertion`
		1158	`can match a different length of string. Perl 5.005 requires`
		1159	`them all to have the same length.`
		1160
		1161
		1162	`(b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not`
		1163	`set, the $ meta- character matches only at the very end of`
		1164	`the string.`
		1165
		1166
		1167	`(c) If PCRE_EXTRA is set, a backslash followed by a letter`
		1168	`with no special meaning is faulted.`
		1169
		1170
		1171	`(d) If PCRE_UNGREEDY is set, the greediness of the`
		1172	`repetition quantifiers is inverted, that is, by default they`
		1173	`are not greedy, but if followed by a question mark they`
		1174	`are.`
		1175	`!!LIMITATIONS`
		1176
		1177
		1178	`There are some size limitations in PCRE but it is hoped that`
		1179	`they will never in practice be relevant. The maximum length`
		1180	`of a compiled pattern is 65539 (sic) bytes. All values in`
		1181	`repeating quantifiers must be less than 65536. The maximum`
		1182	`number of capturing subpatterns is 99. The maximum number of`
		1183	`all parenthesized subpatterns, including capturing`
		1184	`subpatterns, assertions, and other types of subpattern, is`
		1185	`200.`
		1186
		1187
		1188	`The maximum length of a subject string is the largest`
		1189	`positive number that an integer variable can hold. However,`
		1190	`PCRE uses recursion to handle subpatterns and indefinite`
		1191	`repetition. This means that the available stack space may`
		1192	`limit the size of a subject string that can be processed by`
		1193	`certain patterns.`
		1194	`!!AUTHOR`
		1195
		1196
		1197	`Philip Hazel`
		1198	`University Computing Service,`
		1199	`New Museums Site,`
		1200	`Cambridge CB2 3QG, England.`
		1201	`Phone: +44 1223 334714`
		1202
		1203
		1204	`Copyright (c) 1997-1999 University of`
		1205	`Cambridge.`
		1206	`----`

This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.

Last edited on Monday, June 3, 2002 6:56:25 pm by "perry"

Edit PageHistory Diff Info LikePages