version 1, including all changes.
.
Rev |
Author |
# |
Line |
1 |
perry |
1 |
REGEX |
|
|
2 |
!!!REGEX |
|
|
3 |
NAME |
|
|
4 |
DESCRIPTION |
|
|
5 |
SEE ALSO |
|
|
6 |
BUGS |
|
|
7 |
AUTHOR |
|
|
8 |
---- |
|
|
9 |
!!NAME |
|
|
10 |
|
|
|
11 |
|
|
|
12 |
regex - POSIX 1003.2 regular expressions |
|
|
13 |
!!DESCRIPTION |
|
|
14 |
|
|
|
15 |
|
|
|
16 |
Regular expressions (``RE''s), as defined in POSIX 1003.2, |
|
|
17 |
come in two forms: modern REs (roughly those of |
|
|
18 |
''egrep''; 1003.2 calls these ``extended'' REs) and |
|
|
19 |
obsolete REs (roughly those of ed(1); 1003.2 |
|
|
20 |
``basic'' REs). Obsolete REs mostly exist for backward |
|
|
21 |
compatibility in some old programs; they will be discussed |
|
|
22 |
at the end. 1003.2 leaves some aspects of RE syntax and |
|
|
23 |
semantics open; `' marks decisions on these aspects that may |
|
|
24 |
not be fully portable to other 1003.2 |
|
|
25 |
implementations. |
|
|
26 |
|
|
|
27 |
|
|
|
28 |
A (modern) RE is one or more non-empty ''branches'', |
|
|
29 |
separated by `|'. It matches anything that matches one of |
|
|
30 |
the branches. |
|
|
31 |
|
|
|
32 |
|
|
|
33 |
A branch is one or more ''pieces'', concatenated. It |
|
|
34 |
matches a match for the first, followed by a match for the |
|
|
35 |
second, etc. |
|
|
36 |
|
|
|
37 |
|
|
|
38 |
A piece is an ''atom'' possibly followed by a single `*', |
|
|
39 |
`+', `?', or ''bound''. An atom followed by `*' matches a |
|
|
40 |
sequence of 0 or more matches of the atom. An atom followed |
|
|
41 |
by `+' matches a sequence of 1 or more matches of the atom. |
|
|
42 |
An atom followed by `?' matches a sequence of 0 or 1 matches |
|
|
43 |
of the atom. |
|
|
44 |
|
|
|
45 |
|
|
|
46 |
A ''bound'' is `{' followed by an unsigned decimal |
|
|
47 |
integer, possibly followed by `,' possibly followed by |
|
|
48 |
another unsigned decimal integer, always followed by `}'. |
|
|
49 |
The integers must lie between 0 and RE_DUP_MAX (255) |
|
|
50 |
inclusive, and if there are two of them, the first may not |
|
|
51 |
exceed the second. An atom followed by a bound containing |
|
|
52 |
one integer ''i'' and no comma matches a sequence of |
|
|
53 |
exactly ''i'' matches of the atom. An atom followed by a |
|
|
54 |
bound containing one integer ''i'' and a comma matches a |
|
|
55 |
sequence of ''i'' or more matches of the atom. An atom |
|
|
56 |
followed by a bound containing two integers ''i'' and |
|
|
57 |
''j'' matches a sequence of ''i'' through ''j'' |
|
|
58 |
(inclusive) matches of the atom. |
|
|
59 |
|
|
|
60 |
|
|
|
61 |
An atom is a regular expression enclosed in `()' (matching a |
|
|
62 |
match for the regular expression), an empty set of `()' |
|
|
63 |
(matching the null string), a ''bracket expression'' (see |
|
|
64 |
below), `.' (matching any single character), `^' (matching |
|
|
65 |
the null string at the beginning of a line), `$' (matching |
|
|
66 |
the null string at the end of a line), a `' followed by one |
|
|
67 |
of the characters `^.[[$()|*+?{' (matching that character |
|
|
68 |
taken as an ordinary character), a `' followed by any other |
|
|
69 |
character (matching that character taken as an ordinary |
|
|
70 |
character, as if the `' had not been present), or a single |
|
|
71 |
character with no other significance (matching that |
|
|
72 |
character). A `{' followed by a character other than a digit |
|
|
73 |
is an ordinary character, not the beginning of a bound. It |
|
|
74 |
is illegal to end an RE with `'. |
|
|
75 |
|
|
|
76 |
|
|
|
77 |
A ''bracket expression'' is a list of characters enclosed |
|
|
78 |
in `[[]'. It normally matches any single character from the |
|
|
79 |
list (but see below). If the list begins with `^', it |
|
|
80 |
matches any single character (but see below) ''not'' from |
|
|
81 |
the rest of the list. If two characters in the list are |
|
|
82 |
separated by `-', this is shorthand for the full |
|
|
83 |
''range'' of characters between those two (inclusive) in |
|
|
84 |
the collating sequence, e.g. `[[0-9]' in ASCII matches any |
|
|
85 |
decimal digit. It is illegal for two ranges to share an |
|
|
86 |
endpoint, e.g. `a-c-e'. Ranges are very |
|
|
87 |
collating-sequence-dependent, and portable programs should |
|
|
88 |
avoid relying on them. |
|
|
89 |
|
|
|
90 |
|
|
|
91 |
To include a literal `]' in the list, make it the first |
|
|
92 |
character (following a possible `^'). To include a literal |
|
|
93 |
`-', make it the first or last character, or the second |
|
|
94 |
endpoint of a range. To use a literal `-' as the first |
|
|
95 |
endpoint of a range, enclose it in `[[.' and `.]' to make it |
|
|
96 |
a collating element (see below). With the exception of these |
|
|
97 |
and some combinations using `[[' (see next paragraphs), all |
|
|
98 |
other special characters, including `', lose their special |
|
|
99 |
significance within a bracket expression. |
|
|
100 |
|
|
|
101 |
|
|
|
102 |
Within a bracket expression, a collating element (a |
|
|
103 |
character, a multi-character sequence that collates as if it |
|
|
104 |
were a single character, or a collating-sequence name for |
|
|
105 |
either) enclosed in `[[.' and `.]' stands for the sequence of |
|
|
106 |
characters of that collating element. The sequence is a |
|
|
107 |
single element of the bracket expression's list. A bracket |
|
|
108 |
expression containing a multi-character collating element |
|
|
109 |
can thus match more than one character, e.g. if the |
|
|
110 |
collating sequence includes a `ch' collating element, then |
|
|
111 |
the RE `[[[[.ch.]]*c' matches the first five characters of |
|
|
112 |
`chchcc'. |
|
|
113 |
|
|
|
114 |
|
|
|
115 |
Within a bracket expression, a collating element enclosed in |
|
|
116 |
`[[=' and `=]' is an equivalence class, standing for the |
|
|
117 |
sequences of characters of all collating elements equivalent |
|
|
118 |
to that one, including itself. (If there are no other |
|
|
119 |
equivalent collating elements, the treatment is as if the |
|
|
120 |
enclosing delimiters were `[[.' and `.]'.) For example, if o |
|
|
121 |
and o^ are the members of an equivalence class, then |
|
|
122 |
`[[[[=o=]]', `[[[[=o^=]]', and `[[oo^]' are all synonymous. An |
|
|
123 |
equivalence class may not be an endpoint of a |
|
|
124 |
range. |
|
|
125 |
|
|
|
126 |
|
|
|
127 |
Within a bracket expression, the name of a ''character |
|
|
128 |
class'' enclosed in `[[:' and `:]' stands for the list of |
|
|
129 |
all characters belonging to that class. Standard character |
|
|
130 |
class names are: |
|
|
131 |
|
|
|
132 |
|
|
|
133 |
alnum digit punct |
|
|
134 |
alpha graph space |
|
|
135 |
blank lower upper |
|
|
136 |
cntrl print xdigit |
|
|
137 |
|
|
|
138 |
|
|
|
139 |
These stand for the character classes defined in |
|
|
140 |
ctype(3). A locale may provide others. A character |
|
|
141 |
class may not be used as an endpoint of a |
|
|
142 |
range. |
|
|
143 |
|
|
|
144 |
|
|
|
145 |
There are two special cases of bracket expressions: the |
|
|
146 |
bracket expressions `[[[[: |
|
|
147 |
alnum'' character (as defined by |
|
|
148 |
ctype(3)) or an underscore. This is an extension, |
|
|
149 |
compatible with but not specified by POSIX 1003.2, and |
|
|
150 |
should be used with caution in software intended to be |
|
|
151 |
portable to other systems. |
|
|
152 |
|
|
|
153 |
|
|
|
154 |
In the event that an RE could match more than one substring |
|
|
155 |
of a given string, the RE matches the one starting earliest |
|
|
156 |
in the string. If the RE could match more than one substring |
|
|
157 |
starting at that point, it matches the longest. |
|
|
158 |
Subexpressions also match the longest possible substrings, |
|
|
159 |
subject to the constraint that the whole match be as long as |
|
|
160 |
possible, with subexpressions starting earlier in the RE |
|
|
161 |
taking priority over ones starting later. Note that |
|
|
162 |
higher-level subexpressions thus take priority over their |
|
|
163 |
lower-level component subexpressions. |
|
|
164 |
|
|
|
165 |
|
|
|
166 |
Match lengths are measured in characters, not collating |
|
|
167 |
elements. A null string is considered longer than no match |
|
|
168 |
at all. For example, `bb*' matches the three middle |
|
|
169 |
characters of `abbbc', `(wee|week)(knights|nights)' matches |
|
|
170 |
all ten characters of `weeknights', when `(.*).*' is matched |
|
|
171 |
against `abc' the parenthesized subexpression matches all |
|
|
172 |
three characters, and when `(a*)*' is matched against `bc' |
|
|
173 |
both the whole RE and the parenthesized subexpression match |
|
|
174 |
the null string. |
|
|
175 |
|
|
|
176 |
|
|
|
177 |
If case-independent matching is specified, the effect is |
|
|
178 |
much as if all case distinctions had vanished from the |
|
|
179 |
alphabet. When an alphabetic that exists in multiple cases |
|
|
180 |
appears as an ordinary character outside a bracket |
|
|
181 |
expression, it is effectively transformed into a bracket |
|
|
182 |
expression containing both cases, e.g. `x' becomes `[[xX]'. |
|
|
183 |
When it appears inside a bracket expression, all case |
|
|
184 |
counterparts of it are added to the bracket expression, so |
|
|
185 |
that (e.g.) `[[x]' becomes `[[xX]' and `[[^x]' becomes |
|
|
186 |
`[[^xX]'. |
|
|
187 |
|
|
|
188 |
|
|
|
189 |
No particular limit is imposed on the length of REs. |
|
|
190 |
Programs intended to be portable should not employ REs |
|
|
191 |
longer than 256 bytes, as an implementation can refuse to |
|
|
192 |
accept such REs and remain POSIX-compliant. |
|
|
193 |
|
|
|
194 |
|
|
|
195 |
Obsolete (``basic'') regular expressions differ in several |
|
|
196 |
respects. `|', `+', and `?' are ordinary characters and |
|
|
197 |
there is no equivalent for their functionality. The |
|
|
198 |
delimiters for bounds are `{' and `}', with `{' and `}' by |
|
|
199 |
themselves ordinary characters. The parentheses for nested |
|
|
200 |
subexpressions are `' and `)', with `(' and `)' by |
|
|
201 |
themselves ordinary characters. `^' is an ordinary character |
|
|
202 |
except at the beginning of the RE or the beginning of a |
|
|
203 |
parenthesized subexpression, `$' is an ordinary character |
|
|
204 |
except at the end of the RE or the end of a parenthesized |
|
|
205 |
subexpression, and `*' is an ordinary character if it |
|
|
206 |
appears at the beginning of the RE or the beginning of a |
|
|
207 |
parenthesized subexpression (after a possible leading `^'). |
|
|
208 |
Finally, there is one new type of atom, a ''back |
|
|
209 |
reference'': `' followed by a non-zero decimal digit |
|
|
210 |
''d'' matches the same sequence of characters matched by |
|
|
211 |
the ''d''th parenthesized subexpression (numbering |
|
|
212 |
subexpressions by the positions of their opening |
|
|
213 |
parentheses, left to right), so that (e.g.) `1' matches `bb' |
|
|
214 |
or `cc' but not `bc'. |
|
|
215 |
!!SEE ALSO |
|
|
216 |
|
|
|
217 |
|
|
|
218 |
regex(3) |
|
|
219 |
|
|
|
220 |
|
|
|
221 |
POSIX 1003.2, section 2.8 (Regular Expression |
|
|
222 |
Notation). |
|
|
223 |
!!BUGS |
|
|
224 |
|
|
|
225 |
|
|
|
226 |
Having two kinds of REs is a botch. |
|
|
227 |
|
|
|
228 |
|
|
|
229 |
The current 1003.2 spec says that `)' is an ordinary |
|
|
230 |
character in the absence of an unmatched `('; this was an |
|
|
231 |
unintentional result of a wording error, and change is |
|
|
232 |
likely. Avoid relying on it. |
|
|
233 |
|
|
|
234 |
|
|
|
235 |
Back references are a dreadful botch, posing major problems |
|
|
236 |
for efficient implementations. They are also somewhat |
|
|
237 |
vaguely defined (does `a*2)*d' match `abbbd'?). Avoid using |
|
|
238 |
them. |
|
|
239 |
|
|
|
240 |
|
|
|
241 |
1003.2's specification of case-independent matching is |
|
|
242 |
vague. The ``one case implies all cases'' definition given |
|
|
243 |
above is current consensus among implementors as to the |
|
|
244 |
right interpretation. |
|
|
245 |
|
|
|
246 |
|
|
|
247 |
The syntax for word boundaries is incredibly |
|
|
248 |
ugly. |
|
|
249 |
!!AUTHOR |
|
|
250 |
|
|
|
251 |
|
|
|
252 |
This page was taken from Henry Spencer's regex |
|
|
253 |
package. |
|
|
254 |
---- |