version 1, including all changes.
.
| Rev |
Author |
# |
Line |
| 1 |
perry |
1 |
REGEX |
| |
|
2 |
!!!REGEX |
| |
|
3 |
NAME |
| |
|
4 |
DESCRIPTION |
| |
|
5 |
SEE ALSO |
| |
|
6 |
BUGS |
| |
|
7 |
AUTHOR |
| |
|
8 |
---- |
| |
|
9 |
!!NAME |
| |
|
10 |
|
| |
|
11 |
|
| |
|
12 |
regex - POSIX 1003.2 regular expressions |
| |
|
13 |
!!DESCRIPTION |
| |
|
14 |
|
| |
|
15 |
|
| |
|
16 |
Regular expressions (``RE''s), as defined in POSIX 1003.2, |
| |
|
17 |
come in two forms: modern REs (roughly those of |
| |
|
18 |
''egrep''; 1003.2 calls these ``extended'' REs) and |
| |
|
19 |
obsolete REs (roughly those of ed(1); 1003.2 |
| |
|
20 |
``basic'' REs). Obsolete REs mostly exist for backward |
| |
|
21 |
compatibility in some old programs; they will be discussed |
| |
|
22 |
at the end. 1003.2 leaves some aspects of RE syntax and |
| |
|
23 |
semantics open; `' marks decisions on these aspects that may |
| |
|
24 |
not be fully portable to other 1003.2 |
| |
|
25 |
implementations. |
| |
|
26 |
|
| |
|
27 |
|
| |
|
28 |
A (modern) RE is one or more non-empty ''branches'', |
| |
|
29 |
separated by `|'. It matches anything that matches one of |
| |
|
30 |
the branches. |
| |
|
31 |
|
| |
|
32 |
|
| |
|
33 |
A branch is one or more ''pieces'', concatenated. It |
| |
|
34 |
matches a match for the first, followed by a match for the |
| |
|
35 |
second, etc. |
| |
|
36 |
|
| |
|
37 |
|
| |
|
38 |
A piece is an ''atom'' possibly followed by a single `*', |
| |
|
39 |
`+', `?', or ''bound''. An atom followed by `*' matches a |
| |
|
40 |
sequence of 0 or more matches of the atom. An atom followed |
| |
|
41 |
by `+' matches a sequence of 1 or more matches of the atom. |
| |
|
42 |
An atom followed by `?' matches a sequence of 0 or 1 matches |
| |
|
43 |
of the atom. |
| |
|
44 |
|
| |
|
45 |
|
| |
|
46 |
A ''bound'' is `{' followed by an unsigned decimal |
| |
|
47 |
integer, possibly followed by `,' possibly followed by |
| |
|
48 |
another unsigned decimal integer, always followed by `}'. |
| |
|
49 |
The integers must lie between 0 and RE_DUP_MAX (255) |
| |
|
50 |
inclusive, and if there are two of them, the first may not |
| |
|
51 |
exceed the second. An atom followed by a bound containing |
| |
|
52 |
one integer ''i'' and no comma matches a sequence of |
| |
|
53 |
exactly ''i'' matches of the atom. An atom followed by a |
| |
|
54 |
bound containing one integer ''i'' and a comma matches a |
| |
|
55 |
sequence of ''i'' or more matches of the atom. An atom |
| |
|
56 |
followed by a bound containing two integers ''i'' and |
| |
|
57 |
''j'' matches a sequence of ''i'' through ''j'' |
| |
|
58 |
(inclusive) matches of the atom. |
| |
|
59 |
|
| |
|
60 |
|
| |
|
61 |
An atom is a regular expression enclosed in `()' (matching a |
| |
|
62 |
match for the regular expression), an empty set of `()' |
| |
|
63 |
(matching the null string), a ''bracket expression'' (see |
| |
|
64 |
below), `.' (matching any single character), `^' (matching |
| |
|
65 |
the null string at the beginning of a line), `$' (matching |
| |
|
66 |
the null string at the end of a line), a `' followed by one |
| |
|
67 |
of the characters `^.[[$()|*+?{' (matching that character |
| |
|
68 |
taken as an ordinary character), a `' followed by any other |
| |
|
69 |
character (matching that character taken as an ordinary |
| |
|
70 |
character, as if the `' had not been present), or a single |
| |
|
71 |
character with no other significance (matching that |
| |
|
72 |
character). A `{' followed by a character other than a digit |
| |
|
73 |
is an ordinary character, not the beginning of a bound. It |
| |
|
74 |
is illegal to end an RE with `'. |
| |
|
75 |
|
| |
|
76 |
|
| |
|
77 |
A ''bracket expression'' is a list of characters enclosed |
| |
|
78 |
in `[[]'. It normally matches any single character from the |
| |
|
79 |
list (but see below). If the list begins with `^', it |
| |
|
80 |
matches any single character (but see below) ''not'' from |
| |
|
81 |
the rest of the list. If two characters in the list are |
| |
|
82 |
separated by `-', this is shorthand for the full |
| |
|
83 |
''range'' of characters between those two (inclusive) in |
| |
|
84 |
the collating sequence, e.g. `[[0-9]' in ASCII matches any |
| |
|
85 |
decimal digit. It is illegal for two ranges to share an |
| |
|
86 |
endpoint, e.g. `a-c-e'. Ranges are very |
| |
|
87 |
collating-sequence-dependent, and portable programs should |
| |
|
88 |
avoid relying on them. |
| |
|
89 |
|
| |
|
90 |
|
| |
|
91 |
To include a literal `]' in the list, make it the first |
| |
|
92 |
character (following a possible `^'). To include a literal |
| |
|
93 |
`-', make it the first or last character, or the second |
| |
|
94 |
endpoint of a range. To use a literal `-' as the first |
| |
|
95 |
endpoint of a range, enclose it in `[[.' and `.]' to make it |
| |
|
96 |
a collating element (see below). With the exception of these |
| |
|
97 |
and some combinations using `[[' (see next paragraphs), all |
| |
|
98 |
other special characters, including `', lose their special |
| |
|
99 |
significance within a bracket expression. |
| |
|
100 |
|
| |
|
101 |
|
| |
|
102 |
Within a bracket expression, a collating element (a |
| |
|
103 |
character, a multi-character sequence that collates as if it |
| |
|
104 |
were a single character, or a collating-sequence name for |
| |
|
105 |
either) enclosed in `[[.' and `.]' stands for the sequence of |
| |
|
106 |
characters of that collating element. The sequence is a |
| |
|
107 |
single element of the bracket expression's list. A bracket |
| |
|
108 |
expression containing a multi-character collating element |
| |
|
109 |
can thus match more than one character, e.g. if the |
| |
|
110 |
collating sequence includes a `ch' collating element, then |
| |
|
111 |
the RE `[[[[.ch.]]*c' matches the first five characters of |
| |
|
112 |
`chchcc'. |
| |
|
113 |
|
| |
|
114 |
|
| |
|
115 |
Within a bracket expression, a collating element enclosed in |
| |
|
116 |
`[[=' and `=]' is an equivalence class, standing for the |
| |
|
117 |
sequences of characters of all collating elements equivalent |
| |
|
118 |
to that one, including itself. (If there are no other |
| |
|
119 |
equivalent collating elements, the treatment is as if the |
| |
|
120 |
enclosing delimiters were `[[.' and `.]'.) For example, if o |
| |
|
121 |
and o^ are the members of an equivalence class, then |
| |
|
122 |
`[[[[=o=]]', `[[[[=o^=]]', and `[[oo^]' are all synonymous. An |
| |
|
123 |
equivalence class may not be an endpoint of a |
| |
|
124 |
range. |
| |
|
125 |
|
| |
|
126 |
|
| |
|
127 |
Within a bracket expression, the name of a ''character |
| |
|
128 |
class'' enclosed in `[[:' and `:]' stands for the list of |
| |
|
129 |
all characters belonging to that class. Standard character |
| |
|
130 |
class names are: |
| |
|
131 |
|
| |
|
132 |
|
| |
|
133 |
alnum digit punct |
| |
|
134 |
alpha graph space |
| |
|
135 |
blank lower upper |
| |
|
136 |
cntrl print xdigit |
| |
|
137 |
|
| |
|
138 |
|
| |
|
139 |
These stand for the character classes defined in |
| |
|
140 |
ctype(3). A locale may provide others. A character |
| |
|
141 |
class may not be used as an endpoint of a |
| |
|
142 |
range. |
| |
|
143 |
|
| |
|
144 |
|
| |
|
145 |
There are two special cases of bracket expressions: the |
| |
|
146 |
bracket expressions `[[[[: |
| |
|
147 |
alnum'' character (as defined by |
| |
|
148 |
ctype(3)) or an underscore. This is an extension, |
| |
|
149 |
compatible with but not specified by POSIX 1003.2, and |
| |
|
150 |
should be used with caution in software intended to be |
| |
|
151 |
portable to other systems. |
| |
|
152 |
|
| |
|
153 |
|
| |
|
154 |
In the event that an RE could match more than one substring |
| |
|
155 |
of a given string, the RE matches the one starting earliest |
| |
|
156 |
in the string. If the RE could match more than one substring |
| |
|
157 |
starting at that point, it matches the longest. |
| |
|
158 |
Subexpressions also match the longest possible substrings, |
| |
|
159 |
subject to the constraint that the whole match be as long as |
| |
|
160 |
possible, with subexpressions starting earlier in the RE |
| |
|
161 |
taking priority over ones starting later. Note that |
| |
|
162 |
higher-level subexpressions thus take priority over their |
| |
|
163 |
lower-level component subexpressions. |
| |
|
164 |
|
| |
|
165 |
|
| |
|
166 |
Match lengths are measured in characters, not collating |
| |
|
167 |
elements. A null string is considered longer than no match |
| |
|
168 |
at all. For example, `bb*' matches the three middle |
| |
|
169 |
characters of `abbbc', `(wee|week)(knights|nights)' matches |
| |
|
170 |
all ten characters of `weeknights', when `(.*).*' is matched |
| |
|
171 |
against `abc' the parenthesized subexpression matches all |
| |
|
172 |
three characters, and when `(a*)*' is matched against `bc' |
| |
|
173 |
both the whole RE and the parenthesized subexpression match |
| |
|
174 |
the null string. |
| |
|
175 |
|
| |
|
176 |
|
| |
|
177 |
If case-independent matching is specified, the effect is |
| |
|
178 |
much as if all case distinctions had vanished from the |
| |
|
179 |
alphabet. When an alphabetic that exists in multiple cases |
| |
|
180 |
appears as an ordinary character outside a bracket |
| |
|
181 |
expression, it is effectively transformed into a bracket |
| |
|
182 |
expression containing both cases, e.g. `x' becomes `[[xX]'. |
| |
|
183 |
When it appears inside a bracket expression, all case |
| |
|
184 |
counterparts of it are added to the bracket expression, so |
| |
|
185 |
that (e.g.) `[[x]' becomes `[[xX]' and `[[^x]' becomes |
| |
|
186 |
`[[^xX]'. |
| |
|
187 |
|
| |
|
188 |
|
| |
|
189 |
No particular limit is imposed on the length of REs. |
| |
|
190 |
Programs intended to be portable should not employ REs |
| |
|
191 |
longer than 256 bytes, as an implementation can refuse to |
| |
|
192 |
accept such REs and remain POSIX-compliant. |
| |
|
193 |
|
| |
|
194 |
|
| |
|
195 |
Obsolete (``basic'') regular expressions differ in several |
| |
|
196 |
respects. `|', `+', and `?' are ordinary characters and |
| |
|
197 |
there is no equivalent for their functionality. The |
| |
|
198 |
delimiters for bounds are `{' and `}', with `{' and `}' by |
| |
|
199 |
themselves ordinary characters. The parentheses for nested |
| |
|
200 |
subexpressions are `' and `)', with `(' and `)' by |
| |
|
201 |
themselves ordinary characters. `^' is an ordinary character |
| |
|
202 |
except at the beginning of the RE or the beginning of a |
| |
|
203 |
parenthesized subexpression, `$' is an ordinary character |
| |
|
204 |
except at the end of the RE or the end of a parenthesized |
| |
|
205 |
subexpression, and `*' is an ordinary character if it |
| |
|
206 |
appears at the beginning of the RE or the beginning of a |
| |
|
207 |
parenthesized subexpression (after a possible leading `^'). |
| |
|
208 |
Finally, there is one new type of atom, a ''back |
| |
|
209 |
reference'': `' followed by a non-zero decimal digit |
| |
|
210 |
''d'' matches the same sequence of characters matched by |
| |
|
211 |
the ''d''th parenthesized subexpression (numbering |
| |
|
212 |
subexpressions by the positions of their opening |
| |
|
213 |
parentheses, left to right), so that (e.g.) `1' matches `bb' |
| |
|
214 |
or `cc' but not `bc'. |
| |
|
215 |
!!SEE ALSO |
| |
|
216 |
|
| |
|
217 |
|
| |
|
218 |
regex(3) |
| |
|
219 |
|
| |
|
220 |
|
| |
|
221 |
POSIX 1003.2, section 2.8 (Regular Expression |
| |
|
222 |
Notation). |
| |
|
223 |
!!BUGS |
| |
|
224 |
|
| |
|
225 |
|
| |
|
226 |
Having two kinds of REs is a botch. |
| |
|
227 |
|
| |
|
228 |
|
| |
|
229 |
The current 1003.2 spec says that `)' is an ordinary |
| |
|
230 |
character in the absence of an unmatched `('; this was an |
| |
|
231 |
unintentional result of a wording error, and change is |
| |
|
232 |
likely. Avoid relying on it. |
| |
|
233 |
|
| |
|
234 |
|
| |
|
235 |
Back references are a dreadful botch, posing major problems |
| |
|
236 |
for efficient implementations. They are also somewhat |
| |
|
237 |
vaguely defined (does `a*2)*d' match `abbbd'?). Avoid using |
| |
|
238 |
them. |
| |
|
239 |
|
| |
|
240 |
|
| |
|
241 |
1003.2's specification of case-independent matching is |
| |
|
242 |
vague. The ``one case implies all cases'' definition given |
| |
|
243 |
above is current consensus among implementors as to the |
| |
|
244 |
right interpretation. |
| |
|
245 |
|
| |
|
246 |
|
| |
|
247 |
The syntax for word boundaries is incredibly |
| |
|
248 |
ugly. |
| |
|
249 |
!!AUTHOR |
| |
|
250 |
|
| |
|
251 |
|
| |
|
252 |
This page was taken from Henry Spencer's regex |
| |
|
253 |
package. |
| |
|
254 |
---- |