Penguin
Annotated edit history of regex(7) version 1, including all changes. View license author blame.
Rev Author # Line
1 perry 1 REGEX
2 !!!REGEX
3 NAME
4 DESCRIPTION
5 SEE ALSO
6 BUGS
7 AUTHOR
8 ----
9 !!NAME
10
11
12 regex - POSIX 1003.2 regular expressions
13 !!DESCRIPTION
14
15
16 Regular expressions (``RE''s), as defined in POSIX 1003.2,
17 come in two forms: modern REs (roughly those of
18 ''egrep''; 1003.2 calls these ``extended'' REs) and
19 obsolete REs (roughly those of ed(1); 1003.2
20 ``basic'' REs). Obsolete REs mostly exist for backward
21 compatibility in some old programs; they will be discussed
22 at the end. 1003.2 leaves some aspects of RE syntax and
23 semantics open; `' marks decisions on these aspects that may
24 not be fully portable to other 1003.2
25 implementations.
26
27
28 A (modern) RE is one or more non-empty ''branches'',
29 separated by `|'. It matches anything that matches one of
30 the branches.
31
32
33 A branch is one or more ''pieces'', concatenated. It
34 matches a match for the first, followed by a match for the
35 second, etc.
36
37
38 A piece is an ''atom'' possibly followed by a single `*',
39 `+', `?', or ''bound''. An atom followed by `*' matches a
40 sequence of 0 or more matches of the atom. An atom followed
41 by `+' matches a sequence of 1 or more matches of the atom.
42 An atom followed by `?' matches a sequence of 0 or 1 matches
43 of the atom.
44
45
46 A ''bound'' is `{' followed by an unsigned decimal
47 integer, possibly followed by `,' possibly followed by
48 another unsigned decimal integer, always followed by `}'.
49 The integers must lie between 0 and RE_DUP_MAX (255)
50 inclusive, and if there are two of them, the first may not
51 exceed the second. An atom followed by a bound containing
52 one integer ''i'' and no comma matches a sequence of
53 exactly ''i'' matches of the atom. An atom followed by a
54 bound containing one integer ''i'' and a comma matches a
55 sequence of ''i'' or more matches of the atom. An atom
56 followed by a bound containing two integers ''i'' and
57 ''j'' matches a sequence of ''i'' through ''j''
58 (inclusive) matches of the atom.
59
60
61 An atom is a regular expression enclosed in `()' (matching a
62 match for the regular expression), an empty set of `()'
63 (matching the null string), a ''bracket expression'' (see
64 below), `.' (matching any single character), `^' (matching
65 the null string at the beginning of a line), `$' (matching
66 the null string at the end of a line), a `' followed by one
67 of the characters `^.[[$()|*+?{' (matching that character
68 taken as an ordinary character), a `' followed by any other
69 character (matching that character taken as an ordinary
70 character, as if the `' had not been present), or a single
71 character with no other significance (matching that
72 character). A `{' followed by a character other than a digit
73 is an ordinary character, not the beginning of a bound. It
74 is illegal to end an RE with `'.
75
76
77 A ''bracket expression'' is a list of characters enclosed
78 in `[[]'. It normally matches any single character from the
79 list (but see below). If the list begins with `^', it
80 matches any single character (but see below) ''not'' from
81 the rest of the list. If two characters in the list are
82 separated by `-', this is shorthand for the full
83 ''range'' of characters between those two (inclusive) in
84 the collating sequence, e.g. `[[0-9]' in ASCII matches any
85 decimal digit. It is illegal for two ranges to share an
86 endpoint, e.g. `a-c-e'. Ranges are very
87 collating-sequence-dependent, and portable programs should
88 avoid relying on them.
89
90
91 To include a literal `]' in the list, make it the first
92 character (following a possible `^'). To include a literal
93 `-', make it the first or last character, or the second
94 endpoint of a range. To use a literal `-' as the first
95 endpoint of a range, enclose it in `[[.' and `.]' to make it
96 a collating element (see below). With the exception of these
97 and some combinations using `[[' (see next paragraphs), all
98 other special characters, including `', lose their special
99 significance within a bracket expression.
100
101
102 Within a bracket expression, a collating element (a
103 character, a multi-character sequence that collates as if it
104 were a single character, or a collating-sequence name for
105 either) enclosed in `[[.' and `.]' stands for the sequence of
106 characters of that collating element. The sequence is a
107 single element of the bracket expression's list. A bracket
108 expression containing a multi-character collating element
109 can thus match more than one character, e.g. if the
110 collating sequence includes a `ch' collating element, then
111 the RE `[[[[.ch.]]*c' matches the first five characters of
112 `chchcc'.
113
114
115 Within a bracket expression, a collating element enclosed in
116 `[[=' and `=]' is an equivalence class, standing for the
117 sequences of characters of all collating elements equivalent
118 to that one, including itself. (If there are no other
119 equivalent collating elements, the treatment is as if the
120 enclosing delimiters were `[[.' and `.]'.) For example, if o
121 and o^ are the members of an equivalence class, then
122 `[[[[=o=]]', `[[[[=o^=]]', and `[[oo^]' are all synonymous. An
123 equivalence class may not be an endpoint of a
124 range.
125
126
127 Within a bracket expression, the name of a ''character
128 class'' enclosed in `[[:' and `:]' stands for the list of
129 all characters belonging to that class. Standard character
130 class names are:
131
132
133 alnum digit punct
134 alpha graph space
135 blank lower upper
136 cntrl print xdigit
137
138
139 These stand for the character classes defined in
140 ctype(3). A locale may provide others. A character
141 class may not be used as an endpoint of a
142 range.
143
144
145 There are two special cases of bracket expressions: the
146 bracket expressions `[[[[:
147 alnum'' character (as defined by
148 ctype(3)) or an underscore. This is an extension,
149 compatible with but not specified by POSIX 1003.2, and
150 should be used with caution in software intended to be
151 portable to other systems.
152
153
154 In the event that an RE could match more than one substring
155 of a given string, the RE matches the one starting earliest
156 in the string. If the RE could match more than one substring
157 starting at that point, it matches the longest.
158 Subexpressions also match the longest possible substrings,
159 subject to the constraint that the whole match be as long as
160 possible, with subexpressions starting earlier in the RE
161 taking priority over ones starting later. Note that
162 higher-level subexpressions thus take priority over their
163 lower-level component subexpressions.
164
165
166 Match lengths are measured in characters, not collating
167 elements. A null string is considered longer than no match
168 at all. For example, `bb*' matches the three middle
169 characters of `abbbc', `(wee|week)(knights|nights)' matches
170 all ten characters of `weeknights', when `(.*).*' is matched
171 against `abc' the parenthesized subexpression matches all
172 three characters, and when `(a*)*' is matched against `bc'
173 both the whole RE and the parenthesized subexpression match
174 the null string.
175
176
177 If case-independent matching is specified, the effect is
178 much as if all case distinctions had vanished from the
179 alphabet. When an alphabetic that exists in multiple cases
180 appears as an ordinary character outside a bracket
181 expression, it is effectively transformed into a bracket
182 expression containing both cases, e.g. `x' becomes `[[xX]'.
183 When it appears inside a bracket expression, all case
184 counterparts of it are added to the bracket expression, so
185 that (e.g.) `[[x]' becomes `[[xX]' and `[[^x]' becomes
186 `[[^xX]'.
187
188
189 No particular limit is imposed on the length of REs.
190 Programs intended to be portable should not employ REs
191 longer than 256 bytes, as an implementation can refuse to
192 accept such REs and remain POSIX-compliant.
193
194
195 Obsolete (``basic'') regular expressions differ in several
196 respects. `|', `+', and `?' are ordinary characters and
197 there is no equivalent for their functionality. The
198 delimiters for bounds are `{' and `}', with `{' and `}' by
199 themselves ordinary characters. The parentheses for nested
200 subexpressions are `' and `)', with `(' and `)' by
201 themselves ordinary characters. `^' is an ordinary character
202 except at the beginning of the RE or the beginning of a
203 parenthesized subexpression, `$' is an ordinary character
204 except at the end of the RE or the end of a parenthesized
205 subexpression, and `*' is an ordinary character if it
206 appears at the beginning of the RE or the beginning of a
207 parenthesized subexpression (after a possible leading `^').
208 Finally, there is one new type of atom, a ''back
209 reference'': `' followed by a non-zero decimal digit
210 ''d'' matches the same sequence of characters matched by
211 the ''d''th parenthesized subexpression (numbering
212 subexpressions by the positions of their opening
213 parentheses, left to right), so that (e.g.) `1' matches `bb'
214 or `cc' but not `bc'.
215 !!SEE ALSO
216
217
218 regex(3)
219
220
221 POSIX 1003.2, section 2.8 (Regular Expression
222 Notation).
223 !!BUGS
224
225
226 Having two kinds of REs is a botch.
227
228
229 The current 1003.2 spec says that `)' is an ordinary
230 character in the absence of an unmatched `('; this was an
231 unintentional result of a wording error, and change is
232 likely. Avoid relying on it.
233
234
235 Back references are a dreadful botch, posing major problems
236 for efficient implementations. They are also somewhat
237 vaguely defined (does `a*2)*d' match `abbbd'?). Avoid using
238 them.
239
240
241 1003.2's specification of case-independent matching is
242 vague. The ``one case implies all cases'' definition given
243 above is current consensus among implementors as to the
244 right interpretation.
245
246
247 The syntax for word boundaries is incredibly
248 ugly.
249 !!AUTHOR
250
251
252 This page was taken from Henry Spencer's regex
253 package.
254 ----
This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.