Blame: regex(7) - Waikato Linux Users Group

Annotated edit history of regex(7) version 1, including all changes. View license author blame.

Rev	Author	#	Line
1	perry	1	`REGEX`
		2	`!!!REGEX`
		3	`NAME`
		4	`DESCRIPTION`
		5	`SEE ALSO`
		6	`BUGS`
		7	`AUTHOR`
		8	`----`
		9	`!!NAME`
		10
		11
		12	`regex - POSIX 1003.2 regular expressions`
		13	`!!DESCRIPTION`
		14
		15
		16	Regular expressions (``RE''s), as defined in POSIX 1003.2,
		17	`come in two forms: modern REs (roughly those of`
		18	''egrep''; 1003.2 calls these ``extended'' REs) and
		19	`obsolete REs (roughly those of ed(1); 1003.2`
		20	``basic'' REs). Obsolete REs mostly exist for backward
		21	`compatibility in some old programs; they will be discussed`
		22	`at the end. 1003.2 leaves some aspects of RE syntax and`
		23	semantics open; `' marks decisions on these aspects that may
		24	`not be fully portable to other 1003.2`
		25	`implementations.`
		26
		27
		28	`A (modern) RE is one or more non-empty ''branches'',`
		29	separated by `\|'. It matches anything that matches one of
		30	`the branches.`
		31
		32
		33	`A branch is one or more ''pieces'', concatenated. It`
		34	`matches a match for the first, followed by a match for the`
		35	`second, etc.`
		36
		37
		38	A piece is an ''atom'' possibly followed by a single `*',
		39	`+', `?', or ''bound''. An atom followed by `*' matches a
		40	`sequence of 0 or more matches of the atom. An atom followed`
		41	by `+' matches a sequence of 1 or more matches of the atom.
		42	An atom followed by `?' matches a sequence of 0 or 1 matches
		43	`of the atom.`
		44
		45
		46	A ''bound'' is `{' followed by an unsigned decimal
		47	integer, possibly followed by `,' possibly followed by
		48	another unsigned decimal integer, always followed by `}'.
		49	`The integers must lie between 0 and RE_DUP_MAX (255)`
		50	`inclusive, and if there are two of them, the first may not`
		51	`exceed the second. An atom followed by a bound containing`
		52	`one integer ''i'' and no comma matches a sequence of`
		53	`exactly ''i'' matches of the atom. An atom followed by a`
		54	`bound containing one integer ''i'' and a comma matches a`
		55	`sequence of ''i'' or more matches of the atom. An atom`
		56	`followed by a bound containing two integers ''i'' and`
		57	`''j'' matches a sequence of ''i'' through ''j''`
		58	`(inclusive) matches of the atom.`
		59
		60
		61	An atom is a regular expression enclosed in `()' (matching a
		62	match for the regular expression), an empty set of `()'
		63	`(matching the null string), a ''bracket expression'' (see`
		64	below), `.' (matching any single character), `^' (matching
		65	the null string at the beginning of a line), `$' (matching
		66	the null string at the end of a line), a `' followed by one
		67	of the characters `^.[[$()\|*+?{' (matching that character
		68	taken as an ordinary character), a `' followed by any other
		69	`character (matching that character taken as an ordinary`
		70	character, as if the `' had not been present), or a single
		71	`character with no other significance (matching that`
		72	character). A `{' followed by a character other than a digit
		73	`is an ordinary character, not the beginning of a bound. It`
		74	is illegal to end an RE with `'.
		75
		76
		77	`A ''bracket expression'' is a list of characters enclosed`
		78	in `[[]'. It normally matches any single character from the
		79	list (but see below). If the list begins with `^', it
		80	`matches any single character (but see below) ''not'' from`
		81	`the rest of the list. If two characters in the list are`
		82	separated by `-', this is shorthand for the full
		83	`''range'' of characters between those two (inclusive) in`
		84	the collating sequence, e.g. `[[0-9]' in ASCII matches any
		85	`decimal digit. It is illegal for two ranges to share an`
		86	endpoint, e.g. `a-c-e'. Ranges are very
		87	`collating-sequence-dependent, and portable programs should`
		88	`avoid relying on them.`
		89
		90
		91	To include a literal `]' in the list, make it the first
		92	character (following a possible `^'). To include a literal
		93	`-', make it the first or last character, or the second
		94	endpoint of a range. To use a literal `-' as the first
		95	endpoint of a range, enclose it in `[[.' and `.]' to make it
		96	`a collating element (see below). With the exception of these`
		97	and some combinations using `[[' (see next paragraphs), all
		98	other special characters, including `', lose their special
		99	`significance within a bracket expression.`
		100
		101
		102	`Within a bracket expression, a collating element (a`
		103	`character, a multi-character sequence that collates as if it`
		104	`were a single character, or a collating-sequence name for`
		105	either) enclosed in `[[.' and `.]' stands for the sequence of
		106	`characters of that collating element. The sequence is a`
		107	`single element of the bracket expression's list. A bracket`
		108	`expression containing a multi-character collating element`
		109	`can thus match more than one character, e.g. if the`
		110	collating sequence includes a `ch' collating element, then
		111	the RE `[[[[.ch.]]*c' matches the first five characters of
		112	`chchcc'.
		113
		114
		115	`Within a bracket expression, a collating element enclosed in`
		116	`[[=' and `=]' is an equivalence class, standing for the
		117	`sequences of characters of all collating elements equivalent`
		118	`to that one, including itself. (If there are no other`
		119	`equivalent collating elements, the treatment is as if the`
		120	enclosing delimiters were `[[.' and `.]'.) For example, if o
		121	`and o^ are the members of an equivalence class, then`
		122	`[[[[=o=]]', `[[[[=o^=]]', and `[[oo^]' are all synonymous. An
		123	`equivalence class may not be an endpoint of a`
		124	`range.`
		125
		126
		127	`Within a bracket expression, the name of a ''character`
		128	class'' enclosed in `[[:' and `:]' stands for the list of
		129	`all characters belonging to that class. Standard character`
		130	`class names are:`
		131
		132
		133	`alnum digit punct`
		134	`alpha graph space`
		135	`blank lower upper`
		136	`cntrl print xdigit`
		137
		138
		139	`These stand for the character classes defined in`
		140	`ctype(3). A locale may provide others. A character`
		141	`class may not be used as an endpoint of a`
		142	`range.`
		143
		144
		145	`There are two special cases of bracket expressions: the`
		146	bracket expressions `[[[[:
		147	`alnum'' character (as defined by`
		148	`ctype(3)) or an underscore. This is an extension,`
		149	`compatible with but not specified by POSIX 1003.2, and`
		150	`should be used with caution in software intended to be`
		151	`portable to other systems.`
		152
		153
		154	`In the event that an RE could match more than one substring`
		155	`of a given string, the RE matches the one starting earliest`
		156	`in the string. If the RE could match more than one substring`
		157	`starting at that point, it matches the longest.`
		158	`Subexpressions also match the longest possible substrings,`
		159	`subject to the constraint that the whole match be as long as`
		160	`possible, with subexpressions starting earlier in the RE`
		161	`taking priority over ones starting later. Note that`
		162	`higher-level subexpressions thus take priority over their`
		163	`lower-level component subexpressions.`
		164
		165
		166	`Match lengths are measured in characters, not collating`
		167	`elements. A null string is considered longer than no match`
		168	at all. For example, `bb*' matches the three middle
		169	characters of `abbbc', `(wee\|week)(knights\|nights)' matches
		170	all ten characters of `weeknights', when `(.).' is matched
		171	against `abc' the parenthesized subexpression matches all
		172	three characters, and when `(a)' is matched against `bc'
		173	`both the whole RE and the parenthesized subexpression match`
		174	`the null string.`
		175
		176
		177	`If case-independent matching is specified, the effect is`
		178	`much as if all case distinctions had vanished from the`
		179	`alphabet. When an alphabetic that exists in multiple cases`
		180	`appears as an ordinary character outside a bracket`
		181	`expression, it is effectively transformed into a bracket`
		182	expression containing both cases, e.g. `x' becomes `[[xX]'.
		183	`When it appears inside a bracket expression, all case`
		184	`counterparts of it are added to the bracket expression, so`
		185	that (e.g.) `[[x]' becomes `[[xX]' and `[[^x]' becomes
		186	`[[^xX]'.
		187
		188
		189	`No particular limit is imposed on the length of REs.`
		190	`Programs intended to be portable should not employ REs`
		191	`longer than 256 bytes, as an implementation can refuse to`
		192	`accept such REs and remain POSIX-compliant.`
		193
		194
		195	Obsolete (``basic'') regular expressions differ in several
		196	respects. `\|', `+', and `?' are ordinary characters and
		197	`there is no equivalent for their functionality. The`
		198	delimiters for bounds are `{' and `}', with `{' and `}' by
		199	`themselves ordinary characters. The parentheses for nested`
		200	subexpressions are `' and `)', with `(' and `)' by
		201	themselves ordinary characters. `^' is an ordinary character
		202	`except at the beginning of the RE or the beginning of a`
		203	parenthesized subexpression, `$' is an ordinary character
		204	`except at the end of the RE or the end of a parenthesized`
		205	subexpression, and `*' is an ordinary character if it
		206	`appears at the beginning of the RE or the beginning of a`
		207	parenthesized subexpression (after a possible leading `^').
		208	`Finally, there is one new type of atom, a ''back`
		209	reference'': `' followed by a non-zero decimal digit
		210	`''d'' matches the same sequence of characters matched by`
		211	`the ''d''th parenthesized subexpression (numbering`
		212	`subexpressions by the positions of their opening`
		213	parentheses, left to right), so that (e.g.) `1' matches `bb'
		214	or `cc' but not `bc'.
		215	`!!SEE ALSO`
		216
		217
		218	`regex(3)`
		219
		220
		221	`POSIX 1003.2, section 2.8 (Regular Expression`
		222	`Notation).`
		223	`!!BUGS`
		224
		225
		226	`Having two kinds of REs is a botch.`
		227
		228
		229	The current 1003.2 spec says that `)' is an ordinary
		230	character in the absence of an unmatched `('; this was an
		231	`unintentional result of a wording error, and change is`
		232	`likely. Avoid relying on it.`
		233
		234
		235	`Back references are a dreadful botch, posing major problems`
		236	`for efficient implementations. They are also somewhat`
		237	vaguely defined (does `a2)d' match `abbbd'?). Avoid using
		238	`them.`
		239
		240
		241	`1003.2's specification of case-independent matching is`
		242	vague. The ``one case implies all cases'' definition given
		243	`above is current consensus among implementors as to the`
		244	`right interpretation.`
		245
		246
		247	`The syntax for word boundaries is incredibly`
		248	`ugly.`
		249	`!!AUTHOR`
		250
		251
		252	`This page was taken from Henry Spencer's regex`
		253	`package.`
		254	`----`

This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.

Last edited on Tuesday, June 4, 2002 12:30:59 am by "perry"

Edit PageHistory Diff Info LikePages