Penguin
Blame: perlrequick(1)
EditPageHistoryDiffInfoLikePages
Annotated edit history of perlrequick(1) version 1, including all changes. View license author blame.
Rev Author # Line
1 perry 1 PERLREQUICK
2 !!!PERLREQUICK
3 NAME
4 DESCRIPTION
5 The Guide
6 BUGS
7 SEE ALSO
8 AUTHOR AND COPYRIGHT
9 ----
10 !!NAME
11
12
13 perlrequick - Perl regular expressions quick start
14 !!DESCRIPTION
15
16
17 This page covers the very basics of understanding, creating
18 and using regular expressions ('regexes') in
19 Perl.
20 !!The Guide
21
22
23 __Simple word matching__
24
25
26 The simplest regex is simply a word, or more generally, a
27 string of characters. A regex consisting of a word matches
28 any string that contains that word:
29
30
31
32 In this statement, World is a regex and the // enclosing /World/ tells perl to search a string for a match. The operator =~ associates the string with the regex match and produces a true value if the regex matched, or false if the regex did not match. In our case, World matches the second word in , so the expression is true. This idea has several variations.
33
34
35 Expressions like this are useful in
36 conditionals:
37
38
39 print
40 The sense of the match can be reversed by using !~ operator:
41
42
43 print
44 The literal string in the regex can be replaced by a variable:
45
46
47 $greeting =
48 If you're matching against $_, the $_ =~ part can be omitted:
49
50
51 $_ =
52 Finally, the // default delimiters for a match can be changed to arbitrary delimiters by putting an 'm' out front:
53
54
55
56 Regexes must match a part of the string ''exactly'' in order for the statement to be true:
57
58
59
60 perl will always match at the earliest possible point in the string:
61
62
63
64 Not all characters can be used 'as is' in a match. Some characters, called __metacharacters__, are reserved for use in regex notation. The metacharacters are
65
66
67 {}[[]()^$.*+?\
68 A metacharacter can be matched by putting a backslash before it:
69
70
71
72 In the last regex, the forward slash '/' is also backslashed, because it is used to delimit the regex.
73
74
75 Non-printable ASCII characters are
76 represented by __escape sequences__. Common examples are
77 t for a tab, n for a newline, and
78 r for a carriage return. Arbitrary bytes are
79 represented by octal escape sequences, e.g., 033,
80 or hexadecimal escape sequences, e.g.,
81 x1B:
82
83
84
85 Regexes are treated mostly as double quoted strings, so variable substitution works:
86
87
88 $foo = 'house';
89 'cathouse' =~ /cat$foo/; # matches
90 'housecat' =~ /${foo}cat/; # matches
91 With all of the regexes above, if the regex matched anywhere in the string, it was considered a match. To specify ''where'' it should match, we would use the __anchor__ metacharacters ^ and $. The anchor ^ means match at the beginning of the string and the anchor $ means match at the end of the string, or before a newline at the end of the string. Some examples:
92
93
94
95
96
97 __Using character classes__
98
99
100 A __character class__ allows a set of possible
101 characters, rather than just a single character, to match at
102 a particular point in a regex. Character classes are denoted
103 by brackets [[...], with the set of characters to be
104 possibly matched inside. Here are some
105 examples:
106
107
108 /cat/; # matches 'cat'
109 /[[bcr]at/; # matches 'bat', 'cat', or 'rat'
110 In the last statement, even though 'c' is the first character in the class, the earliest point at which the regex can match is 'a'.
111
112
113 /[[yY][[eE][[sS]/; # match 'yes' in a case-insensitive way
114 # 'yes', 'Yes', 'YES', etc.
115 /yes/i; # also match 'yes' in a case-insensitive way
116 The last example shows a match with an 'i' __modifier__, which makes the match case-insensitive.
117
118
119 Character classes also have ordinary and special characters,
120 but the sets of ordinary and special characters inside a
121 character class are different than those outside a character
122 class. The special characters for a character class are
123 -]^$ and are matched using an escape:
124
125
126 /[[]c]def/; # matches ']def' or 'cdef'
127 $x = 'bcr';
128 /[[$x]at/; # matches 'bat, 'cat', or 'rat'
129 /[[$x]at/; # matches '$at' or 'xat'
130 /[[\$x]at/; # matches 'at', 'bat, 'cat', or 'rat'
131 The special character '-' acts as a range operator within character classes, so that the unwieldy [[0123456789] and [[abc...xyz] become the svelte [[0-9] and [[a-z]:
132
133
134 /item[[0-9]/; # matches 'item0' or ... or 'item9'
135 /[[0-9a-fA-F]/; # matches a hexadecimal digit
136 If '-' is the first or last character in a character class, it is treated as an ordinary character.
137
138
139 The special character ^ in the first position of a
140 character class denotes a __negated character class__,
141 which matches any character but those in the brackets. Both
142 [[...] and [[^...] must match a character,
143 or the match fails. Then
144
145
146 /[[^a]at/; # doesn't match 'aat' or 'at', but matches
147 # all other 'bat', 'cat, '0at', '%at', etc.
148 /[[^0-9]/; # matches a non-numeric character
149 /[[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary
150 Perl has several abbreviations for common character classes:
151
152
153 d is a digit and represents [[0-9]
154
155
156 s is a whitespace character and represents [[\
157 trnf]
158
159
160 w is a word character (alphanumeric or _) and represents
161 [[0-9a-zA-Z_]
162
163
164 D is a negated d; it represents any character but a digit
165 [[^0-9]
166
167
168 S is a negated s; it represents any non-whitespace character
169 [[^s]
170
171
172 W is a negated w; it represents any non-word character
173 [[^w]
174
175
176 The period '.' matches any character but ``n''
177
178
179 The dswDSW abbreviations can be used both inside
180 and outside of character classes. Here are some in
181 use:
182
183
184 /dd:dd:dd/; # matches a hh:mm:ss time format
185 /[[ds]/; # matches any digit or whitespace character
186 /wWw/; # matches a word char, followed by a
187 # non-word char, followed by a word char
188 /..rt/; # matches any two chars, followed by 'rt'
189 /end./; # matches 'end.'
190 /end[[.]/; # same thing, matches 'end.'
191 The __word anchor__ b matches a boundary between a word character and a non-word character wW or Ww:
192
193
194 $x =
195 In the last example, the end of the string is considered a word boundary.
196
197
198 __Matching this or that__
199
200
201 We can match match different character strings with the
202 __alternation__ metacharacter ''. To match
203 dog or cat, we form the regex
204 dogcat. As before, perl will try to match the regex
205 at the earliest possible point in the string. At each
206 character position, perl will first try to match the the
207 first alternative, dog. If dog doesn't
208 match, perl will then try the next alternative,
209 cat. If cat doesn't match either, then the
210 match fails and perl moves to the next position in the
211 string. Some examples:
212
213
214
215 Even though dog is the first alternative in the second regex, cat is able to match earlier in the string.
216
217
218
219 At a given character position, the first alternative that allows the regex match to succeed wil be the one that matches. Here, all the alternatives match at the first string position, so th first matches.
220
221
222 __Grouping things and hierarchical
223 matching__
224
225
226 The __grouping__ metacharacters () allow a part
227 of a regex to be treated as a single unit. Parts of a regex
228 are grouped by enclosing them in parentheses. The regex
229 house(catkeeper) means match house
230 followed by either cat or keeper. Some
231 more examples are
232
233
234 /(ab)b/; # matches 'ab' or 'bb'
235 /(^ab)c/; # matches 'ac' at start of string or 'bc' anywhere
236 /house(cat)/; # matches either 'housecat' or 'house'
237 /house(cat(s))/; # matches either 'housecats' or 'housecat' or
238 # 'house'. Note groups can be nested.
239
240
241
242 __Extracting matches__
243
244
245 The grouping metacharacters () also allow the
246 extraction of the parts of a string that matched. For each
247 grouping, the part that matched inside goes into the special
248 variables $1, $2, etc. They can be used
249 just as ordinary variables:
250
251
252 # extract hours, minutes, seconds
253 $time =~ /(dd):(dd):(dd)/; # match hh:mm:ss format
254 $hours = $1;
255 $minutes = $2;
256 $seconds = $3;
257 In list context, a match /regex/ with groupings will return the list of matched values ($1,$2,...). So we could rewrite it as
258
259
260 ($hours, $minutes, $second) = ($time =~ /(dd):(dd):(dd)/);
261 If the groupings in a regex are nested, $1 gets the group with the leftmost opening parenthesis, $2 the next opening parenthesis, etc. For example, here is a complex regex and the matching variables indicated below it:
262
263
264 /(ab(cdef)((gi)j))/;
265 1 2 34
266 Associated with the matching variables $1, $2, ... are the __backreferences__ 1, 2, ... Backreferences are matching variables that can be used ''inside'' a regex:
267
268
269 /(www)s1/; # find sequences like 'the the' in string
270 $1, $2, ... should only be used outside of a regex, and 1, 2, ... only inside a regex.
271
272
273 __Matching repetitions__
274
275
276 The __quantifier__ metacharacters ?, *,
277 +, and {} allow us to determine the number
278 of repeats of a portion of a regex we consider to be a
279 match. Quantifiers are put immediately after the character,
280 character class, or grouping that we want to specify. They
281 have the following meanings:
282
283
284 a? = match 'a' 1 or 0 times
285
286
287 a* = match 'a' 0 or more times, i.e., any number of
288 times
289
290
291 a+ = match 'a' 1 or more times, i.e., at least
292 once
293
294
295 a{n,m} = match at least n times, but not
296 more than m times.
297
298
299 a{n,} = match at least n or more
300 times
301
302
303 a{n} = match exactly n times
304
305
306 Here are some examples:
307
308
309 /[[a-z]+s+d*/; # match a lowercase word, at least some space, and
310 # any number of digits
311 /(w+)s+1/; # match doubled words of arbitrary length
312 $year =~ /d{2,4}/; # make sure year is at least 2 but not more
313 # than 4 digits
314 $year =~ /d{4}d{2}/; # better match; throw out 3 digit dates
315 These quantifiers will try to match as much of the string as possible, while still allowing the regex to match. So we have
316
317
318 $x = 'the cat in the hat';
319 $x =~ /^(.*)(at)(.*)$/; # matches,
320 # $1 = 'the cat in the h'
321 # $2 = 'at'
322 # $3 = '' (0 matches)
323 The first quantifier .* grabs as much of the string as possible while still having the regex match. The second quantifier .* has no string left to it, so it matches 0 times.
324
325
326 __More matching__
327
328
329 There are a few more things you might want to know about
330 matching operators. In the code
331
332
333 $pattern = 'Seuss';
334 while (
335 perl has to re-evaluate $pattern each time through the loop. If $pattern won't be changing, use the //o modifier, to only perform variable substitutions once. If you don't want any substitutions at all, use the special delimiter m'':
336
337
338 $pattern = 'Seuss';
339 m'$pattern'; # matches '$pattern', not 'Seuss'
340 The global modifier //g allows the matching operator to match within a string as many times as possible. In scalar context, successive matches against a string will have //g jump from match to match, keeping track of position in the string as it goes along. You can get or set the position with the pos() function. For example,
341
342
343 $x =
344 prints
345
346
347 Word is cat, ends at position 3
348 Word is dog, ends at position 7
349 Word is house, ends at position 13
350 A failed match or changing the target string resets the position. If you don't want the position reset after failure to match, add the //c, as in /regex/gc.
351
352
353 In list context, //g returns a list of matched
354 groupings, or if there are no groupings, a list of matches
355 to the whole regex. So
356
357
358 @words = ($x =~ /(w+)/g); # matches,
359 # $word[[0] = 'cat'
360 # $word[[1] = 'dog'
361 # $word[[2] = 'house'
362
363
364 __Search and replace__
365
366
367 Search and replace is performed using
368 s/regex/replacement/modifiers. The
369 replacement is a Perl double quoted string that
370 replaces in the string whatever is matched with the
371 regex. The operator =~ is also used here
372 to associate a string with s///. If matching
373 against $_, the $_ =~ can be dropped. If
374 there is a match, s/// returns the number of
375 substitutions made, otherwise it returns false. Here are a
376 few examples:
377
378
379 $x =
380 With the s/// operator, the matched variables $1, $2, etc. are immediately available for use in the replacement expression. With the global modifier, s///g will search and replace all occurrences of the regex in the string:
381
382
383 $x =
384 The evaluation modifier s///e wraps an eval{...} around the replacement string and the evaluated result is substituted for the matched substring. Some examples:
385
386
387 # reverse all the words in a string
388 $x =
389 # convert percentage to decimal
390 $x =
391 The last example shows that s/// can use other delimiters, such as s!!! and s{}{}, and even s{}//. If single quotes are used s''', then the regex and replacement are treated as single quoted strings.
392
393
394 __The split operator__
395
396
397 split /regex/, string splits string into a
398 list of substrings and returns that list. The regex
399 determines the character sequence that string is
400 split with respect to. For example, to split a string into
401 words, use
402
403
404 $x =
405 To extract a comma-delimited list of numbers, use
406
407
408 $x =
409 If the empty regex // is used, the string is split into individual characters. If the regex has groupings, then list produced contains the matched substrings from the groupings as well:
410
411
412 $x =
413 Since the first character of $x matched the regex, split prepended an empty initial element to the list.
414 !!BUGS
415
416
417 None.
418 !!SEE ALSO
419
420
421 This is just a quick start guide. For a more in-depth
422 tutorial on regexes, see perlretut and for the reference
423 page, see perlre.
424 !!AUTHOR AND COPYRIGHT
425
426
427 Copyright (c) 2000 Mark Kvale All rights
428 reserved.
429
430
431 This document may be distributed under the same terms as
432 Perl itself.
433
434
435 __Acknowledgments__
436
437
438 The author would like to thank Mark-Jason Dominus, Tom
439 Christiansen, Ilya Zakharevich, Brad Hughes, and Mike Giroux
440 for all their helpful comments.
441 ----
This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.