Penguin
Annotated edit history of mawk(1) version 1, including all changes. View license author blame.
Rev Author # Line
1 perry 1 MAWK
2 !!!MAWK
3 NAME
4 SYNOPSIS
5 DESCRIPTION
6 OPTIONS
7 THE AWK LANGUAGE
8 EXAMPLES
9 COMPATIBILITY ISSUES
10 SEE ALSO
11 BUGS
12 AUTHOR
13 ----
14 !!NAME
15
16
17 mawk - pattern scanning and text processing language
18 !!SYNOPSIS
19
20
21 __mawk__ [[-__W__ ''option''] [[-__F__
22 ''value''] [[-__v__ ''var=value''] [[--] 'program
23 text' [[file ...]__
24 mawk__ [[-__W__ ''option''] [[-__F__ ''value'']
25 [[-__v__ ''var=value''] [[-__f__ ''program-file'']
26 [[--] [[file ...]
27 !!DESCRIPTION
28
29
30 __mawk__ is an interpreter for the AWK Programming
31 Language. The AWK language is useful for manipulation of
32 data files, text retrieval and processing, and for
33 prototyping and experimenting with algorithms. __mawk__
34 is a ''new awk'' meaning it implements the AWK language
35 as defined in Aho, Kernighan and Weinberger, ''The AWK
36 Programming Language,'' Addison-Wesley Publishing, 1988.
37 (Hereafter referred to as the AWK book.) __mawk__
38 conforms to the Posix 1003.2 (draft 11.3) definition of the
39 AWK language which contains a few features not described in
40 the AWK book, and __mawk__ provides a small number of
41 extensions.
42
43
44 An AWK program is a sequence of ''pattern {action}''
45 pairs and function definitions. Short programs are entered
46 on the command line usually enclosed in ' ' to avoid shell
47 interpretation. Longer programs can be read in from a file
48 with the -f option. Data input is read from the list of
49 files on the command line or from standard input when the
50 list is empty. The input is broken into records as
51 determined by the record separator variable, __RS__.
52 Initially, __RS__ =
53 __pattern'' and if it matches, the program text for
54 ''{action}'' is executed.
55 !!OPTIONS
56
57
58 -__F__ ''value''
59
60
61 sets the field separator, __FS__, to
62 ''value''.
63
64
65 -__f__ ''file'' Program text is read from ''file''
66 instead of from the command line. Multiple __-f__ options
67 are allowed.
68
69
70 -__v__ ''var=value''
71
72
73 assigns ''value'' to program variable
74 ''var''.
75
76
77 -- indicates the unambiguous end of options.
78
79
80 The above options will be available with any Posix
81 compatible implementation of AWK, and implementation
82 specific options are prefaced with __-W__. __mawk__
83 provides six:
84
85
86 -__W__ version
87
88
89 __mawk__ writes its version and copyright to stdout and
90 compiled limits to stderr and exits 0.
91
92
93 -__W__ dump writes an assembler like listing of the
94 internal representation of the program to stdout and exits 0
95 (on successful compilation).
96
97
98 -__W__ interactive
99
100
101 sets unbuffered writes to stdout and line buffered reads
102 from stdin. Records from stdin are lines regardless of the
103 value of __RS__.
104
105
106 -__W__ exec ''file''
107
108
109 Program text is read from ''file'' and this is the last
110 option. Useful on systems that support the __#!__
111 __
112
113
114 -__W__ sprintf=''num''
115
116
117 adjusts the size of __mawk's__ internal sprintf buffer to
118 ''num'' bytes. More than rare use of this option
119 indicates __mawk__ should be recompiled.
120
121
122 -__W__ posix_space
123
124
125 forces __mawk__ not to consider 'n' to be
126 space.
127
128
129 The short forms __-W__[[vdiesp] are recognized and on some
130 systems __-W__e is mandatory to avoid command line length
131 limitations.
132 !!THE AWK LANGUAGE
133
134
135 __1. Program structure__
136
137
138 An AWK program is a sequence of ''pattern {action}''
139 pairs and user function definitions.
140
141
142 A pattern can be:
143
144
145 __ BEGIN
146 END
147 __ expression
148 expression , expression
149 One, but not both, of ''pattern {action}'' can be omitted. If ''{action}'' is omitted it is implicitly { print }. If ''pattern'' is omitted, then it is implicitly matched. __BEGIN__ and __END__ patterns require an action.
150
151
152 Statements are terminated by newlines, semi-colons or both.
153 Groups of statements such as actions or loop bodies are
154 blocked via { ... } as in C. The last statement in a block
155 doesn't need a terminator. Blank lines have no meaning; an
156 empty statement is terminated with a semi-colon. Long
157 statements can be continued with a backslash, . A statement
158 can be broken without a backslash after a comma, left brace,
159 do__, __else__, the right
160 parenthesis of an __if__, __while__ or __for__
161 statement, and the right parenthesis of a function
162 definition. A comment starts with # and extends to, but does
163 not include the end of line.
164
165
166 The following statements control program flow inside
167 blocks.
168
169
170 __if__ ( ''expr'' ) ''statement''
171
172
173 __if__ ( ''expr'' ) ''statement'' __else__
174 ''statement''
175
176
177 __while__ ( ''expr'' ) ''statement''
178
179
180 __do__ ''statement'' __while__ ( ''expr''
181 )
182
183
184 __for__ ( ''opt_expr'' ; ''opt_expr'' ;
185 ''opt_expr'' ) ''statement''
186
187
188 __for__ ( ''var'' __in__ ''array'' )
189 ''statement''
190
191
192 __continue__
193
194
195 __break__
196
197
198 __2. Data types, conversion and comparison__
199
200
201 There are two basic data types, numeric and string. Numeric
202 constants can be integer like -2, decimal like 1.08, or in
203 scientific notation like -1.1e4 or .28E-3. All numbers are
204 represented internally and all computations are done in
205 floating point arithmetic. So for example, the expression
206 0.2e2 == 20 is true and true is represented as
207 1.0.
208
209
210 String constants are enclosed in double quotes.
211 Strings can be continued across a line by escaping ()
212 the newline. The following escape sequences are
213 recognized.
214 \ \
215 If you escape any other character c, you get c, i.e.,
216 __mawk__ ignores the escape.
217 There are really three basic data types; the third is
218 ''number and string'' which has both a numeric value and
219 a string value at the same time. User defined variables come
220 into existence when first referenced and are initialized to
221 ''null'', a number and string value which has numeric
222 value 0 and string value
223 ''
224 The type of an expression is determined by its context
225 and automatic type conversion occurs if needed. For example,
226 to evaluate the statements
227 y = x + 2 ; z = x
228 The value stored in variable y will be typed numeric. If
229 x is not numeric, the value read from x is converted to
230 numeric before it is added to 2 and stored in y. The value
231 stored in variable z will be typed string, and the value of
232 x will be converted to string if necessary and concatenated
233 with
234 atof''(3). A numeric expression is
235 converted to string by replacing ''expr'' with
236 __sprintf(CONVFMT__, ''expr''), unless ''expr'' can
237 be represented on the host machine as an exact integer then
238 it is converted to __sprintf__(
239 __expr''). __Sprintf()__ is an AWK built-in that
240 duplicates the functionality of sprintf(3), and
241 __CONVFMT__ is a built-in variable used for internal
242 conversion from number to string and initialized to
243 __expr'' ''expr''+0 is
244 numeric.
245 To evaluate, ''expr''1 __rel-op__ ''expr''2, if
246 both operands are numeric or number and string then the
247 comparison is numeric; if both operands are string the
248 comparison is string; if one operand is string, the
249 non-string operand is converted and the comparison is
250 string. The result is numeric, 1 or 0.
251 In boolean contexts such as, __if__ ( ''expr'' )
252 ''statement'', a string expression evaluates true if and
253 only if it is not the empty string
254 ''
255
256
257 __3. Regular expressions__
258
259
260 In the AWK language, records, fields and strings are often
261 tested for matching a ''regular expression''. Regular
262 expressions are enclosed in slashes, and
263
264
265 '' expr'' ~ /''r''/
266 is an AWK expression that evaluates to 1 if ''expr'' ''r'', which means a substring of ''expr'' is in the set of strings defined by ''r''. With no match the expression evaluates to 0; replacing ~ with the ''
267
268
269 /''r''/ { ''action'' } and __ $0__ ~ /''r''/ { ''action'' }
270 are the same, and for each input record that matches ''r'', ''action'' is executed. In fact, /''r''/ is an AWK expression that is equivalent to (__$0__ ~ /''r''/) anywhere except when on the right side of a match operator or passed as an argument to a built-in function that expects a regular expression argument.
271
272
273 AWK uses extended regular expressions as with
274 egrep(1). The regular expression metacharacters,
275 i.e., those with special meaning in regular expressions
276 are
277
278
279 ^ $ . [[ ] | ( ) * + ?
280 Regular expressions are built up from characters as follows:
281
282
283 ''c''
284
285
286 matches any non-metacharacter ''c''.
287
288
289 \''c''
290
291
292 matches a character defined by the same escape sequences
293 used in string constants or the literal character ''c''
294 if \''c'' is not an escape sequence.
295
296
297 .
298
299
300 matches any character (including newline).
301
302
303 ^
304
305
306 matches the front of a string.
307
308
309 $
310
311
312 matches the back of a string.
313
314
315 [[c1c2c3...]
316
317
318 matches any character in the class c1c2c3... . An interval
319 of characters is denoted c1-c2 inside a class
320 [[...].
321
322
323 [[^c1c2c3...]
324
325
326 matches any character not in the class
327 c1c2c3...
328
329
330 Regular expressions are built up from other regular
331 expressions as follows:
332
333
334 ''r''1''r''2
335
336
337 matches ''r''1 followed immediately by ''r''2
338 (concatenation).
339
340
341 ''r''1 | ''r''2
342
343
344 matches ''r''1 or ''r''2 (alternation).
345
346
347 ''r''*
348
349
350 matches ''r'' repeated zero or more times.
351
352
353 ''r''+
354
355
356 matches ''r'' repeated one or more times.
357
358
359 ''r''?
360
361
362 matches ''r'' zero or once.
363
364
365 (''r'')
366
367
368 matches ''r'', providing grouping.
369
370
371 The increasing precedence of operators is alternation,
372 concatenation and unary (*, + or ?).
373
374
375 For example,
376
377
378 /^[[_a-zA-Z][[_a-zA-Z0-9]*$/ and
379 /^[[-+]?([[0-9]+.?|.[[0-9])[[0-9]*([[eE][[-+]?[[0-9]+)?$/
380 are matched by AWK identifiers and AWK numeric constants respectively. Note that . has to be escaped to be recognized as a decimal point, and that metacharacters are not special inside character classes.
381
382
383 Any expression can be used on the right hand side of the ~
384 or !~ operators or passed to a built-in that expects a
385 regular expression. If needed, it is converted to string,
386 and then interpreted as a regular expression. For
387 example,
388
389
390 BEGIN { identifier =
391 prints all lines that start with an AWK identifier.
392
393
394 __mawk__ recognizes the empty regular expression, //,
395 which matches the empty string and hence is matched by any
396 string at the front, back and between every character. For
397 example,
398
399
400 echo abc | mawk { gsub(//,
401
402
403 __4. Records and fields__
404
405
406 Records are read in one at a time, and stored in the
407 ''field'' variable __$0__. The record is split into
408 ''fields'' which are stored in __$1__, __$2__, ...,
409 __$NF__. The built-in variable __NF__ is set to the
410 number of fields, and __NR__ and __FNR__ are
411 incremented by 1. Fields above __$NF__ are set to
412 __
413
414
415 Assignment to __$0__ causes the fields and __NF__ to
416 be recomputed. Assignment to __NF__ or to a field causes
417 __$0__ to be reconstructed by concatenating the
418 __$i's__ separated by __OFS__. Assignment to a field
419 with index greater than __NF__, increases __NF__ and
420 causes __$0__ to be reconstructed.
421
422
423 Data input stored in fields is string, unless the entire
424 field has numeric form and then the type is number and
425 string. For example,
426
427
428 echo 24 24E |
429 mawk '{ print($1
430 __$0__ and __$2__ are string and __$1__ is number and string. The first comparison is numeric, the second is string, the third is string (100 is converted to __
431
432
433 __5. Expressions and operators__
434
435
436 The expression syntax is similar to C. Primary expressions
437 are numeric constants, string constants, variables, fields,
438 arrays and function calls. The identifier for a variable,
439 array or function can be a sequence of letters, digits and
440 underscores, that does not start with a digit. Variables are
441 not declared; they exist when first referenced and are
442 initialized to ''null''.
443
444
445 New expressions are composed with the following operators in
446 order of increasing precedence.
447
448
449 ''assignment'' = += -= *= /= %= ^=
450 ''conditional'' ? :
451 ''logical or'' ||
452 ''logical and''
453 ''array membership'' __ in
454 __''matching'' ~ !~
455 ''relational''
456 ''concatenation'' (no explicit operator)
457 ''add ops'' + -
458 ''mul ops'' * / %
459 ''unary'' + -
460 ''logical not'' !
461 ''exponentiation'' ^
462 ''inc and dec'' ++ -- (both post and pre)
463 ''field'' $
464 Assignment, conditional and exponentiation associate right to left; the other operators associate left to right. Any expression can be parenthesized.
465
466
467 __6. Arrays__
468
469
470 Awk provides one-dimensional arrays. Array elements are
471 expressed as ''array''[[''expr'']. ''Expr'' is
472 internally converted to string type, so, for example, A[[1]
473 and A[[
474 ''expr''
475 __in__ ''array'' evaluates to 1 if
476 ''array''[[''expr''] exists, else to 0.
477
478
479 There is a form of the __for__ statement that loops over
480 each index of an array.
481
482
483 __ for__ ( ''var'' __in__ ''array'' ) ''statement
484 ''sets ''var'' to each index of ''array'' and executes ''statement''. The order that ''var'' transverses the indices of ''array'' is not defined.
485
486
487 The statement, __delete__ ''array''[[''expr''],
488 causes ''array''[[''expr''] not to exist. __mawk__
489 supports an extension, __delete__ ''array'', which
490 deletes all elements of ''array''.
491
492
493 Multidimensional arrays are synthesized with concatenation
494 using the built-in variable __SUBSEP__.
495 ''array''[[''expr''1,''expr''2] is equivalent to
496 ''array''[[''expr''1 __SUBSEP__ ''expr''2].
497 Testing for a multidimensional element uses a parenthesized
498 index, such as
499
500
501 if ( (i, j) in A ) print A[[i, j]
502
503
504 __7. Builtin-variables__
505
506
507 The following variables are built-in and initialized before
508 program execution.
509
510
511 __ARGC__
512
513
514 number of command line arguments.
515
516
517 __ARGV__
518
519
520 array of command line arguments, 0..ARGC-1.
521
522
523 __CONVFMT__
524
525
526 format for internal conversion of numbers to string,
527 initially =
528
529
530 __ENVIRON__
531
532
533 array indexed by environment variables. An environment
534 string, ''var=value'' is stored as
535 __ENVIRON__[[''var''] = ''value''.
536
537
538 __FILENAME__
539
540
541 name of the current input file.
542
543
544 __FNR__
545
546
547 current record number in __FILENAME__.
548
549
550 __FS__
551
552
553 splits records into fields as a regular
554 expression.
555
556
557 __NF__
558
559
560 number of fields in the current record.
561
562
563 __NR__
564
565
566 current record number in the total input
567 stream.
568
569
570 __OFMT__
571
572
573 format for printing numbers; initially =
574
575
576 __OFS__
577
578
579 inserted between fields on output, initially =
580
581
582 __ORS__
583
584
585 terminates each record on output, initially =
586
587
588 __RLENGTH__
589
590
591 length set by the last call to the built-in function,
592 __match()__.
593
594
595 __RS__
596
597
598 input record separator, initially =
599
600
601 __RSTART__
602
603
604 index set by the last call to __match()__.
605
606
607 __SUBSEP__
608
609
610 used to build multiple array subscripts, initially =
611
612
613 __8. Built-in functions__
614
615
616 String functions
617
618
619 gsub(''r,s,t'') gsub(''r,s'')
620
621
622 Global substitution, every match of regular expression
623 ''r'' in variable ''t'' is replaced by string
624 ''s''. The number of replacements is returned. If
625 ''t'' is omitted, __$0__ is used. An
626 __s'' is replaced by the matched
627 substring of ''t''.
628 ''
629
630
631 index(''s,t'')
632
633
634 If ''t'' is a substring of ''s'', then the position
635 where ''t'' starts is returned, else 0 is returned. The
636 first character of ''s'' is in position 1.
637
638
639 length(''s'')
640
641
642 Returns the length of string ''s''.
643
644
645 match(''s,r'')
646
647
648 Returns the index of the first longest match of regular
649 expression ''r'' in string ''s''. Returns 0 if no
650 match. As a side effect, __RSTART__ is set to the return
651 value. __RLENGTH__ is set to the length of the match or
652 -1 if no match. If the empty string is matched,
653 __RLENGTH__ is set to 0, and 1 is returned if the match
654 is at the front, and length(''s'')+1 is returned if the
655 match is at the back.
656
657
658 split(''s,A,r'') split(''s,A'')
659
660
661 String ''s'' is split into fields by regular expression
662 ''r'' and the fields are loaded into array ''A''. The
663 number of fields is returned. See section 11 below for more
664 detail. If ''r'' is omitted, __FS__ is
665 used.
666
667
668 sprintf(''format,expr-list'')
669
670
671 Returns a string constructed from ''expr-list'' according
672 to ''format''. See the description of printf()
673 below.
674
675
676 sub(''r,s,t'') sub(''r,s'')
677
678
679 Single substitution, same as gsub() except at most one
680 substitution.
681
682
683 substr(''s,i,n'') substr(''s,i'')
684
685
686 Returns the substring of string ''s'', starting at index
687 ''i'', of length ''n''. If ''n'' is omitted, the
688 suffix of ''s'', starting at ''i'' is
689 returned.
690
691
692 tolower(''s'')
693
694
695 Returns a copy of ''s'' with all upper case characters
696 converted to lower case.
697
698
699 toupper(''s'')
700
701
702 Returns a copy of ''s'' with all lower case characters
703 converted to upper case.
704
705
706 Arithmetic functions
707
708
709 atan2(''y,x'') Arctan of ''y''/''x'' between - and .
710 cos(''x'') Cosine function, ''x'' in radians.
711 exp(''x'') Exponential function.
712 int(''x'') Returns ''x'' truncated towards zero.
713 log(''x'') Natural logarithm.
714 rand() Returns a random number between zero and one.
715 sin(''x'') Sine function, ''x'' in radians.
716 sqrt(''x'') Returns square root of ''x''.
717 srand(''expr'') srand()
718
719
720 Seeds the random number generator, using the clock if
721 ''expr'' is omitted, and returns the value of the
722 previous seed. __mawk__ seeds the random number generator
723 from the clock at startup so there is no real need to call
724 srand(). Srand(''expr'') is useful for repeating pseudo
725 random sequences.
726
727
728 __9. Input and output__
729
730
731 There are two output statements, __print__ and
732 __printf__.
733
734
735 print
736
737
738 writes __$0 ORS__ to standard output.
739
740
741 print ''expr''1, ''expr''2, ...,
742 ''expr''n
743
744
745 writes ''expr''1 __OFS__ ''expr''2 __OFS__ ...
746 ''expr''n __ORS__ to standard output. Numeric
747 expressions are converted to string with
748 __OFMT__.
749
750
751 printf ''format, expr-list''
752
753
754 duplicates the printf C library function writing to standard
755 output. The complete ANSI C format specifications are
756 recognized with conversions %c, %d, %e, %E, %f, %g, %G, %i,
757 %o, %s, %u, %x, %X and %%, and conversion qualifiers h and
758 l.
759
760
761 The argument list to print or printf can optionally be
762 enclosed in parentheses. Print formats numbers using
763 __OFMT__ or
764 __file'', ''file'' or |
765 ''command'' to the end of the print statement.
766 Redirection opens ''file'' or ''command'' only once,
767 subsequent redirections append to the already open stream.
768 By convention, __mawk__ associates the filename
769 __mawk__ also
770 associates
771 __
772
773
774 The input function __getline__ has the following
775 variations.
776
777
778 getline
779
780
781 reads into __$0__, updates the fields, __NF__,
782 __NR__ and __FNR__.
783
784
785 getline file''
786
787
788 reads into __$0__ from ''file'', updates the fields
789 and __NF__.
790
791
792 getline ''var''
793
794
795 reads the next record into ''var'', updates __NR__ and
796 __FNR__.
797
798
799 getline ''var'' ''file''
800
801
802 reads the next record of ''file'' into
803 ''var''.
804
805
806 ''command'' | getline
807
808
809 pipes a record from ''command'' into __$0__ and
810 updates the fields and __NF__.
811
812
813 ''command'' | getline ''var''
814
815
816 pipes a record from ''command'' into
817 ''var''.
818
819
820 Getline returns 0 on end-of-file, -1 on error, otherwise
821 1.
822
823
824 Commands on the end of pipes are executed by
825 /bin/sh.
826
827
828 The function __close__(''expr'') closes the file or
829 pipe associated with ''expr''. Close returns 0 if
830 ''expr'' is an open file, the exit status if ''expr''
831 is a piped command, and -1 otherwise. Close is used to
832 reread a file or command, make sure the other end of an
833 output pipe is finished or conserve file
834 resources.
835
836
837 The function __fflush__(''expr'') flushes the output
838 file or pipe associated with ''expr''. Fflush returns 0
839 if ''expr'' is an open output stream else -1. Fflush
840 without an argument flushes stdout. Fflush with an empty
841 argument (
842 ''
843
844
845 The function __system__(''expr'') uses /bin/sh to
846 execute ''expr'' and returns the exit status of the
847 command ''expr''. Changes made to the __ENVIRON__
848 array are not passed to commands executed with __system__
849 or pipes.
850
851
852 __10. User defined functions__
853
854
855 The syntax for a user defined function is
856
857
858 __ function__ name( ''args'' ) { ''statements'' }
859 The function body can contain a return statement
860
861
862 __ return__ ''opt_expr
863 ''A return statement is not required. Function calls may be nested or recursive. Functions are passed expressions by value and arrays by reference. Extra arguments serve as local variables and are initialized to ''null''. For example, csplit(''s,A'') puts each character of ''s'' into array ''A'' and returns the length of ''s''.
864
865
866 function csplit(s, A, n, i)
867 {
868 n = length(s)
869 for( i = 1 ; i
870 Putting extra space between passed arguments and local variables is conventional. Functions can be referenced before they are defined, but the function name and the '(' of the arguments must touch to avoid confusion with concatenation.
871
872
873 __11. Splitting strings, records and files__
874
875
876 Awk programs use the same algorithm to split strings into
877 arrays with split(), and records into fields on __FS__.
878 __mawk__ uses essentially the same algorithm to split
879 files into records on __RS__.
880
881
882 Split(''expr,A,sep'') works as follows:
883
884
885 (1)
886
887
888 If ''sep'' is omitted, it is replaced by __FS__.
889 ''Sep'' can be an expression or regular expression. If it
890 is an expression of non-string type, it is converted to
891 string.
892
893
894 (2)
895
896
897 If ''sep'' =
898 ''expr'', and ''sep'' becomes
899 ''mawk__ defines
900 __sep'' is treated as a regular
901 expression, except that meta-characters are ignored for a
902 string of length 1, e.g., split(x, A,
903 ''
904
905
906 (3)
907
908
909 If ''expr'' is not string, it is converted to string. If
910 ''expr'' is then the empty string
911 ''A'' is set empty. Otherwise, all
912 non-overlapping, non-null and longest matches of ''sep''
913 in ''expr'', separate ''expr'' into fields which are
914 loaded into ''A''. The fields are placed in A[[1], A[[2],
915 ..., A[[n] and split() returns n, the number of fields which
916 is the number of matches plus one. Data placed in ''A''
917 that looks numeric is typed number and string.
918
919
920 Splitting records into fields works the same except the
921 pieces are loaded into __$1__, __$2__,..., __$NF__.
922 If __$0__ is empty, __NF__ is set to 0 and all
923 __$i__ to __
924
925
926 __mawk__ splits files into records by the same algorithm,
927 but with the slight difference that __RS__ is really a
928 terminator instead of a separator. (__ORS__ is really a
929 terminator too).
930
931
932 E.g., if __FS__ = __$0__ =
933 __NF__ = 3 and __$1__ =
934 __$2__ = __$3__ =
935 __RS__ =
936 __
937
938
939 __RS__ = __
940
941
942 If __FS__ = __mawk__ breaks the
943 record into individual characters, and, similarly,
944 split(''s,A,''
945 ''s'' into ''A''.
946
947
948 __12. Multi-line records__
949
950
951 Since __mawk__ interprets __RS__ as a regular
952 expression, multi-line records are easy. Setting __RS__ =
953 __FS__ =
954 __
955
956
957 For example, if a file is RS__ =
958 __FS__ =
959 __FS__ =
960 __FS__ =
961 __
962
963
964 If you want lines with spaces or tabs to be considered
965 blank, set __RS__ =
966 __RS__ =
967 __RS__ =
968 __RS__ =
969 __FS__. __mawk__ does not support this convention,
970 because defining
971 __
972
973
974 Most of the time when you change __RS__ for multi-line
975 records, you will also want to change __ORS__ to
976 __
977
978
979 __13. Program execution__
980
981
982 This section describes the order of program execution. First
983 __ARGC__ is set to the total number of command line
984 arguments passed to the execution phase of the program.
985 __ARGV[[0]__ is set the name of the AWK interpreter and
986 __ARGV[[1]__ ... __ARGV[[ARGC-1]__ holds the remaining
987 command line arguments exclusive of options and program
988 source. For example with
989
990
991 mawk -f prog v=1 A t=hello B
992 __ARGC__ = 5 with __ARGV[[0]__ = __ARGV[[1]__ = __ARGV[[2]__ = __ARGV[[3]__ = __ARGV[[4]__ = __
993
994
995 Next, each __BEGIN__ block is executed in order. If the
996 program consists entirely of __BEGIN__ blocks, then
997 execution terminates, else an input stream is opened and
998 execution continues. If __ARGC__ equals 1, the input
999 stream is set to stdin, else the command line arguments
1000 __ARGV[[1]__ ... __ARGV[[ARGC-1]__ are examined for a
1001 file argument.
1002
1003
1004 The command line arguments divide into three sets: file
1005 arguments, assignment arguments and empty strings
1006 var''=''string''. When an __ARGV[[i]__ is examined
1007 as a possible file argument, if it is empty it is skipped;
1008 if it is an assignment argument, the assignment to
1009 ''var'' takes place and __i__ skips to the next
1010 argument; else __ARGV[[i]__ is opened for input. If it
1011 fails to open, execution terminates with exit code 2. If no
1012 command line argument is a file argument, then input comes
1013 from stdin. Getline in a __BEGIN__ action opens input.
1014 __
1015
1016
1017 Once an input stream is open, each input record is tested
1018 against each ''pattern'', and if it matches, the
1019 associated ''action'' is executed. An expression pattern
1020 matches if it is boolean true (see the end of section 2). A
1021 __BEGIN__ pattern matches before any input has been read,
1022 and an __END__ pattern matches after all input has been
1023 read. A range pattern, ''expr''1,''expr''2 , matches
1024 every record between the match of ''expr''1 and the match
1025 ''expr''2 inclusively.
1026
1027
1028 When end of file occurs on the input stream, the remaining
1029 command line arguments are examined for a file argument, and
1030 if there is one it is opened, else the __END__
1031 ''pattern'' is considered matched and all __END__
1032 ''actions'' are executed.
1033
1034
1035 In the example, the assignment v=1 takes place after the
1036 __BEGIN__ ''actions'' are executed, and the data
1037 placed in v is typed number and string. Input is then read
1038 from file A. On end of file A, t is set to the string
1039 ''END__ ''actions'' are executed.
1040
1041
1042 Program flow at the ''pattern {action}'' level can be
1043 changed with the
1044
1045
1046 __ next
1047 exit__ '' opt_expr
1048 ''statements. A __next__ statement causes the next input record to be read and pattern testing to restart with the first ''pattern {action}'' pair in the program. An __exit__ statement causes immediate execution of the __END__ actions or program termination if there are none or if the __exit__ occurs in an __END__ action. The ''opt_expr'' sets the exit value of the program unless overridden by a later __exit__ or subsequent error.
1049 !!EXAMPLES
1050
1051
1052 1. emulate cat.
1053 { print }
1054 2. emulate wc.
1055 { chars += length($0) + 1 # add one for the n
1056 words += NF
1057 }
1058 END{ print NR, words, chars }
1059 3. count the number of unique
1060 4. sum the second field of every record based on the first field.
1061
1062
1063 $1 ~ /credit|gain/ { sum += $2 }
1064 $1 ~ /debit|loss/ { sum -= $2 }
1065 END { print sum }
1066 5. sort a file, comparing as string
1067 { line[[NR] = $0
1068 !!COMPATIBILITY ISSUES
1069
1070
1071 The Posix 1003.2(draft 11.3) definition of the AWK language
1072 is AWK as described in the AWK book with a few extensions
1073 that appeared in SystemVR4 nawk. The extensions
1074 are:
1075
1076
1077 New functions: toupper() and tolower().
1078
1079
1080 New variables: ENVIRON[[] and CONVFMT.
1081
1082
1083 ANSI C conversion specifications for printf() and
1084 sprintf().
1085
1086
1087 New command options: -v var=value, multiple -f options and
1088 implementation options as arguments to -W.
1089
1090
1091 Posix AWK is oriented to operate on files a line at a time.
1092 __RS__ can be changed from
1093 __RS__ =
1094 __RS__ =
1095 __FS__.
1096
1097
1098 __mawk__, on the other hand, allows __RS__ to be a
1099 regular expression. When
1100 __FS__ always determines
1101 fields.
1102
1103
1104 Removing the line at a time paradigm can make some programs
1105 simpler and can often improve performance. For example,
1106 redoing example 3 from above,
1107
1108
1109 BEGIN { RS =
1110 counts the number of unique words by making each word a record. On moderate size files, __mawk__ executes twice as fast, because of the simplified inner loop.
1111
1112
1113 The following program replaces each comment by a single
1114 space in a C program file,
1115
1116
1117 BEGIN {
1118 RS =
1119 Buffering one record is needed to avoid terminating the last record with a space.
1120
1121
1122 With __mawk__, the following are all
1123 equivalent,
1124
1125
1126 x ~ /a+b/ x ~
1127 The strings get scanned twice, once as string and once as regular expression. On the string scan, __mawk__ ignores the escape on non-escape characters while the AWK book advocates ''c'' be recognized as ''c'' which necessitates the double escaping of meta-characters in strings. Posix explicitly declines to define the behavior which passively forces programs that must run under a variety of awks to use the more portable but less readable, double escape.
1128
1129
1130 Posix AWK does not recognize
1131 mawk__ limits the number of digits that follows x to
1132 two as the current implementation only supports 8 bit
1133 characters. The built-in __fflush__ first appeared in a
1134 recent (1993) AT
1135 __delete__ ''array'' is not part of the posix
1136 standard.
1137
1138
1139 Posix explicitly leaves the behavior of __FS__ =
1140 __
1141
1142
1143 Finally, here is how __mawk__ handles exceptional cases
1144 not discussed in the AWK book or the Posix draft. It is
1145 unsafe to assume consistency across awks and safe to skip to
1146 the next section.
1147
1148
1149 substr(s, i, n) returns the characters of s in the
1150 intersection of the closed interval [[1, length(s)] and the
1151 half-open interval [[i, i+n). When this intersection is
1152 empty, the empty string is returned; so
1153 substr(
1154
1155
1156 Every string, including the empty string, matches the empty
1157 string at the front so, s ~ // and s ~
1158 RLENGTH__ to 0.
1159
1160
1161 index(s, t) is always the same as match(s, t1) where t1 is
1162 the same as t with metacharacters escaped. Hence consistency
1163 with match requires that index(s,
1164
1165
1166 If getline encounters end of file, getline var, leaves var
1167 unchanged. Similarly, on entry to the __END__ actions,
1168 __$0__, the fields and __NF__ have their value
1169 unaltered from the last record.
1170 !!SEE ALSO
1171
1172
1173 egrep(1)
1174
1175
1176 Aho, Kernighan and Weinberger, ''The AWK Programming
1177 Language'', Addison-Wesley Publishing, 1988, (the AWK
1178 book), defines the language, opening with a tutorial and
1179 advancing to many interesting programs that delve into
1180 issues of software design and analysis relevant to
1181 programming in any language.
1182
1183
1184 ''The GAWK Manual'', The Free Software Foundation, 1991,
1185 is a tutorial and language reference that does not attempt
1186 the depth of the AWK book and assumes the reader may be a
1187 novice programmer. The section on AWK arrays is excellent.
1188 It also discusses Posix requirements for AWK.
1189 !!BUGS
1190
1191
1192 __mawk__ cannot handle ascii NUL 0 in the source or data
1193 files. You can output NUL using printf with %c, and any
1194 other 8 bit character is acceptable input.
1195
1196
1197 __mawk__ implements printf() and sprintf() using the C
1198 library functions, printf and sprintf, so full ANSI
1199 compatibility requires an ANSI C library. In practice this
1200 means the h conversion qualifier may not be available. Also
1201 __mawk__ inherits any bugs or limitations of the library
1202 functions.
1203
1204
1205 Implementors of the AWK language have shown a consistent
1206 lack of imagination when naming their programs.
1207 !!AUTHOR
1208
1209
1210 Mike Brennan (brennan@whidbey.com).
1211 ----
This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.