Penguin
Annotated edit history of flex(1) version 2, including all changes. View license author blame.
Rev Author # Line
1 perry 1 FLEX
2 !!!FLEX
3 NAME
4 SYNOPSIS
5 OVERVIEW
6 DESCRIPTION
7 SOME SIMPLE EXAMPLES
8 FORMAT OF THE INPUT FILE
9 PATTERNS
10 HOW THE INPUT IS MATCHED
11 ACTIONS
12 THE GENERATED SCANNER
13 START CONDITIONS
14 MULTIPLE INPUT BUFFERS
15 END-OF-FILE RULES
16 MISCELLANEOUS MACROS
17 VALUES AVAILABLE TO THE USER
18 INTERFACING WITH YACC
19 OPTIONS
20 PERFORMANCE CONSIDERATIONS
21 GENERATING C++ SCANNERS
22 INCOMPATIBILITIES WITH LEX AND POSIX
23 DIAGNOSTICS
24 FILES
25 DEFICIENCIES / BUGS
26 SEE ALSO
27 AUTHOR
28 ----
29 !!NAME
30
31
32 flex - fast lexical analyzer generator
33 !!SYNOPSIS
34
35
36 __flex [[-bcdfhilnpstvwBFILTV78+? -C[[aefFmr] -ooutput
37 -Pprefix -Sskeleton] [[--help --version]__ ''[[filename
38 ...]''
39 !!OVERVIEW
40
41
42 This manual describes ''flex,'' a tool for generating
43 programs that perform pattern-matching on text. The manual
44 includes both tutorial and reference sections:
45
46
47 Description
48 a brief overview of the tool
49 Some Simple Examples
50 Format Of The Input File
51 Patterns
52 the extended regular expressions used by flex
53 How The Input Is Matched
54 the rules for determining what has been matched
55 Actions
56 how to specify what to do when a pattern is matched
57 The Generated Scanner
58 details regarding the scanner that flex produces;
59 how to control the input source
60 Start Conditions
61 introducing context into your scanners, and
62 managing
63 !!DESCRIPTION
64
65
66 ''flex'' is a tool for generating ''scanners:''
67 programs which recognized lexical patterns in text.
68 ''flex'' reads the given input files, or its standard
69 input if no file names are given, for a description of a
70 scanner to generate. The description is in the form of pairs
71 of regular expressions and C code, called ''rules. flex''
72 generates as output a C source file, __lex.yy.c,__ which
73 defines a routine __yylex().__ This file is compiled and
74 linked with the __-lfl__ library to produce an
75 executable. When the executable is run, it analyzes its
76 input for occurrences of the regular expressions. Whenever
77 it finds one, it executes the corresponding C
78 code.
79 !!SOME SIMPLE EXAMPLES
80
81
82 First some simple examples to get the flavor of how one uses
83 ''flex.'' The following ''flex'' input specifies a
84 scanner which whenever it encounters the string
85 ''
86
87
88 %%
89 username printf(
90 By default, any text not matched by a ''flex'' scanner is copied to the output, so the net effect of this scanner is to copy its input file to its output with each occurrence of ''pattern'' and the ''action.'' The ''
91
92
93 Here's another simple example:
94
95
96 int num_lines = 0, num_chars = 0;
97 %%
98 n ++num_lines; ++num_chars;
99 . ++num_chars;
100 %%
101 main()
102 {
103 yylex();
104 printf(
105 This scanner counts the number of characters and the number of lines in its input (it produces no output other than the final report on the counts). The first line declares two globals, yylex()__ and in the __main()__ routine declared after the second __
106
107
108 A somewhat more complicated example:
109
110
111 /* scanner for a toy Pascal-like language */
112 %{
113 /* need this for the call to atof() below */
114 #include
115 This is the beginnings of a simple scanner for a language like Pascal. It identifies different types of ''tokens'' and reports on what it has seen.
116
117
118 The details of this example will be explained in the
119 following sections.
120 !!FORMAT OF THE INPUT FILE
121
122
123 The ''flex'' input file consists of three sections,
124 separated by a line with just __%%__ in it:
125
126
127 definitions
128 %%
129 rules
130 %%
131 user code
132 The ''definitions'' section contains declarations of simple ''name'' definitions to simplify the scanner specification, and declarations of ''start conditions,'' which are explained in a later section.
133
134
135 Name definitions have the form:
136
137
138 name definition
139 The
140
141
142 DIGIT [[0-9]
143 ID [[a-z][[a-z0-9]*
144 defines
145
146
147 {DIGIT}+
148 is identical to
149
150
151 ([[0-9])+
152 and matches one-or-more digits followed by a '.' followed by zero-or-more digits.
153
154
155 The ''rules'' section of the ''flex'' input contains a
156 series of rules of the form:
157
158
159 pattern action
160 where the pattern must be unindented and the action must begin on the same line.
161
162
163 See below for a further description of patterns and
164 actions.
165
166
167 Finally, the user code section is simply copied to
168 __lex.yy.c__ verbatim. It is used for companion routines
169 which call or are called by the scanner. The presence of
170 this section is optional; if it is missing, the second
171 __%%__ in the input file may be skipped,
172 too.
173
174
175 In the definitions and rules sections, any ''indented''
176 text or text enclosed in __%{__ and __%}__ is copied
177 verbatim to the output (with the %{}'s removed). The %{}'s
178 must appear unindented on lines by themselves.
179
180
181 In the rules section, any indented or %{} text appearing
182 before the first rule may be used to declare variables which
183 are local to the scanning routine and (after the
184 declarations) code which is to be executed whenever the
185 scanning routine is entered. Other indented or %{} text in
186 the rule section is still copied to the output, but its
187 meaning is not well-defined and it may well cause
188 compile-time errors (this feature is present for
189 ''POSIX'' compliance; see below for other such
190 features).
191
192
193 In the definitions section (but not in the rules section),
194 an unindented comment (i.e., a line beginning with
195 !!PATTERNS
196
197
198 The patterns in the input are written using an extended set
199 of regular expressions. These are:
200
201
202 x match the character 'x'
203 . any character (byte) except newline
204 [[xyz] a
205 Note that inside of a character class, all regular expression operators lose their special meaning except escape ('') and the character class operators, '-', ']', and, at the beginning of the class, '^'.
206
207
208 The regular expressions listed above are grouped according
209 to precedence, from highest precedence at the top to lowest
210 at the bottom. Those grouped together have equal precedence.
211 For example,
212
213
214 foo|bar*
215 is the same as
216
217
218 (foo)|(ba(r*))
219 since the '*' operator has higher precedence than concatenation, and concatenation higher than alternation ('|'). This pattern therefore matches ''either'' the string ''or'' the string ''
220
221
222 foo|(bar)*
223 and to match zero-or-more
224
225
226 (foo|bar)*
227 In addition to characters and ranges of characters, character classes can also contain character class ''expressions.'' These are expressions enclosed inside __[[:__ and __:]__ delimiters (which themselves must appear between the '[[' and ']' of the character class; other elements may occur inside the character class, too). The valid expressions are:
228
229
230 [[:alnum:] [[:alpha:] [[:blank:]
231 [[:cntrl:] [[:digit:] [[:graph:]
232 [[:lower:] [[:print:] [[:punct:]
233 [[:space:] [[:upper:] [[:xdigit:]
234 These expressions all designate a set of characters equivalent to the corresponding standard C __isXXX__ function. For example, __[[:alnum:]__ designates those characters for which __isalnum()__ returns true - i.e., any alphabetic or numeric. Some systems don't provide __isblank(),__ so flex defines __[[:blank:]__ as a blank or a tab.
235
236
237 For example, the following character classes are all
238 equivalent:
239
240
241 [[[[:alnum:]]
242 [[[[:alpha:][[:digit:]]
243 [[[[:alpha:][[0-9]]
244 [[a-zA-Z0-9]
245 If your scanner is case-insensitive (the __-i__ flag), then __[[:upper:]__ and __[[:lower:]__ are equivalent to __[[:alpha:].__
246
247
248 Some notes on patterns:
249
250
251 -
252
253
254 A negated character class such as the example
255 will match a newline'' unless
256 ''
257
258
259 -
260
261
262 A rule can have at most one instance of trailing context
263 (the '/' operator or the '$' operator). The start condition,
264 '^', and
265
266
267 The following are illegal:
268
269
270 foo/bar$
271 Note that the first of these, can be written
272
273
274 The following will result in '$' or '^' being treated as a
275 normal character:
276
277
278 foo|(bar$)
279 foo|^bar
280 If what's wanted is a
281
282
283 foo |
284 bar$ /* action goes here */
285 A similar trick will work for matching a foo or a bar-at-the-beginning-of-a-line.
286 !!HOW THE INPUT IS MATCHED
287
288
289 When the generated scanner is run, it analyzes its input
290 looking for strings which match any of its patterns. If it
291 finds more than one match, it takes the one matching the
292 most text (for trailing context rules, this includes the
293 length of the trailing part, even though it will then be
294 returned to the input). If it finds two or more matches of
295 the same length, the rule listed first in the ''flex''
296 input file is chosen.
297
298
299 Once the match is determined, the text corresponding to the
300 match (called the ''token)'' is made available in the
301 global character pointer __yytext,__ and its length in
302 the global integer __yyleng.__ The ''action''
303 corresponding to the matched pattern is then executed (a
304 more detailed description of actions follows), and then the
305 remaining input is scanned for another match.
306
307
308 If no match is found, then the ''default rule'' is
309 executed: the next character in the input is considered
310 matched and copied to the standard output. Thus, the
311 simplest legal ''flex'' input is:
312
313
314 %%
315 which generates a scanner that simply copies its input (one character at a time) to its output.
316
317
318 Note that __yytext__ can be defined in two different
319 ways: either as a character ''pointer'' or as a character
320 ''array.'' You can control which definition ''flex''
321 uses by including one of the special directives
322 __%pointer__ or __%array__ in the first (definitions)
323 section of your flex input. The default is __%pointer,__
324 unless you use the __-l__ lex compatibility option, in
325 which case __yytext__ will be an array. The advantage of
326 using __%pointer__ is substantially faster scanning and
327 no buffer overflow when matching very large tokens (unless
328 you run out of dynamic memory). The disadvantage is that you
329 are restricted in how your actions can modify __yytext__
330 (see the next section), and calls to the __unput()__
331 function destroys the present contents of __yytext,__
332 which can be a considerable porting headache when moving
333 between different ''lex'' versions.
334
335
336 The advantage of __%array__ is that you can then modify
337 __yytext__ to your heart's content, and calls to
338 __unput()__ do not destroy __yytext__ (see below).
339 Furthermore, existing ''lex'' programs sometimes access
340 __yytext__ externally using declarations of the
341 form:
342
343
344 extern char yytext[[];
345 This definition is erroneous when used with __%pointer,__ but correct for __%array.__
346
347
348 __%array__ defines __yytext__ to be an array of
349 __YYLMAX__ characters, which defaults to a fairly large
350 value. You can change the size by simply #define'ing
351 __YYLMAX__ to a different value in the first section of
352 your ''flex'' input. As mentioned above, with
353 __%pointer__ yytext grows dynamically to accommodate
354 large tokens. While this means your __%pointer__ scanner
355 can accommodate very large tokens (such as matching entire
356 blocks of comments), bear in mind that each time the scanner
357 must resize __yytext__ it also must rescan the entire
358 token from the beginning, so matching such tokens can prove
359 slow. __yytext__ presently does ''not'' dynamically
360 grow if a call to __unput()__ results in too much text
361 being pushed back; instead, a run-time error
362 results.
363
364
365 Also note that you cannot use __%array__ with C++ scanner
366 classes (the __c++__ option; see below).
367 !!ACTIONS
368
369
370 Each pattern in a rule has a corresponding action, which can
371 be any arbitrary C statement. The pattern ends at the first
372 non-escaped whitespace character; the remainder of the line
373 is its action. If the action is empty, then when the pattern
374 is matched the input token is simply discarded. For example,
375 here is the specification for a program which deletes all
376 occurrences of
377
378
379 %%
380 (It will copy all other characters in the input to the output since they will be matched by the default rule.)
381
382
383 Here is a program which compresses multiple blanks and tabs
384 down to a single blank, and throws away whitespace found at
385 the end of a line:
386
387
388 %%
389 [[ t]+ putchar( ' ' );
390 [[ t]+$ /* ignore this token */
391 If the action contains a '{', then the action spans till the balancing '}' is found, and the action may cross multiple lines. ''flex'' knows about C strings and comments and won't be fooled by braces found within them, but also allows actions to begin with __%{__ and will consider the action to be all the text up to the next __%}__ (regardless of ordinary braces inside the action).
392
393
394 An action consisting solely of a vertical bar ('|') means
395
396
397 Actions can include arbitrary C code, including
398 __return__ statements to return a value to whatever
399 routine called __yylex().__ Each time __yylex()__ is
400 called it continues processing tokens from where it last
401 left off until it either reaches the end of the file or
402 executes a return.
403
404
405 Actions are free to modify __yytext__ except for
406 lengthening it (adding characters to its end--these will
407 overwrite later characters in the input stream). This
408 however does not apply when using __%array__ (see above);
409 in that case, __yytext__ may be freely modified in any
410 way.
411
412
413 Actions are free to modify __yyleng__ except they should
414 not do so if the action also includes use of __yymore()__
415 (see below).
416
417
418 There are a number of special directives which can be
419 included within an action:
420
421
422 -
423
424
425 __ECHO__ copies yytext to the scanner's
426 output.
427
428
429 -
430
431
432 __BEGIN__ followed by the name of a start condition
433 places the scanner in the corresponding start condition (see
434 below).
435
436
437 -
438
439
440 __REJECT__ directs the scanner to proceed on to the
441 __yytext__
442 and __yyleng__ set up appropriately. It may either be one
443 which matched as much text as the originally chosen rule but
444 came later in the ''flex'' input file, or one which
445 matched less text. For example, the following will both
446 count the words in the input and call the routine special()
447 whenever ''
448
449
450 int word_count = 0;
451 %%
452 frob special(); REJECT;
453 [[^ tn]+ ++word_count;
454 Without the __REJECT,__ any __REJECT's__ are allowed, each one finding the next best choice to the currently active rule. For example, when the following scanner scans the token __
455
456
457 %%
458 a |
459 ab |
460 abc |
461 abcd ECHO; REJECT;
462 .|n /* eat up any unmatched character */
463 (The first three rules share the fourth's action since they use the special '|' action.) __REJECT__ is a particularly expensive feature in terms of scanner performance; if it is used in ''any'' of the scanner's actions it will slow down ''all'' of the scanner's matching. Furthermore, __REJECT__ cannot be used with the ''-Cf'' or ''-CF'' options (see below).
464
465
466 Note also that unlike the other special actions,
467 __REJECT__ is a ''branch;'' code immediately following
468 it in the action will ''not'' be executed.
469
470
471 -
472
473
474 __yymore()__ tells the scanner that the next time it
475 matches a rule, the corresponding token should be
476 ''appended'' onto the current value of __yytext__
477 rather than replacing it. For example, given the input
478 __
479
480
481 %%
482 mega- ECHO; yymore();
483 kludge ECHO;
484 First yytext__ so the __ECHO__ for the __
485
486
487 Two notes regarding use of __yymore().__ First,
488 __yymore()__ depends on the value of ''yyleng''
489 correctly reflecting the size of the current token, so you
490 must not modify ''yyleng'' if you are using
491 __yymore().__ Second, the presence of __yymore()__ in
492 the scanner's action entails a minor performance penalty in
493 the scanner's matching speed.
494
495
496 -
497
498
499 __yyless(n)__ returns all but the first ''n''
500 characters of the current token back to the input stream,
501 where they will be rescanned when the scanner looks for the
502 next match. __yytext__ and __yyleng__ are adjusted
503 appropriately (e.g., __yyleng__ will now be equal to
504 ''n'' ). For example, on the input
505 ''
506
507
508 %%
509 foobar ECHO; yyless(3);
510 [[a-z]+ ECHO;
511 An argument of 0 to __yyless__ will cause the entire current input string to be scanned again. Unless you've changed how the scanner will subsequently process its input (using __BEGIN,__ for example), this will result in an endless loop.
512
513
514 Note that __yyless__ is a macro and can only be used in
515 the flex input file, not from other source
516 files.
517
518
519 -
520
521
522 __unput(c)__ puts the character ''c'' back onto the
523 input stream. It will be the next character scanned. The
524 following action will take the current token and cause it to
525 be rescanned enclosed in parentheses.
526
527
528 {
529 int i;
530 /* Copy yytext because unput() trashes yytext */
531 char *yycopy = strdup( yytext );
532 unput( ')' );
533 for ( i = yyleng - 1; i
534 Note that since each __unput()__ puts the given character back at the ''beginning'' of the input stream, pushing back strings must be done back-to-front.
535
536
537 An important potential problem when using __unput()__ is
538 that if you are using __%pointer__ (the default), a call
539 to __unput()__ ''destroys'' the contents of
540 ''yytext,'' starting with its rightmost character and
541 devouring one character to the left with each call. If you
542 need the value of yytext preserved after a call to
543 __unput()__ (as in the above example), you must either
544 first copy it elsewhere, or build your scanner using
545 __%array__ instead (see How The Input Is
546 Matched).
547
548
549 Finally, note that you cannot put back __EOF__ to attempt
550 to mark the input stream with an end-of-file.
551
552
553 -
554
555
556 __input()__ reads the next character from the input
557 stream. For example, the following is one way to eat up C
558 comments:
559
560
561 %%
562 (Note that if the scanner is compiled using __C++,__ then __input()__ is instead referred to as __yyinput(),__ in order to avoid a name clash with the __C++__ stream by the name of ''input.)''
563
564
565 -
566
567
568 __YY_FLUSH_BUFFER__ flushes the scanner's internal buffer
569 so that the next time the scanner attempts to match a token,
570 it will first refill the buffer using __YY_INPUT__ (see
571 The Generated Scanner, below). This action is a special case
572 of the more general __yy_flush_buffer()__ function,
573 described below in the section Multiple Input
574 Buffers.
575
576
577 -
578
579
580 __yyterminate()__ can be used in lieu of a return
581 statement in an action. It terminates the scanner and
582 returns a 0 to the scanner's caller, indicating
583 __yyterminate()__ is also called
584 when an end-of-file is encountered. It is a macro and may be
585 redefined.
586 !!THE GENERATED SCANNER
587
588
589 The output of ''flex'' is the file __lex.yy.c,__ which
590 contains the scanning routine __yylex(),__ a number of
591 tables used by it for matching tokens, and a number of
592 auxiliary routines and macros. By default, __yylex()__ is
593 declared as follows:
594
595
596 int yylex()
597 {
598 ... various definitions and the actions in here ...
599 }
600 (If your environment supports function prototypes, then it will be
601
602
603 #define YY_DECL float lexscan( a, b ) float a, b;
604 to give the scanning routine the name ''lexscan,'' returning a float, and taking two floats as arguments. Note that if you give arguments to the scanning routine using a K''
605
606
607 Whenever __yylex()__ is called, it scans tokens from the
608 global input file ''yyin'' (which defaults to stdin). It
609 continues until it either reaches an end-of-file (at which
610 point it returns the value 0) or one of its actions executes
611 a ''return'' statement.
612
613
614 If the scanner reaches an end-of-file, subsequent calls are
615 undefined unless either ''yyin'' is pointed at a new
616 input file (in which case scanning continues from that
617 file), or __yyrestart()__ is called. __yyrestart()__
618 takes one argument, a __FILE *__ pointer (which can be
619 nil, if you've set up __YY_INPUT__ to scan from a source
620 other than ''yyin),'' and initializes ''yyin'' for
621 scanning from that file. Essentially there is no difference
622 between just assigning ''yyin'' to a new input file or
623 using __yyrestart()__ to do so; the latter is available
624 for compatibility with previous versions of ''flex,'' and
625 because it can be used to switch input files in the middle
626 of scanning. It can also be used to throw away the current
627 input buffer, by calling it with an argument of ''yyin;''
628 but better is to use __YY_FLUSH_BUFFER__ (see above).
629 Note that __yyrestart()__ does ''not'' reset the start
630 condition to __INITIAL__ (see Start Conditions,
631 below).
632
633
634 If __yylex()__ stops scanning due to executing a
635 ''return'' statement in one of the actions, the scanner
636 may then be called again and it will resume scanning where
637 it left off.
638
639
640 By default (and for purposes of efficiency), the scanner
641 uses block-reads rather than simple ''getc()'' calls to
642 read characters from ''yyin.'' The nature of how it gets
643 its input can be controlled by defining the __YY_INPUT__
644 macro. YY_INPUT's calling sequence is
645 __max_size'' characters in the character
646 array ''buf'' and return in the integer variable
647 ''result'' either the number of characters read or the
648 constant YY_NULL (0 on Unix systems) to indicate EOF. The
649 default YY_INPUT reads from the global file-pointer
650 ''
651
652
653 A sample definition of YY_INPUT (in the definitions section
654 of the input file):
655
656
657 %{
658 #define YY_INPUT(buf,result,max_size) \
659 { \
660 int c = getchar(); \
661 result = (c == EOF) ? YY_NULL : (buf[[0] = c, 1); \
662 }
663 %}
664 This definition will change the input processing to occur one character at a time.
665
666
667 When the scanner receives an end-of-file indication from
668 YY_INPUT, it then checks the __yywrap()__ function. If
669 __yywrap()__ returns false (zero), then it is assumed
670 that the function has gone ahead and set up ''yyin'' to
671 point to another input file, and scanning continues. If it
672 returns true (non-zero), then the scanner terminates,
673 returning 0 to its caller. Note that in either case, the
674 start condition remains unchanged; it does ''not'' revert
675 to __INITIAL.__
676
677
678 If you do not supply your own version of __yywrap(),__
679 then you must either use __%option noyywrap__ (in which
680 case the scanner behaves as though __yywrap()__ returned
681 1), or you must link with __-lfl__ to obtain the default
682 version of the routine, which always returns 1.
683
684
685 Three routines are available for scanning from in-memory
686 buffers rather than files: __yy_scan_string(),
687 yy_scan_bytes(),__ and __yy_scan_buffer().__ See the
688 discussion of them below in the section Multiple Input
689 Buffers.
690
691
692 The scanner writes its __ECHO__ output to the
693 ''yyout'' global (default, stdout), which may be
694 redefined by the user simply by assigning it to some other
695 __FILE__ pointer.
696 !!START CONDITIONS
697
698
699 ''flex'' provides a mechanism for conditionally
700 activating rules. Any rule whose pattern is prefixed with
701 ''
702
703
704 will be active only when the scanner is in the
705
706
707 will be active only when the current start condition is either
708
709
710 Start conditions are declared in the definitions (first)
711 section of the input using unindented lines beginning with
712 either __%s__ or __%x__ followed by a list of names.
713 The former declares ''inclusive'' start conditions, the
714 latter ''exclusive'' start conditions. A start condition
715 is activated using the __BEGIN__ action. Until the next
716 __BEGIN__ action is executed, rules with the given start
717 condition will be active and rules with other start
718 conditions will be inactive. If the start condition is
719 ''inclusive,'' then rules with no start conditions at all
720 will also be active. If it is ''exclusive,'' then
721 ''only'' rules qualified with the start condition will be
722 active. A set of rules contingent on the same exclusive
723 start condition describe a scanner which is independent of
724 any of the other rules in the ''flex'' input. Because of
725 this, exclusive start conditions make it easy to specify
726 ''
727
728
729 If the distinction between inclusive and exclusive start
730 conditions is still a little vague, here's a simple example
731 illustrating the connection between the two. The set of
732 rules:
733
734
735 %s example
736 %%
737 is equivalent to
738
739
740 %x example
741 %%
742 Without the ____ qualifier, the ''bar'' pattern in the second example wouldn't be active (i.e., couldn't match) when in start condition __example.__ If we just used ____ to qualify ''bar,'' though, then it would only be active in __example__ and not in __INITIAL,__ while in the first example it's active in both, because in the first example the __example__ startion condition is an ''inclusive'' __(%s)__ start condition.
743
744
745 Also note that the special start-condition specifier
746 ____ matches every start condition. Thus, the
747 above example could also have been written;
748
749
750 %x example
751 %%
752 The default rule (to __ECHO__ any unmatched character) remains active in start conditions. It is equivalent to:
753
754
755 __BEGIN(0)__ returns to the original state where only the rules with no start conditions are active. This state can also be referred to as the start-condition __BEGIN(INITIAL)__ is equivalent to __BEGIN(0).__ (The parentheses around the start condition name are not required but are considered good style.)
756
757
758 __BEGIN__ actions can also be given as indented code at
759 the beginning of the rules section. For example, the
760 following will cause the scanner to enter the
761 __yylex()__
762 is called and the global variable ''enter_special'' is
763 true:
764
765
766 int enter_special;
767 %x SPECIAL
768 %%
769 if ( enter_special )
770 BEGIN(SPECIAL);
771 To illustrate the uses of start conditions, here is a scanner which provides two different interpretations of a string like
772
773
774 %{
775 #include
776 Here is a scanner which recognizes (and discards) C comments while maintaining a count of the current input line.
777
778
779 %x comment
780 %%
781 int line_num = 1;
782 This scanner goes to a bit of trouble to match as much text as possible with each rule. In general, when attempting to write a high-speed scanner try to match as much possible in each rule, as it's a big win.
783
784
785 Note that start-conditions names are really integer values
786 and can be stored as such. Thus, the above could be extended
787 in the following fashion:
788
789
790 %x comment foo
791 %%
792 int line_num = 1;
793 int comment_caller;
794 Furthermore, you can access the current start condition using the integer-valued __YY_START__ macro. For example, the above assignments to ''comment_caller'' could instead be written
795
796
797 comment_caller = YY_START;
798 Flex provides __YYSTATE__ as an alias for __YY_START__ (since that is what's used by AT__lex).''
799
800
801 Note that start conditions do not have their own name-space;
802 %s's and %x's declare names in the same fashion as
803 #define's.
804
805
806 Finally, here's an example of how to match C-style quoted
807 strings using exclusive start conditions, including expanded
808 escape sequences (but not including checking for a string
809 that's too long):
810
811
812 %x str
813 %%
814 char string_buf[[MAX_STR_CONST];
815 char *string_buf_ptr;
816 Often, such as in some of the examples above, you wind up writing a whole bunch of rules all preceded by the same start condition(s). Flex makes this a little easier and cleaner by introducing a notion of start condition ''scope.'' A start condition scope is begun with:
817
818
819 where ''SCs'' is a list of one or more start conditions. Inside the start condition scope, every rule automatically has the prefix '''' applied to it, until a '''}''' which matches the initial '''{'.'' So, for example,
820
821
822 is equivalent to:
823
824
825 Start condition scopes may be nested.
826
827
828 Three routines are available for manipulating stacks of
829 start conditions:
830
831
832 __void yy_push_state(int new_state)__
833
834
835 pushes the current start condition onto the top of the start
836 condition stack and switches to ''new_state'' as though
837 you had used __BEGIN new_state__ (recall that start
838 condition names are also integers).
839
840
841 __void yy_pop_state()__
842
843
844 pops the top of the stack and switches to it via
845 __BEGIN.__
846
847
848 __int yy_top_state()__
849
850
851 returns the top of the stack without altering the stack's
852 contents.
853
854
855 The start condition stack grows dynamically and so has no
856 built-in size limitation. If memory is exhausted, program
857 execution aborts.
858
859
860 To use start condition stacks, your scanner must include a
861 __%option stack__ directive (see Options
862 below).
863 !!MULTIPLE INPUT BUFFERS
864
865
866 Some scanners (such as those which support
867 flex'' scanners do a large amount of
868 buffering, one cannot control where the next input will be
869 read from by simply writing a __YY_INPUT__ which is
870 sensitive to the scanning context. __YY_INPUT__ is only
871 called when the scanner reaches the end of its buffer, which
872 may be a long time after scanning a statement such as an
873 __
874
875
876 To negotiate these sorts of problems, ''flex'' provides a
877 mechanism for creating and switching between multiple input
878 buffers. An input buffer is created by using:
879
880
881 YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
882 which takes a ''FILE'' pointer and a size and creates a buffer associated with the given file and large enough to hold ''size'' characters (when in doubt, use __YY_BUF_SIZE__ for the size). It returns a __YY_BUFFER_STATE__ handle, which may then be passed to other routines (see below). The __YY_BUFFER_STATE__ type is a pointer to an opaque __struct yy_buffer_state__ structure, so you may safely initialize YY_BUFFER_STATE variables to __((YY_BUFFER_STATE) 0)__ if you wish, and also refer to the opaque structure in order to correctly declare input buffers in source files other than that of your scanner. Note that the ''FILE'' pointer in the call to __yy_create_buffer__ is only used as the value of ''yyin'' seen by __YY_INPUT;__ if you redefine __YY_INPUT__ so it no longer uses ''yyin,'' then you can safely pass a nil ''FILE'' pointer to __yy_create_buffer.__ You select a particular buffer to scan from using:
883
884
885 void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )
886 switches the scanner's input buffer so subsequent tokens will come from ''new_buffer.'' Note that __yy_switch_to_buffer()__ may be used by yywrap() to set things up for continued scanning, instead of opening a new file and pointing ''yyin'' at it. Note also that switching input sources via either __yy_switch_to_buffer()__ or __yywrap()__ does ''not'' change the start condition.
887
888
889 void yy_delete_buffer( YY_BUFFER_STATE buffer )
890 is used to reclaim the storage associated with a buffer. ( __buffer__ can be nil, in which case the routine does nothing.) You can also clear the current contents of a buffer using:
891
892
893 void yy_flush_buffer( YY_BUFFER_STATE buffer )
894 This function discards the buffer's contents, so the next time the scanner attempts to match a token from the buffer, it will first fill the buffer anew using __YY_INPUT.__
895
896
897 __yy_new_buffer()__ is an alias for
898 __yy_create_buffer(),__ provided for compatibility with
899 the C++ use of ''new'' and ''delete'' for creating and
900 destroying dynamic objects.
901
902
903 Finally, the __YY_CURRENT_BUFFER__ macro returns a
904 __YY_BUFFER_STATE__ handle to the current
905 buffer.
906
907
908 Here is an example of using these features for writing a
909 scanner which expands include files (the
910 ____ feature is discussed
911 below):
912
913
914 /* the
915 Three routines are available for setting up input buffers for scanning in-memory strings instead of files. All of them create a new input buffer for scanning the string, and return a corresponding __YY_BUFFER_STATE__ handle (which you should delete with __yy_delete_buffer()__ when done with it). They also switch to the new buffer using __yy_switch_to_buffer(),__ so the next call to __yylex()__ will start scanning the string.
916
917
918 __yy_scan_string(const char *str)__
919
920
921 scans a NUL-terminated string.
922
923
924 __yy_scan_bytes(const char *bytes, int
925 len)__
926
927
928 scans ''len'' bytes (including possibly NUL's) starting
929 at location ''bytes.''
930
931
932 Note that both of these functions create and scan a
933 ''copy'' of the string or bytes. (This may be desirable,
934 since __yylex()__ modifies the contents of the buffer it
935 is scanning.) You can avoid the copy by using:
936
937
938 __yy_scan_buffer(char *base, yy_size_t
939 size)__
940
941
942 which scans in place the buffer starting at ''base,''
943 consisting of ''size'' bytes, the last two bytes of which
944 ''must'' be __YY_END_OF_BUFFER_CHAR__ (ASCII NUL).
945 These last two bytes are not scanned; thus, scanning
946 consists of __base[[0]__ through __base[[size-2],__
947 inclusive.
948
949
950 If you fail to set up ''base'' in this manner (i.e.,
951 forget the final two __YY_END_OF_BUFFER_CHAR__ bytes),
952 then __yy_scan_buffer()__ returns a nil pointer instead
953 of creating a new input buffer.
954
955
956 The type __yy_size_t__ is an integral type to which you
957 can cast an integer expression reflecting the size of the
958 buffer.
959 !!END-OF-FILE RULES
960
961
962 The special rule
963
964
965 -
966
967
968 assigning ''yyin'' to a new input file (in previous
969 versions of flex, after doing the assignment you had to call
970 the special action __YY_NEW_FILE;__ this is no longer
971 necessary);
972
973
974 -
975
976
977 executing a ''return'' statement;
978
979
980 -
981
982
983 executing the special __yyterminate()__
984 action;
985
986
987 -
988
989
990 or, switching to a new buffer using
991 __yy_switch_to_buffer()__ as shown in the example
992 above.
993
994
995 all'' start conditions which do
996 not already have
997 ''
998
999
1000 These rules are useful for catching things like unclosed comments. An example:
1001
1002
1003 %x quote
1004 %%
1005 ...other rules for dealing with quotes...
1006 !!MISCELLANEOUS MACROS
1007
1008
1009 The macro __YY_USER_ACTION__ can be defined to provide an
1010 action which is always executed prior to the matched rule's
1011 action. For example, it could be #define'd to call a routine
1012 to convert yytext to lower-case. When __YY_USER_ACTION__
1013 is invoked, the variable ''yy_act'' gives the number of
1014 the matched rule (rules are numbered starting with 1).
1015 Suppose you want to profile how often each of your rules is
1016 matched. The following would do the trick:
1017
1018
1019 #define YY_USER_ACTION ++ctr[[yy_act]
1020 where ''ctr'' is an array to hold the counts for the different rules. Note that the macro __YY_NUM_RULES__ gives the total number of rules (including the default rule, even if you use __-s),__ so a correct declaration for ''ctr'' is:
1021
1022
1023 int ctr[[YY_NUM_RULES];
1024 The macro __YY_USER_INIT__ may be defined to provide an action which is always executed before the first scan (and before the scanner's internal initializations are done). For example, it could be used to call a routine to read in a data table or open a logging file.
1025
1026
1027 The macro __yy_set_interactive(is_interactive)__ can be
1028 used to control whether the current buffer is considered
1029 ''interactive.'' An interactive buffer is processed more
1030 slowly, but must be used when the scanner's input source is
1031 indeed interactive to avoid problems due to waiting to fill
1032 buffers (see the discussion of the __-I__ flag below). A
1033 non-zero value in the macro invocation marks the buffer as
1034 interactive, a zero value as non-interactive. Note that use
1035 of this macro overrides __%option always-interactive__ or
1036 __%option never-interactive__ (see Options below).
1037 __yy_set_interactive()__ must be invoked prior to
1038 beginning to scan the buffer that is (or is not) to be
1039 considered interactive.
1040
1041
1042 The macro __yy_set_bol(at_bol)__ can be used to control
1043 whether the current buffer's scanning context for the next
1044 token match is done as though at the beginning of a line. A
1045 non-zero macro argument makes rules anchored with '^'
1046 active, while a zero argument makes '^' rules
1047 inactive.
1048
1049
1050 The macro __YY_AT_BOL()__ returns true if the next token
1051 scanned from the current buffer will have '^' rules active,
1052 false otherwise.
1053
1054
1055 In the generated scanner, the actions are all gathered in
1056 one large switch statement and separated using
1057 __YY_BREAK,__ which may be redefined. By default, it is
1058 simply a
1059 __YY_BREAK__
1060 allows, for example, C++ users to #define YY_BREAK to do
1061 nothing (while being very careful that every rule ends with
1062 a
1063 __YY_BREAK__ is inaccessible.
1064 !!VALUES AVAILABLE TO THE USER
1065
1066
1067 This section summarizes the various values available to the
1068 user in the rule actions.
1069
1070
1071 -
1072
1073
1074 __char *yytext__ holds the text of the current token. It
1075 may be modified but not lengthened (you cannot append
1076 characters to the end).
1077
1078
1079 If the special directive __%array__ appears in the first
1080 section of the scanner description, then __yytext__ is
1081 instead declared __char yytext[[YYLMAX],__ where
1082 __YYLMAX__ is a macro definition that you can redefine in
1083 the first section if you don't like the default value
1084 (generally 8KB). Using __%array__ results in somewhat
1085 slower scanners, but the value of __yytext__ becomes
1086 immune to calls to ''input()'' and ''unput(),'' which
1087 potentially destroy its value when __yytext__ is a
1088 character pointer. The opposite of __%array__ is
1089 __%pointer,__ which is the default.
1090
1091
1092 You cannot use __%array__ when generating C++ scanner
1093 classes (the __-+__ flag).
1094
1095
1096 -
1097
1098
1099 __int yyleng__ holds the length of the current
1100 token.
1101
1102
1103 -
1104
1105
1106 __FILE *yyin__ is the file which by default ''flex''
1107 reads from. It may be redefined but doing so only makes
1108 sense before scanning begins or after an EOF has been
1109 encountered. Changing it in the midst of scanning will have
1110 unexpected results since ''flex'' buffers its input; use
1111 __yyrestart()__ instead. Once scanning terminates because
1112 an end-of-file has been seen, you can assign ''yyin'' at
1113 the new input file and then call the scanner again to
1114 continue scanning.
1115
1116
1117 -
1118
1119
1120 __void yyrestart( FILE *new_file )__ may be called to
1121 point ''yyin'' at the new input file. The switch-over to
1122 the new file is immediate (any previously buffered-up input
1123 is lost). Note that calling __yyrestart()__ with
1124 ''yyin'' as an argument thus throws away the current
1125 input buffer and continues scanning the same input
1126 file.
1127
1128
1129 -
1130
1131
1132 __FILE *yyout__ is the file to which __ECHO__ actions
1133 are done. It can be reassigned by the user.
1134
1135
1136 -
1137
1138
1139 __YY_CURRENT_BUFFER__ returns a __YY_BUFFER_STATE__
1140 handle to the current buffer.
1141
1142
1143 -
1144
1145
1146 __YY_START__ returns an integer value corresponding to
1147 the current start condition. You can subsequently use this
1148 value with __BEGIN__ to return to that start
1149 condition.
1150 !!INTERFACING WITH YACC
1151
1152
1153 One of the main uses of ''flex'' is as a companion to the
1154 ''yacc'' parser-generator. ''yacc'' parsers expect to
1155 call a routine named __yylex()__ to find the next input
1156 token. The routine is supposed to return the type of the
1157 next token as well as putting any associated value in the
1158 global __yylval.__ To use ''flex'' with ''yacc,''
1159 one specifies the __-d__ option to ''yacc'' to
1160 instruct it to generate the file __y.tab.h__ containing
1161 definitions of all the __%tokens__ appearing in the
1162 ''yacc'' input. This file is then included in the
1163 ''flex'' scanner. For example, if one of the tokens is
1164 ''
1165
1166
1167 %{
1168 #include
1169 !!OPTIONS
1170
1171
1172 ''flex'' has the following options:
1173
1174
1175 __-b__
1176
1177
1178 Generate backing-up information to ''lex.backup.'' This
1179 is a list of scanner states which require backing up and the
1180 input characters on which they do so. By adding rules one
1181 can remove backing-up states. If ''all'' backing-up
1182 states are eliminated and __-Cf__ or __-CF__ is used,
1183 the generated scanner will run faster (see the __-p__
1184 flag). Only users who wish to squeeze every last cycle out
1185 of their scanners need worry about this option. (See the
1186 section on Performance Considerations below.)
1187
1188
1189 __-c__
1190
1191
1192 is a do-nothing, deprecated option included for POSIX
1193 compliance.
1194
1195
1196 __-d__
1197
1198
1199 makes the generated scanner run in ''debug'' mode.
1200 Whenever a pattern is recognized and the global
1201 __yy_flex_debug__ is non-zero (which is the default), the
1202 scanner will write to ''stderr'' a line of the
1203 form:
1204
1205
1206 --accepting rule at line 53 (
1207 The line number refers to the location of the rule in the file defining the scanner (i.e., the file that was fed to flex). Messages are also generated when the scanner backs up, accepts the default rule, reaches the end of its input buffer (or encounters a NUL; at this point, the two look the same as far as the scanner's concerned), or reaches an end-of-file.
1208
1209
1210 __-f__
1211
1212
1213 specifies ''fast scanner.'' No table compression is done
1214 and stdio is bypassed. The result is large but fast. This
1215 option is equivalent to __-Cfr__ (see
1216 below).
1217
1218
1219 __-h__
1220
1221
1222 generates a flex's''
1223 options to ''stdout'' and then exits. __-?__ and
1224 __--help__ are synonyms for __-h.__
1225
1226
1227 __-i__
1228
1229
1230 instructs ''flex'' to generate a ''case-insensitive''
1231 scanner. The case of letters given in the ''flex'' input
1232 patterns will be ignored, and tokens in the input will be
1233 matched regardless of case. The matched text given in
1234 ''yytext'' will have the preserved case (i.e., it will
1235 not be folded).
1236
1237
1238 __-l__
1239
1240
1241 turns on maximum compatibility with the original AT
1242 lex'' implementation. Note that this does not mean
1243 ''full'' compatibility. Use of this option costs a
1244 considerable amount of performance, and it cannot be used
1245 with the __-+, -f, -F, -Cf,__ or __-CF__ options. For
1246 details on the compatibilities it provides, see the section
1247 __YY_FLEX_LEX_COMPAT__
1248 being #define'd in the generated scanner.
1249
1250
1251 __-n__
1252
1253
1254 is another do-nothing, deprecated option included only for
1255 POSIX compliance.
1256
1257
1258 __-p__
1259
1260
1261 generates a performance report to stderr. The report
1262 consists of comments regarding features of the ''flex''
1263 input file which will cause a serious loss of performance in
1264 the resulting scanner. If you give the flag twice, you will
1265 also get comments regarding features that lead to minor
1266 performance losses.
1267
1268
1269 Note that the use of __REJECT, %option yylineno,__ and
1270 variable trailing context (see the Deficiencies / Bugs
1271 section below) entails a substantial performance penalty;
1272 use of ''yymore(),'' the __^__ operator, and the
1273 __-I__ flag entail minor performance
1274 penalties.
1275
1276
1277 __-s__
1278
1279
1280 causes the ''default rule'' (that unmatched scanner input
1281 is echoed to ''stdout)'' to be suppressed. If the scanner
1282 encounters input that does not match any of its rules, it
1283 aborts with an error. This option is useful for finding
1284 holes in a scanner's rule set.
1285
1286
1287 __-t__
1288
1289
1290 instructs ''flex'' to write the scanner it generates to
1291 standard output instead of __lex.yy.c.__
1292
1293
1294 __-v__
1295
1296
1297 specifies that ''flex'' should write to ''stderr'' a
1298 summary of statistics regarding the scanner it generates.
1299 Most of the statistics are meaningless to the casual
1300 ''flex'' user, but the first line identifies the version
1301 of ''flex'' (same as reported by __-V),__ and the next
1302 line the flags used when generating the scanner, including
1303 those that are on by default.
1304
1305
1306 __-w__
1307
1308
1309 suppresses warning messages.
1310
1311
1312 __-B__
1313
1314
1315 instructs ''flex'' to generate a ''batch'' scanner,
1316 the opposite of ''interactive'' scanners generated by
1317 __-I__ (see below). In general, you use __-B__ when
1318 you are ''certain'' that your scanner will never be used
1319 interactively, and you want to squeeze a ''little'' more
1320 performance out of it. If your goal is instead to squeeze
1321 out a ''lot'' more performance, you should be using the
1322 __-Cf__ or __-CF__ options (discussed below), which
1323 turn on __-B__ automatically anyway.
1324
1325
1326 __-F__
1327
1328
1329 specifies that the ''fast'' scanner table representation
1330 should be used (and stdio bypassed). This representation is
1331 about as fast as the full table representation __(-f),__
1332 and for some sets of patterns will be considerably smaller
1333 (and for others, larger). In general, if the pattern set
1334 contains both
1335 __
1336
1337
1338 then you're better off using the full table representation. If only the -F.__
1339
1340
1341 This option is equivalent to __-CFr__ (see below). It
1342 cannot be used with __-+.__
1343
1344
1345 __-I__
1346
1347
1348 instructs ''flex'' to generate an ''interactive''
1349 scanner. An interactive scanner is one that only looks ahead
1350 to decide what token has been matched if it absolutely must.
1351 It turns out that always looking one extra character ahead,
1352 even if the scanner has already seen enough text to
1353 disambiguate the current token, is a bit faster than only
1354 looking ahead when necessary. But scanners that always look
1355 ahead give dreadful interactive performance; for example,
1356 when a user types a newline, it is not recognized as a
1357 newline token until they enter ''another'' token, which
1358 often means typing in another whole line.
1359
1360
1361 ''Flex'' scanners default to ''interactive'' unless
1362 you use the __-Cf__ or __-CF__ table-compression
1363 options (see below). That's because if you're looking for
1364 high-performance you should be using one of these options,
1365 so if you didn't, ''flex'' assumes you'd rather trade off
1366 a bit of run-time performance for intuitive interactive
1367 behavior. Note also that you ''cannot'' use __-I__ in
1368 conjunction with __-Cf__ or __-CF.__ Thus, this option
1369 is not really needed; it is on by default for all those
1370 cases in which it is allowed.
1371
1372
1373 You can force a scanner to ''not'' be interactive by
1374 using __-B__ (see above).
1375
1376
1377 __-L__
1378
1379
1380 instructs ''flex'' not to generate __#line__
1381 directives. Without this option, ''flex'' peppers the
1382 generated scanner with #line directives so error messages in
1383 the actions will be correctly located with respect to either
1384 the original ''flex'' input file (if the errors are due
1385 to code in the input file), or __lex.yy.c__ (if the
1386 errors are ''flex's'' fault -- you should report these
1387 sorts of errors to the email address given
1388 below).
1389
1390
1391 __-T__
1392
1393
1394 makes ''flex'' run in ''trace'' mode. It will generate
1395 a lot of messages to ''stderr'' concerning the form of
1396 the input and the resultant non-deterministic and
1397 deterministic finite automata. This option is mostly for use
1398 in maintaining ''flex.''
1399
1400
1401 __-V__
1402
1403
1404 prints the version number to ''stdout'' and exits.
1405 __--version__ is a synonym for __-V.__
1406
1407
1408 __-7__
1409
1410
1411 instructs ''flex'' to generate a 7-bit scanner, i.e., one
1412 which can only recognized 7-bit characters in its input. The
1413 advantage of using __-7__ is that the scanner's tables
1414 can be up to half the size of those generated using the
1415 __-8__ option (see below). The disadvantage is that such
1416 scanners often hang or crash if their input contains an
1417 8-bit character.
1418
1419
1420 Note, however, that unless you generate your scanner using
1421 the __-Cf__ or __-CF__ table compression options, use
1422 of __-7__ will save only a small amount of table space,
1423 and make your scanner considerably less portable.
1424 ''Flex's'' default behavior is to generate an 8-bit
1425 scanner unless you use the __-Cf__ or __-CF,__ in
1426 which case ''flex'' defaults to generating 7-bit scanners
1427 unless your site was always configured to generate 8-bit
1428 scanners (as will often be the case with non-USA sites). You
1429 can tell whether flex generated a 7-bit or an 8-bit scanner
1430 by inspecting the flag summary in the __-v__ output as
1431 described above.
1432
1433
1434 Note that if you use __-Cfe__ or __-CFe__ (those table
1435 compression options, but also using equivalence classes as
1436 discussed see below), flex still defaults to generating an
1437 8-bit scanner, since usually with these compression options
1438 full 8-bit tables are not much more expensive than 7-bit
1439 tables.
1440
1441
1442 __-8__
1443
1444
1445 instructs ''flex'' to generate an 8-bit scanner, i.e.,
1446 one which can recognize 8-bit characters. This flag is only
1447 needed for scanners generated using __-Cf__ or
1448 __-CF,__ as otherwise flex defaults to generating an
1449 8-bit scanner anyway.
1450
1451
1452 See the discussion of __-7__ above for flex's default
1453 behavior and the tradeoffs between 7-bit and 8-bit
1454 scanners.
1455
1456
1457 __-+__
1458
1459
1460 specifies that you want flex to generate a C++ scanner
1461 class. See the section on Generating C++ Scanners below for
1462 details.
1463
1464
1465 __-C[[aefFmr]__
1466
1467
1468 controls the degree of table compression and, more
1469 generally, trade-offs between small scanners and fast
1470 scanners.
1471
1472
1473 __-Ca__ (
1474 __
1475
1476
1477 __-Ce__ directs ''flex'' to construct ''equivalence
1478 classes,'' i.e., sets of characters which have identical
1479 lexical properties (for example, if the only appearance of
1480 digits in the ''flex'' input is in the character class
1481 ''
1482
1483
1484 __-Cf__ specifies that the ''full'' scanner tables
1485 should be generated - ''flex'' should not compress the
1486 tables by taking advantages of similar transition functions
1487 for different states.
1488
1489
1490 __-CF__ specifies that the alternate fast scanner
1491 representation (described above under the __-F__ flag)
1492 should be used. This option cannot be used with
1493 __-+.__
1494
1495
1496 __-Cm__ directs ''flex'' to construct
1497 ''meta-equivalence classes,'' which are sets of
1498 equivalence classes (or characters, if equivalence classes
1499 are not being used) that are commonly used together.
1500 Meta-equivalence classes are often a big win when using
1501 compressed tables, but they have a moderate performance
1502 impact (one or two
1503 ''
1504
1505
1506 __-Cr__ causes the generated scanner to ''bypass'' use
1507 of the standard I/O library (stdio) for input. Instead of
1508 calling __fread()__ or __getc(),__ the scanner will
1509 use the __read()__ system call, resulting in a
1510 performance gain which varies from system to system, but in
1511 general is probably negligible unless you are also using
1512 __-Cf__ or __-CF.__ Using __-Cr__ can cause strange
1513 behavior if, for example, you read from ''yyin'' using
1514 stdio prior to calling the scanner (because the scanner will
1515 miss whatever text your previous reads left in the stdio
1516 input buffer).
1517
1518
1519 __-Cr__ has no effect if you define __YY_INPUT__ (see
1520 The Generated Scanner above).
1521
1522
1523 A lone __-C__ specifies that the scanner tables should be
1524 compressed but neither equivalence classes nor
1525 meta-equivalence classes should be used.
1526
1527
1528 The options __-Cf__ or __-CF__ and __-Cm__ do not
1529 make sense together - there is no opportunity for
1530 meta-equivalence classes if the table is not being
1531 compressed. Otherwise the options may be freely mixed, and
1532 are cumulative.
1533
1534
1535 The default setting is __-Cem,__ which specifies that
1536 ''flex'' should generate equivalence classes and
1537 meta-equivalence classes. This setting provides the highest
1538 degree of table compression. You can trade off
1539 faster-executing scanners at the cost of larger tables with
1540 the following generally being true:
1541
1542
1543 slowest
1544 Note that scanners with the smallest tables are usually generated and compiled the quickest, so during development you will usually want to use the default, maximal compression.
1545
1546
1547 __-Cfe__ is often a good compromise between speed and
1548 size for production scanners.
1549
1550
1551 __-ooutput__
1552
1553
1554 directs flex to write the scanner to the file __output__
1555 instead of __lex.yy.c.__ If you combine __-o__ with
1556 the __-t__ option, then the scanner is written to
1557 ''stdout'' but its __#line__ directives (see the
1558 __-L__ option above) refer to the file
1559 __output.__
1560
1561
1562 __-Pprefix__
1563
1564
1565 changes the default ''yy'' prefix used by ''flex'' for
1566 all globally-visible variable and function names to instead
1567 be ''prefix.'' For example, __-Pfoo__ changes the name
1568 of __yytext__ to __footext.__ It also changes the name
1569 of the default output file from __lex.yy.c__ to
1570 __lex.foo.c.__ Here are all of the names
1571 affected:
1572
1573
1574 yy_create_buffer
1575 yy_delete_buffer
1576 yy_flex_debug
1577 yy_init_buffer
1578 yy_flush_buffer
1579 yy_load_buffer_state
1580 yy_switch_to_buffer
1581 yyin
1582 yyleng
1583 yylex
1584 yylineno
1585 yyout
1586 yyrestart
1587 yytext
1588 yywrap
1589 (If you are using a C++ scanner, then only __yywrap__ and __yyFlexLexer__ are affected.) Within your scanner itself, you can still refer to the global variables and functions using either version of their name; but externally, they have the modified name.
1590
1591
1592 This option lets you easily link together multiple
1593 ''flex'' programs into the same executable. Note, though,
1594 that using this option also renames __yywrap(),__ so you
1595 now ''must'' either provide your own
1596 (appropriately-named) version of the routine for your
1597 scanner, or use __%option noyywrap,__ as linking with
1598 __-lfl__ no longer provides one for you by
1599 default.
1600
1601
1602 __-Sskeleton_file__
1603
1604
1605 overrides the default skeleton file from which ''flex''
1606 constructs its scanners. You'll never need this option
1607 unless you are doing ''flex'' maintenance or
1608 development.
1609
1610
1611 ''flex'' also provides a mechanism for controlling
1612 options within the scanner specification itself, rather than
1613 from the flex command-line. This is done by including
1614 __%option__ directives in the first section of the
1615 scanner specification. You can specify multiple options with
1616 a single __%option__ directive, and multiple directives
1617 in the first section of your flex input file.
1618
1619
1620 Most options are given simply as names, optionally preceded
1621 by the word
1622
1623
1624 7bit -7 option
1625 8bit -8 option
1626 align -Ca option
1627 backup -b option
1628 batch -B option
1629 c++ -+ option
1630 caseful or
1631 case-sensitive opposite of -i (default)
1632 case-insensitive or
1633 caseless -i option
1634 debug -d option
1635 default opposite of -s option
1636 ecs -Ce option
1637 fast -F option
1638 full -f option
1639 interactive -I option
1640 lex-compat -l option
1641 meta-ecs -Cm option
1642 perf-report -p option
1643 read -Cr option
1644 stdout -t option
1645 verbose -v option
1646 warn opposite of -w option
1647 (use
1648 Some __%option's__ provide features otherwise not available:
1649
1650
1651 __always-interactive__
1652
1653
1654 instructs flex to generate a scanner which always considers
1655 its input
1656 isatty()__ in an attempt
1657 to determine whether the scanner's input source is
1658 interactive and thus should be read a character at a time.
1659 When this option is used, however, then no such call is
1660 made.
1661
1662
1663 __main__
1664
1665
1666 directs flex to provide a default __main()__ program for
1667 the scanner, which simply calls __yylex().__ This option
1668 implies __noyywrap__ (see below).
1669
1670
1671 __never-interactive__
1672
1673
1674 instructs flex to generate a scanner which never considers
1675 its input
1676 isatty()).__ This is the opposite of
1677 __always-interactive.__
1678
1679
1680 __stack__
1681
1682
1683 enables the use of start condition stacks (see Start
1684 Conditions above).
1685
1686
1687 __stdinit__
1688
1689
1690 if set (i.e., __%option stdinit)__ initializes
1691 ''yyin'' and ''yyout'' to ''stdin'' and
1692 ''stdout,'' instead of the default of ''nil.'' Some
1693 existing ''lex'' programs depend on this behavior, even
1694 though it is not compliant with ANSI C, which does not
1695 require ''stdin'' and ''stdout'' to be compile-time
1696 constant. In a reentrant scanner, however, this is not a
1697 problem since initialization is performed in
1698 ''yylex_init'' at runtime.
1699
1700
1701 __yylineno__
1702
1703
1704 directs ''flex'' to generate a scanner that maintains the
1705 number of the current line read from its input in the global
1706 variable __yylineno.__ This option is implied by
1707 __%option lex-compat.__
1708
1709
1710 __yywrap__
1711
1712
1713 if unset (i.e., __%option noyywrap),__ makes the scanner
1714 not call __yywrap()__ upon an end-of-file, but simply
1715 assume that there are no more files to scan (until the user
1716 points ''yyin'' at a new file and calls __yylex()__
1717 again).
1718
1719
1720 ''flex'' scans your rule actions to determine whether you
1721 use the __REJECT__ or __yymore()__ features. The
1722 __reject__ and __yymore__ options are available to
1723 override its decision as to whether you use the options,
1724 either by setting them (e.g., __%option reject)__ to
1725 indicate the feature is indeed used, or unsetting them to
1726 indicate it actually is not used (e.g., __%option
1727 noyymore).__
1728
1729
1730 Three options take string-delimited values, offset with
1731 '=':
1732
1733
1734 %option outfile=
1735 is equivalent to __-oABC,__ and
1736
1737
1738 %option prefix=
1739 is equivalent to __-PXYZ.__ Finally,
1740
1741
1742 %option yyclass=
2 perry 1743 only applies when generating a C++ scanner ( __-+__ option). It informs ''flex'' that you have derived __foo__ as a subclass of __yyFlexLexer,__ so ''flex'' will place your actions in the member function __foo::yylex()__ instead of __yyFlexLexer::yylex().__ It also generates a __yyFlexLexer::yylex()__ member function that emits a run-time error (by invoking __yyFlexLexer::!LexerError())__ if called. See Generating C++ Scanners, below, for additional information.
1 perry 1744
1745
1746 A number of options are available for lint purists who want
1747 to suppress the appearance of unneeded routines in the
1748 generated scanner. Each of the following, if unset (e.g.,
1749 __%option nounput__ ), results in the corresponding
1750 routine not appearing in the generated scanner:
1751
1752
1753 input, unput
1754 yy_push_state, yy_pop_state, yy_top_state
1755 yy_scan_buffer, yy_scan_bytes, yy_scan_string
1756 (though __yy_push_state()__ and friends won't appear anyway unless you use __%option stack).__
1757 !!PERFORMANCE CONSIDERATIONS
1758
1759
1760 The main design goal of ''flex'' is that it generate
1761 high-performance scanners. It has been optimized for dealing
1762 well with large sets of rules. Aside from the effects on
1763 scanner speed of the table compression __-C__ options
1764 outlined above, there are a number of options/actions which
1765 degrade performance. These are, from most expensive to
1766 least:
1767
1768
1769 REJECT
1770 %option yylineno
1771 arbitrary trailing context
1772 pattern sets that require backing up
1773 %array
1774 %option interactive
1775 %option always-interactive
1776 '^' beginning-of-line operator
1777 yymore()
1778 with the first three all being quite expensive and the last two being quite cheap. Note also that __unput()__ is implemented as a routine call that potentially does quite a bit of work, while __yyless()__ is a quite-cheap macro; so if just putting back some excess text you scanned, use __yyless().__
1779
1780
1781 __REJECT__ should be avoided at all costs when
1782 performance is important. It is a particularly expensive
1783 option.
1784
1785
1786 Getting rid of backing up is messy and often may be an
1787 enormous amount of work for a complicated scanner. In
1788 principal, one begins by using the __-b__ flag to
1789 generate a ''lex.backup'' file. For example, on the
1790 input
1791
1792
1793 %%
1794 foo return TOK_KEYWORD;
1795 foobar return TOK_KEYWORD;
1796 the file looks like:
1797
1798
1799 State #6 is non-accepting -
1800 associated rule line numbers:
1801 2 3
1802 out-transitions: [[ o ]
1803 jam-transitions: EOF [[ 001-n p-177 ]
1804 State #8 is non-accepting -
1805 associated rule line numbers:
1806 3
1807 out-transitions: [[ a ]
1808 jam-transitions: EOF [[ 001-` b-177 ]
1809 State #9 is non-accepting -
1810 associated rule line numbers:
1811 3
1812 out-transitions: [[ r ]
1813 jam-transitions: EOF [[ 001-q s-177 ]
1814 Compressed tables always back up.
1815 The first few lines tell us that there's a scanner state in which it can make a transition on an 'o' but not on any other character, and that in that state the currently scanned text does not match any rule. The state occurs when trying to match the rules found at lines 2 and 3 in the input file. If the scanner is in that state and then reads something other than an 'o', it will have to back up to find a rule which is matched. With a bit of headscratching one can see that this must be the state it's in when it has seen
1816
1817
1818 The comment regarding State #8 indicates there's a problem
1819 when
1820
1821
1822 The final comment reminds us that there's no point going to
1823 all the trouble of removing backing up from the rules unless
1824 we're using __-Cf__ or __-CF,__ since there's no
1825 performance gain doing so with compressed
1826 scanners.
1827
1828
1829 The way to remove the backing up is to add
1830
1831
1832 %%
1833 foo return TOK_KEYWORD;
1834 foobar return TOK_KEYWORD;
1835 fooba |
1836 foob |
1837 fo {
1838 /* false alarm, not really a keyword */
1839 return TOK_ID;
1840 }
1841 Eliminating backing up among a list of keywords can also be done using a
1842
1843
1844 %%
1845 foo return TOK_KEYWORD;
1846 foobar return TOK_KEYWORD;
1847 [[a-z]+ return TOK_ID;
1848 This is usually the best solution when appropriate.
1849
1850
1851 Backing up messages tend to cascade. With a complicated set
1852 of rules it's not uncommon to get hundreds of messages. If
1853 one can decipher them, though, it often only takes a dozen
1854 or so rules to eliminate the backing up (though it's easy to
1855 make a mistake and have an error rule accidentally match a
1856 valid token. A possible future ''flex'' feature will be
1857 to automatically add rules to eliminate backing
1858 up).
1859
1860
1861 It's important to keep in mind that you gain the benefits of
1862 eliminating backing up only if you eliminate ''every''
1863 instance of backing up. Leaving just one means you gain
1864 nothing.
1865
1866
1867 ''Variable'' trailing context (where both the leading and
1868 trailing parts do not have a fixed length) entails almost
1869 the same performance loss as __REJECT__ (i.e.,
1870 substantial). So when possible a rule like:
1871
1872
1873 %%
1874 mouse|rat/(cat|dog) run();
1875 is better written:
1876
1877
1878 %%
1879 mouse/cat|dog run();
1880 rat/cat|dog run();
1881 or as
1882
1883
1884 %%
1885 mouse|rat/cat run();
1886 mouse|rat/dog run();
1887 Note that here the special '|' action does ''not'' provide any savings, and can even make things worse (see Deficiencies / Bugs below).
1888
1889
1890 Another area where the user can increase a scanner's
1891 performance (and one that's easier to implement) arises from
1892 the fact that the longer the tokens matched, the faster the
1893 scanner will run. This is because with long tokens the
1894 processing of most input characters takes place in the
1895 (short) inner scanning loop, and does not often have to go
1896 through the additional work of setting up the scanning
1897 environment (e.g., __yytext)__ for the action. Recall the
1898 scanner for C comments:
1899
1900
1901 %x comment
1902 %%
1903 int line_num = 1;
1904 This could be sped up by writing it as:
1905
1906
1907 %x comment
1908 %%
1909 int line_num = 1;
1910 Now instead of each newline requiring the processing of another action, recognizing the newlines is adding'' rules does ''not'' slow down the scanner! The speed of the scanner is independent of the number of rules or (modulo the considerations given at the beginning of this section) how complicated the rules are with regard to operators such as '*' and '|'.
1911
1912
1913 A final example in speeding up a scanner: suppose you want
1914 to scan through a file containing identifiers and keywords,
1915 one per line and with no other extraneous characters, and
1916 recognize all the keywords. A natural first approach
1917 is:
1918
1919
1920 %%
1921 asm |
1922 auto |
1923 break |
1924 ... etc ...
1925 volatile |
1926 while /* it's a keyword */
1927 .|n /* it's not a keyword */
1928 To eliminate the back-tracking, introduce a catch-all rule:
1929
1930
1931 %%
1932 asm |
1933 auto |
1934 break |
1935 ... etc ...
1936 volatile |
1937 while /* it's a keyword */
1938 [[a-z]+ |
1939 .|n /* it's not a keyword */
1940 Now, if it's guaranteed that there's exactly one word per line, then we can reduce the total number of matches by a half by merging in the recognition of newlines with that of the other tokens:
1941
1942
1943 %%
1944 asmn |
1945 auton |
1946 breakn |
1947 ... etc ...
1948 volatilen |
1949 whilen /* it's a keyword */
1950 [[a-z]+n |
1951 .|n /* it's not a keyword */
1952 One has to be careful here, as we have now reintroduced backing up into the scanner. In particular, while ''we'' know that there will never be any characters in the input stream other than letters or newlines, ''flex'' can't figure this out, and it will plan for possibly needing to back up when it has scanned a token like ''
1953
1954
1955 %%
1956 asmn |
1957 auton |
1958 breakn |
1959 ... etc ...
1960 volatilen |
1961 whilen /* it's a keyword */
1962 [[a-z]+n |
1963 [[a-z]+ |
1964 .|n /* it's not a keyword */
1965 Compiled with __-Cf,__ this is about as fast as one can get a ''flex'' scanner to go for this particular problem.
1966
1967
1968 A final note: ''flex'' is slow when matching NUL's,
1969 particularly when a token contains multiple NUL's. It's best
1970 to write rules which match ''short'' amounts of text if
1971 it's anticipated that the text will often include
1972 NUL's.
1973
1974
1975 Another final note regarding performance: as mentioned above
1976 in the section How the Input is Matched, dynamically
1977 resizing __yytext__ to accommodate huge tokens is a slow
1978 process because it presently requires that the (huge) token
1979 be rescanned from the beginning. Thus if performance is
1980 vital, you should attempt to match
1981 __
1982 !!GENERATING C++ SCANNERS
1983
1984
1985 ''flex'' provides two different ways to generate scanners
1986 for use with C++. The first way is to simply compile a
1987 scanner generated by ''flex'' using a C++ compiler
1988 instead of a C compiler. You should not encounter any
1989 compilations errors (please report any you find to the email
1990 address given in the Author section below). You can then use
1991 C++ code in your rule actions instead of C code. Note that
1992 the default input source for your scanner remains
1993 ''yyin,'' and default echoing is still done to
1994 ''yyout.'' Both of these remain ''FILE *'' variables
1995 and not C++ ''streams.''
1996
1997
1998 You can also use ''flex'' to generate a C++ scanner
1999 class, using the __-+__ option (or, equivalently,
2000 __%option c++),__ which is automatically specified if the
2001 name of the flex executable ends in a '+', such as
2002 ''flex++.'' When using this option, flex defaults to
2003 generating the scanner to the file __lex.yy.cc__ instead
2004 of __lex.yy.c.__ The generated scanner includes the
2 perry 2005 header file ''!FlexLexer.h,'' which defines the interface
1 perry 2006 to two C++ classes.
2007
2008
2 perry 2009 The first class, __!FlexLexer,__ provides an abstract base
1 perry 2010 class defining the general scanner class interface. It
2011 provides the following member functions:
2012
2013
2014 __const char* YYText()__
2015
2016
2017 returns the text of the most recently matched token, the
2018 equivalent of __yytext.__
2019
2020
2021 __int YYLeng()__
2022
2023
2024 returns the length of the most recently matched token, the
2025 equivalent of __yyleng.__
2026
2027
2028 __int lineno() const__
2029
2030
2031 returns the current input line number (see __%option
2032 yylineno),__ or __1__ if __%option yylineno__ was
2033 not used.
2034
2035
2036 __void set_debug( int flag )__
2037
2038
2039 sets the debugging flag for the scanner, equivalent to
2040 assigning to __yy_flex_debug__ (see the Options section
2041 above). Note that you must build the scanner using
2042 __%option debug__ to include debugging information in
2043 it.
2044
2045
2046 __int debug() const__
2047
2048
2049 returns the current setting of the debugging
2050 flag.
2051
2052
2053 Also provided are member functions equivalent to
2054 __yy_switch_to_buffer(), yy_create_buffer()__ (though the
2055 first argument is an __istream*__ object pointer and not
2056 a __FILE*), yy_flush_buffer(), yy_delete_buffer(),__ and
2057 __yyrestart()__ (again, the first argument is a
2058 __istream*__ object pointer).
2059
2060
2 perry 2061 The second class defined in ''!FlexLexer.h'' is
2062 __yyFlexLexer,__ which is derived from __!FlexLexer.__
1 perry 2063 It defines the following additional member
2064 functions:
2065
2066
2067 __yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout =
2068 0 )__
2069
2070
2071 constructs a __yyFlexLexer__ object using the given
2072 streams for input and output. If not specified, the streams
2073 default to __cin__ and __cout,__
2074 respectively.
2075
2076
2077 __virtual int yylex()__
2078
2079
2080 performs the same role is __yylex()__ does for ordinary
2081 flex scanners: it scans the input stream, consuming tokens,
2082 until a rule's action returns a value. If you derive a
2083 subclass __S__ from __yyFlexLexer__ and want to access
2084 the member functions and variables of __S__ inside
2085 __yylex(),__ then you need to use __%option
2086 yyclass=__ to inform ''flex'' that you
2087 will be using that subclass instead of __yyFlexLexer.__
2088 In this case, rather than generating
2089 __yyFlexLexer::yylex(),__ ''flex'' generates
2090 __S::yylex()__ (and also generates a dummy
2091 __yyFlexLexer::yylex()__ that calls
2 perry 2092 __yyFlexLexer::!LexerError()__ if called).
1 perry 2093
2094
2095 __virtual void switch_streams(istream* new_in =
2096 0,__
2097
2098
2099 __ostream* new_out = 0)__ reassigns __yyin__ to
2100 __new_in__ (if non-nil) and __yyout__ to
2101 __new_out__ (ditto), deleting the previous input buffer
2102 if __yyin__ is reassigned.
2103
2104
2105 __int yylex( istream* new_in, ostream* new_out = 0
2106 )__
2107
2108
2109 first switches the input streams via __switch_streams(
2110 new_in, new_out )__ and then returns the value of
2111 __yylex().__
2112
2113
2114 In addition, __yyFlexLexer__ defines the following
2115 protected virtual functions which you can redefine in
2116 derived classes to tailor the scanner:
2117
2118
2 perry 2119 __virtual int !LexerInput( char* buf, int max_size
1 perry 2120 )__
2121
2122
2123 reads up to __max_size__ characters into __buf__ and
2124 returns the number of characters read. To indicate
2125 end-of-input, return 0 characters. Note that
2126 __-B__ and
2127 __-I__ flags) define the macro __YY_INTERACTIVE.__ If
2 perry 2128 you redefine __!LexerInput()__ and need to take different
1 perry 2129 actions depending on whether or not the scanner might be
2130 scanning an interactive input source, you can test for the
2131 presence of this name via __#ifdef.__
2132
2133
2 perry 2134 __virtual void !LexerOutput( const char* buf, int size
1 perry 2135 )__
2136
2137
2138 writes out __size__ characters from the buffer
2139 __buf,__ which, while NUL-terminated, may also contain
2140 __
2141
2142
2 perry 2143 __virtual void !LexerError( const char* msg
1 perry 2144 )__
2145
2146
2147 reports a fatal error message. The default version of this
2148 function writes the message to the stream __cerr__ and
2149 exits.
2150
2151
2152 Note that a __yyFlexLexer__ object contains its
2153 ''entire'' scanning state. Thus you can use such objects
2154 to create reentrant scanners. You can instantiate multiple
2155 instances of the same __yyFlexLexer__ class, and you can
2156 also combine multiple C++ scanner classes together in the
2157 same program using the __-P__ option discussed
2158 above.
2159
2160
2161 Finally, note that the __%array__ feature is not
2162 available to C++ scanner classes; you must use
2163 __%pointer__ (the default).
2164
2165
2166 Here is an example of a simple C++ scanner:
2167
2168
2169 // An example of using the flex C++ scanner class.
2170 %{
2171 int mylineno = 0;
2172 %}
2173 string
2174 If you want to create multiple (different) lexer classes, you use the __-P__ flag (or the __prefix=__ option) to rename each __yyFlexLexer__ to some other __xxFlexLexer.__ You then can include ____ in your other sources once per lexer class, first renaming __yyFlexLexer__ as follows:
2175
2176
2177 #undef yyFlexLexer
2178 #define yyFlexLexer xxFlexLexer
2179 #include
2180 if, for example, you used __%option prefix=__ for one of your scanners and __%option prefix=__ for the other.
2181
2182
2183 IMPORTANT: the present form of the scanning class is
2184 ''experimental'' and may change considerably between
2185 major releases.
2186 !!INCOMPATIBILITIES WITH LEX AND POSIX
2187
2188
2189 ''flex'' is a rewrite of the AT''lex''
2190 tool (the two implementations do not share any code,
2191 though), with some extensions and incompatibilities, both of
2192 which are of concern to those who wish to write scanners
2193 acceptable to either implementation. Flex is fully compliant
2194 with the POSIX ''lex'' specification, except that when
2195 using __%pointer__ (the default), a call to
2196 __unput()__ destroys the contents of __yytext,__ which
2197 is counter to the POSIX specification.
2198
2199
2200 In this section we discuss all of the known areas of
2201 incompatibility between flex, AT
2202
2203
2204 ''flex's'' __-l__ option turns on maximum
2205 compatibility with the original AT__lex''
2206 implementation, at the cost of a major loss in the generated
2207 scanner's performance. We note below which incompatibilities
2208 can be overcome using the __-l__ option.
2209
2210
2211 ''flex'' is fully compatible with ''lex'' with the
2212 following exceptions:
2213
2214
2215 -
2216
2217
2218 The undocumented ''lex'' scanner internal variable
2219 __yylineno__ is not supported unless __-l__ or
2220 __%option yylineno__ is used.
2221
2222
2223 __yylineno__ should be maintained on a per-buffer basis,
2224 rather than a per-scanner (single global variable)
2225 basis.
2226
2227
2228 __yylineno__ is not part of the POSIX
2229 specification.
2230
2231
2232 -
2233
2234
2235 The __input()__ routine is not redefinable, though it may
2236 be called to read characters following whatever has been
2237 matched by a rule. If __input()__ encounters an
2238 end-of-file the normal __yywrap()__ processing is done. A
2239 ``real'' end-of-file is returned by __input()__ as
2240 ''EOF.''
2241
2242
2243 Input is instead controlled by defining the __YY_INPUT__
2244 macro.
2245
2246
2247 The ''flex'' restriction that __input()__ cannot be
2248 redefined is in accordance with the POSIX specification,
2249 which simply does not specify any way of controlling the
2250 scanner's input other than by making an initial assignment
2251 to ''yyin.''
2252
2253
2254 -
2255
2256
2257 The __unput()__ routine is not redefinable. This
2258 restriction is in accordance with POSIX.
2259
2260
2261 -
2262
2263
2264 ''flex'' scanners are not as reentrant as ''lex''
2265 scanners. In particular, if you have an interactive scanner
2266 and an interrupt handler which long-jumps out of the
2267 scanner, and the scanner is subsequently called again, you
2268 may get the following message:
2269
2270
2271 fatal flex scanner internal error--end of buffer missed
2272 To reenter the scanner, first use
2273
2274
2275 yyrestart( yyin );
2276 Note that this call will throw away any buffered input; usually this isn't a problem with an interactive scanner.
2277
2278
2279 Also note that flex C++ scanner classes ''are''
2280 reentrant, so if using C++ is an option for you, you should
2281 use them instead. See
2282 ''
2283
2284
2285 -
2286
2287
2288 __output()__ is not supported. Output from the
2289 __ECHO__ macro is done to the file-pointer ''yyout''
2290 (default ''stdout).''
2291
2292
2293 __output()__ is not part of the POSIX
2294 specification.
2295
2296
2297 -
2298
2299
2300 ''lex'' does not support exclusive start conditions (%x),
2301 though they are in the POSIX specification.
2302
2303
2304 -
2305
2306
2307 When definitions are expanded, ''flex'' encloses them in
2308 parentheses. With lex, the following:
2309
2310
2311 NAME [[A-Z][[A-Z0-9]*
2312 %%
2313 foo{NAME}? printf(
2314 will not match the string flex,'' the rule will be expanded to ''
2315
2316
2317 Note that if the definition begins with __^__ or ends
2318 with __$__ then it is ''not'' expanded with
2319 parentheses, to allow these operators to appear in
2320 definitions without losing their special meanings. But the
2321 ____ and ____
2322 operators cannot be used in a ''flex''
2323 definition.
2324
2325
2326 Using __-l__ results in the ''lex'' behavior of no
2327 parentheses around the definition.
2328
2329
2330 The POSIX specification is that the definition be enclosed
2331 in parentheses.
2332
2333
2334 -
2335
2336
2337 Some implementations of ''lex'' allow a rule's action to
2338 begin on a separate line, if the rule's pattern has trailing
2339 whitespace:
2340
2341
2342 %%
2343 foo|bar
2344 ''flex'' does not support this feature.
2345
2346
2347 -
2348
2349
2350 The ''lex'' __%r__ (generate a Ratfor scanner) option
2351 is not supported. It is not part of the POSIX
2352 specification.
2353
2354
2355 -
2356
2357
2358 After a call to __unput(),__ ''yytext'' is undefined
2359 until the next token is matched, unless the scanner was
2360 built using __%array.__ This is not the case with
2361 ''lex'' or the POSIX specification. The __-l__ option
2362 does away with this incompatibility.
2363
2364
2365 -
2366
2367
2368 The precedence of the __{}__ (numeric range) operator is
2369 different. ''lex'' interprets
2370 ''flex'' interprets it as
2371 ''
2372
2373
2374 -
2375
2376
2377 The precedence of the __^__ operator is different.
2378 ''lex'' interprets
2379 ''flex'' interprets it as
2380 ''
2381
2382
2383 -
2384
2385
2386 The special table-size declarations such as __%a__
2387 supported by ''lex'' are not required by ''flex''
2388 scanners; ''flex'' ignores them.
2389
2390
2391 -
2392
2393
2394 The name FLEX_SCANNER is #define'd so scanners may be
2395 written for use with either ''flex'' or ''lex.''
2396 Scanners also include __YY_FLEX_MAJOR_VERSION__ and
2397 __YY_FLEX_MINOR_VERSION__ indicating which version of
2398 ''flex'' generated the scanner (for example, for the 2.5
2399 release, these defines would be 2 and 5
2400 respectively).
2401
2402
2403 The following ''flex'' features are not included in
2404 ''lex'' or the POSIX specification:
2405
2406
2407 C++ scanners
2408 %option
2409 start condition scopes
2410 start condition stacks
2411 interactive/non-interactive scanners
2412 yy_scan_string() and friends
2413 yyterminate()
2414 yy_set_interactive()
2415 yy_set_bol()
2416 YY_AT_BOL()
2417 plus almost all of the flex flags. The last feature in the list refers to the fact that with ''flex'' you can put multiple actions on the same line, separated with semi-colons, while with ''lex,'' the following
2418
2419
2420 foo handle_foo(); ++num_foos_seen;
2421 is (rather surprisingly) truncated to
2422
2423
2424 foo handle_foo();
2425 ''flex'' does not truncate the action. Actions that are not enclosed in braces are simply terminated at the end of the line.
2426 !!DIAGNOSTICS
2427
2428
2429 ''warning, rule cannot be matched'' indicates that the
2430 given rule cannot be matched because it follows other rules
2431 that will always match the same text as it. For example, in
2432 the following
2433 ''
2434
2435
2436 [[a-z]+ got_identifier();
2437 foo got_foo();
2438 Using __REJECT__ in a scanner suppresses this warning.
2439
2440
2441 ''warning,'' __-s__ ''option given but default rule
2442 can be matched'' means that it is possible (perhaps only
2443 in a particular start condition) that the default rule
2444 (match any single character) is the only one that will match
2445 a particular input. Since __-s__ was given, presumably
2446 this is not intended.
2447
2448
2449 ''reject_used_but_not_detected undefined'' or
2450 ''yymore_used_but_not_detected undefined -'' These errors
2451 can occur at compile time. They indicate that the scanner
2452 uses __REJECT__ or __yymore()__ but that ''flex''
2453 failed to notice the fact, meaning that ''flex'' scanned
2454 the first two sections looking for occurrences of these
2455 actions and failed to find any, but somehow you snuck some
2456 in (via a #include file, for example). Use __%option
2457 reject__ or __%option yymore__ to indicate to flex that
2458 you really do use these features.
2459
2460
2461 ''flex scanner jammed -'' a scanner compiled with
2462 __-s__ has encountered an input string which wasn't
2463 matched by any of its rules. This error can also occur due
2464 to internal problems.
2465
2466
2467 ''token too large, exceeds YYLMAX -'' your scanner uses
2468 __%array__ and one of its rules matched a string longer
2469 than the __YYLMAX__ constant (8K bytes by default). You
2470 can increase the value by #define'ing __YYLMAX__ in the
2471 definitions section of your ''flex'' input.
2472
2473
2474 ''scanner requires -8 flag to use the character 'x' -''
2475 Your scanner specification includes recognizing the 8-bit
2476 character '''x''' and you did not specify the -8 flag,
2477 and your scanner defaulted to 7-bit because you used the
2478 __-Cf__ or __-CF__ table compression options. See the
2479 discussion of the __-7__ flag for details.
2480
2481
2482 ''flex scanner push-back overflow -'' you used
2483 __unput()__ to push back so much text that the scanner's
2484 buffer could not hold both the pushed-back text and the
2485 current token in __yytext.__ Ideally the scanner should
2486 dynamically resize the buffer in this case, but at present
2487 it does not.
2488
2489
2490 ''input buffer overflow, can't enlarge buffer because
2491 scanner uses REJECT -'' the scanner was working on
2492 matching an extremely large token and needed to expand the
2493 input buffer. This doesn't work with scanners that use
2494 __REJECT.__
2495
2496
2497 ''fatal flex scanner internal error--end of buffer missed
2498 -'' This can occur in an scanner which is reentered after
2499 a long-jump has jumped out (or over) the scanner's
2500 activation frame. Before reentering the scanner,
2501 use:
2502
2503
2504 yyrestart( yyin );
2505 or, as noted above, switch to using the C++ scanner class.
2506
2507
2508 ''too many start conditions in ''
2509 you listed more start conditions in a
2510 ''
2511 !!FILES
2512
2513
2514 __-lfl__
2515
2516
2517 library with which scanners must be linked.
2518
2519
2520 ''lex.yy.c''
2521
2522
2523 generated scanner (called ''lexyy.c'' on some
2524 systems).
2525
2526
2527 ''lex.yy.cc''
2528
2529
2530 generated C++ scanner class, when using
2531 __-+.__
2532
2533
2534 ''''
2535
2536
2537 header file defining the C++ scanner base class,
2 perry 2538 __!FlexLexer,__ and its derived class,
1 perry 2539 __yyFlexLexer.__
2540
2541
2542 ''flex.skl''
2543
2544
2545 skeleton scanner. This file is only used when building flex,
2546 not when flex executes.
2547
2548
2549 ''lex.backup''
2550
2551
2552 backing-up information for __-b__ flag (called
2553 ''lex.bck'' on some systems).
2554 !!DEFICIENCIES / BUGS
2555
2556
2557 Some trailing context patterns cannot be properly matched
2558 and generate warning messages (
2559
2560
2561 For some trailing context rules, parts which are actually
2562 fixed-length are not recognized as such, leading to the
2563 abovementioned performance loss. In particular, parts using
2564 '|' or {n} (such as
2565
2566
2567 Combining trailing context with the special '|' action can
2568 result in ''fixed'' trailing context being turned into
2569 the more expensive ''variable'' trailing context. For
2570 example, in the following:
2571
2572
2573 %%
2574 abc |
2575 xyz/def
2576 Use of __unput()__ invalidates yytext and yyleng, unless the __%array__ directive or the __-l__ option has been used.
2577
2578
2579 Pattern-matching of NUL's is substantially slower than
2580 matching other characters.
2581
2582
2583 Dynamic resizing of the input buffer is slow, as it entails
2584 rescanning all the text matched so far by the current
2585 (generally huge) token.
2586
2587
2588 Due to both buffering of input and read-ahead, you cannot
2589 intermix calls to
2590 getchar(),__ with ''flex'' rules and expect
2591 it to work. Call __input()__ instead.
2592
2593
2594 The total table entries listed by the __-v__ flag
2595 excludes the number of table entries needed to determine
2596 what rule has been matched. The number of entries is equal
2597 to the number of DFA states if the scanner does not use
2598 __REJECT,__ and somewhat greater than the number of
2599 states if it does.
2600
2601
2602 __REJECT__ cannot be used with the __-f__ or __-F__
2603 options.
2604
2605
2606 The ''flex'' internal algorithms need
2607 documentation.
2608 !!SEE ALSO
2609
2610
2611 lex(1), yacc(1), sed(1), awk(1).
2612
2613
2614 John Levine, Tony Mason, and Doug Brown, ''Lex
2615 '' O'Reilly and Associates. Be sure to get the 2nd
2616 edition.
2617
2618
2619 M. E. Lesk and E. Schmidt, ''LEX - Lexical Analyzer
2620 Generator''
2621
2622
2623 Alfred Aho, Ravi Sethi and Jeffrey Ullman, ''Compilers:
2624 Principles, Techniques and Tools,'' Addison-Wesley (1986).
2625 Describes the pattern-matching techniques used by
2626 ''flex'' (deterministic finite automata).
2627 !!AUTHOR
2628
2629
2630 Vern Paxson, with the help of many ideas and much
2631 inspiration from Van Jacobson. Original version by Jef
2632 Poskanzer. The fast table representation is a partial
2633 implementation of a design done by Van Jacobson. The
2634 implementation was done by Kevin Gong and Vern
2635 Paxson.
2636
2637
2638 Thanks to the many ''flex'' beta-testers, feedbackers,
2639 and contributors, especially Francois Pinard, Casey Leedom,
2640 Robert Abramovitz, Stan Adermann, Terry Allen, David
2641 Barker-Plummer, John Basrai, Neal Becker, Nelson H.F. Beebe,
2642 benson@odi.com, Karl Berry, Peter A. Bigot, Simon Blanchard,
2643 Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick
2644 Christopher, Brian Clapper, J.T. Conklin, Jason Coughlin,
2645 Bill Cox, Nick Cropper, Dave Curtis, Scott David Daniels,
2646 Chris G. Demetriou, Theo Deraadt, Mike Donahue, Chuck
2647 Doucette, Tom Epperly, Leo Eskin, Chris Faylor, Chris
2648 Flatters, Jon Forrest, Jeffrey Friedl, Joe Gayda, Kaveh R.
2649 Ghazi, Wolfgang Glunz, Eric Goldman, Christopher M. Gould,
2650 Ulrich Grepel, Peer Griebel, Jan Hajic, Charles Hemphill,
2651 NORO Hideo, Jarkko Hietaniemi, Scott Hofmann, Jeff Honig,
2652 Dana Hudes, Eric Hughes, John Interrante, Ceriel Jacobs,
2653 Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones, Henry
2654 Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane,
2655 Amir Katz, ken@ken.hilco.com, Kevin B. Kenny, Steve Kirsch,
2656 Winfried Koenig, Marq Kole, Ronald Lamprecht, Greg Lee,
2657 Rohan Lenard, Craig Leres, John Levine, Steve Liddle, David
2658 Loffredo, Mike Long, Mohamed el Lozy, Brian Madsen, Malte,
2659 Joe Marshall, Bengt Martensson, Chris Metcalf, Luke Mewburn,
2660 Jim Meyering, R. Alexander Milowski, Erik Naggum, G.T.
2661 Nicol, Landon Noll, James Nordby, Marc Nozell, Richard
2662 Ohnemus, Karsten Pahnke, Sven Panne, Roland Pesch, Walter
2663 Pelissero, Gaumond Pierre, Esmond Pitt, Jef Poskanzer, Joe
2664 Rahmeh, Jarmo Raiha, Frederic Raimbault, Pat Rankin, Rick
2665 Richardson, Kevin Rodgers, Kai Uwe Rommel, Jim Roskind,
2666 Alberto Santini, Andreas Scherer, Darrell Schiebel, Raf
2667 Schietekat, Doug Schmidt, Philippe Schnoebelen, Andreas
2668 Schwab, Larry Schwimmer, Alex Siegel, Eckehard Stolz,
2669 Jan-Erik Strvmquist, Mike Stump, Paul Stuart, Dave Tallman,
2670 Ian Lance Taylor, Chris Thewalt, Richard M. Timoney, Jodi
2671 Tsai, Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard
2672 Wilhelms, Kent Williams, Ken Yap, Ron Zellar, Nathan Zelle,
2673 David Zuhn, and those whose names have slipped my marginal
2674 mail-archiving skills but whose contributions are
2675 appreciated all the same.
2676
2677
2678 Thanks to Keith Bostic, Jon Forrest, Noah Friedman, John
2679 Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T. Nicol,
2680 Francois Pinard, Rich Salz, and Richard Stallman for help
2681 with various distribution headaches.
2682
2683
2684 Thanks to Esmond Pitt and Earle Horton for 8-bit character
2685 support; to Benson Margulies and Fred Burke for C++ support;
2686 to Kent Williams and Tom Epperly for C++ class support; to
2687 Ove Ewerlid for support of NUL's; and to Eric Hughes for
2688 support of multiple buffers.
2689
2690
2691 This work was primarily done when I was with the Real Time
2692 Systems Group at the Lawrence Berkeley Laboratory in
2693 Berkeley, CA. Many thanks to all there for the support I
2694 received.
2695
2696
2697 Send comments to vern@ee.lbl.gov.
2698 ----
This page is a man page (or other imported legacy content). We are unable to automatically determine the license status of this page.