Differences between current version and revision by previous author of HowToLexYACCHOWTO.
Other diffs: Previous Major Revision, Previous Revision, or view the Annotated Edit History
Newer page: | version 3 | Last edited on Tuesday, October 26, 2004 10:23:21 am | by AristotlePagaltzis | |
Older page: | version 2 | Last edited on Friday, June 7, 2002 1:06:57 am | by perry | Revert |
@@ -1,1672 +1 @@
-
-
-
-Lex and YACC primer/HOWTO
-
-
-
-----
-
-!!!Lex and YACC primer/HOWTO
-
-!!PowerDNS BV (bert hubert <bert@powerdns.com>)v0.8 $Date: 2001/05/20 19:35:59 $
-
-
-----
-''This document tries to help you get started using Lex and YACC''
-----
-
-
-
-
-!!1. Introduction
-
-
-*1.1 What this document is NOT
-
-*1.2 Downloading stuff
-
-*1.3 License
-
-
-
-
-
-!!2. What Lex & YACC can do for you
-
-
-*2.1 What each program does on its own
-
-
-
-
-
-!!3. Lex
-
-
-*3.1 Regular expressions in matches
-
-*3.2 A more complicated example for a C like syntax
-
-*3.3 What we've seen
-
-
-
-
-
-!!4. YACC
-
-
-*4.1 A simple thermostat controller
-
-*4.2 Expanding the thermostat to handle parameters
-
-*4.3 Parsing a configuration file
-
-
-
-
-
-!!5. Making a Parser in C++
-
-
-
-
-!!6. How do Lex and YACC work internally
-
-
-*6.1 Token values
-
-*6.2 Recursion: 'right is wrong'
-
-*6.3 Advanced yylval: %union
-
-
-
-
-
-!!7. Debugging
-
-
-*7.1 The state machine
-
-*7.2 Conflicts: 'shift/reduce', 'reduce/reduce'
-
-
-
-
-
-!!8. Further reading
-
-
-
-
-!!9. Acknowledgements & Thanks
-----
-
-!!1. Introduction
-
-
-Welcome, gentle reader.
-
-
-If you have been programming for any length of time in a Unix environment,
-you will have encountered the mystical programs Lex & YACC, or as they
-are known to GNU/Linux users worldwide, Flex & Bison, where Flex is a
-Lex implementation by Vern Paxon and Bison the GNU version of YACC. We will
-call these programs Lex and YACC throughout - the newer versions are
-upwardly compatible, so you can use Flex and Bison when trying our examples.
-
-
-These programs are massively useful, but as with your C compiler, their
-manpage does not explain the language they understand, nor how to use them.
-YACC is really amazing when used in combination with Lex, however, the Bison
-manpage does not describe how to integrate Lex generated code with your
-Bison program.
-
-
-
-
-!!1.1 What this document is NOT
-
-
-
-There are several great books which deal with Lex & YACC. By all means
-read these books if you need to know more. They provide far more information
-than we ever will. See the 'Further Reading' section at the end. This
-document is aimed at bootstrapping your use of Lex
-& YACC, to allow you to create your first programs.
-
-
-The documentation that comes with Flex and BISON is also excellent, but no
-tutorial. They do complement my HOWTO very well though. They too are
-referenced at the end.
-
-
-I am by no means a YACC/Lex expert. When I started writing this document, I
-had exactly two days of experience. All I want to accomplish is to make
-those two days easier for you.
-
-
-In no way expect the HOWTO to show proper YACC and Lex style. Examples
-have been kept very simple and there may be better ways to write them. If
-you know how to, please let me know.
-
-!!1.2 Downloading stuff
-
-
-
-Please note that you can download all the examples shown, which are in
-machine readable form. See the
-homepage for details.
-
-
-
-
-!!1.3 License
-
-
-
-Copyright (c) 2001 by bert hubert. This material may be
-distributed only subject to the terms and conditions set forth in the Open
-Publication License, vX.Y or later (the latest version is presently
-available at http://www.opencontent.org/openpub/).
-----
-
-!!2. What Lex & YACC can do for you
-
-
-When properly used, these programs allow you to parse complex languages with
-ease. This is a great boon when you want to read a configuration file, or
-want to write a compiler for any language you (or anyone else) might have
-invented.
-
-
-With a little help, which this document will hopefully provide, you will
-find that you will never write a parser again by hand - Lex & YACC are
-the tools to do this.
-
-
-
-
-!!2.1 What each program does on its own
-
-
-
-Although these programs shine when used together, they each serve a
-different purpose. The next chapter will explain what each part does.
-
-
-
-----
-
-!!3. Lex
-
-
-The program Lex generates a so called `Lexer'. This is a function that takes
-a stream of characters as its input, and whenever it sees a group of
-characters that match a key, takes a certain action. A very simple example:
-
-
-
-
-
-%{
-#include <stdio.h>
-%}
-%%
-stop printf("Stop command received\n");
-start printf("Start command received\n");
-%%
-
-
-
-
-The first section, in between the %{ and %} pair is included directly in the
-output program. We need this, because we use printf later on, which is
-defined in stdio.h.
-
-
-Sections are separated using '%%', so the first line of the second section
-starts with the 'stop' key. Whenever the 'stop' key is encountered in the
-input, the rest of the line (a printf() call) is executed.
-
-
-Besides 'stop', we've also defined 'start', which otherwise does mostly the
-same.
-
-
-We terminate the code section with '%%' again.
-
-
-To compile Example 1, do this:
-
-
-lex example1.l
-cc lex.yy.c -o example1 -ll
-
-
-
-
-
-
-NOTE: If you are using flex, instead of lex, you may have to change '-ll'
-to '-lfl' in the compilation scripts. !RedHat 6.x and SuSE need this, even when
-you invoke 'flex' as 'lex'!
-
-
-
-This will generate the file 'example1'. If you run it, it waits for you to
-type some input. Whenever you type something that is not matched by any of
-the defined keys (ie, 'stop' and 'start') it's output again. If you
-enter 'stop' it will output 'Stop command received';
-
-
-Terminate with a EOF (^D).
-
-
-You may wonder how the program runs, as we didn't define a main() function.
-This function is defined for you in libl (liblex) which we compiled in with
-the -ll command.
-
-
-
-
-!!3.1 Regular expressions in matches
-
-
-
-This example wasn't very useful in itself, and our next one won't be either.
-It will however show how to use regular expressions in Lex, which are
-massively useful later on.
-
-
-Example 2:
-
-
-%{
-#include <stdio.h>
-%}
-%%
-
[[0123456789
]+ printf("NUMBER\n");
-[[a-zA-Z][[a-zA-Z0-9]* printf("WORD\n");
-%%
-
-
-
-
-This Lex file describes two kinds of matches (tokens): WORDs and NUMBERs.
-Regular expressions can be pretty daunting but with only a little work it is
-easy to understand them. Let's examine the NUMBER match:
-
-
-[[0123456789]+
-
-
-This says: a sequence of one or more characters from the group 0123456789.
-We could also have written it shorter as:
-
-
-[[-9]+
-
-
-Now, the WORD match is somewhat more involved:
-
-
-[[a-zA-Z][[a-zA-Z0-9]*
-
-
-The first part matches 1 and only 1 character that is between 'a' and 'z',
-or between 'A' and 'Z'. In other words, a letter. This initial letter then
-needs to be followed by zero or more characters which are either a letter or
-a digit. Why use an asterisk
here? The '+' signifies 1 or more matches, but
-a WORD might very well consist of only one character, which we've already
-matched. So the second part may have zero matches, so we write a '*'.
-
-
-This way, we've mimicked the behaviour of many programming languages which
-demand that a variable name *must* start with a letter, but can contain
-digits afterwards. In other words, 'temperature1' is a valid name,
-but '1temperature' is not.
-
-
-Try compiling Example 2, lust like Example 1, and feed it some text. Here is
-a sample session:
-
-
-
-
-
-$ ./example2
-foo
-WORD
-bar
-WORD
-123
-NUMBER
-bar123
-WORD
-123bar
-NUMBER
-WORD
-
-
-
-
-You may also be wondering where all this whitespace is coming from in the
-output. The reason is simple: it was in the input, and we don't match on it
-anywhere, so it gets output again.
-
-
-The Flex manpage documents its regular expressions in detail. Many people
-feel that the perl regular expression manpage (perlre) is also very useful,
-although Flex does not implement everything perl does.
-
-
-Make sure that you do not create zero length matches like '[[-9]*' - your
-lexer might get confused and start matching empty strings repeatedly.
-
-!!3.2 A more complicated example for a C like syntax
-
-
-
-Let's say we want to parse a file that looks like this:
-
-
-logging {
-category lame-servers { null; };
-category cname { null; };
-};
-zone "." {
-type hint;
-file "/etc/bind/db.root";
-};
-
-
-
-
-We clearly see a number of categories (tokens) in this file:
-
-
-* WORDs, like 'zone' and 'type'
-*
-
-* FILENAMEs, like '/etc/bind/db.root'
-*
-
-* QUOTEs, like those surrounding the filename
-*
-
-* OBRACEs, {
-*
-
-* EBRACEs, }
-*
-
-* SEMICOLONs, ;
-*
-
-
-
-The corresponding Lex file is Example 3:
-
-
-%{
-#include <stdio.h>
-%}
-%%
-[[a-zA-Z][[a-zA-Z0-9]* printf("WORD ");
-[[a-zA-Z0-9\/.-]+ printf("FILENAME ");
-\" printf("QUOTE ");
-\{ printf("OBRACE ");
-\} printf("EBRACE ");
-; printf("SEMICOLON ");
-\n printf("\n");
-[[ \t]+ /* ignore whitespace */;
-%%
-
-
-
-
-When we feed our file to the program this Lex file generates (using
-example3.compile), we get:
-
-
-
-
-
-WORD OBRACE
-WORD FILENAME OBRACE WORD SEMICOLON EBRACE SEMICOLON
-WORD WORD OBRACE WORD SEMICOLON EBRACE SEMICOLON
-EBRACE SEMICOLON
-WORD QUOTE FILENAME QUOTE OBRACE
-WORD WORD SEMICOLON
-WORD QUOTE FILENAME QUOTE SEMICOLON
-EBRACE SEMICOLON
-
-
-
-
-When compared with the configuration file mentioned above, it is clear that
-we have neatly 'Tokenized' it. Each part of the configuration file has been
-matched, and converted into a token.
-
-
-And this is exactly what we need to put YACC to good use.
-
-
-
-
-!!3.3 What we've seen
-
-
-
-We've seen that Lex is able to read arbitrary input, and determine what each
-part of the input is. This is called 'Tokenizing'.
-
-
-
-----
-
-!!4. YACC
-
-
-YACC can parse input streams consisting of tokens with certain values. This
-clearly describes the relation YACC has with Lex, YACC has no idea
-what 'input streams' are, it needs preprocessed tokens. While you can write your
-own Tokenizer, we will leave that entirely up to Lex.
-
-
-A note on grammars and parsers. When YACC saw the light of day, the tool was
-used to parse input files for compilers: programs. Programs written in a
-programming language for computers are typically *not* ambiguous - they have
-just one meaning. As such, YACC does not cope with ambiguity and will
-complain about shift/reduce or reduce/reduce conflicts. More about
-ambiguity and YACC "problems" can be found in 'Conflicts' chapter.
-
-
-
-
-
-
-
-!!4.1 A simple thermostat controller
-
-
-
-Let's say we have a thermostat that we want to control using a simple
-language. A session with the thermostat may look like this:
-
-
-
-
-
-heat on
-Heater on!
-heat off
-Heater off!
-target temperature 22
-New temperature set!
-
-
-
-
-The tokens we need to recognize are: heat, on/off (STATE), target, temperature,
-NUMBER.
-
-
-The Lex tokenizer (Example 4) is:
-
-
-%{
-#include <stdio.h>
-#include "y.tab.h"
-%}
-%%
-[[-9]+ return NUMBER;
-heat return TOKHEAT;
-on|off return STATE;
-target return TOKTARGET;
-temperature return TOKTEMPERATURE;
-\n /* ignore end of line */;
-[[ \t]+ /* ignore whitespace */;
-%%
-
-
-
-
-We note two important changes. First, we include the file 'y.tab.h', and
-secondly, we no longer print stuff, we return names of tokens. This change
-is because we are now feeding it all to YACC, which isn't interested in
-what we output to the screen. Y.tab.h has definitions for these tokens.
-
-
-But where does y.tab.h come from? It is generated by YACC from the Grammar
-File we are about to create. As our language is very basic, so is the grammar:
-
-
-
-
-
-commands: /* empty */
-| commands command
-;
-command:
-heat_switch
-|
-target_set
-;
-heat_switch:
-TOKHEAT STATE
-{
-printf("\tHeat turned on or off\n");
-}
-;
-target_set:
-TOKTARGET TOKTEMPERATURE NUMBER
-{
-printf("\tTemperature set\n");
-}
-;
-
-
-
-
-The first part is what I call the 'root'. It tells us that we
-have 'commands', and that these commands consist of individual 'command'
-parts. As you can see this rule is very recursive, because it again
-contains the word 'commands'. What this means is that the program is now
-capable of reducing a series of commands one by one. Read the chapter 'How
-do Lex and YACC work internally' for important details on recursion.
-
-
-The second rule defines what a command is. We support only two kinds of
-commands, the 'heat_switch' and the 'target_set'. This is what the |-symbol
-signifies - 'a command consists of either a heat_switch or a target_set'.
-
-
-A heat_switch consists of the HEAT token, which is simply the word 'heat',
-followed by a state (which we defined in the Lex file as 'on' or 'off').
-
-
-Somewhat more complicated is the target_set, which consists of the TARGET
-token (the word 'target'), the TEMPERATURE token (the word 'temperature')
-and a number.
-
-
-
-
-!A complete YACC file
-
-
-The previous section only showed the grammar part of the YACC file, but
-there is more. This is the header that we omitted:
-
-
-
-
-
-%{
-#include <stdio.h>
-#include <string.h>
-void yyerror(const char *str)
-{
-fprintf(stderr,"error: %s\n",str);
-}
-int yywrap()
-{
-return 1;
-}
-main()
-{
-yyparse();
-}
-%}
-%token NUMBER TOKHEAT STATE TOKTARGET TOKTEMPERATURE
-
-
-The yyerror() function is called by YACC if it finds an error. We simply
-output the message passed, but there are smarter things to do. See
-the 'Further reading' section at the end.
-
-
-The function yywrap() can be used to continue reading from another file. It
-is called at EOF and you can than open another file, and return . Or you
-can return 1, indicating that this is truly the end. For more about this,
-see the 'How do Lex and YACC work internally' chapter.
-
-
-Then there is the main() function, that does nothing but set everything in
-motion.
-
-
-The last line simply defines the tokens we will be using. These are output
-using y.tab.h if YACC is invoked with the '-d' option.
-
-
-
-
-!Compiling & running the thermostat controller
-
-
-
-
-
-lex example4.l
-yacc -d example4.y
-cc lex.yy.c y.tab.c -o example4
-
-
-A few things have changed. We now also invoke YACC to compile our grammar,
-which creates y.tab.c and y.tab.h. We then call Lex as usual. When
-compiling, we remove the -ll flag: we now have our own main() function and
-don't need the one provided by libl.
-
-
-
-
-NOTE: if you get an error about your compiler not being able to
-find 'yylval', add this to example4.l, just beneath #include
-<y.tab.h>:
-
-extern YYSTYPE yylval;
-
-This is explained in the 'How Lex and YACC work internally' section.
-
-
-
-A sample session:
-
-
-$ ./example4
-heat on
-Heat turned on or off
-heat off
-Heat turned on or off
-target temperature 10
-Temperature set
-target humidity 20
-error: parse error
-$
-
-
-
-
-This is not quite what we set out to achieve, but in the interest of keeping
-the learning curve manageable, not all cool stuff can be presented at once.
-
-
-
-
-!!4.2 Expanding the thermostat to handle parameters
-
-
-
-As we've seen, we now parse the thermostat commands correctly, and even flag
-mistakes properly. But as you might have guessed by the weasely wording, the
-program has no idea of what it should do, it does not get passed any of the
-values you enter.
-
-
-Let's start by adding the ability to read the new target temperature. In
-order to do so, we need to learn the NUMBER match in the Lexer to convert
-itself into an integer value, which can then be read in YACC.
-
-
-Whenever Lex matches a target, it puts the text of the match in the
-character string 'yytext'. YACC in turn expects to find a value in the
-variable 'yylval'. In Example 5, we see the obvious solution:
-
-
-
-
-
-%{
-#include <stdio.h>
-#include "y.tab.h"
-%}
-%%
-[[-9]+ yylval=atoi(yytext); return NUMBER;
-heat return TOKHEAT;
-on|off yylval=!strcmp(yytext,"on"); return STATE;
-target return TOKTARGET;
-temperature return TOKTEMPERATURE;
-\n /* ignore end of line */;
-[[ \t]+ /* ignore whitespace */;
-%%
-
-
-
-
-As you can see, we run atoi() on yytext, and put the result in yylval, where
-YACC can see it. We do much the same for the STATE match, where we compare
-it to 'on', and set yylval to 1 if it is equal. Please note that having a
-separate 'on' and 'off' match in Lex would produce faster code, but I wanted
-to show a more complicated rule and action for a change.
-
-
-Now we need to learn YACC how to deal with this. What is called 'yylval' in
-Lex has a different name in YACC. Let's examine the rule setting the new
-temperature target:
-
-
-
-
-
-target_set:
-TOKTARGET TOKTEMPERATURE NUMBER
-{
-printf("\tTemperature set to %d\n",$3);
-}
-;
-
-
-
-
-To access the value of the third part of the rule (ie, NUMBER), we need to
-use $3. Whenever yylex() returns, the contents of yylval are attached to the
-terminal, the value of which can be accessed with the $-construct.
-
-
-To expound on this further, let's observe the new 'heat_switch' rule:
-
-
-
-
-
-heat_switch:
-TOKHEAT STATE
-{
-if($2)
-printf("\tHeat turned on\n");
-else
-printf("\tHeat turned off\n");
-}
-;
-
-
-
-
-If you now run example5, it properly outputs what you entered.
-
-
-
-
-!!4.3 Parsing a configuration file
-
-
-
-Let's repeat part of the configuration file we mentioned earlier:
-
-
-zone "." {
-type hint;
-file "/etc/bind/db.root";
-};
-
-
-
-
-Remember that we already wrote a Lexer for this file. Now all we need to do
-is write the YACC grammar, and modify the Lexer so it returns values in
-a format YACC can understand.
-
-
-In the lexer from Example 6 we see:
-
-
-%{
-#include <stdio.h>
-#include "y.tab.h"
-%}
-%%
-zone return ZONETOK;
-file return FILETOK;
-[[a-zA-Z][[a-zA-Z0-9]* yylval=strdup(yytext); return WORD;
-[[a-zA-Z0-9\/.-]+ yylval=strdup(yytext); return FILENAME;
-\" return QUOTE;
-\{ return OBRACE;
-\} return EBRACE;
-; return SEMICOLON;
-\n /* ignore EOL */;
-[[ \t]+ /* ignore whitespace */;
-%%
-
-
-
-
-If you look carefully, you can see that yylval has changed! We no longer
-expect it to be an integer, but in fact assume that it is a char *. In the
-interest of keeping things simple, we invoke strdup and waste a lot of
-memory. Please note that this may not be a problem in many areas where you
-only need to parse a file once, and then exit.
-
-
-We want to store character strings because we are now mostly dealing with
-names: file names and zone names. In a later chapter we will explain how to
-deal with multiple types of data.
-
-
-In order to tell YACC about the new type of yylval, we add this line to the
-header of our YACC grammar:
-
-
-
-
-#define YYSTYPE char *
-
-
-
-The grammar itself is again more complicated. We chop it in parts to make it
-easier to digest.
-
-
-
-
-
-commands:
-|
-commands command SEMICOLON
-;
-command:
-zone_set
-;
-zone_set:
-ZONETOK quotedname zonecontent
-{
-printf("Complete zone for '%s' found\n",$2);
-}
-;
-
-
-
-
-This is the intro, including the aforementioned recursive 'root'. Please
-note that we specify that commands are terminated (and separated) by ;'s. We
-define one kind of command, the 'zone_set'. It consists of the ZONE token
-(the word 'zone'), followed by a quoted name and the 'zonecontent'. This
-zonecontent starts out simple enough:
-
-
-
-
-
-zonecontent:
-OBRACE zonestatements EBRACE
-
-
-
-
-It needs to start with an OBRACE, a {. Then follow the zonestatements,
-followed by an EBRACE, }.
-
-
-
-
-
-quotedname:
-QUOTE FILENAME QUOTE
-{
-$$=$2;
-}
-
-
-
-
-This section defines what a 'quotedname' is: a FILENAME between QUOTEs.
-Then it says something special: the value of a quotedname token is the value
-of the FILENAME. This means that the quotedname has as its value the
-filename without quotes.
-
-
-This is what the magic '$$=$2;' command does. It says: my value is the value
-of my second part. When the quotedname is now referenced in other rules, and
-you access its value with the $-construct, you see the value that we set
-here with $$=$2.
-
-
-
-
-NOTE: this grammar chokes on filenames without either a '.' or a '/' in
-them.
-
-
-
-
-
-
-
-
-
-zonestatements:
-|
-zonestatements zonestatement SEMICOLON
-;
-zonestatement:
-statements
-|
-FILETOK quotedname
-{
-printf("A zonefile name '%s' was encountered\n", $2);
-}
-;
-
-
-
-
-This is a generic statement that catches all kinds of statements within
-the 'zone' block. We again see the recursiveness.
-
-
-
-
-
-block:
-OBRACE zonestatements EBRACE SEMICOLON
-;
-statements:
-| statements statement
-;
-statement: WORD | block | quotedname
-
-
-
-
-This defines a block, and 'statements' which may be found within.
-
-
-When executed, the output is like this:
-
-
-
-
-
-$ ./example6
-zone "." {
-type hint;
-file "/etc/bind/db.root";
-type hint;
-};
-A zonefile name '/etc/bind/db.root' was encountered
-Complete zone for '.' found
-
-
-
-
-
-----
-
-!!5. Making a Parser in C++
-
-
-Although Lex and YACC predate C++, it is possible to generate a C++ parser.
-While Flex includes an option to generate a C++ lexer, we won't be using
-that, as YACC doesn't know how to deal with it directly.
-
-
-My preferred way to make a C++ parser is to have Lex generate a plain C
-file, and to let YACC generate C++ code. When you then link your
-application, you may run into some problems because the C++ code by default
-won't be able to find C functions, unless you've told it that those
-functions are extern "C".
-
-
-To do so, make a C header in YACC like this:
-
-
-
-
-
-extern "C"
-{
-int yyparse(void);
-int yylex(void);
-int yywrap()
-{
-return 1;
-}
-}
-
-
-
-
-If you want to declare or change yydebug, you must now do it like this:
-
-
-
-
-
-extern int yydebug;
-main()
-{
-yydebug=1;
-yyparse();
-}
-
-
-
-
-This is because C++'s One Definition Rule, which disallows multiple
-definitions of yydebug.
-
-
-You may also find that you need to repeat the #define of YYSTYPE in your Lex
-file, because of C++'s stricter type checking.
-
-
-To compile, do something like this:
-
-
-lex bindconfig2.l
-yacc --verbose --debug -d bindconfig2.y -o bindconfig2.cc
-cc -c lex.yy.c -o lex.yy.o
-c++ lex.yy.o bindconfig2.cc -o bindconfig2
-
-
-
-
-Because of the -o statement, y.tab.h is now called bindconfig2.cc.h, so take
-that into account.
-
-
-To summarize: don't bother to compile your Lexer in C++, keep it in C. Make
-your Parser in C++ and explain your compiler that some functions are C
-functions with extern "C" statements.
-
-
-
-----
-
-!!6. How do Lex and YACC work internally
-
-
-In the YACC file, you write your own main() function, which calls yyparse()
-at one point. The function yyparse() is created for you by YACC, and ends up
-in y.tab.c.
-
-
-yyparse() reads a stream of token/value pairs from yylex(), which needs to
-be supplied. You can code this function yourself, or have Lex do it for you.
-In our examples, we've chosen to leave this task to Lex.
-
-
-The yylex() as written by Lex reads characters from a FILE * file pointer
-called yyin. If you do not set yyin, it defaults to standard input. It
-outputs to yyout, which if unset defaults to stdout. You can also modify
-yyin in the yywrap() function which is called at the end of a file. It
-allows you to open another file, and continue parsing.
-
-
-If this is the case, have it return . If you want to end parsing at this
-file, let it return 1.
-
-
-Each call to yylex() returns an integer value which represents a token type.
-This tells YACC what kind of token it has read. The token may optionally
-have a value, which should be placed in the variable yylval.
-
-
-By default yylval is of type int, but you can override that from the YACC
-file by re#defining YYSTYPE.
-
-
-The Lexer needs to be able to access yylval. In order to do so, it must be
-declared in the scope of the lexer as an extern variable. The original YACC
-neglects to do this for you, so you should add the following to your lexter,
-just beneath #include <y.tab.h>:
-
-
-
-
-extern YYSTYPE yylval;
-
-
-
-Bison, which most people are using these days, does this for you
-automatically.
-
-
-
-
-!!6.1 Token values
-
-
-
-As mentioned before, yylex() needs to return what kind of token it
-encountered, and put its value in yylval. When these tokens are defined with
-the %token command, they are assigned numerical id's, starting from 256.
-
-
-Because of that fact, it is possible to have all ascii characters as a
-token. Let's say you are writing a calculator, up till now we would have
-written the lexer like this:
-
-
-
-
-
-[[-9]+ yylval=atoi(yytext); return NUMBER;
-[[ \n]+ /* eat whitespace */;
-- return MINUS;
-\* return MULT;
-\+ return PLUS;
-...
-
-
-
-
-Our YACC grammer would then contain:
-
-
-
-
-
-exp: NUMBER
-|
-exp PLUS exp
-|
-exp MINUS exp
-|
-exp MULT exp
-
-
-
-
-This is needlessly complicated. By using characters as shorthands for
-numerical token id's, we can rewrite our lexer like this:
-
-
-
-
-[[-9]+ yylval=atoi(yytext); return NUMBER;
-[[ \n]+ /* eat whitespace */;
-. return (int) yytext[[];
-
-
-
-This last dot matches all single otherwise unmatched characters.
-
-
-Our YACC grammer would then be:
-
-
-
-
-
-exp: NUMBER
-|
-exp '+' exp
-|
-exp '-' exp
-|
-exp '*' exp
-
-
-
-
-This is lots shorter and also more obvious. You do not need to declare these
-ascii tokens with %token in the header, they work out of the box.
-
-
-One other very good thing about this construct is that Lex will now match
-everything we throw at it - avoiding the default behaviour of echoing
-unmatched input to standard output. If a user of this calculator uses a ^,
-for example, it will now generate a parsing error, instead of being echoed
-to standard output.
-
-
-
-
-!!6.2 Recursion: 'right is wrong'
-
-
-
-Recursion is a vital aspect of YACC. Without it, you can't specify that a
-file consists of a sequence of independent commands or statements. Out of
-its own accord, YACC is only interested in the first rule, or the one you
-designate as the starting rule, with the '%start' symbol.
-
-
-Recursion in YACC comes in two flavours: right and left. Left recursion,
-which is the one you should use most of the time, looks like this:
-
-commands: /* empty */
-|
-commands command
-
-This says: a command is either empty, or it consists of more commands,
-followed by a command. They way YACC works means that it can now easily chop
-off individual command groups (from the front) and reduce them.
-
-
-Compare this to right recursion, which confusingly enough looks better to
-many eyes:
-
-commands: /* empty */
-|
-command commands
-
-But this is expensive. If used as the %start rule, it requires YACC to keep
-all commands in your file on the stack, which may take a lot of memory. So
-by all means, use left recursion when parsing long statements, like entire
-files. Sometimes it is hard to avoid right recursion but if your statements
-are not too long, you do not need to go out of your way to use left
-recursion.
-
-
-If you have something terminating (and therefore separating) your commands,
-right recursion looks very natural, but is still expensive:
-
-commands: /* empty */
-|
-command SEMICOLON commands
-
-
-
-The right way to code this is using left recursion (I didn't invent this
-either):
-
-commands: /* empty */
-|
-commands command SEMICOLON
-
-
-
-Earlier versions of this HOWTO mistakenly used right recursion. Markus
-Triska kindly informed us of this.
-
-
-
-
-!!6.3 Advanced yylval: %union
-
-
-
-Currently, we need to define *the* type of yylval. This however is not
-always appropriate. There will be times when we need to be able to handle
-multiple data types. Returning to our hypothetical thermostat, perhaps we
-want to be able to choose a heater to control, like this:
-
-
-
-
-
-heater mainbuiling
-Selected 'mainbuilding' heater
-target temperature 23
-'mainbuilding' heater target temperature now 23
-
-
-
-
-What this calls for is for yylval to be a union, which can hold both strings
-and integers - but not simultaneously.
-
-
-Remember that we told YACC previously what type yylval was supposed to by by
-defining YYSTYPE. We could conceivably define YYSTYPE to be a union this
-way, by YACC has an easier method for doing this: the %union statement.
-
-
-Based on Example 4, we now write the Example 7 YACC grammar. First the
-intro:
-
-
-%token TOKHEATER TOKHEAT TOKTARGET TOKTEMPERATURE
-%union
-{
-int number;
-char *string;
-}
-%token <number> STATE
-%token <number> NUMBER
-%token <string> WORD
-
-
-
-
-We define our union, which contains only a number and a string. Then using
-an extended %token syntax, we explain to YACC which part of the union each
-token should access.
-
-
-In this case, we let the STATE token use an integer, as before. Same goes
-for the NUMBER token, which we use for reading temperatures.
-
-
-New however is the WORD token, which is declared to need a string.
-
-
-The Lexer file changes a bit too:
-
-
-%{
-#include <stdio.h>
-#include <string.h>
-#include "y.tab.h"
-%}
-%%
-[[-9]+ yylval.number=atoi(yytext); return NUMBER;
-heater return TOKHEATER;
-heat return TOKHEAT;
-on|off yylval.number=!strcmp(yytext,"on"); return STATE;
-target return TOKTARGET;
-temperature return TOKTEMPERATURE;
-[[a-z0-9]+ yylval.string=strdup(yytext);return WORD;
-\n /* ignore end of line */;
-[[ \t]+ /* ignore whitespace */;
-%%
-
-
-
-
-As you can see, we don't access the yylval directly anymore, we add a suffix
-indicating which part we want to access. We don't need to do that in the
-YACC grammar however, as YACC performs the magic for us:
-
-
-
-
-
-heater_select:
-TOKHEATER WORD
-{
-printf("\tSelected heater '%s'\n",$2);
-heater=$2;
-}
-;
-
-
-
-
-Because of the %token declaration above, YACC automatically picks
-the 'string' member from our union. Note also that we store a copy of $2,
-which is later used to tell the user which heater he is sending commands to:
-
-
-
-
-
-target_set:
-TOKTARGET TOKTEMPERATURE NUMBER
-{
-printf("\tHeater '%s' temperature set to %d\n",heater,$3);
-}
-;
-
-
-
-
-For more details, read example7.y.
-
-
-
-----
-
-!!7. Debugging
-
-
-Especially when learning, it is important to have debugging facilities.
-Luckily, YACC can give a lot of feedback. This feedback comes at the cost of
-some overhead, so you need to supply some switches to enable it.
-
-
-When compiling your grammar, add --debug and --verbose to the YACC
-commandline. In your grammar C heading, add the following:
-
-
-int yydebug=1;
-
-
-This will generate the file 'y.output' which explains the state machine that
-was created.
-
-
-When you now run the generated binary, it will output a *lot* of what is
-happening. This includes what state the state machine currently has, and
-what tokens are being read.
-
-
-Peter Jinks wrote a page on
-debugging which
-contains some common errors and how to solve them.
-
-
-
-
-!!7.1 The state machine
-
-
-
-Internally, your YACC parser runs a so called 'state machine'. As the name
-implies, this is a machine that can be in several states. Then there are
-rules which govern transitions from one state to another. Everything starts
-with the so called 'root' rule I mentioned earlier.
-
-
-To quote from the output from the Example 7 y.output:
-
-
-state
-ZONETOK , and go to state 1
-$default reduce using rule 1 (commands)
-commands go to state 29
-command go to state 2
-zone_set go to state 3
-
-
-
-
-By default, this state reduces using the 'commands' rule. This is the
-aforementioned recursive rule that defines 'commands' to be built up from
-individual command statements, followed by a semicolon, followed by possibly
-more commands.
-
-
-This state reduces until it hits something it understands, in this case, a
-ZONETOK, ie, the word 'zone'. It then goes to state 1, which deals further
-with a zone command:
-
-
-
-
-
-state 1
-zone_set -> ZONETOK . quotedname zonecontent (rule 4)
-QUOTE , and go to state 4
-quotedname go to state 5
-
-
-
-
-The first line has a '.' in it to indicate where we are: we've just seen a
-ZONETOK and are now looking for a 'quotedname'. Apparently, a quotedname
-starts with a QUOTE, which sends us to state 4.
-
-
-To follow this further, compile Example 7 with the flags mentioned in the
-Debugging section.
-
-
-
-
-!!7.2 Conflicts: 'shift/reduce', 'reduce/reduce'
-
-
-
-Whenever YACC warns you about conflicts, you may be in for trouble. Solving
-these conflicts appears to be somewhat of an art form that may teach you a
-lot about your language. More than you possibly would have wanted to know.
-
-
-The problems revolve around how to interpret a sequence of tokens. Let's
-suppose we define a language that needs to accept both these commands:
-
-
-
-
-
-delete heater all
-delete heater number1
-
-
-
-
-To do this, we define this grammar:
-
-
-
-
-
-delete_heaters:
-TOKDELETE TOKHEATER mode
-{
-deleteheaters($3);
-}
-mode: WORD
-delete_a_heater:
-TOKDELETE TOKHEATER WORD
-{
-delete($3);
-}
-
-
-
-
-You may already be smelling trouble. The state machine starts by reading the
-word 'delete', and then needs to decide where to go based on the next token.
-This next token can either be a mode, specifying how to delete the heaters,
-or the name of a heater to delete.
-
-
-The problem however is that for both commands, the next token is going to be
-a WORD. YACC has therefore no idea what to do. This leads to
-a 'reduce/reduce' warning, and a further warning that the 'delete_a_heater'
-node is never going to be reached.
-
-
-In this case the conflict is resolved easily (ie, by renaming the first
-command to 'delete heaters all', or by making 'all' a separate token), but
-sometimes it is harder. The y.output file generated when you pass yacc the
---verbose flag can be of tremendous help.
-
-
-
-----
-
-!!8. Further reading
-
-
-GNU YACC (Bison) comes with a very nice info-file (.info) which documents
-the YACC syntax very well. It mentions Lex only once, but otherwise it's
-very good. You can read .info files with Emacs or with the very nice
-tool 'pinfo'. It is also available on the GNU site:
-BISON Manual.
-
-
-Flex comes with a good manpage which is very useful if you already
-have a rough understanding of what Flex does. The
-Flex Manual is also
-available online.
-
-
-After this introduction to Lex and YACC, you may find that you need more
-information. I haven't read any of these books yet, but they sound good:
-
-; __Bison-The Yacc-Compatible Parser Generator__:
-
-By Charles Donnelly and Richard Stallman. An
-Amazon
-user found it useful.
-
-
-
-; __Lex & Yacc__:
-
-By John R. Levine, Tony Mason and Doug Brown. Considered to be the standard
-work on this subject, although a bit dated. Reviews over at
-Amazon.
-; __Compilers : Principles, Techniques, and Tools__:
-
-By Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman. The 'Dragon Book'. From
-1985 and they just keep printing it. Considered the standard work on
-constructing compilers.
-Amazon
-
-
-
-
-
-Thomas Niemann wrote a document discussing how to write compilers and
-calculators with Lex & YACC. You can find it
-here.
-
-
-The moderated usenet newsgroup comp.compilers can also be very useful but
-please keep in mind that the people there are not a dedicated parser
-helpdesk! Before posting, read their interesting
-page and especially the
-FAQ.
-
-
-Lex - A Lexical Analyzer Generator by M. E. Lesk and E. Schmidt is one of
-the original reference papers. It can be found
-here.
-
-
-Yacc: Yet Another Compiler-Compiler by Stephen C. Johnson is one of the
-original reference papers for YACC. It can be found
-here.
-It contains useful hints on style.
-
-
-
-----
-
-!!9. Acknowledgements & Thanks
-
-
-
-
-
-*Pete Jinks <pjj%cs.man.ac.uk>
-*
-
-*Chris Lattner <sabre%nondot.org>
-*
-
-*John W. Millaway <johnmillaway%yahoo.com>
-*
-
-*Martin Neitzel <neitzel%gaertner.de>
-*
-
-*Esmond Pitt <esmond.pitt%bigpond.com>
-*
-
-*Eric S. Raymond
-*
-
-*Bob Schmertz <schmertz%wam.umd.edu>
-*
-
-*Adam Sulmicki <adam%cfar.umd.edu>
-*
-
-*Markus Triska <triska%gmx.at>
-*
-
-*Erik Verbruggen <erik%road-warrior.cs.kun.nl>
-*
-
-*Gary V. Vaughan <gary%gnu
.org> (read his awesome
-Autobook)
-*
-
-*Ivo van der Wijk (Amaze Internet)
-*
-
-
-
-
-----
+Describe
[HowToLexYACCHOWTO
] here.