# To unbundle, run this file echo ss.. sed 's/.//' >ss.. <<'//GO.SYSIN DD ss..' -.RP -.ND "July 31, 1978" -.TL -Yacc: -Yet Another Compiler-Compiler -.AU "MH 2C-559" 3968 -Stephen C. Johnson -.AI -.MH -.AB -.PP -Computer program input generally has some structure; -in fact, every computer program that does input can be thought of as defining -an ``input language'' which it accepts. -An input language may be as complex as a programming language, or as simple as -a sequence of numbers. -Unfortunately, usual input facilities -are limited, difficult to use, -and often are lax about checking their inputs for validity. -.PP -Yacc provides a general tool for describing -the input to a computer program. -The Yacc user specifies the structures -of his input, together with code to be invoked as -each such structure is recognized. -Yacc turns such a specification into a subroutine that -handles the input process; -frequently, it is convenient and appropriate to have most -of the flow of control in the user's application -handled by this subroutine. -.PP -The input subroutine produced by Yacc calls a user-supplied routine to -return the next basic input item. -Thus, the user can specify his input in terms of individual input characters, or -in terms of higher level constructs such as names and numbers. -The user-supplied routine may also handle idiomatic features such as -comment and continuation conventions, which typically defy easy grammatical specification. -.PP -Yacc is written in portable C. -The class of specifications accepted is a very general one: LALR(1) -grammars with disambiguating rules. -.PP -In addition to compilers for C, APL, Pascal, RATFOR, etc., Yacc -has also been used for less conventional languages, -including a phototypesetter language, several desk calculator languages, a document retrieval system, -and a Fortran debugging system. -.AE -.OK -Computer Languages -Compilers -Formal Language Theory -.CS 23 11 34 0 0 8 //GO.SYSIN DD ss.. echo ss0 sed 's/.//' >ss0 <<'//GO.SYSIN DD ss0' -.SH -0: Introduction -.PP -Yacc provides a general tool for imposing structure on the input to a computer program. -The Yacc user prepares a -specification of the input process; this includes rules -describing the input structure, code to be invoked when these -rules are recognized, and a low-level routine to do the -basic input. -Yacc then generates a function to control the input process. -This function, called a -.I parser , -calls the user-supplied low-level input routine -(the -.I "lexical analyzer" ) -to pick up the basic items -(called -.I tokens ) -from the input stream. -These tokens are organized according to the input structure rules, -called -.I "grammar rules" \|; -when one of these rules has been recognized, -then user code supplied for this rule, an -.I action , -is invoked; actions have the ability to return values and -make use of the values of other actions. -.PP -Yacc is written in a portable dialect of C -.[ -Ritchie Kernighan Language Prentice -.] -and the actions, and output subroutine, are in C as well. -Moreover, many of the syntactic conventions of Yacc follow C. -.PP -The heart of the input specification is a collection of grammar rules. -Each rule describes an allowable structure and gives it a name. -For example, one grammar rule might be -.DS -date : month\_name day \',\' year ; -.DE -Here, -.I date , -.I month\_name , -.I day , -and -.I year -represent structures of interest in the input process; -presumably, -.I month\_name , -.I day , -and -.I year -are defined elsewhere. -The comma ``,'' is enclosed in single quotes; this implies that the -comma is to appear literally in the input. -The colon and semicolon merely serve as punctuation in the rule, and have -no significance in controlling the input. -Thus, with proper definitions, the input -.DS -July 4, 1776 -.DE -might be matched by the above rule. -.PP -An important part of the input process is carried out by the -lexical analyzer. -This user routine reads the input stream, recognizing the lower level structures, -and communicates these tokens -to the parser. -For historical reasons, a structure recognized by the lexical analyzer is called a -.I "terminal symbol" , -while the structure recognized by the parser is called a -.I "nonterminal symbol" . -To avoid confusion, terminal symbols will usually be referred to as -.I tokens . -.PP -There is considerable leeway in deciding whether to recognize structures using the lexical -analyzer or grammar rules. -For example, the rules -.DS -month\_name : \'J\' \'a\' \'n\' ; -month\_name : \'F\' \'e\' \'b\' ; - - . . . - -month\_name : \'D\' \'e\' \'c\' ; -.DE -might be used in the above example. -The lexical analyzer would only need to recognize individual letters, and -.I month\_name -would be a nonterminal symbol. -Such low-level rules tend to waste time and space, and may -complicate the specification beyond Yacc's ability to deal with it. -Usually, the lexical analyzer would -recognize the month names, -and return an indication that a -.I month\_name -was seen; in this case, -.I month\_name -would be a token. -.PP -Literal characters such as ``,'' must also be passed through the lexical -analyzer, and are also considered tokens. -.PP -Specification files are very flexible. -It is realively easy to add to the above example the rule -.DS -date : month \'/\' day \'/\' year ; -.DE -allowing -.DS -7 / 4 / 1776 -.DE -as a synonym for -.DS -July 4, 1776 -.DE -In most cases, this new rule could be ``slipped in'' to a working system with minimal effort, -and little danger of disrupting existing input. -.PP -The input being read may not conform to the -specifications. -These input errors are detected as early as is theoretically possible with a -left-to-right scan; -thus, not only is the chance of reading and computing with bad -input data substantially reduced, but the bad data can usually be quickly found. -Error handling, -provided as part of the input specifications, -permits the reentry of bad data, -or the continuation of the input process after skipping over the bad data. -.PP -In some cases, Yacc fails to produce a parser when given a set of -specifications. -For example, the specifications may be self contradictory, or they may -require a more powerful recognition mechanism than that available to Yacc. -The former cases represent design errors; -the latter cases -can often be corrected -by making -the lexical analyzer -more powerful, or by rewriting some of the grammar rules. -While Yacc cannot handle all possible specifications, its power -compares favorably with similar systems; -moreover, the -constructions which are difficult for Yacc to handle are -also frequently difficult for human beings to handle. -Some users have reported that the discipline of formulating valid -Yacc specifications for their input revealed errors of -conception or design early in the program development. -.PP -The theory underlying Yacc has been described elsewhere. -.[ -Aho Johnson Surveys LR Parsing -.] -.[ -Aho Johnson Ullman Ambiguous Grammars -.] -.[ -Aho Ullman Principles Compiler Design -.] -Yacc has been extensively used in numerous practical applications, -including -.I lint , -.[ -Johnson Lint -.] -the Portable C Compiler, -.[ -Johnson Portable Compiler Theory -.] -and a system for typesetting mathematics. -.[ -Kernighan Cherry typesetting system CACM -.] -.PP -The next several sections describe the -basic process of preparing a Yacc specification; -Section 1 describes the preparation of grammar rules, -Section 2 the preparation of the user supplied actions associated with these rules, -and Section 3 the preparation of lexical analyzers. -Section 4 describes the operation of the parser. -Section 5 discusses various reasons why Yacc may be unable to produce a -parser from a specification, and what to do about it. -Section 6 describes a simple mechanism for -handling operator precedences in arithmetic expressions. -Section 7 discusses error detection and recovery. -Section 8 discusses the operating environment and special features -of the parsers Yacc produces. -Section 9 gives some suggestions which should improve the -style and efficiency of the specifications. -Section 10 discusses some advanced topics, and Section 11 gives -acknowledgements. -Appendix A has a brief example, and Appendix B gives a -summary of the Yacc input syntax. -Appendix C gives an example using some of the more advanced -features of Yacc, and, finally, -Appendix D describes mechanisms and syntax -no longer actively supported, but -provided for historical continuity with older versions of Yacc. //GO.SYSIN DD ss0 echo ss1 sed 's/.//' >ss1 <<'//GO.SYSIN DD ss1' -.tr *\(** -.tr |\(or -.SH -1: Basic Specifications -.PP -Names refer to either tokens or nonterminal symbols. -Yacc requires -token names to be declared as such. -In addition, for reasons discussed in Section 3, it is often desirable -to include the lexical analyzer as part of the specification file; -it may be useful to include other programs as well. -Thus, every specification file consists of three sections: -the -.I declarations , -.I "(grammar) rules" , -and -.I programs . -The sections are separated by double percent ``%%'' marks. -(The percent ``%'' is generally used in Yacc specifications as an escape character.) -.PP -In other words, a full specification file looks like -.DS -declarations -%% -rules -%% -programs -.DE -.PP -The declaration section may be empty. -Moreover, if the programs section is omitted, the second %% mark may be omitted also; -thus, the smallest legal Yacc specification is -.DS -%% -rules -.DE -.PP -Blanks, tabs, and newlines are ignored except -that they may not appear in names or multi-character reserved symbols. -Comments may appear wherever a name is legal; they are enclosed -in /* . . . */, as in C and PL/I. -.PP -The rules section is made up of one or more grammar rules. -A grammar rule has the form: -.DS -A : BODY ; -.DE -A represents a nonterminal name, and BODY represents a sequence of zero or more names and literals. -The colon and the semicolon are Yacc punctuation. -.PP -Names may be of arbitrary length, and may be made up of letters, dot ``.'', underscore ``\_'', and -non-initial digits. -Upper and lower case letters are distinct. -The names used in the body of a grammar rule may represent tokens or nonterminal symbols. -.PP -A literal consists of a character enclosed in single quotes ``\'''. -As in C, the backslash ``\e'' is an escape character within literals, and all the C escapes -are recognized. -Thus -.DS -\'\en\' newline -\'\er\' return -\'\e\'\' single quote ``\''' -\'\e\e\' backslash ``\e'' -\'\et\' tab -\'\eb\' backspace -\'\ef\' form feed -\'\exxx\' ``xxx'' in octal -.DE -For a number of technical reasons, the -\s-2NUL\s0 -character (\'\e0\' or 0) should never -be used in grammar rules. -.PP -If there are several grammar rules with the same left hand side, the vertical bar ``|'' -can be used to avoid rewriting the left hand side. -In addition, -the semicolon at the end of a rule can be dropped before a vertical bar. -Thus the grammar rules -.DS -A : B C D ; -A : E F ; -A : G ; -.DE -can be given to Yacc as -.DS -A : B C D - | E F - | G - ; -.DE -It is not necessary that all grammar rules with the same left side appear together in the grammar rules section, -although it makes the input much more readable, and easier to change. -.PP -If a nonterminal symbol matches the empty string, this can be indicated in the obvious way: -.DS -empty : ; -.DE -.PP -Names representing tokens must be declared; this is most simply done by writing -.DS -%token name1 name2 . . . -.DE -in the declarations section. -(See Sections 3 , 5, and 6 for much more discussion). -Every name not defined in the declarations section is assumed to represent a nonterminal symbol. -Every nonterminal symbol must appear on the left side of at least one rule. -.PP -Of all the nonterminal symbols, one, called the -.I "start symbol" , -has particular importance. -The parser is designed to recognize the start symbol; thus, -this symbol represents the largest, -most general structure described by the grammar rules. -By default, -the start symbol is taken to be the left hand side of the first -grammar rule in the rules section. -It is possible, and in fact desirable, to declare the start -symbol explicitly in the declarations section using the %start keyword: -.DS -%start symbol -.DE -.PP -The end of the input to the parser is signaled by a special token, called the -.I endmarker . -If the tokens up to, but not including, the endmarker form a structure -which matches the start symbol, the parser function returns to its caller -after the endmarker is seen; it -.I accepts -the input. -If the endmarker is seen in any other context, it is an error. -.PP -It is the job of the user-supplied lexical analyzer -to return the endmarker when appropriate; see section 3, below. -Usually the endmarker represents some reasonably obvious -I/O status, such as ``end-of-file'' or ``end-of-record''. //GO.SYSIN DD ss1 echo ss2 sed 's/.//' >ss2 <<'//GO.SYSIN DD ss2' -.SH -2: Actions -.PP -With each grammar rule, the user may associate actions to be performed each time -the rule is recognized in the input process. -These actions may return values, and may obtain the values returned by previous -actions. -Moreover, the lexical analyzer can return values -for tokens, if desired. -.PP -An action is an arbitrary C statement, and as such can do -input and output, call subprograms, and alter -external vectors and variables. -An action is specified by -one or more statements, enclosed in curly braces ``{'' and ``}''. -For example, -.DS -A : \'(\' B \')\' - { hello( 1, "abc" ); } -.DE -and -.DS -XXX : YYY ZZZ - { printf("a message\en"); - flag = 25; } -.DE -are grammar rules with actions. -.PP -To facilitate easy communication between the actions and the parser, the action statements are altered -slightly. -The symbol ``dollar sign'' ``$'' is used as a signal to Yacc in this context. -.PP -To return a value, the action normally sets the -pseudo-variable ``$$'' to some value. -For example, an action that does nothing but return the value 1 is -.DS - { $$ = 1; } -.DE -.PP -To obtain the values returned by previous actions and the lexical analyzer, the -action may use the pseudo-variables $1, $2, . . ., -which refer to the values returned by the -components of the right side of a rule, reading from left to right. -Thus, if the rule is -.DS -A : B C D ; -.DE -for example, then $2 has the value returned by C, and $3 the value returned by D. -.PP -As a more concrete example, consider the rule -.DS -expr : \'(\' expr \')\' ; -.DE -The value returned by this rule is usually the value of the -.I expr -in parentheses. -This can be indicated by -.DS -expr : \'(\' expr \')\' { $$ = $2 ; } -.DE -.PP -By default, the value of a rule is the value of the first element in it ($1). -Thus, grammar rules of the form -.DS -A : B ; -.DE -frequently need not have an explicit action. -.PP -In the examples above, all the actions came at the end of their rules. -Sometimes, it is desirable to get control before a rule is fully parsed. -Yacc permits an action to be written in the middle of a rule as well -as at the end. -This rule is assumed to return a value, accessible -through the usual \$ mechanism by the actions to -the right of it. -In turn, it may access the values -returned by the symbols to its left. -Thus, in the rule -.DS -A : B - { $$ = 1; } - C - { x = $2; y = $3; } - ; -.DE -the effect is to set -.I x -to 1, and -.I y -to the value returned by C. -.PP -Actions that do not terminate a rule are actually -handled by Yacc by manufacturing a new nonterminal -symbol name, and a new rule matching this -name to the empty string. -The interior action is the action triggered off by recognizing -this added rule. -Yacc actually treats the above example as if -it had been written: -.DS -$ACT : /* empty */ - { $$ = 1; } - ; - -A : B $ACT C - { x = $2; y = $3; } - ; -.DE -.PP -In many applications, output is not done directly by the actions; -rather, a data structure, such as a parse tree, is constructed in memory, -and transformations are applied to it before output is generated. -Parse trees are particularly easy to -construct, given routines to build and maintain the tree -structure desired. -For example, suppose there is a C function -.I node , -written so that the call -.DS -node( L, n1, n2 ) -.DE -creates a node with label L, and descendants n1 and n2, and returns the index of -the newly created node. -Then parse tree can be built by supplying actions such as: -.DS -expr : expr \'+\' expr - { $$ = node( \'+\', $1, $3 ); } -.DE -in the specification. -.PP -The user may define other variables to be used by the actions. -Declarations and definitions can appear in -the declarations section, -enclosed in the marks ``%{'' and ``%}''. -These declarations and definitions have global scope, -so they are known to the action statements and the lexical analyzer. -For example, -.DS -%{ int variable = 0; %} -.DE -could be placed in the declarations section, -making -.I variable -accessible to all of the actions. -The Yacc parser uses only names beginning in ``yy''; -the user should avoid such names. -.PP -In these examples, all the values are integers: a discussion of -values of other types will be found in Section 10. //GO.SYSIN DD ss2 echo ss3 sed 's/.//' >ss3 <<'//GO.SYSIN DD ss3' -.SH -3: Lexical Analysis -.PP -The user must supply a lexical analyzer to read the input stream and communicate tokens -(with values, if desired) to the parser. -The lexical analyzer is an integer-valued function called -.I yylex . -The function returns an integer, the -.I "token number" , -representing the kind of token read. -If there is a value associated with that token, it should be assigned -to the external variable -.I yylval . -.PP -The parser and the lexical analyzer must agree on these token numbers in order for -communication between them to take place. -The numbers may be chosen by Yacc, or chosen by the user. -In either case, the ``# define'' mechanism of C is used to allow the lexical analyzer -to return these numbers symbolically. -For example, suppose that the token name DIGIT has been defined in the declarations section of the -Yacc specification file. -The relevant portion of the lexical analyzer might look like: -.DS -yylex(){ - extern int yylval; - int c; - . . . - c = getchar(); - . . . - switch( c ) { - . . . - case \'0\': - case \'1\': - . . . - case \'9\': - yylval = c\-\'0\'; - return( DIGIT ); - . . . - } - . . . -.DE -.PP -The intent is to return a token number of DIGIT, and a value equal to the numerical value of the -digit. -Provided that the lexical analyzer code is placed in the programs section of the specification file, -the identifier DIGIT will be defined as the token number associated -with the token DIGIT. -.PP -This mechanism leads to clear, -easily modified lexical analyzers; the only pitfall is the need -to avoid using any token names in the grammar that are reserved -or significant in C or the parser; for example, the use of -token names -.I if -or -.I while -will almost certainly cause severe -difficulties when the lexical analyzer is compiled. -The token name -.I error -is reserved for error handling, and should not be used naively -(see Section 7). -.PP -As mentioned above, the token numbers may be chosen by Yacc or by the user. -In the default situation, the numbers are chosen by Yacc. -The default token number for a literal -character is the numerical value of the character in the local character set. -Other names are assigned token numbers -starting at 257. -.PP -To assign a token number to a token (including literals), -the first appearance of the token name or literal -.I -in the declarations section -.R -can be immediately followed by -a nonnegative integer. -This integer is taken to be the token number of the name or literal. -Names and literals not defined by this mechanism retain their default definition. -It is important that all token numbers be distinct. -.PP -For historical reasons, the endmarker must have token -number 0 or negative. -This token number cannot be redefined by the user; thus, all -lexical analyzers should be prepared to return 0 or negative as a token number -upon reaching the end of their input. -.PP -A very useful tool for constructing lexical analyzers is -the -.I Lex -program developed by Mike Lesk. -.[ -Lesk Lex -.] -These lexical analyzers are designed to work in close -harmony with Yacc parsers. -The specifications for these lexical analyzers -use regular expressions instead of grammar rules. -Lex can be easily used to produce quite complicated lexical analyzers, -but there remain some languages (such as FORTRAN) which do not -fit any theoretical framework, and whose lexical analyzers -must be crafted by hand. //GO.SYSIN DD ss3 echo ss4 sed 's/.//' >ss4 <<'//GO.SYSIN DD ss4' -.SH -4: How the Parser Works -.PP -Yacc turns the specification file into a C program, which -parses the input according to the specification given. -The algorithm used to go from the -specification to the parser is complex, and will not be discussed -here (see -the references for more information). -The parser itself, however, is relatively simple, -and understanding how it works, while -not strictly necessary, will nevertheless make -treatment of error recovery and ambiguities much more -comprehensible. -.PP -The parser produced by Yacc consists -of a finite state machine with a stack. -The parser is also capable of reading and remembering the next -input token (called the -.I lookahead -token). -The -.I "current state" -is always the one on the top of the stack. -The states of the finite state machine are given -small integer labels; initially, the machine is in state 0, -the stack contains only state 0, and no lookahead token has been read. -.PP -The machine has only four actions available to it, called -.I shift , -.I reduce , -.I accept , -and -.I error . -A move of the parser is done as follows: -.IP 1. -Based on its current state, the parser decides -whether it needs a lookahead token to decide -what action should be done; if it needs one, and does -not have one, it calls -.I yylex -to obtain the next token. -.IP 2. -Using the current state, and the lookahead token -if needed, the parser decides on its next action, and -carries it out. -This may result in states being pushed onto the stack, or popped off of -the stack, and in the lookahead token being processed -or left alone. -.PP -The -.I shift -action is the most common action the parser takes. -Whenever a shift action is taken, there is always -a lookahead token. -For example, in state 56 there may be -an action: -.DS - IF shift 34 -.DE -which says, in state 56, if the lookahead token is IF, -the current state (56) is pushed down on the stack, -and state 34 becomes the current state (on the -top of the stack). -The lookahead token is cleared. -.PP -The -.I reduce -action keeps the stack from growing without -bounds. -Reduce actions are appropriate when the parser has seen -the right hand side of a grammar rule, -and is prepared to announce that it has seen -an instance of the rule, replacing the right hand side -by the left hand side. -It may be necessary to consult the lookahead token -to decide whether to reduce, but -usually it is not; in fact, the default -action (represented by a ``.'') is often a reduce action. -.PP -Reduce actions are associated with individual grammar rules. -Grammar rules are also given small integer -numbers, leading to some confusion. -The action -.DS - \fB.\fR reduce 18 -.DE -refers to -.I "grammar rule" -18, while the action -.DS - IF shift 34 -.DE -refers to -.I state -34. -.PP -Suppose the rule being reduced is -.DS -A \fB:\fR x y z ; -.DE -The reduce action depends on the -left hand symbol (A in this case), and the number of -symbols on the right hand side (three in this case). -To reduce, first pop off the top three states -from the stack -(In general, the number of states popped equals the number of symbols on the -right side of the rule). -In effect, these states were the ones -put on the stack while recognizing -.I x , -.I y , -and -.I z , -and no longer serve any useful purpose. -After popping these states, a state is uncovered -which was the state the parser was in before beginning to -process the rule. -Using this uncovered state, and the symbol -on the left side of the rule, perform what is in -effect a shift of A. -A new state is obtained, pushed onto the stack, and parsing continues. -There are significant differences between the processing of -the left hand symbol and an ordinary shift of a token, -however, so this action is called a -.I goto -action. -In particular, the lookahead token is cleared by a shift, and -is not affected by a goto. -In any case, the uncovered state contains an entry such as: -.DS - A goto 20 -.DE -causing state 20 to be pushed -onto the stack, and become the current state. -.PP -In effect, the reduce action ``turns back the clock'' in the parse, -popping the states off the stack to go back to the -state where the right hand side of the rule was first seen. -The parser then behaves as if it had seen the left side at that time. -If the right hand side of the rule is empty, -no states are popped off of the stack: the uncovered state -is in fact the current state. -.PP -The reduce action is also important in the treatment of user-supplied -actions and values. -When a rule is reduced, the code supplied with the rule is executed -before the stack is adjusted. -In addition to the stack holding the states, another stack, -running in parallel with it, holds the values returned -from the lexical analyzer and the actions. -When a shift takes place, the external variable -.I yylval -is copied onto the value stack. -After the return from the user code, the reduction is carried out. -When the -.I goto -action is done, the external variable -.I yyval -is copied onto the value stack. -The pseudo-variables $1, $2, etc., refer to the value stack. -.PP -The other two parser actions are conceptually much simpler. -The -.I accept -action indicates that the entire input has been seen and -that it matches the specification. -This action appears only when the lookahead token is -the endmarker, and indicates that the parser has successfully -done its job. -The -.I error -action, on the other hand, represents a place where the parser -can no longer continue parsing according to the specification. -The input tokens it has seen, together with the lookahead token, -cannot be followed by anything that would result -in a legal input. -The parser reports an error, and attempts to recover the situation and -resume parsing: the error recovery (as opposed to the detection of error) -will be covered in Section 7. -.PP -It is time for an example! -Consider the specification -.DS -%token DING DONG DELL -%% -rhyme : sound place - ; -sound : DING DONG - ; -place : DELL - ; -.DE -.PP -When Yacc is invoked with the -.B \-v -option, a file called -.I y.output -is produced, with a human-readable description of the parser. -The -.I y.output -file corresponding to the above grammar (with some statistics -stripped off the end) is: -.DS -state 0 - $accept : \_rhyme $end - - DING shift 3 - . error - - rhyme goto 1 - sound goto 2 - -state 1 - $accept : rhyme\_$end - - $end accept - . error - -state 2 - rhyme : sound\_place - - DELL shift 5 - . error - - place goto 4 - -state 3 - sound : DING\_DONG - - DONG shift 6 - . error - -state 4 - rhyme : sound place\_ (1) - - . reduce 1 - -state 5 - place : DELL\_ (3) - - . reduce 3 - -state 6 - sound : DING DONG\_ (2) - - . reduce 2 -.DE -Notice that, in addition to the actions for each state, there is a -description of the parsing rules being processed in each -state. The \_ character is used to indicate -what has been seen, and what is yet to come, in each rule. -Suppose the input is -.DS -DING DONG DELL -.DE -It is instructive to follow the steps of the parser while -processing this input. -.PP -Initially, the current state is state 0. -The parser needs to refer to the input in order to decide -between the actions available in state 0, so -the first token, -.I DING , -is read, becoming the lookahead token. -The action in state 0 on -.I DING -is -is ``shift 3'', so state 3 is pushed onto the stack, -and the lookahead token is cleared. -State 3 becomes the current state. -The next token, -.I DONG , -is read, becoming the lookahead token. -The action in state 3 on the token -.I DONG -is ``shift 6'', -so state 6 is pushed onto the stack, and the lookahead is cleared. -The stack now contains 0, 3, and 6. -In state 6, without even consulting the lookahead, -the parser reduces by rule 2. -.DS - sound : DING DONG -.DE -This rule has two symbols on the right hand side, so -two states, 6 and 3, are popped off of the stack, uncovering state 0. -Consulting the description of state 0, looking for a goto on -.I sound , -.DS - sound goto 2 -.DE -is obtained; thus state 2 is pushed onto the stack, -becoming the current state. -.PP -In state 2, the next token, -.I DELL , -must be read. -The action is ``shift 5'', so state 5 is pushed onto the stack, which now has -0, 2, and 5 on it, and the lookahead token is cleared. -In state 5, the only action is to reduce by rule 3. -This has one symbol on the right hand side, so one state, 5, -is popped off, and state 2 is uncovered. -The goto in state 2 on -.I place , -the left side of rule 3, -is state 4. -Now, the stack contains 0, 2, and 4. -In state 4, the only action is to reduce by rule 1. -There are two symbols on the right, so the top two states are popped off, -uncovering state 0 again. -In state 0, there is a goto on -.I rhyme -causing the parser to enter state 1. -In state 1, the input is read; the endmarker is obtained, -indicated by ``$end'' in the -.I y.output -file. -The action in state 1 when the endmarker is seen is to accept, -successfully ending the parse. -.PP -The reader is urged to consider how the parser works -when confronted with such incorrect strings as -.I "DING DONG DONG" , -.I "DING DONG" , -.I "DING DONG DELL DELL" , -etc. -A few minutes spend with this and other simple examples will -probably be repaid when problems arise in more complicated contexts. //GO.SYSIN DD ss4 echo ss5 sed 's/.//' >ss5 <<'//GO.SYSIN DD ss5' -.SH -5: Ambiguity and Conflicts -.PP -A set of grammar rules is -.I ambiguous -if there is some input string that can be structured in two or more different ways. -For example, the grammar rule -.DS -expr : expr \'\-\' expr -.DE -is a natural way of expressing the fact that one way of forming an arithmetic expression -is to put two other expressions together with a minus sign between them. -Unfortunately, this grammar rule does not -completely specify the way that all complex inputs -should be structured. -For example, if the input is -.DS -expr \- expr \- expr -.DE -the rule allows this input to be structured as either -.DS -( expr \- expr ) \- expr -.DE -or as -.DS -expr \- ( expr \- expr ) -.DE -(The first is called -.I "left association" , -the second -.I "right association" ). -.PP -Yacc detects such ambiguities when it is attempting to build the parser. -It is instructive to consider the problem that confronts the parser when it is -given an input such as -.DS -expr \- expr \- expr -.DE -When the parser has read the second expr, the input that it has seen: -.DS -expr \- expr -.DE -matches the right side of the grammar rule above. -The parser could -.I reduce -the input by applying this rule; -after applying the rule; -the input is reduced to -.I expr (the -left side of the rule). -The parser would then read the final part of the input: -.DS -\- expr -.DE -and again reduce. -The effect of this is to take the left associative interpretation. -.PP -Alternatively, when the parser has seen -.DS -expr \- expr -.DE -it could defer the immediate application of the rule, and continue reading -the input until it had seen -.DS -expr \- expr \- expr -.DE -It could then apply the rule to the rightmost three symbols, reducing them to -.I expr -and leaving -.DS -expr \- expr -.DE -Now the rule can be reduced once more; the effect is to -take the right associative interpretation. -Thus, having read -.DS -expr \- expr -.DE -the parser can do two legal things, a shift or a reduction, and has no way of -deciding between them. -This is called a -.I "shift / reduce conflict" . -It may also happen that the parser has a choice of two legal reductions; -this is called a -.I "reduce / reduce conflict" . -Note that there are never any ``Shift/shift'' conflicts. -.PP -When there are shift/reduce or reduce/reduce conflicts, Yacc still produces a parser. -It does this by selecting one of the valid steps wherever it has a choice. -A rule describing which choice to make in a given situation is called -a -.I "disambiguating rule" . -.PP -Yacc invokes two disambiguating rules by default: -.IP 1. -In a shift/reduce conflict, the default is to do the shift. -.IP 2. -In a reduce/reduce conflict, the default is to reduce by the -.I earlier -grammar rule (in the input sequence). -.PP -Rule 1 implies that reductions are deferred whenever there is a choice, -in favor of shifts. -Rule 2 gives the user rather crude control over the behavior of the parser -in this situation, but reduce/reduce conflicts should be avoided whenever possible. -.PP -Conflicts may arise because of mistakes in input or logic, or because the grammar rules, while consistent, -require a more complex parser than Yacc can construct. -The use of actions within rules can also cause conflicts, if the action must -be done before the parser can be sure which rule is being recognized. -In these cases, the application of disambiguating rules is inappropriate, -and leads to an incorrect parser. -For this reason, Yacc -always reports the number of shift/reduce and reduce/reduce conflicts resolved by Rule 1 and Rule 2. -.PP -In general, whenever it is possible to apply disambiguating rules to produce a correct parser, it is also -possible to rewrite the grammar rules so that the same inputs are read but there are no -conflicts. -For this reason, most previous parser generators -have considered conflicts to be fatal errors. -Our experience has suggested that this rewriting is somewhat unnatural, -and produces slower parsers; thus, Yacc will produce parsers even in the presence of conflicts. -.PP -As an example of the power of disambiguating rules, consider a fragment from a programming -language involving an ``if-then-else'' construction: -.DS -stat : IF \'(\' cond \')\' stat - | IF \'(\' cond \')\' stat ELSE stat - ; -.DE -In these rules, -.I IF -and -.I ELSE -are tokens, -.I cond -is a nonterminal symbol describing -conditional (logical) expressions, and -.I stat -is a nonterminal symbol describing statements. -The first rule will be called the -.ul -simple-if -rule, and the -second the -.ul -if-else -rule. -.PP -These two rules form an ambiguous construction, since input of the form -.DS -IF ( C1 ) IF ( C2 ) S1 ELSE S2 -.DE -can be structured according to these rules in two ways: -.DS -IF ( C1 ) { - IF ( C2 ) S1 - } -ELSE S2 -.DE -or -.DS -IF ( C1 ) { - IF ( C2 ) S1 - ELSE S2 - } -.DE -The second interpretation is the one given in most programming languages -having this construct. -Each -.I ELSE -is associated with the last preceding -``un-\fIELSE'\fRd'' -.I IF . -In this example, consider the situation where the parser has seen -.DS -IF ( C1 ) IF ( C2 ) S1 -.DE -and is looking at the -.I ELSE . -It can immediately -reduce -by the simple-if rule to get -.DS -IF ( C1 ) stat -.DE -and then read the remaining input, -.DS -ELSE S2 -.DE -and reduce -.DS -IF ( C1 ) stat ELSE S2 -.DE -by the if-else rule. -This leads to the first of the above groupings of the input. -.PP -On the other hand, the -.I ELSE -may be shifted, -.I S2 -read, and then the right hand portion of -.DS -IF ( C1 ) IF ( C2 ) S1 ELSE S2 -.DE -can be reduced by the if-else rule to get -.DS -IF ( C1 ) stat -.DE -which can be reduced by the simple-if rule. -This leads to the second of the above groupings of the input, which -is usually desired. -.PP -Once again the parser can do two valid things \- there is a shift/reduce conflict. -The application of disambiguating rule 1 tells the parser to shift in this case, -which leads to the desired grouping. -.PP -This shift/reduce conflict arises only when there is a particular current input symbol, -.I ELSE , -and particular inputs already seen, such as -.DS -IF ( C1 ) IF ( C2 ) S1 -.DE -In general, there may be many conflicts, and each one -will be associated with an input symbol and -a set of previously read inputs. -The previously read inputs are characterized by the -state -of the parser. -.PP -The conflict messages of Yacc are best understood -by examining the verbose (\fB\-v\fR) option output file. -For example, the output corresponding to the above -conflict state might be: -.DS L -23: shift/reduce conflict (shift 45, reduce 18) on ELSE - -state 23 - - stat : IF ( cond ) stat\_ (18) - stat : IF ( cond ) stat\_ELSE stat - - ELSE shift 45 - . reduce 18 - -.DE -The first line describes the conflict, giving the state and the input symbol. -The ordinary state description follows, giving -the grammar rules active in the state, and the parser actions. -Recall that the underline marks the -portion of the grammar rules which has been seen. -Thus in the example, in state 23 the parser has seen input corresponding -to -.DS -IF ( cond ) stat -.DE -and the two grammar rules shown are active at this time. -The parser can do two possible things. -If the input symbol is -.I ELSE , -it is possible to shift into state -45. -State 45 will have, as part of its description, the line -.DS -stat : IF ( cond ) stat ELSE\_stat -.DE -since the -.I ELSE -will have been shifted in this state. -Back in state 23, the alternative action, described by ``\fB.\fR'', -is to be done if the input symbol is not mentioned explicitly in the above actions; thus, -in this case, if the input symbol is not -.I ELSE , -the parser reduces by grammar rule 18: -.DS -stat : IF \'(\' cond \')\' stat -.DE -Once again, notice that the numbers following ``shift'' commands refer to other states, -while the numbers following ``reduce'' commands refer to grammar -rule numbers. -In the -.I y.output -file, the rule numbers are printed after those rules which can be reduced. -In most one states, there will be at most reduce action possible in the -state, and this will be the default command. -The user who encounters unexpected shift/reduce conflicts will probably want to -look at the verbose output to decide whether the default actions are appropriate. -In really tough cases, the user might need to know more about -the behavior and construction of the parser than can be covered here. -In this case, one of the theoretical references -.[ -Aho Johnson Surveys Parsing -.] -.[ -Aho Johnson Ullman Deterministic Ambiguous -.] -.[ -Aho Ullman Principles Design -.] -might be consulted; the services of a local guru might also be appropriate. //GO.SYSIN DD ss5 echo ss6 sed 's/.//' >ss6 <<'//GO.SYSIN DD ss6' -.SH -6: Precedence -.PP -There is one common situation -where the rules given above for resolving conflicts are not sufficient; -this is in the parsing of arithmetic expressions. -Most of the commonly used constructions for arithmetic expressions can be naturally -described by the notion of -.I precedence -levels for operators, together with information about left -or right associativity. -It turns out that ambiguous grammars with appropriate disambiguating rules -can be used to create parsers that are faster and easier to -write than parsers constructed from unambiguous grammars. -The basic notion is to write grammar rules -of the form -.DS -expr : expr OP expr -.DE -and -.DS -expr : UNARY expr -.DE -for all binary and unary operators desired. -This creates a very ambiguous grammar, with many parsing conflicts. -As disambiguating rules, the user specifies the precedence, or binding -strength, of all the operators, and the associativity -of the binary operators. -This information is sufficient to allow Yacc to resolve the parsing conflicts -in accordance with these rules, and construct a parser that realizes the desired -precedences and associativities. -.PP -The precedences and associativities are attached to tokens in the declarations section. -This is done by a series of lines beginning with a Yacc keyword: %left, %right, -or %nonassoc, followed by a list of tokens. -All of the tokens on the same line are assumed to have the same precedence level -and associativity; the lines are listed in -order of increasing precedence or binding strength. -Thus, -.DS -%left \'+\' \'\-\' -%left \'*\' \'/\' -.DE -describes the precedence and associativity of the four arithmetic operators. -Plus and minus are left associative, and have lower precedence than -star and slash, which are also left associative. -The keyword %right is used to describe right associative operators, -and the keyword %nonassoc is used to describe operators, like -the operator .LT. in Fortran, that may not associate with themselves; thus, -.DS -A .LT. B .LT. C -.DE -is illegal in Fortran, and such an operator would be described with the keyword -%nonassoc in Yacc. -As an example of the behavior of these declarations, the description -.DS -%right \'=\' -%left \'+\' \'\-\' -%left \'*\' \'/\' - -%% - -expr : expr \'=\' expr - | expr \'+\' expr - | expr \'\-\' expr - | expr \'*\' expr - | expr \'/\' expr - | NAME - ; -.DE -might be used to structure the input -.DS -a = b = c*d \- e \- f*g -.DE -as follows: -.DS -a = ( b = ( ((c*d)\-e) \- (f*g) ) ) -.DE -When this mechanism is used, -unary operators must, in general, be given a precedence. -Sometimes a unary operator and a binary operator -have the same symbolic representation, but different precedences. -An example is unary and binary \'\-\'; unary minus may be given the same -strength as multiplication, or even higher, while binary minus has a lower strength than -multiplication. -The keyword, %prec, changes the precedence level associated with a particular grammar rule. -%prec appears immediately after the body of the grammar rule, before the action or closing semicolon, -and is followed by a token name or literal. -It -causes the precedence of the grammar rule to become that of the following token name or literal. -For example, to make unary minus have the same precedence as multiplication the rules might resemble: -.DS -%left \'+\' \'\-\' -%left \'*\' \'/\' - -%% - -expr : expr \'+\' expr - | expr \'\-\' expr - | expr \'*\' expr - | expr \'/\' expr - | \'\-\' expr %prec \'*\' - | NAME - ; -.DE -.PP -A token declared -by %left, %right, and %nonassoc need not be, but may be, declared by %token as well. -.PP -The precedences and associativities are used by Yacc to -resolve parsing conflicts; they give rise to disambiguating rules. -Formally, the rules work as follows: -.IP 1. -The precedences and associativities are recorded for those tokens and literals -that have them. -.IP 2. -A precedence and associativity is associated with each grammar rule; it is the precedence -and associativity of the last token or literal in the body of the rule. -If the %prec construction is used, it overrides this default. -Some grammar rules may have no precedence and associativity associated with them. -.IP 3. -When there is a reduce/reduce conflict, or there is a shift/reduce conflict -and either the input symbol or the grammar rule has no precedence and associativity, -then the two disambiguating rules given at the beginning of the section are used, -and the conflicts are reported. -.IP 4. -If there is a shift/reduce conflict, and both the grammar rule and the input character -have precedence and associativity associated with them, then the conflict is resolved -in favor of the action (shift or reduce) associated with the higher precedence. -If the precedences are the same, then the associativity is used; left -associative implies reduce, right associative implies shift, and nonassociating -implies error. -.PP -Conflicts resolved by precedence are not counted in the number of shift/reduce and reduce/reduce -conflicts reported by Yacc. -This means that mistakes in the specification of precedences may -disguise errors in the input grammar; it is a good idea to be sparing -with precedences, and use them in an essentially ``cookbook'' fashion, -until some experience has been gained. -The -.I y.output -file -is very useful in deciding whether the parser is actually doing -what was intended. //GO.SYSIN DD ss6 echo ss7 sed 's/.//' >ss7 <<'//GO.SYSIN DD ss7' -.SH -7: Error Handling -.PP -Error handling is an extremely difficult area, and many of the problems are semantic ones. -When an error is found, for example, it may be necessary to reclaim parse tree storage, -delete or alter symbol table entries, and, typically, set switches to avoid generating any further output. -.PP -It is seldom acceptable to stop all processing when an error is found; it is more useful to continue -scanning the input to find further syntax errors. -This leads to the problem of getting the parser ``restarted'' after an error. -A general class of algorithms to do this involves discarding a number of tokens -from the input string, and attempting to adjust the parser so that input can continue. -.PP -To allow the user some control over this process, -Yacc provides a simple, but reasonably general, feature. -The token name ``error'' is reserved for error handling. -This name can be used in grammar rules; -in effect, it suggests places where errors are expected, and recovery might take place. -The parser pops its stack until it enters a state where the token ``error'' is legal. -It then behaves as if the token ``error'' were the current lookahead token, -and performs the action encountered. -The lookahead token is then reset to the token that caused the error. -If no special error rules have been specified, the processing halts when an error is detected. -.PP -In order to prevent a cascade of error messages, the parser, after -detecting an error, remains in error state until three tokens have been successfully -read and shifted. -If an error is detected when the parser is already in error state, -no message is given, and the input token is quietly deleted. -.PP -As an example, a rule of the form -.DS -stat : error -.DE -would, in effect, mean that on a syntax error the parser would attempt to skip over the statement -in which the error was seen. -More precisely, the parser will -scan ahead, looking for three tokens that might legally follow -a statement, and start processing at the first of these; if -the beginnings of statements are not sufficiently distinctive, it may make a -false start in the middle of a statement, and end up reporting a -second error where there is in fact no error. -.PP -Actions may be used with these special error rules. -These actions might attempt to reinitialize tables, reclaim symbol table space, etc. -.PP -Error rules such as the above are very general, but difficult to control. -Somewhat easier are rules such as -.DS -stat : error \';\' -.DE -Here, when there is an error, the parser attempts to skip over the statement, but -will do so by skipping to the next \';\'. -All tokens after the error and before the next \';\' cannot be shifted, and are discarded. -When the \';\' is seen, this rule will be reduced, and any ``cleanup'' -action associated with it performed. -.PP -Another form of error rule arises in interactive applications, where -it may be desirable to permit a line to be reentered after an error. -A possible error rule might be -.DS -input : error \'\en\' { printf( "Reenter last line: " ); } input - { $$ = $4; } -.DE -There is one potential difficulty with this approach; -the parser must correctly process three input tokens before it -admits that it has correctly resynchronized after the error. -If the reentered line contains an error -in the first two tokens, the parser deletes the offending tokens, -and gives no message; this is clearly unacceptable. -For this reason, there is a mechanism that -can be used to force the parser -to believe that an error has been fully recovered from. -The statement -.DS -yyerrok ; -.DE -in an action -resets the parser to its normal mode. -The last example is better written -.DS -input : error \'\en\' - { yyerrok; - printf( "Reenter last line: " ); } - input - { $$ = $4; } - ; -.DE -.PP -As mentioned above, the token seen immediately -after the ``error'' symbol is the input token at which the -error was discovered. -Sometimes, this is inappropriate; for example, an -error recovery action might -take upon itself the job of finding the correct place to resume input. -In this case, -the previous lookahead token must be cleared. -The statement -.DS -yyclearin ; -.DE -in an action will have this effect. -For example, suppose the action after error -were to call some sophisticated resynchronization routine, -supplied by the user, that attempted to advance the input to the -beginning of the next valid statement. -After this routine was called, the next token returned by yylex would presumably -be the first token in a legal statement; -the old, illegal token must be discarded, and the error state reset. -This could be done by a rule like -.DS -stat : error - { resynch(); - yyerrok ; - yyclearin ; } - ; -.DE -.PP -These mechanisms are admittedly crude, but do allow for a simple, fairly effective recovery of the parser -from many errors; -moreover, the user can get control to deal with -the error actions required by other portions of the program. //GO.SYSIN DD ss7 echo ss8 sed 's/.//' >ss8 <<'//GO.SYSIN DD ss8' -.SH -8: The Yacc Environment -.PP -When the user inputs a specification -to Yacc, the output is a file of C programs, called -.I y.tab.c -on most -systems -(due to local file system conventions, the names may differ from -installation to installation). -The function produced by Yacc is called -.I yyparse \|; -it is an integer valued function. -When it is called, it in turn repeatedly calls -.I yylex , -the lexical analyzer -supplied by the user (see Section 3) -to obtain input tokens. -Eventually, either an error is detected, in which case -(if no error recovery is possible) -.I yyparse -returns the value 1, -or the lexical analyzer returns the endmarker token -and the parser accepts. -In this case, -.I yyparse -returns the value 0. -.PP -The user must provide a certain amount of environment for this -parser in order to obtain a working program. -For example, as with every C program, a program called -.I main -must be defined, that eventually calls -.I yyparse . -In addition, a routine called -.I yyerror -prints a message -when a syntax error is detected. -.PP -These two routines must be supplied in one form or another by the -user. -To ease the initial effort of using Yacc, a library has been -provided with default versions of -.I main -and -.I yyerror . -The name of this library is system dependent; -on many systems the library is accessed by a -.B \-ly -argument to the loader. -To show the triviality of these default programs, the source is -given below: -.DS -main(){ - return( yyparse() ); - } -.DE -and -.DS -# include - -yyerror(s) char *s; { - fprintf( stderr, "%s\en", s ); - } -.DE -The argument to -.I yyerror -is a string containing an error message, usually -the string ``syntax error''. -The average application will want to do better than this. -Ordinarily, the program should keep track of the input line number, and print it -along with the message when a syntax error is detected. -The external integer variable -.I yychar -contains the lookahead token number at the time the error was detected; -this may be of some interest in giving better diagnostics. -Since the -.I main -program is probably supplied by the user (to read arguments, etc.) -the Yacc library is useful only in small -projects, or in the earliest stages of larger ones. -.PP -The external integer variable -.I yydebug -is normally set to 0. -If it is set to a nonzero value, the parser will output a -verbose description of its actions, including -a discussion of which input symbols have been read, and -what the parser actions are. -Depending on the operating environment, -it may be possible to set this variable by using a debugging system. //GO.SYSIN DD ss8 echo ss9 sed 's/.//' >ss9 <<'//GO.SYSIN DD ss9' -.SH -9: Hints for Preparing Specifications -.PP -This section contains miscellaneous hints on preparing efficient, easy to change, -and clear specifications. -The individual subsections are more or less -independent. -.SH -Input Style -.PP -It is difficult to -provide rules with substantial actions -and still have a readable specification file. -The following style hints owe much to Brian Kernighan. -.IP a. -Use all capital letters for token names, all lower case letters for -nonterminal names. -This rule comes under the heading of ``knowing who to blame when -things go wrong.'' -.IP b. -Put grammar rules and actions on separate lines. -This allows either to be changed without -an automatic need to change the other. -.IP c. -Put all rules with the same left hand side together. -Put the left hand side in only once, and let all -following rules begin with a vertical bar. -.IP d. -Put a semicolon only after the last rule with a given left hand side, -and put the semicolon on a separate line. -This allows new rules to be easily added. -.IP e. -Indent rule bodies by two tab stops, and action bodies by three -tab stops. -.PP -The example in Appendix A is written following this style, as are -the examples in the text of this paper (where space permits). -The user must make up his own mind about these stylistic questions; -the central problem, however, is to make the rules visible through -the morass of action code. -.SH -Left Recursion -.PP -The algorithm used by the Yacc parser encourages so called ``left recursive'' -grammar rules: rules of the form -.DS -name : name rest_of_rule ; -.DE -These rules frequently arise when -writing specifications of sequences and lists: -.DS -list : item - | list \',\' item - ; -.DE -and -.DS -seq : item - | seq item - ; -.DE -In each of these cases, the first rule -will be reduced for the first item only, and the second rule -will be reduced for the second and all succeeding items. -.PP -With right recursive rules, such as -.DS -seq : item - | item seq - ; -.DE -the parser would be a bit bigger, and the items would be seen, and reduced, -from right to left. -More seriously, an internal stack in the parser -would be in danger of overflowing if a very long sequence were read. -Thus, the user should use left recursion wherever reasonable. -.PP -It is worth considering whether a sequence with zero -elements has any meaning, and if so, consider writing -the sequence specification with an empty rule: -.DS -seq : /* empty */ - | seq item - ; -.DE -Once again, the first rule would always be reduced exactly once, before the -first item was read, -and then the second rule would be reduced once for each item read. -Permitting empty sequences -often leads to increased generality. -However, conflicts might arise if Yacc is asked to decide -which empty sequence it has seen, when it hasn't seen enough to -know! -.SH -Lexical Tie-ins -.PP -Some lexical decisions depend on context. -For example, the lexical analyzer might want to -delete blanks normally, but not within quoted strings. -Or names might be entered into a symbol table in declarations, -but not in expressions. -.PP -One way of handling this situation is -to create a global flag that is -examined by the lexical analyzer, and set by actions. -For example, suppose a program -consists of 0 or more declarations, followed by 0 or more statements. -Consider: -.DS -%{ - int dflag; -%} - ... other declarations ... - -%% - -prog : decls stats - ; - -decls : /* empty */ - { dflag = 1; } - | decls declaration - ; - -stats : /* empty */ - { dflag = 0; } - | stats statement - ; - - ... other rules ... -.DE -The flag -.I dflag -is now 0 when reading statements, and 1 when reading declarations, -.ul -except for the first token in the first statement. -This token must be seen by the parser before it can tell that -the declaration section has ended and the statements have -begun. -In many cases, this single token exception does not -affect the lexical scan. -.PP -This kind of ``backdoor'' approach can be elaborated -to a noxious degree. -Nevertheless, it represents a way of doing some things -that are difficult, if not impossible, to -do otherwise. -.SH -Reserved Words -.PP -Some programming languages -permit the user to -use words like ``if'', which are normally reserved, -as label or variable names, provided that such use does not -conflict with the legal use of these names in the programming language. -This is extremely hard to do in the framework of Yacc; -it is difficult to pass information to the lexical analyzer -telling it ``this instance of `if' is a keyword, and that instance is a variable''. -The user can make a stab at it, using the -mechanism described in the last subsection, -but it is difficult. -.PP -A number of ways of making this easier are under advisement. -Until then, it is better that the keywords be -.I reserved \|; -that is, be forbidden for use as variable names. -There are powerful stylistic reasons for preferring this, anyway. //GO.SYSIN DD ss9 echo ssA sed 's/.//' >ssA <<'//GO.SYSIN DD ssA' -.SH -10: Advanced Topics -.PP -This section discusses a number of advanced features -of Yacc. -.SH -Simulating Error and Accept in Actions -.PP -The parsing actions of error and accept can be simulated -in an action by use of macros YYACCEPT and YYERROR. -YYACCEPT causes -.I yyparse -to return the value 0; -YYERROR causes -the parser to behave as if the current input symbol -had been a syntax error; -.I yyerror -is called, and error recovery takes place. -These mechanisms can be used to simulate parsers -with multiple endmarkers or context-sensitive syntax checking. -.SH -Accessing Values in Enclosing Rules. -.PP -An action may refer to values -returned by actions to the left of the current rule. -The mechanism is simply the same as with ordinary actions, -a dollar sign followed by a digit, but in this case the -digit may be 0 or negative. -Consider -.DS -sent : adj noun verb adj noun - { \fIlook at the sentence\fR . . . } - ; - -adj : THE { $$ = THE; } - | YOUNG { $$ = YOUNG; } - . . . - ; - -noun : DOG - { $$ = DOG; } - | CRONE - { if( $0 == YOUNG ){ - printf( "what?\en" ); - } - $$ = CRONE; - } - ; - . . . -.DE -In the action following the word CRONE, a check is made that the -preceding token shifted was not YOUNG. -Obviously, this is only possible when a great deal is known about -what might precede the symbol -.I noun -in the input. -There is also a distinctly unstructured flavor about this. -Nevertheless, at times this mechanism will save a great -deal of trouble, especially when a few combinations are to -be excluded from an otherwise regular structure. -.SH -Support for Arbitrary Value Types -.PP -By default, the values returned by actions and the lexical analyzer are integers. -Yacc can also support -values of other types, including structures. -In addition, Yacc keeps track of the types, and inserts -appropriate union member names so that the resulting parser will -be strictly type checked. -The Yacc value stack (see Section 4) -is declared to be a -.I union -of the various types of values desired. -The user declares the union, and associates union member names -to each token and nonterminal symbol having a value. -When the value is referenced through a $$ or $n construction, -Yacc will automatically insert the appropriate union name, so that -no unwanted conversions will take place. -In addition, type checking commands such as -.I Lint\| -.[ -Johnson Lint Checker 1273 -.] -will be far more silent. -.PP -There are three mechanisms used to provide for this typing. -First, there is a way of defining the union; this must be -done by the user since other programs, notably the lexical analyzer, -must know about the union member names. -Second, there is a way of associating a union member name with tokens -and nonterminals. -Finally, there is a mechanism for describing the type of those -few values where Yacc can not easily determine the type. -.PP -To declare the union, the user includes in the declaration section: -.DS -%union { - body of union ... - } -.DE -This declares the Yacc value stack, -and the external variables -.I yylval -and -.I yyval , -to have type equal to this union. -If Yacc was invoked with the -.B \-d -option, the union declaration -is copied onto the -.I y.tab.h -file. -Alternatively, -the union may be declared in a header file, and a typedef -used to define the variable YYSTYPE to represent -this union. -Thus, the header file might also have said: -.DS -typedef union { - body of union ... - } YYSTYPE; -.DE -The header file must be included in the declarations -section, by use of %{ and %}. -.PP -Once YYSTYPE is defined, -the union member names must be associated -with the various terminal and nonterminal names. -The construction -.DS -< name > -.DE -is used to indicate a union member name. -If this follows -one of the -keywords %token, -%left, %right, and %nonassoc, -the union member name is associated with the tokens listed. -Thus, saying -.DS -%left \'+\' \'\-\' -.DE -will cause any reference to values returned by these two tokens to be -tagged with -the union member name -.I optype . -Another keyword, %type, is -used similarly to associate -union member names with nonterminals. -Thus, one might say -.DS -%type expr stat -.DE -.PP -There remain a couple of cases where these mechanisms are insufficient. -If there is an action within a rule, the value returned -by this action has no -.I "a priori" -type. -Similarly, reference to left context values (such as $0 \- see the -previous subsection ) leaves Yacc with no easy way of knowing the type. -In this case, a type can be imposed on the reference by inserting -a union member name, between < and >, immediately after -the first $. -An example of this usage is -.DS -rule : aaa { $$ = 3; } bbb - { fun( $2, $0 ); } - ; -.DE -This syntax has little to recommend it, but the situation arises rarely. -.PP -A sample specification is given in Appendix C. -The facilities in this subsection are not triggered until they are used: -in particular, the use of %type will turn on these mechanisms. -When they are used, there is a fairly strict level of checking. -For example, use of $n or $$ to refer to something with no defined type -is diagnosed. -If these facilities are not triggered, the Yacc value stack is used to -hold -.I int' s, -as was true historically. //GO.SYSIN DD ssA echo ssB sed 's/.//' >ssB <<'//GO.SYSIN DD ssB' -.SH -11: Acknowledgements -.PP -Yacc owes much to a -most stimulating collection of users, who have goaded -me beyond my inclination, and frequently beyond my -ability, in their endless search for ``one more feature''. -Their irritating unwillingness to learn how to -do things my way has usually led to my doing things their way; -most of the time, they have been right. -B. W. Kernighan, P. J. Plauger, S. I. Feldman, C. Imagna, -M. E. Lesk, -and A. Snyder will recognize some of their ideas in the current version -of Yacc. -C. B. Haley contributed to the error recovery algorithm. -D. M. Ritchie, B. W. Kernighan, and M. O. Harris helped translate this document into English. -Al Aho also deserves special credit for bringing -the mountain to Mohammed, and other favors. -.SG "MH-1273-SCJ-unix" -.bp -.[ -$LIST$ -.] -.bp //GO.SYSIN DD ssB echo ssa sed 's/.//' >ssa <<'//GO.SYSIN DD ssa' -.SH -Appendix A: A Simple Example -.PP -This example gives the complete Yacc specification for a small desk calculator; -the desk calculator has 26 registers, labeled ``a'' through ``z'', and accepts -arithmetic expressions made up of the operators +, \-, *, /, -% (mod operator), & (bitwise and), | (bitwise or), and assignment. -If an expression at the top level is an assignment, the value is not -printed; otherwise it is. -As in C, an integer that begins with 0 (zero) is assumed to be octal; -otherwise, it is assumed to be decimal. -.PP -As an example of a Yacc specification, the desk calculator -does a reasonable job of showing how precedences and ambiguities -are used, and demonstrating simple error recovery. -The major oversimplifications are that the -lexical analysis phase is much simpler than for most applications, and the -output is produced immediately, line by line. -Note the way that decimal and octal integers are read in by the grammar rules; -This job is probably better done by the lexical analyzer. -.sp -.nf -.ta .5i 1i 1.5i 2i 2.5i - -%{ -# include -# include - -int regs[26]; -int base; - -%} - -%start list - -%token DIGIT LETTER - -%left \'|\' -%left \'&\' -%left \'+\' \'\-\' -%left \'*\' \'/\' \'%\' -%left UMINUS /* supplies precedence for unary minus */ - -%% /* beginning of rules section */ - -list : /* empty */ - | list stat \'\en\' - | list error \'\en\' - { yyerrok; } - ; - -stat : expr - { printf( "%d\en", $1 ); } - | LETTER \'=\' expr - { regs[$1] = $3; } - ; - -expr : \'(\' expr \')\' - { $$ = $2; } - | expr \'+\' expr - { $$ = $1 + $3; } - | expr \'\-\' expr - { $$ = $1 \- $3; } - | expr \'*\' expr - { $$ = $1 * $3; } - | expr \'/\' expr - { $$ = $1 / $3; } - | expr \'%\' expr - { $$ = $1 % $3; } - | expr \'&\' expr - { $$ = $1 & $3; } - | expr \'|\' expr - { $$ = $1 | $3; } - | \'\-\' expr %prec UMINUS - { $$ = \- $2; } - | LETTER - { $$ = regs[$1]; } - | number - ; - -number : DIGIT - { $$ = $1; base = ($1==0) ? 8 : 10; } - | number DIGIT - { $$ = base * $1 + $2; } - ; - -%% /* start of programs */ - -yylex() { /* lexical analysis routine */ - /* returns LETTER for a lower case letter, yylval = 0 through 25 */ - /* return DIGIT for a digit, yylval = 0 through 9 */ - /* all other characters are returned immediately */ - - int c; - - while( (c=getchar()) == \' \' ) { /* skip blanks */ } - - /* c is now nonblank */ - - if( islower( c ) ) { - yylval = c \- \'a\'; - return ( LETTER ); - } - if( isdigit( c ) ) { - yylval = c \- \'0\'; - return( DIGIT ); - } - return( c ); - } -.fi -.bp //GO.SYSIN DD ssa echo ssb sed 's/.//' >ssb <<'//GO.SYSIN DD ssb' -.SH -Appendix B: Yacc Input Syntax -.PP -This Appendix has a description of the Yacc input syntax, as a Yacc specification. -Context dependencies, etc., are not considered. -Ironically, the Yacc input specification language -is most naturally specified as an LR(2) grammar; the sticky -part comes when an identifier is seen in a rule, immediately -following an action. -If this identifier is followed by a colon, it is the start of the -next rule; otherwise -it is a continuation of the current rule, which just happens to have -an action embedded in it. -As implemented, the lexical analyzer looks -ahead after seeing an identifier, and -decide whether the next token (skipping blanks, newlines, comments, etc.) -is a colon. -If so, it returns the token C_IDENTIFIER. -Otherwise, it returns IDENTIFIER. -Literals (quoted strings) are also returned as IDENTIFIERS, -but never as part of C_IDENTIFIERs. -.sp -.nf -.ta .6i 1.2i 1.8i 2.4i 3i 3.6i - - /* grammar for the input to Yacc */ - - /* basic entities */ -%token IDENTIFIER /* includes identifiers and literals */ -%token C_IDENTIFIER /* identifier (but not literal) followed by colon */ -%token NUMBER /* [0-9]+ */ - - /* reserved words: %type => TYPE, %left => LEFT, etc. */ - -%token LEFT RIGHT NONASSOC TOKEN PREC TYPE START UNION - -%token MARK /* the %% mark */ -%token LCURL /* the %{ mark */ -%token RCURL /* the %} mark */ - - /* ascii character literals stand for themselves */ - -%start spec - -%% - -spec : defs MARK rules tail - ; - -tail : MARK { \fIIn this action, eat up the rest of the file\fR } - | /* empty: the second MARK is optional */ - ; - -defs : /* empty */ - | defs def - ; - -def : START IDENTIFIER - | UNION { \fICopy union definition to output\fR } - | LCURL { \fICopy C code to output file\fR } RCURL - | ndefs rword tag nlist - ; - -rword : TOKEN - | LEFT - | RIGHT - | NONASSOC - | TYPE - ; - -tag : /* empty: union tag is optional */ - | \'<\' IDENTIFIER \'>\' - ; - -nlist : nmno - | nlist nmno - | nlist \',\' nmno - ; - -nmno : IDENTIFIER /* NOTE: literal illegal with %type */ - | IDENTIFIER NUMBER /* NOTE: illegal with %type */ - ; - - /* rules section */ - -rules : C_IDENTIFIER rbody prec - | rules rule - ; - -rule : C_IDENTIFIER rbody prec - | '|' rbody prec - ; - -rbody : /* empty */ - | rbody IDENTIFIER - | rbody act - ; - -act : \'{\' { \fICopy action, translate $$, etc.\fR } \'}\' - ; - -prec : /* empty */ - | PREC IDENTIFIER - | PREC IDENTIFIER act - | prec \';\' - ; -.fi -.bp //GO.SYSIN DD ssb echo ssc sed 's/.//' >ssc <<'//GO.SYSIN DD ssc' -.SH -Appendix C: An Advanced Example -.PP -This Appendix gives an example of a grammar using some -of the advanced features discussed in Section 10. -The desk calculator example in Appendix A is -modified to provide a desk calculator that -does floating point interval arithmetic. -The calculator understands floating point -constants, the arithmetic operations +, \-, *, /, -unary \-, and = (assignment), and has 26 floating -point variables, ``a'' through ``z''. -Moreover, it also understands -.I intervals , -written -.DS - ( x , y ) -.DE -where -.I x -is less than or equal to -.I y . -There are 26 interval valued variables ``A'' through ``Z'' -that may also be used. -The usage is similar to that in Appendix A; assignments -return no value, and print nothing, while expressions print -the (floating or interval) value. -.PP -This example explores a number of interesting features -of Yacc and C. -Intervals are represented by a structure, consisting of the -left and right endpoint values, stored as -.I double 's. -This structure is given a type name, INTERVAL, by using -.I typedef . -The Yacc value stack can also contain floating point scalars, and -integers (used to index into the arrays holding the variable values). -Notice that this entire strategy depends strongly on being able to -assign structures and unions in C. -In fact, many of the actions call functions that return structures -as well. -.PP -It is also worth noting the use of YYERROR to handle error conditions: -division by an interval containing 0, and an interval presented in -the wrong order. -In effect, the error recovery mechanism of Yacc is used to throw away the -rest of the offending line. -.PP -In addition to the mixing of types on the value stack, -this grammar also demonstrates an interesting use of syntax to -keep track of the type (e.g. scalar or interval) of intermediate -expressions. -Note that a scalar can be automatically promoted to an interval if -the context demands an interval value. -This causes a large number of conflicts when the grammar is run through -Yacc: 18 Shift/Reduce and 26 Reduce/Reduce. -The problem can be seen by looking at the two input lines: -.DS - 2.5 + ( 3.5 \- 4. ) -.DE -and -.DS - 2.5 + ( 3.5 , 4. ) -.DE -Notice that the 2.5 is to be used in an interval valued expression -in the second example, but this fact is not known until -the ``,'' is read; by this time, 2.5 is finished, and the parser cannot go back -and change its mind. -More generally, it might be necessary to look ahead an arbitrary number of -tokens to decide whether to convert a scalar to an interval. -This problem is evaded by having two rules for each binary interval -valued operator: one when the left operand is a scalar, and one when -the left operand is an interval. -In the second case, the right operand must be an interval, -so the conversion will be applied automatically. -Despite this evasion, there are still many cases where the -conversion may be applied or not, leading to the above -conflicts. -They are resolved by listing the rules that yield scalars first -in the specification file; in this way, the conflicts will -be resolved in the direction of keeping scalar -valued expressions scalar valued until they are forced to become -intervals. -.PP -This way of handling multiple types is very instructive, but not very general. -If there were many kinds of expression types, instead of just two, -the number of rules needed would increase dramatically, and the conflicts -even more dramatically. -Thus, while this example is instructive, it is better practice in a -more normal programming language environment to -keep the type information as part of the value, and not as part -of the grammar. -.PP -Finally, a word about the lexical analysis. -The only unusual feature is the treatment of floating point constants. -The C library routine -.I atof -is used to do the actual conversion from a character string -to a double precision value. -If the lexical analyzer detects an error, -it responds by returning a token that -is illegal in the grammar, provoking a syntax error -in the parser, and thence error recovery. -.DS L - -%{ - -# include -# include - -typedef struct interval { - double lo, hi; - } INTERVAL; - -INTERVAL vmul(), vdiv(); - -double atof(); - -double dreg[ 26 ]; -INTERVAL vreg[ 26 ]; - -%} - -%start lines - -%union { - int ival; - double dval; - INTERVAL vval; - } - -%token DREG VREG /* indices into dreg, vreg arrays */ - -%token CONST /* floating point constant */ - -%type dexp /* expression */ - -%type vexp /* interval expression */ - - /* precedence information about the operators */ - -%left \'+\' \'\-\' -%left \'*\' \'/\' -%left UMINUS /* precedence for unary minus */ - -%% - -lines : /* empty */ - | lines line - ; - -line : dexp \'\en\' - { printf( "%15.8f\en", $1 ); } - | vexp \'\en\' - { printf( "(%15.8f , %15.8f )\en", $1.lo, $1.hi ); } - | DREG \'=\' dexp \'\en\' - { dreg[$1] = $3; } - | VREG \'=\' vexp \'\en\' - { vreg[$1] = $3; } - | error \'\en\' - { yyerrok; } - ; - -dexp : CONST - | DREG - { $$ = dreg[$1]; } - | dexp \'+\' dexp - { $$ = $1 + $3; } - | dexp \'\-\' dexp - { $$ = $1 \- $3; } - | dexp \'*\' dexp - { $$ = $1 * $3; } - | dexp \'/\' dexp - { $$ = $1 / $3; } - | \'\-\' dexp %prec UMINUS - { $$ = \- $2; } - | \'(\' dexp \')\' - { $$ = $2; } - ; - -vexp : dexp - { $$.hi = $$.lo = $1; } - | \'(\' dexp \',\' dexp \')\' - { - $$.lo = $2; - $$.hi = $4; - if( $$.lo > $$.hi ){ - printf( "interval out of order\en" ); - YYERROR; - } - } - | VREG - { $$ = vreg[$1]; } - | vexp \'+\' vexp - { $$.hi = $1.hi + $3.hi; - $$.lo = $1.lo + $3.lo; } - | dexp \'+\' vexp - { $$.hi = $1 + $3.hi; - $$.lo = $1 + $3.lo; } - | vexp \'\-\' vexp - { $$.hi = $1.hi \- $3.lo; - $$.lo = $1.lo \- $3.hi; } - | dexp \'\-\' vexp - { $$.hi = $1 \- $3.lo; - $$.lo = $1 \- $3.hi; } - | vexp \'*\' vexp - { $$ = vmul( $1.lo, $1.hi, $3 ); } - | dexp \'*\' vexp - { $$ = vmul( $1, $1, $3 ); } - | vexp \'/\' vexp - { if( dcheck( $3 ) ) YYERROR; - $$ = vdiv( $1.lo, $1.hi, $3 ); } - | dexp \'/\' vexp - { if( dcheck( $3 ) ) YYERROR; - $$ = vdiv( $1, $1, $3 ); } - | \'\-\' vexp %prec UMINUS - { $$.hi = \-$2.lo; $$.lo = \-$2.hi; } - | \'(\' vexp \')\' - { $$ = $2; } - ; - -%% - -# define BSZ 50 /* buffer size for floating point numbers */ - - /* lexical analysis */ - -yylex(){ - register c; - - while( (c=getchar()) == \' \' ){ /* skip over blanks */ } - - if( isupper( c ) ){ - yylval.ival = c \- \'A\'; - return( VREG ); - } - if( islower( c ) ){ - yylval.ival = c \- \'a\'; - return( DREG ); - } - - if( isdigit( c ) || c==\'.\' ){ - /* gobble up digits, points, exponents */ - - char buf[BSZ+1], *cp = buf; - int dot = 0, exp = 0; - - for( ; (cp\-buf)= BSZ ) printf( "constant too long: truncated\en" ); - else ungetc( c, stdin ); /* push back last char read */ - yylval.dval = atof( buf ); - return( CONST ); - } - return( c ); - } - -INTERVAL hilo( a, b, c, d ) double a, b, c, d; { - /* returns the smallest interval containing a, b, c, and d */ - /* used by *, / routines */ - INTERVAL v; - - if( a>b ) { v.hi = a; v.lo = b; } - else { v.hi = b; v.lo = a; } - - if( c>d ) { - if( c>v.hi ) v.hi = c; - if( dv.hi ) v.hi = d; - if( c= 0. && v.lo <= 0. ){ - printf( "divisor interval contains 0.\en" ); - return( 1 ); - } - return( 0 ); - } - -INTERVAL vdiv( a, b, v ) double a, b; INTERVAL v; { - return( hilo( a/v.hi, a/v.lo, b/v.hi, b/v.lo ) ); - } -.DE -.bp //GO.SYSIN DD ssc echo ssd sed 's/.//' >ssd <<'//GO.SYSIN DD ssd' -.SH -Appendix D: Old Features Supported but not Encouraged -.PP -This Appendix mentions synonyms and features which are supported for historical -continuity, but, for various reasons, are not encouraged. -.IP 1. -Literals may also be delimited by double quotes ``"''. -.IP 2. -Literals may be more than one character long. -If all the characters are alphabetic, numeric, or \_, the type number of the literal is defined, -just as if the literal did not have the quotes around it. -Otherwise, it is difficult to find the value for such literals. -.IP -The use of multi-character literals is likely to mislead those unfamiliar with -Yacc, since it suggests that Yacc is doing a job which must be actually done by the lexical analyzer. -.IP 3. -Most places where % is legal, backslash ``\e'' may be used. -In particular, \e\e is the same as %%, \eleft the same as %left, etc. -.IP 4. -There are a number of other synonyms: -.DS -%< is the same as %left -%> is the same as %right -%binary and %2 are the same as %nonassoc -%0 and %term are the same as %token -%= is the same as %prec -.DE -.IP 5. -Actions may also have the form -.DS -={ . . . } -.DE -and the curly braces can be dropped if the action is a -single C statement. -.IP 6. -C code between %{ and %} used to be permitted at the -head of the rules section, as well as in the -declaration section. //GO.SYSIN DD ssd