Regular grammar generators, like Lex, are often coupled with tools,
such as Yacc and Bison, that can generate parsers for more powerful
languages, namely (a subset of) context-free languages. These tools
take as input a description of the language to be recognized and
generate a parser for that language, written in some other language
(for example, Yacc and Bison generate parsers written in C). The user
must always be aware of the generated parser and that is a nuisance.
Bigloo provides such a tool that overcomes this annoyance. It
generates parsers for the class of Lalr(1) grammars in a more
opaque way.
An lalr(1) grammar is defined by the form:
lalr-grammar term-def non-term-def... | bigloo syntax |
term-def is a list of terminal elements of the grammar. Terminals can
grouped together to form precedence groups by including the related symbols
in a sub-lists of the term-def list. Each precedence group must start
with one of the keywords left: , right: or none: -- this
indicates the associativity of the terminal symbol. Here is a sample
term-def which declares eight terminals:
(terminal-1 terminal-2
(left: terminal-3 terminal-4)
terminal-5
(right: terminal-6)
(none: terminal-7)
terminal-8)
|
In this case, terminal-3 and terminal-4 both have the same
precedence, which is greater than the precedence assigned to
terminal-6 . No precedence was assigned to symbols terminal-1 ,
terminal-2 , terminal-5 or terminal-8 .
Each non-term-def is a list whose first element is the
non-terminal being defined, i.e. a symbol. The remaining elements are
the production rules associated with this non-terminal. Each rule is a
list whose first element is the rule itself (a list of symbols) and
the other elements are the semantic actions associated with that
particular rule.
For example, consider the following grammar:
E ==> E1 + id {E .val := E1 .val + id .val}
| id {E .val := id .val}
|
With Bigloo, it would be written:
(lalr-grammar
(plus id)
(e
((e plus id) (+ e id))
((id) id)))
|
The semantic value of a symbol in a rule can be accessed by simply
using the name of the symbol in the semantic action associated with
the rule. Because a rule can contain multiple occurrences of the same
symbol, Bigloo provides a way to access these occurrences
separately. To do so, the name of each occurrence must be suffixed by
@ var where var is the name of a variable that will be
bound to the semantic value of the occurrence. For example, if the
rule is
ifstmt ==> if E then Stmt else Stmt
|
then, in Bigloo, it would look like
(if-stmt
((if e then stmt@conseq else stmt@altern)
(if (eval e)
(eval conseq)
(eval altern))))
|
|
12.2 Precedence and associativity
|
The bigloo lalr(1) parser generator supports operator precedence and
associativity. The method for specifying the precedence for terminal symbols
is described in
Grammar Definition. Precedence is assigned to each
non-terminal production from the precedence of the last terminal symbol
appearing in that production.
Typically, when the parser generator encounters a shift/reduce conflict, it
produces a warning message, then chooses to reduce. When a parser generator
has precedence and associativity information, it can make a much more
sophisticated decision.
Let's use this simple calculator grammar as an example:
(lalr-grammar
((left: op-mult op-div)
(left: op-add op-sub)
op-lparen op-rparen
op-semicolon
number)
(file
(())
((file stmt)))
(stmt
((expr op-semicolon) (print expr)))
(expr
((number) number)
((expr@a op-add expr@b) (+ a b))
((expr@a op-sub expr@b) (- a b))
((expr@a op-mult expr@b) (* a b))
((expr@a op-div expr@b) (/ a b))
((op-lparen expr op-rparen) expr))))
|
Let's start with this input:
At the point where the parser has read
1 + 2
and the lookahead symbol
is
*
, the parser encounters a shift/reduce conflict. Should it first
reduce by the
(expr op-add expr)
production or shift the
*
in
the hopes of reducing the latter expression first?
The
(expr op-add expr)
production has gotten its precedence from the
op-add
terminal symbol. This is the precedence of the reduce. The
precedence of the shift comes from the precedence assigned to the lookahead
terminal symbol, which is
op-mult
. Since
op-mult
has higher
precedence, the parser generator in this state chooses to shift and does not
produce a warning.
Here's an example which we can use to demonstrate associativity:
The parser generator encounters a similar shift/reduce conflict this time,
except that when it tries to determine whether to shift or reduce, it finds
that both actions have the same precedence. In this case, the parser
generator looks at the associativity of the precedence group containing the
op-add
and
op-sub
. Since these are declared to be
left-associative, the parser generator chooses to reduce from this state,
effectively calculating the
1 + 2
. Had these symbols been
right-associative, the parser would have chosen to shift, effectively
calculating
2 - 3
first. If these symbols had been declared
non-associative with the
none:
keyword, the parser would generate an
error if it ever encountered this state.
12.3 The parsing function
|
Once a grammar has been defined, it can be used to parse some input
using the following function:
read/lalrp lg rg port [emptyp] | bigloo procedure |
This function takes three, possibly four, arguments. The first, lg , is
the Lalr(1) grammar. The second, rg , is the lexical analyzer that feeds
the grammar with tokens. The third argument, port , is the port that
contains the input to be parsed. The last argument, emptyp , if
provided, should be a function of one argument. It is called with each new
token read from the port and should return #t if the token denotes the
end of input. The result of the call is the value computed by the semantic
actions of the production rules.
|
In order to work properly, the regular grammar used with an
Lalr(1) grammar should follow some conventions:
- If a semantic value is to be associated with the token just
parsed, the regular grammar should return a pair whose
car
is the
token name (a symbol) and the cdr
is the semantic value.
- If there is no value associated with the token, the regular
grammar can return just the token name. When used in conjunction with
an Lalr grammar, regular grammar should never return
#f
as a token
value. This is specially true when the regular grammar detects the end of
parsing. In that case, the regular grammar must not return the
#f
value. A good way to handle end-of-file is illustrated in the
following example:
(let ((g (regular-grammar ()
...
(else
(let ((c (the-failure)))
(if (eof-object? c)
c
(error 'rgc "Illegal character" c))))))
(l (lalr-grammar ...)))
(read/lalrp l g (current-input-port)))
|
This way, the Lalr grammar will automatically handles the end-of-file.
12.5 Debugging Lalr Grammars
|
Currently the debugging facility for debugging Lalr grammars is very
limited. When the parameter
bigloo-debug
is set to a value
greater or equal to 100, the Lalr engine outputs all of the state
changes the parser is going through.
Here is the code for a simple calculator implemented by an Lalr(1)
grammar:
(begin
(read/lalrp
(lalr-grammar
(nl plus mult minus div const lpar rpar)
(lines
(())
((lines expression nl) (display "--> ")
(display expression)
(newline))
((lines nl)))
(expression
((expression plus term) (+ expression term))
((expression minus term) (- expression term))
((term) term))
(term
((term mult factor) (* term factor))
((term div factor) (/ term factor))
((factor) factor))
(factor
((lpar expression rpar) expression)
((const) const)))
(regular-grammar ()
((+ (or #\tab #\space)) (ignore))
(#\newline 'nl)
((+ digit) (cons 'const (string->number (the-string))))
(#\+ 'plus)
(#\- 'minus)
(#\* 'mult)
(#\/ 'div)
(#\( 'lpar)
(#\) 'rpar))
(current-input-port))
(reset-eof (current-input-port)))
|