Bigloo

Regular grammar generators, like Lex, are often coupled with tools, such as Yacc and Bison, that can generate parsers for more powerful languages, namely (a subset of) context-free languages. These tools take as input a description of the language to be recognized and generate a parser for that language, written in some other language (for example, Yacc and Bison generate parsers written in C). The user must always be aware of the generated parser and that is a nuisance. Bigloo provides such a tool that overcomes this annoyance. It generates parsers for the class of Lalr(1) grammars in a more opaque way.

12.1 Grammar definition

An lalr(1) grammar is defined by the form:

lalr-grammar term-def non-term-def...

bigloo syntax

term-def is a list of terminal elements of the grammar. Terminals can grouped together to form precedence groups by including the related symbols in a sub-lists of the term-def list. Each precedence group must start with one of the keywords left:, right: or none:-- this indicates the associativity of the terminal symbol. Here is a sample term-def which declares eight terminals:

(terminal-1 terminal-2
 (left: terminal-3 terminal-4)
 terminal-5
 (right: terminal-6)
 (none: terminal-7)
 terminal-8)

In this case, terminal-3 and terminal-4 both have the same precedence, which is greater than the precedence assigned to terminal-6. No precedence was assigned to symbols terminal-1, terminal-2, terminal-5 or terminal-8.

Each non-term-def is a list whose first element is the non-terminal being defined, i.e. a symbol. The remaining elements are the production rules associated with this non-terminal. Each rule is a list whose first element is the rule itself (a list of symbols) and the other elements are the semantic actions associated with that particular rule.

For example, consider the following grammar:

E ==> E1 + id {E.val := E1.val + id.val}
      | id {E.val := id.val}

With Bigloo, it would be written:

(lalr-grammar
  (plus id)
  (e
   ((e plus id)   (+ e id))
   ((id)          id)))

The semantic value of a symbol in a rule can be accessed by simply using the name of the symbol in the semantic action associated with the rule. Because a rule can contain multiple occurrences of the same symbol, Bigloo provides a way to access these occurrences separately. To do so, the name of each occurrence must be suffixed by @var where var is the name of a variable that will be bound to the semantic value of the occurrence. For example, if the rule is

   ifstmt ==> if E then Stmt else Stmt

then, in Bigloo, it would look like

(if-stmt
 ((if e then stmt@conseq else stmt@altern)
  (if (eval e) 
      (eval conseq) 
      (eval altern))))

12.2 Precedence and associativity

The bigloo lalr(1) parser generator supports operator precedence and associativity. The method for specifying the precedence for terminal symbols is described in Grammar Definition. Precedence is assigned to each non-terminal production from the precedence of the last terminal symbol appearing in that production.

Typically, when the parser generator encounters a shift/reduce conflict, it produces a warning message, then chooses to reduce. When a parser generator has precedence and associativity information, it can make a much more sophisticated decision.

Let's use this simple calculator grammar as an example:

(lalr-grammar
 ((left: op-mult op-div)
  (left: op-add op-sub)
  op-lparen op-rparen
  op-semicolon
  number)

 (file
   (())
   ((file stmt)))
 (stmt
   ((expr op-semicolon) (print expr)))
 (expr
   ((number) number)
   ((expr@a op-add expr@b) (+ a b))
   ((expr@a op-sub expr@b) (- a b))
   ((expr@a op-mult expr@b) (* a b))
   ((expr@a op-div expr@b) (/ a b))
   ((op-lparen expr op-rparen) expr))))

Let's start with this input:

1 + 2 * 3;

At the point where the parser has read 1 + 2 and the lookahead symbol is *, the parser encounters a shift/reduce conflict. Should it first reduce by the (expr op-add expr) production or shift the * in the hopes of reducing the latter expression first?

The (expr op-add expr) production has gotten its precedence from the op-add terminal symbol. This is the precedence of the reduce. The precedence of the shift comes from the precedence assigned to the lookahead terminal symbol, which is op-mult. Since op-mult has higher precedence, the parser generator in this state chooses to shift and does not produce a warning.

Here's an example which we can use to demonstrate associativity:

1 + 2 - 3;

The parser generator encounters a similar shift/reduce conflict this time, except that when it tries to determine whether to shift or reduce, it finds that both actions have the same precedence. In this case, the parser generator looks at the associativity of the precedence group containing the op-add and op-sub. Since these are declared to be left-associative, the parser generator chooses to reduce from this state, effectively calculating the 1 + 2. Had these symbols been right-associative, the parser would have chosen to shift, effectively calculating 2 - 3 first. If these symbols had been declared non-associative with the none: keyword, the parser would generate an error if it ever encountered this state.

12.3 The parsing function

Once a grammar has been defined, it can be used to parse some input using the following function:

read/lalrp lg rg port [emptyp]

bigloo procedure

This function takes three, possibly four, arguments. The first, lg, is the Lalr(1) grammar. The second, rg, is the lexical analyzer that feeds the grammar with tokens. The third argument, port, is the port that contains the input to be parsed. The last argument, emptyp, if provided, should be a function of one argument. It is called with each new token read from the port and should return #t if the token denotes the end of input. The result of the call is the value computed by the semantic actions of the production rules.

12.4 The regular grammar

In order to work properly, the regular grammar used with an Lalr(1) grammar should follow some conventions:

If a semantic value is to be associated with the token just parsed, the regular grammar should return a pair whose car is the token name (a symbol) and the cdr is the semantic value.

If there is no value associated with the token, the regular grammar can return just the token name. When used in conjunction with an Lalr grammar, regular grammar should never return #f as a token value. This is specially true when the regular grammar detects the end of parsing. In that case, the regular grammar must not return the #f value. A good way to handle end-of-file is illustrated in the following example:

(let ((g (regular-grammar ()
             ...
             (else 
              (let ((c (the-failure)))
                 (if (eof-object? c)
                     c
                     (error 'rgc "Illegal character" c))))))
      (l (lalr-grammar ...)))
   (read/lalrp l g (current-input-port)))

This way, the Lalr grammar will automatically handles the end-of-file.

12.5 Debugging Lalr Grammars

Currently the debugging facility for debugging Lalr grammars is very limited. When the parameter bigloo-debug is set to a value greater or equal to 100, the Lalr engine outputs all of the state changes the parser is going through.

12.6 A simple example

Here is the code for a simple calculator implemented by an Lalr(1) grammar:

(begin
  (read/lalrp
   (lalr-grammar
    (nl plus mult minus div const lpar rpar)
    (lines
     (())
     ((lines expression nl)    (display "--> ") 
                               (display expression) 
                               (newline))
     ((lines nl)))
    (expression
     ((expression plus term)   (+ expression term))
     ((expression minus term)  (- expression term))
     ((term)                   term))
    (term
     ((term mult factor)       (* term factor))
     ((term div factor)        (/ term factor))
     ((factor)                 factor))
    (factor
     ((lpar expression rpar)   expression)
     ((const)                  const)))

   (regular-grammar ()
    ((+ (or #\tab #\space)) (ignore))
    (#\newline              'nl)
    ((+ digit)              (cons 'const (string->number (the-string))))
    (#\+                    'plus)
    (#\-                    'minus)
    (#\*                    'mult)
    (#\/                    'div)
    (#\(                    'lpar)
    (#\)                    'rpar))

   (current-input-port))
  (reset-eof (current-input-port)))