Bigloo

Programming languages have poor reading libraries since the lexical information that can be specified is directly tied to the structure of the language. For example, in C it's hard to read a rational number because there is no type rational. Programs have been written to circumvent this problem: Lex [Lesk75], for example, is one of them. We choose to incorporate in Bigloo a set of new functions to assist in such parsing. The syntax for regular grammar (also known as regular analyser) of Bigloo 2.0 (the one described in this document) is not compatible with former Bigloo versions.

11.1 A new way of reading

There is only one way in Bigloo to read text, regular reading, which is done by the new form:

read/rp regular-grammar input-port

bigloo procedure

The first argument is a regular grammar (also known as regular analyser) and the second a Scheme port. This way of reading is almost the same as the Lex's one. The reader tries to match the longest input, from the stream pointed to by input-port, with one of several regular expressions contained in regular-grammar. If many rules match, the reader takes the first one defined in the grammar. When the regular rule has been found the corresponding Scheme expression is evaluated.

remark: The traditional read Scheme function is implemented as:

(define-inline (read port)
   (read/rp scheme-grammar port))

11.2 The syntax of the regular grammar

A regular grammar is built by the means of the form regular-grammar:

regular-grammar (binding ...) rule ...

bigloo syntax

The binding and rule are defined by the following grammar:

<binding>  ==> (<variable> <re>)
             | <option>
<option>   ==> <variable>
<rule>     ==> <define>
             | (<cre> <s-expression> <s-expression> ...)
             | (else <s-expression> <s-expression> ...)
<define>   ==> (define <s-expression>)
<cre>      ==> <re>
             | (context <symbol> <re>)
             | (when <s-expr> <re>)
             | (bol <re>)
             | (eol <re>)
             | (bof <re>)
             | (eof <re>)
<re>       ==> <variable>
             | <char>
             | <string>
             | (: <re> ...)
             | (or <re> ...)
             | (* <re>)
             | (+ <re>)
             | (? <re>)
             | (= <integer> <re>)
             | (>= <integer> <re>)
             | (** <integer> <integer> <re>)
             | (... <integer> <re>)
             | (uncase <re>)
             | (in <cset> ...)
             | (out <cset> ...)
             | (and <cset> <cset>)
             | (but <cset> <cset>)
             | (posix <string>)
<variable> ==> <symbol>
<cset>     ==> <string>
             | <char>
             | (<string>)
             | (<char> <char>)

Here is a description of each construction.

(context <symbol> <re>)

This allows us to protect an expression. A protected expression matches (or accepts) a word only if the grammar has been set to the corresponding context. See The Semantics Actions, for more details.

(when <s-expr> <re>)

This allows us to protect an expression. A protected expression matches (or accepts) a word only if the evaluation of <s-expr> is #t. For instance,

(define *g*
   (let ((armed #f))
      (regular-grammar ()
	 ((when (not armed) (: "#!" (+ (or #\/ alpha))))
	  (set! armed #t)
	  (print "start [" (the-string) "]")
	  (ignore))
	 ((+ (in #\Space #\Tab))
	  (ignore))
	 (else
	  (the-failure)))))
   
(define (main argv)
   (let ((port (open-input-string "#!/bin/sh #!/bin/zsh")))
      (print (read/rp *g* port))))

(bol <re>)

Matches <re> at the beginning of line.

(eol <re>)

Matches <re> at the end of line.

(bof <re>)

Matches <re> at the beginning of file.

(eof <re>)

Matches <re> at the end of file.

<variable>

This is the name of a variable bound by a <binding> construction. In addition to user defined variables, some already exist. These are:

all    <=> (out #\Newline)
lower  <=> (in ("az"))
upper  <=> (in ("AZ"))
alpha  <=> (or lower upper)
digit  <=> (in ("09"))
xdigit <=> (uncase (in ("af09")))
alnum  <=> (uncase (in ("az09")))
punct  <=> (in ".,;!?")
blank  <=> (in #" \t\n")
space  <=> #\Space

It is a error to reference a variable that it is not bound by a <binding>. Defining a variable that already exists is acceptable and causes the former variable definition to be erased. Here is an example of a grammar that binds two variables, one called ident and one called number. These two variables are used within the grammar to match identifiers and numbers.

(regular-grammar ((ident  (: alpha (* alnum)))
                  (number (+ digit)))
   (ident  (cons 'ident (the-string)))
   (number (cons 'number (the-string)))
   (else   (cons 'else (the-failure))))

<char>

The regular language described by one unique character. Here is an example of a grammar that accepts either the character #\a or the character #\b:

(regular-grammar ()
   (#\a (cons 'a (the-string)))
   (#\b (cons 'b (the-string)))
   (else (cons 'else (the-failure))))

<string>

This simple form of regular expression denotes the language represented by the string. For instance the regular expression "Bigloo" matches only the string composed of #\B #\i #\g #\l #\o #\o. The regular expression ".*[" matches the string #\. #\* #\[.

(: <re> ...)

This form constructs sequence of regular expression. That is a form <re1> <re2> ... <ren> matches the language construction by concatenation of the language described by <re1>, <re2>, <ren>. Thus, (: "x" all "y") matches all words of three letters, started by character the #\x and ended with the character #\y.

(or <re> ...)

This construction denotes conditions. The language described by (or re1 re2) accepts words accepted by either re1 or re2.

(* <re>)

This is the Kleene operator, the language described by (* <re>) is the language containing, 0 or more occurrences of <re>. Thus, the language described by (* "abc") accepts the empty word and any word composed by a repetition of the abc (abc, abcabc, abcabcabc, ...).

(+ <re>)

This expression described non empty repetitions. The form (+ re) is equivalent to (: re (* re)). Thus, (+ "abc") matches the words abc, abcabc, etc.

(? <re>)

This expression described one or zero occurrence. Thus, (? "abc") matches the empty word or the words abc.

(= <integer> <re>)

This expression described a fix number of repetitions. The form (= num re) is equivalent to (: re re ... re). Thus, the expression (= 3 "abc") matches the only word abcabcabc. In order to avoid code size explosion when compiling, <integer> must be smaller than an arbitrary constant. In the current version that value is 81.

(>= <integer> <re>)

The language described by the expression (>= int re) accepts word that are, at least, int repetitions of re. For instance, (>= 10 #\a), accepts words compound of, at least, 10 times the character #\a. In order to avoid code size explosion when compiling, <integer> must be smaller than an arbitrary constant. In the current version that value is 81.

(** <integer> <integer> <re>)

The language described by the expression (** min max re) accepts word that are repetitions of re; the number of repetition is in the range min, max. For instance, (** 10 20 #\a). In order to avoid code size explosion when compiling, <integer> must be smaller than an arbitrary constant. In the current version that value is 81.

(... <integer> <re>)

The subexpression <re> has to be a sequence of characters. Sequences are build by the operator : or by string literals. The language described by (... int re), denotes, the first letter of re, or the two first letters of re, or the three first letters of re or the int first letters of re. Thus, (... 3 "begin") is equivalent to (or "b" "be" "beg").

(uncase <re>)

The subexpression <re> has to be a sequence construction. The language described by (uncase re) is the same as re where letters may be upper case or lower case. For instance, (uncase "begin"), accepts the words "begin", "beGin", "BEGIN", "BegiN", etc.

(in <cset> ...)

Denotes union of characters. Characters may be described individually such as in (in #\a #\b #\c #\d). They may be described by strings. The expression (in "abcd") is equivalent to

(in
#\a #\b #\c #\d)

. Characters may also be described using a range notation that is a list of two characters. The expression

(in (#\a
#\d))

is equivalent to (in #\a #\b #\c #\d). The Ranges may be expresses using lists of string. The expression (in ("ad")) is equivalent to (in #\a #\b #\c #\d).

(out <cset> ...)

The language described by (out cset ...) is opposite to the one described by (in cset ...). For instance, (out ("azAZ") (#\0 #\9)) accepts all words of one character that are neither letters nor digits. One should not that if the character numbered zero may be used inside regular grammar, the out construction never matches it. Thus to write a rule that, for instances, matches every character but #\Newline including the character zero, one should write:

(or (out #\Newline) #a000)

(and <cset> <cset>)

The language described by (and cset1 cset2) accepts words made of characters that are in both cset1 and cset2.

(but <cset> <cset>)

The language described by (but cset1 cset2) accepts words made of characters of cset1 that are not member of cset2.

(posix <string>)

The expression (posix string) allows one to use Posix string notation for regular expressions. So, for example, the following two expressions are equivalent:

(posix "[az]+|x*|y{3,5}")

(or (+ (in ("az"))) (* "x") (** 3 5 "y"))

string-case string rule ...

bigloo syntax

This form dispatches on strings. it opens an input on string a read into it according to the regular grammar defined by the binding and rule. Example:

(define (suffix string)
   (string-case string
      ((: (* all) ".")
       (ignore))
      ((+ (out #\.))
       (the-string))
      (else
       "")))

11.3 The semantics actions

The semantics actions are regular Scheme expressions. These expressions appear in an environment where some ``extra procedures'' are defined. These procedures are:

the-port

bigloo rgc procedure

Returns the input port currently in used.

the-length

bigloo rgc procedure

Get the length of the biggest matching string.

the-string

bigloo rgc procedure

Get a copy of the last matching string. The function the-string returns a fresh copy of the matching each time it is called. In consequence,

(let ((l1 (the-string)) (l2 (the-string)))
   (eq? l1 l2))
   => #f

the-substring start len

bigloo rgc procedure

Get a copy of a substring of the last matching string. If the len is negative, it is subtracted to the whole match length. Here is an example of a rule extracting a part of a match:

(regular-grammar ()
   ((: #\" (* (out #\")) #\")
    (the-substring 1 (-fx (the-length) 1))))

Which can also be written:

(regular-grammar ()
   ((: #\" (* (out #\")) #\")
    (the-substring 1 -1)))

the-character

bigloo rgc procedure

the-byte

bigloo rgc procedure

Returns the first character of a match (respectively, the first byte).

the-byte-ref n

bigloo rgc procedure

Returns the n-th bytes of the matching string.

the-symbol

bigloo rgc procedure

the-downcase-symbol

bigloo rgc procedure

the-upcase-symbol

bigloo rgc procedure

the-subsymbol start length

bigloo rgc procedure

Convert the last matching string into a symbol. The function the-subsymbol obeys the same rules as the-substring.

the-keyword

bigloo rgc procedure

the-downcase-keyword

bigloo rgc procedure

the-upcase-keyword

bigloo rgc procedure

Convert the last matching string into a keyword.

the-fixnum

bigloo rgc procedure

The conversion of the last matching string to fixnum.

the-flonum

bigloo rgc procedure

The conversion of the last matching string to flonum.

the-failure

bigloo rgc procedure

Returns the first char that the grammar can't match or the end of file object.

ignore

bigloo rgc procedure

Ignore the parsing, keep reading. It's better to use (ignore) rather than an expression like (read/rp grammar port) in semantics actions since the (ignore) call will be done in a tail recursive way. For instance,

(let ((g (regular-grammar ()
            (")" 
             '())
            ("(" 
             (let* ((car (ignore))
                    (cdr (ignore)))
                (cons car cdr)))
            ((+ (out "()"))
             (the-string))))
      (p (open-input-string "(foo(bar(gee)))")))
   (read/rp g p))
   => ("foo" ("bar" ("gee")))

rgc-context [context]

bigloo rgc procedure

If no context is provide, this procedure reset the reader context state. That is the reader is in no context. With one argument, context set the reader in the context context. For instance,

(let ((g (regular-grammar ()
            ((context foo "foo") (print 'foo-bis))
            ("foo" (rgc-context 'foo) (print 'foo) (ignore))
            (else 'done)))
      (p (open-input-string "foofoo")))
   (read/rp g p))
   -| foo
      foo-bis

Note that RGC context are preserved across different uses of read/rp.

the-context

bigloo rgc procedure

Returns the value of the current Rgc context.

11.4 Options and user definitions

Options act as parameters that are transmitted to the parser on the call to read/rp. Local defines are user functions inserted in the produced parser, at the same level as the pre-defined ignore function.

Here is an example of grammar using both

(define gram
   (regular-grammar (x y)
      
      (define (foo s)
	 (cons* 'foo x s (ignore)))
      (define (bar s)
	 (cons* 'bar y s (ignore)))

      ((+ #\a) (foo (the-string)))
      ((+ #\b) (bar (the-string)))
      (else '())))

This grammar uses two options x and y. Hence when invokes it takes two additional values such as:

(with-input-from-string "aabb"
   (lambda ()
      (read/rp gram (current-input-port) 'option-x 'option-y)))
   => (foo option-x aa bar option-y bb)

11.5 Examples of regular grammar

The reader who wants to find a real example should read the code of Bigloo's reader. But here are small examples

11.5.1 Word count

The first example presents a grammar that simulates the Unix program wc.

(let ((*char* 0)
      (*word* 0)
      (*line* 0))
   (regular-grammar ()
      ((+ #\Newline)
       (set! *char* (+ *char* (the-length)))
       (set! *line* (+ *line* (the-length)))
       (ignore))
      ((+ (in #\space #\tab))
       (set! *char* (+ *char* (the-length)))
       (ignore))
      ((+ (out #\newline #\space #\tab))
       (set! *char* (+ *char* (the-length)))
       (set! *word* (+ 1 *word*))
       (ignore))))

11.5.2 Roman numbers

The second example presents a grammar that reads Arabic and Roman number.

(let ((par-open 0))
   (regular-grammar ((arabic (in ("09")))
                     (roman  (uncase (in "ivxlcdm"))))
      ((+ (in #" \t\n"))
       (ignore))
      ((+ arabic)
       (string->integer (the-string)))
      ((+ roman)
       (roman->arabic (the-string)))
      (#\(
       (let ((open-key par-open))
          (set! par-open (+ 1 par-open))
          (context 'pair)
          (let loop-pair ((walk (ignore))) 
             (cond
                ((= open-key par-open)
                 '())
                (else
                 (cons walk (loop-pair (ignore))))))))
      (#\)
       (set! par-open (- par-open 1))
       (if (< par-open 0)
           (begin
              (set! par-open 0)
              (ignore))
           #f))
      ((in "+-*\\")
       (string->symbol (the-string)))
      (else
       (let ((char (the-failure)))
          (if (eof-object? char)
              char
              (error "grammar-roman" "Illegal char" char))))))