Regular parsing

Programming languages have poor reading libraries since the lexical information that can be specified is directly tied to the structure of the language. For example, in C it's hard to read a rational number because there is no type rational. Programs have been written to circumvent this problem: Lex [Lesk75], for example, is one of them. We choose to incorporate in Bigloo a set of new functions to assist in such parsing. The syntax for regular grammar (also known as regular analyser) of Bigloo 2.0 (the one described in this document) is not compatible with former Bigloo versions.

A new way of reading

There is only one way in Bigloo to read text, regular reading, which is done by the new form:

read/rp regular-grammar input-portbigloo procedure

The first argument is a regular grammar (also known as regular analyser) and the second a Scheme port. This way of reading is almost the same as the Lex's one. The reader tries to match the longest input, from the stream pointed to by input-port, with one of several regular expressions contained in regular-grammar. If many rules match, the reader takes the first one defined in the grammar. When the regular rule has been found the corresponding Scheme expression is evaluated.

remark: The traditional read Scheme function is implemented as:

(define-inline (read port)
   (read/rp scheme-grammar port))

.keep

The syntax of the regular grammar

A regular grammar is built by the means of the form regular-grammar:

regular-grammar (binding ...) rule ...bigloo syntax

The binding and rule are defined by the following grammar:

<binding>  ⇒ (<variable> <re>)
             | <option>
<option>   ⇒ <variable>
<rule>     ⇒ <define>
             | (<cre> <s-expression> <s-expression> ...)
             | (else <s-expression> <s-expression> ...)
<define>   ⇒ (define <s-expression>)
<cre>      ⇒ <re>
             | (context <symbol> <re>)
             | (when <s-expr> <re>)
             | (bol <re>)
             | (eol <re>)
             | (bof <re>)
             | (eof <re>)
<re>       ⇒ <variable>
             | <char>
             | <string>
             | (: <re> ...)
             | (or <re> ...)
             | (* <re>)
             | (+ <re>)
             | (? <re>)
             | (= <integer> <re>)
             | (>= <integer> <re>)
             | (** <integer> <integer> <re>)
             | (... <integer> <re>)
             | (uncase <re>)
             | (in <cset> ...)
             | (out <cset> ...)
             | (and <cset> <cset>)
             | (but <cset> <cset>)
             | (posix <string>)
<variable> ⇒ <symbol>
<cset>     ⇒ <string>
             | <char>
             | (<string>)
             | (<char> <char>)

Here is a description of each construction.

(context <symbol> <re>)

protect

protected

The Semantics Actions

(when <s-expr> <re>)

protect

protected

<s-expr>

#t

(define *g*
   (let ((armed #f))
      (regular-grammar ()
         ((when (not armed) (: "#!" (+ (or #\/ alpha))))
          (set! armed #t)
          (print "start [" (the-string) "]")
          (ignore))
         ((+ (in #\Space #\Tab))
          (ignore))
         (else
          (the-failure)))))
   
(define (main argv)
   (let ((port (open-input-string "#!/bin/sh #!/bin/zsh")))
      (print (read/rp *g* port))))

(bol <re>)

<re>

(eol <re>)

<re>

(bof <re>)

<re>

(eof <re>)

<re>

<variable>

all    ⇔ (out #\Newline)
lower  ⇔ (in ("az"))
upper  ⇔ (in ("AZ"))
alpha  ⇔ (or lower upper)
digit  ⇔ (in ("09"))
xdigit ⇔ (uncase (in ("af09")))
alnum  ⇔ (uncase (in ("az09")))
punct  ⇔ (in ".,;!?")
blank  ⇔ (in #" \t\n")
space  ⇔ #\Space

(regular-grammar ((ident  (: alpha (* alnum)))
                  (number (+ digit)))
   (ident  (cons 'ident (the-string)))
   (number (cons 'number (the-string)))
   (else   (cons 'else (the-failure))))

<char>

#\a

#\b

(regular-grammar ()
   (#\a (cons 'a (the-string)))
   (#\b (cons 'b (the-string)))
   (else (cons 'else (the-failure))))

<string>

"Bigloo"

#\B #\i #\g #\l #\o #\o

".*["

#\. #\* #\[

(: <re> ...)

<re1> <re2> ... <ren>

<re1>

<re2>

<ren>

(: "x" all "y")

#\x

#\y

(or <re> ...)

(or re1 re2)

re1

re2

(* <re>)

(* <re>)

<re>

(* "abc")

abc

abcabc

abcabcabc

(+ <re>)

(+ re)

(: re (* re))

(+ "abc")

abc

abcabc

(? <re>)

(? "abc")

abc

(= <integer> <re>)

(= num re)

(: re re ... re)

(= 3 "abc")

abcabcabc

<integer>

81

(>= <integer> <re>)

(>= int re)

int

re

(>= 10 #\a)

#\a

<integer>

81

(** <integer> <integer> <re>)

(** min max re)

re

min

max

(** 10 20 #\a)

<integer>

81

(... <integer> <re>)

<re>

:

(... int re)

re

int

re

(... 3 "begin")

(or "b" "be" "beg")

(uncase <re>)

<re>

(uncase re)

re

(uncase "begin")

"begin"

"beGin"

"BEGIN"

"BegiN"

(in <cset> ...)

(in #\a #\b #\c #\d)

(in "abcd")

(in
#\a #\b #\c #\d)

(in (#\a
#\d))

(in #\a #\b #\c #\d)

(in ("ad"))

(in #\a #\b #\c #\d)

(out <cset> ...)

(out cset ...)

(in cset ...)

(out ("azAZ") (#\0 #\9))

out

#\Newline

(or (out #\Newline) #a000)

(and <cset> <cset>)

(and cset1 cset2)

cset1

cset2

(but <cset> <cset>)

(but cset1 cset2)

cset1

cset2

(posix <string>)

(posix string)

(posix "[az]+|x*|y{3,5}")

(or (+ (in ("az"))) (* "x") (** 3 5 "y"))

.keep

string-case string rule ...bigloo syntax

This form dispatches on strings. it opens an input on string a read into it according to the regular grammar defined by the binding and rule. Example:

(define (suffix string)
   (string-case string
      ((: (* all) ".")
       (ignore))
      ((+ (out #\.))
       (the-string))
      (else
       "")))

.keep

The semantics actions

The semantics actions are regular Scheme expressions. These expressions appear in an environment where some ``extra procedures'' are defined. These procedures are:

the-portbigloo rgc procedure

Returns the input port currently in used.

.keep

the-lengthbigloo rgc procedure

Get the length of the biggest matching string.

.keep

the-stringbigloo rgc procedure

Get a copy of the last matching string. The function the-string returns a fresh copy of the matching each time it is called. In consequence,

(let ((l1 (the-string)) (l2 (the-string)))
   (eq? l1 l2))
   ⇒ #f

.keep

the-substring start lenbigloo rgc procedure

Get a copy of a substring of the last matching string. If the len is negative, it is subtracted to the whole match length. Here is an example of a rule extracting a part of a match:

(regular-grammar ()
   ((: #\" (* (out #\")) #\")
    (the-substring 1 (-fx (the-length) 1))))

Which can also be written:

(regular-grammar ()
   ((: #\" (* (out #\")) #\")
    (the-substring 1 -1)))

.keep

the-characterbigloo rgc procedure

the-bytebigloo rgc procedure

Returns the first character of a match (respectively, the first byte).

.keep

the-byte-ref nbigloo rgc procedure

Returns the n-th bytes of the matching string.

.keep

the-symbolbigloo rgc procedure

the-downcase-symbolbigloo rgc procedure

the-upcase-symbolbigloo rgc procedure

the-subsymbol start lengthbigloo rgc procedure

Convert the last matching string into a symbol. The function the-subsymbol obeys the same rules as the-substring.

.keep

the-keywordbigloo rgc procedure

the-downcase-keywordbigloo rgc procedure

the-upcase-keywordbigloo rgc procedure

Convert the last matching string into a keyword.

.keep

the-fixnumbigloo rgc procedure

The conversion of the last matching string to fixnum.

.keep

the-flonumbigloo rgc procedure

The conversion of the last matching string to flonum.

.keep

the-failurebigloo rgc procedure

Returns the first char that the grammar can't match or the end of file object.

.keep

ignorebigloo rgc procedure

Ignore the parsing, keep reading. It's better to use (ignore) rather than an expression like (read/rp grammar port) in semantics actions since the (ignore) call will be done in a tail recursive way. For instance,

(let ((g (regular-grammar ()
            (")" 
             '())
            ("(" 
             (let* ((car (ignore))
                    (cdr (ignore)))
                (cons car cdr)))
            ((+ (out "()"))
             (the-string))))
      (p (open-input-string "(foo(bar(gee)))")))
   (read/rp g p))
   ⇒ ("foo" ("bar" ("gee")))

.keep

rgc-context [context]bigloo rgc procedure

If no context is provide, this procedure reset the reader context state. That is the reader is in no context. With one argument, context set the reader in the context context. For instance,

(let ((g (regular-grammar ()
            ((context foo "foo") (print 'foo-bis))
            ("foo" (rgc-context 'foo) (print 'foo) (ignore))
            (else 'done)))
      (p (open-input-string "foofoo")))
   (read/rp g p))
   ⇥ foo
      foo-bis

Note that RGC context are preserved across different uses of read/rp.

.keep

the-contextbigloo rgc procedure

Returns the value of the current Rgc context.

.keep

Options and user definitions

Options act as parameters that are transmitted to the parser on the call to read/rp. Local defines are user functions inserted in the produced parser, at the same level as the pre-defined ignore function.

Here is an example of grammar using both

(define gram
   (regular-grammar (x y)
      
      (define (foo s)
         (cons* 'foo x s (ignore)))
      (define (bar s)
         (cons* 'bar y s (ignore)))

      ((+ #\a) (foo (the-string)))
      ((+ #\b) (bar (the-string)))
      (else '())))

This grammar uses two options x and y. Hence when invokes it takes two additional values such as:

(with-input-from-string "aabb"
   (lambda ()
      (read/rp gram (current-input-port) 'option-x 'option-y)))
   ⇒ (foo option-x aa bar option-y bb)

Examples of regular grammar

The reader who wants to find a real example should read the code of Bigloo's reader. But here are small examples

Word count

The first example presents a grammar that simulates the Unix program wc.

(let ((*char* 0)
      (*word* 0)
      (*line* 0))
   (regular-grammar ()
      ((+ #\Newline)
       (set! *char* (+ *char* (the-length)))
       (set! *line* (+ *line* (the-length)))
       (ignore))
      ((+ (in #\space #\tab))
       (set! *char* (+ *char* (the-length)))
       (ignore))
      ((+ (out #\newline #\space #\tab))
       (set! *char* (+ *char* (the-length)))
       (set! *word* (+ 1 *word*))
       (ignore))))

Roman numbers

The second example presents a grammar that reads Arabic and Roman number.

(let ((par-open 0))
   (regular-grammar ((arabic (in ("09")))
                     (roman  (uncase (in "ivxlcdm"))))
      ((+ (in #" \t\n"))
       (ignore))
      ((+ arabic)
       (string-&gt;integer (the-string)))
      ((+ roman)
       (roman-&gt;arabic (the-string)))
      (#\(
       (let ((open-key par-open))
          (set! par-open (+ 1 par-open))
          (context 'pair)
          (let loop-pair ((walk (ignore))) 
             (cond
                ((= open-key par-open)
                 '())
                (else
                 (cons walk (loop-pair (ignore))))))))
      (#\)
       (set! par-open (- par-open 1))
       (if (&lt; par-open 0)
           (begin
              (set! par-open 0)
              (ignore))
           #f))
      ((in "+-*\\")
       (string-&gt;symbol (the-string)))
      (else
       (let ((char (the-failure)))
          (if (eof-object? char)
              char
              (error "grammar-roman" "Illegal char" char))))))

Bigloo / Regular parsing

Regular parsing

A new way of reading

read/rp regular-grammar input-portbigloo procedure

The syntax of the regular grammar

regular-grammar (binding ...) rule ...bigloo syntax

string-case string rule ...bigloo syntax

The semantics actions

the-portbigloo rgc procedure

the-lengthbigloo rgc procedure

the-stringbigloo rgc procedure

the-substring start lenbigloo rgc procedure

the-characterbigloo rgc procedure

the-bytebigloo rgc procedure

the-byte-ref nbigloo rgc procedure

the-symbolbigloo rgc procedure

the-downcase-symbolbigloo rgc procedure

the-upcase-symbolbigloo rgc procedure

the-subsymbol start lengthbigloo rgc procedure

the-keywordbigloo rgc procedure

the-downcase-keywordbigloo rgc procedure

the-upcase-keywordbigloo rgc procedure

the-fixnumbigloo rgc procedure

the-flonumbigloo rgc procedure

the-failurebigloo rgc procedure

ignorebigloo rgc procedure

rgc-context [context]bigloo rgc procedure

the-contextbigloo rgc procedure

Options and user definitions

Examples of regular grammar

Word count

Roman numbers