Regular parsing
Programming languages have poor reading libraries since the lexical information that can be specified is directly tied to the structure of the language. For example, in C it's hard to read a rational number because there is no type rational. Programs have been written to circumvent this problem: Lex [Lesk75], for example, is one of them. We choose to incorporate in Bigloo a set of new functions to assist in such parsing. The syntax for regular grammar (also known as regular analyser) of Bigloo 2.0 (the one described in this document) is not compatible with former Bigloo versions.A new way of reading
There is only one way in Bigloo to read text, regular reading, which is done by the new form:read/rp regular-grammar input-portbigloo procedure
The first argument is a regular grammar (also known as regular analyser) and the second a Scheme port. This way of reading is almost the same as the Lex's one. The reader tries to match the longest input, from the stream pointed to by input-port, with one of several regular expressions contained in regular-grammar. If many rules match, the reader takes the first one defined in the grammar. When the regular rule has been found the corresponding Scheme expression is evaluated. remark: The traditionalread
Scheme function is implemented as:
(define-inline (read port) (read/rp scheme-grammar port))
.keep
The syntax of the regular grammar
A regular grammar is built by the means of the formregular-grammar
:
regular-grammar (binding ...) rule ...bigloo syntax
The binding and rule are defined by the following grammar:<binding> ⇒ (<variable> <re>) | <option> <option> ⇒ <variable> <rule> ⇒ <define> | (<cre> <s-expression> <s-expression> ...) | (Here is a description of each construction.else
<s-expression> <s-expression> ...) <define> ⇒ (define <s-expression>) <cre> ⇒ <re> | (context
<symbol> <re>) | (when
<s-expr> <re>) | (bol
<re>) | (eol
<re>) | (bof
<re>) | (eof
<re>) <re> ⇒ <variable> | <char> | <string> | (:
<re> ...) | (or
<re> ...) | (*
<re>) | (+
<re>) | (?
<re>) | (=
<integer> <re>) | (>=
<integer> <re>) | (**
<integer> <integer> <re>) | (...
<integer> <re>) | (uncase
<re>) | (in
<cset> ...) | (out
<cset> ...) | (and
<cset> <cset>) | (but
<cset> <cset>) | (posix
<string>) <variable> ⇒ <symbol> <cset> ⇒ <string> | <char> | (<string>) | (<char> <char>)
(context <symbol> <re>)
This allows us to protect an expression. A protected
expression matches (or accepts) a word only if the grammar has been set to
the corresponding context. The Semantics Actions, for more details.
(when <s-expr> <re>)
This allows us to protect an expression. A protected
expression matches (or accepts) a word only if the evaluation of
<s-expr>
is #t
. For instance,
(define *g* (let ((armed #f)) (regular-grammar () ((when (not armed) (: "#!" (+ (or #\/ alpha)))) (set! armed #t) (print "start [" (the-string) "]") (ignore)) ((+ (in #\Space #\Tab)) (ignore)) (else (the-failure))))) (define (main argv) (let ((port (open-input-string "#!/bin/sh #!/bin/zsh"))) (print (read/rp *g* port))))
(bol <re>)
<re>
at the beginning of line.
(eol <re>)
<re>
at the end of line.
(bof <re>)
<re>
at the beginning of file.
(eof <re>)
<re>
at the end of file.
<variable>
all ⇔ (out #\Newline) lower ⇔ (in ("az")) upper ⇔ (in ("AZ")) alpha ⇔ (or lower upper) digit ⇔ (in ("09")) xdigit ⇔ (uncase (in ("af09"))) alnum ⇔ (uncase (in ("az09"))) punct ⇔ (in ".,;!?") blank ⇔ (in #" \t\n") space ⇔ #\SpaceIt is a error to reference a variable that it is not bound by a <binding>. Defining a variable that already exists is acceptable and causes the former variable definition to be erased. Here is an example of a grammar that binds two variables, one called ident and one called number. These two variables are used within the grammar to match identifiers and numbers.
(regular-grammar ((ident (: alpha (* alnum))) (number (+ digit))) (ident (cons 'ident (the-string))) (number (cons 'number (the-string))) (else (cons 'else (the-failure))))
<char>
#\a
or the character
#\b
:
(regular-grammar () (#\a (cons 'a (the-string))) (#\b (cons 'b (the-string))) (else (cons 'else (the-failure))))
<string>
"Bigloo"
matches
only the string composed of #\B #\i #\g #\l #\o #\o
. The regular
expression ".*["
matches the string #\. #\* #\[
.
(: <re> ...)
<re1> <re2> ... <ren>
matches the language construction by
concatenation of the language described by <re1>
, <re2>
,
<ren>
. Thus, (: "x" all "y")
matches all words of three
letters, started by character the #\x
and ended with the character
#\y
.
(or <re> ...)
(or re1 re2)
accepts words accepted by either re1
or re2
.
(* <re>)
(* <re>)
is
the language containing, 0 or more occurrences of <re>
. Thus,
the language described by (* "abc")
accepts the empty word and
any word composed by a repetition of the abc
(abc
,
abcabc
, abcabcabc
, ...).
(+ <re>)
(+ re)
is
equivalent to (: re (* re))
. Thus, (+ "abc")
matches the
words abc
, abcabc
, etc.
(? <re>)
(? "abc")
matches the empty word or the words abc
.
(= <integer> <re>)
(= num re)
is equivalent to (: re re ... re)
. Thus,
the expression (= 3 "abc")
matches the only word abcabcabc
.
In order to avoid code size explosion when compiling, <integer>
must be smaller than an arbitrary constant. In the current version that
value is 81
.
(>= <integer> <re>)
(>= int re)
accepts word
that are, at least, int
repetitions of re
. For instance,
(>= 10 #\a)
, accepts words compound of, at least, 10 times the
character #\a
. In order to avoid code size explosion when compiling,
<integer>
must be smaller than an arbitrary constant. In the current
version that value is 81
.
(** <integer> <integer> <re>)
(** min max re)
accepts
word that are repetitions of re
; the number of repetition is in
the range min
, max
. For instance, (** 10 20 #\a)
.
In order to avoid code size explosion when compiling,
<integer>
must be smaller than an arbitrary constant. In the current
version that value is 81
.
(... <integer> <re>)
<re>
has to be a sequence
of characters. Sequences are build by the operator :
or by string
literals. The language described by (... int re)
, denotes, the
first letter of re
, or the two first letters of re
, or the
three first letters of re
or the int
first letters of
re
. Thus, (... 3 "begin")
is equivalent to
(or "b" "be" "beg")
.
(uncase <re>)
<re>
has to be a sequence
construction. The language described by (uncase re)
is the
same as re
where letters may be upper case or lower case. For
instance, (uncase "begin")
, accepts the words "begin"
,
"beGin"
, "BEGIN"
, "BegiN"
, etc.
(in <cset> ...)
(in #\a #\b #\c #\d)
. They may be described by
strings. The expression (in "abcd")
is equivalent to (in
#\a #\b #\c #\d)
. Characters may also be described using a range
notation that is a list of two characters. The expression (in (#\a
#\d))
is equivalent to (in #\a #\b #\c #\d)
. The Ranges may be
expresses using lists of string. The expression (in ("ad"))
is equivalent to (in #\a #\b #\c #\d)
.
(out <cset> ...)
(out cset ...)
is opposite to
the one described by (in cset ...)
. For instance,
(out ("azAZ") (#\0 #\9))
accepts all words of one character
that are neither letters nor digits. One should not that if the character
numbered zero may be used inside regular grammar, the out
construction never matches it. Thus to write a rule that, for instances,
matches every character but #\Newline
including the character
zero, one should write:
(or (out #\Newline) #a000)
(and <cset> <cset>)
(and cset1 cset2)
accepts words
made of characters that are in both cset1
and cset2
.
(but <cset> <cset>)
(but cset1 cset2)
accepts words
made of characters of cset1
that are not member of cset2
.
(posix <string>)
(posix string)
allows one to use Posix string
notation for regular expressions. So, for example, the following two
expressions are equivalent:
(posix "[az]+|x*|y{3,5}") (or (+ (in ("az"))) (* "x") (** 3 5 "y"))
.keep
string-case string rule ...bigloo syntax
This form dispatches on strings. it opens an input onstring
a read into it according to the regular grammar defined by the
binding
and rule
. Example:
(define (suffix string) (string-case string ((: (* all) ".") (ignore)) ((+ (out #\.)) (the-string)) (else "")))
.keep
The semantics actions
The semantics actions are regular Scheme expressions. These expressions appear in an environment where some ``extra procedures'' are defined. These procedures are:the-portbigloo rgc procedure
Returns the input port currently in used..keep
the-lengthbigloo rgc procedure
Get the length of the biggest matching string..keep
the-stringbigloo rgc procedure
Get a copy of the last matching string. The functionthe-string
returns a fresh copy of the matching each time it is called. In consequence,
(let ((l1 (the-string)) (l2 (the-string))) (eq? l1 l2)) ⇒ #f
.keep
the-substring start lenbigloo rgc procedure
Get a copy of a substring of the last matching string. If the len is negative, it is subtracted to the whole match length. Here is an example of a rule extracting a part of a match:(regular-grammar () ((: #\" (* (out #\")) #\") (the-substring 1 (-fx (the-length) 1))))Which can also be written:
(regular-grammar () ((: #\" (* (out #\")) #\") (the-substring 1 -1)))
.keep
the-characterbigloo rgc procedure
the-bytebigloo rgc procedure
Returns the first character of a match (respectively, the first byte)..keep
the-byte-ref nbigloo rgc procedure
Returns the n-th bytes of the matching string..keep
the-symbolbigloo rgc procedure
the-downcase-symbolbigloo rgc procedure
the-upcase-symbolbigloo rgc procedure
the-subsymbol start lengthbigloo rgc procedure
Convert the last matching string into a symbol. The functionthe-subsymbol
obeys the same rules as the-substring
.
.keep
the-keywordbigloo rgc procedure
the-downcase-keywordbigloo rgc procedure
the-upcase-keywordbigloo rgc procedure
Convert the last matching string into a keyword..keep
the-fixnumbigloo rgc procedure
The conversion of the last matching string to fixnum..keep
the-flonumbigloo rgc procedure
The conversion of the last matching string to flonum..keep
the-failurebigloo rgc procedure
Returns the first char that the grammar can't match or the end of file object..keep
ignorebigloo rgc procedure
Ignore the parsing, keep reading. It's better to use(ignore)
rather than an expression like (read/rp grammar port)
in semantics actions since the (ignore)
call will be done in a
tail recursive way. For instance,
(let ((g (regular-grammar () (")" '()) ("(" (let* ((car (ignore)) (cdr (ignore))) (cons car cdr))) ((+ (out "()")) (the-string)))) (p (open-input-string "(foo(bar(gee)))"))) (read/rp g p)) ⇒ ("foo" ("bar" ("gee")))
.keep
rgc-context [context]bigloo rgc procedure
If no context is provide, this procedure reset the reader context state. That is the reader is in no context. With one argument,context
set the reader in the context context. For instance,
(let ((g (regular-grammar () ((context foo "foo") (print 'foo-bis)) ("foo" (rgc-context 'foo) (print 'foo) (ignore)) (else 'done))) (p (open-input-string "foofoo"))) (read/rp g p)) ⇥ foo foo-bisNote that RGC context are preserved across different uses of
read/rp
.
.keep
the-contextbigloo rgc procedure
Returns the value of the current Rgc context..keep
Options and user definitions
Options act as parameters that are transmitted to the parser on the call toread/rp
. Local defines are user functions inserted in the produced
parser, at the same level as the pre-defined ignore
function.
Here is an example of grammar using both
(define gram (regular-grammar (x y) (define (foo s) (cons* 'foo x s (ignore))) (define (bar s) (cons* 'bar y s (ignore))) ((+ #\a) (foo (the-string))) ((+ #\b) (bar (the-string))) (else '())))This grammar uses two options x and y. Hence when invokes it takes two additional values such as:
(with-input-from-string "aabb" (lambda () (read/rp gram (current-input-port) 'option-x 'option-y))) ⇒ (foo option-x aa bar option-y bb)
Examples of regular grammar
The reader who wants to find a real example should read the code of Bigloo's reader. But here are small examplesWord count
The first example presents a grammar that simulates the Unix programwc
.
(let ((*char* 0) (*word* 0) (*line* 0)) (regular-grammar () ((+ #\Newline) (set! *char* (+ *char* (the-length))) (set! *line* (+ *line* (the-length))) (ignore)) ((+ (in #\space #\tab)) (set! *char* (+ *char* (the-length))) (ignore)) ((+ (out #\newline #\space #\tab)) (set! *char* (+ *char* (the-length))) (set! *word* (+ 1 *word*)) (ignore))))
Roman numbers
The second example presents a grammar that reads Arabic and Roman number.(let ((par-open 0)) (regular-grammar ((arabic (in ("09"))) (roman (uncase (in "ivxlcdm")))) ((+ (in #" \t\n")) (ignore)) ((+ arabic) (string->integer (the-string))) ((+ roman) (roman->arabic (the-string))) (#\( (let ((open-key par-open)) (set! par-open (+ 1 par-open)) (context 'pair) (let loop-pair ((walk (ignore))) (cond ((= open-key par-open) '()) (else (cons walk (loop-pair (ignore)))))))) (#\) (set! par-open (- par-open 1)) (if (< par-open 0) (begin (set! par-open 0) (ignore)) #f)) ((in "+-*\\") (string->symbol (the-string))) (else (let ((char (the-failure))) (if (eof-object? char) char (error "grammar-roman" "Illegal char" char))))))