Posix Regular Expressions |
|
|
This whole section has been written by Dorai Sitaram.
It consists in the documentation of the pregexp package that may be
found at http://www.ccs.neu.edu/~dorai/pregexp/pregexp.html.
The regexp notation supported is modeled on Perl's, and includes such
powerful directives as numeric and nongreedy quantifiers, capturing and
non-capturing clustering, POSIX character classes, selective case- and
space-insensitivity, backreferences, alternation, backtrack pruning,
positive and negative lookahead and lookbehind, in addition to the more
basic directives familiar to all regexp users. A regexp is a
string that describes a pattern. A regexp matcher tries to match
this pattern against (a portion of) another string, which we will call
the text string. The text string is treated as raw text and not
as a pattern. Most of the characters in a regexp pattern are meant to match
occurrences of themselves in the text string. Thus, the pattern
"abc" matches a string that contains the characters a , b ,
c in succession. In the regexp pattern, some characters act as
metacharacters, and some character sequences act as
metasequences. That is, they specify something
other than their literal selves. For example, in the
pattern "a.c" , the characters a and c do
stand for themselves but the metacharacter .
can match any character (other than
newline). Therefore, the pattern "a.c"
matches an a , followed by any character,
followed by a c . If we needed to match the character . itself,
we escape it, ie, precede it with a backslash
( \ ). The character sequence \. is thus a
metasequence, since it doesn't match itself but rather
just . . So, to match a followed by a literal
. followed by c , we use the regexp pattern
"a\\.c" . 1
Another example of a metasequence is \t , which is a
readable way to represent the tab character. We will call the string representation of a regexp the
U-regexp, where U can be taken to mean Unix-style or
universal, because this
notation for regexps is universally familiar. Our
implementation uses an intermediate tree-like
representation called the S-regexp, where S
can stand for Scheme, symbolic, or
s-expression. S-regexps are more verbose
and less readable than U-regexps, but they are much
easier for Scheme's recursive procedures to navigate.
13.1 Regular Expressions Procedures
|
Four procedures pregexp , pregexp-match-positions ,
pregexp-match , pregexp-replace , and
pregexp-replace* enable compilation and matching of regular
expressions.
pregexp U-regexp . opt-args | bigloo procedure |
The procedure pregexp takes a U-regexp, which is a
string, and returns an S-regexp, which is a tree.
(pregexp "c.r") => (:sub (:or (:seq #\c :any #\r)))
|
There is rarely any need to look at the S-regexps returned by pregexp .
The opt-args specifies how the regular expression is to be matched.
Until documented the argument should be the empty list.
|
pregexp-match-positions regexp string [beg 0] [end -1] | bigloo procedure |
The procedure pregexp-match-positions takes a
regexp pattern and a text string, and returns a match
if the pattern matches the text string.
The pattern may be either a U- or an S-regexp.
(pregexp-match-positions will internally compile a
U-regexp to an S-regexp before proceeding with the
matching. If you find yourself calling
pregexp-match-positions repeatedly with the same
U-regexp, it may be advisable to explicitly convert the
latter into an S-regexp once beforehand, using
pregexp , to save needless recompilation.)
pregexp-match-positions returns #f if the pattern did not
match the string; and a list of index pairs if it
did match. Eg,
(pregexp-match-positions "brain" "bird")
=> #f
(pregexp-match-positions "needle" "hay needle stack")
=> ((4 . 10))
|
In the second example, the integers 4 and 10 identifythe substring that was matched. 1 is the starting
(inclusive) index and 2 the ending (exclusive) index of
the matching substring.
(substring "hay needle stack" 4 10)
=> "needle"
|
Here, pregexp-match-positions 's return list contains only
one index pair, and that pair represents the entire
substring matched by the regexp. When we discuss
subpatterns later, we will see how a single match
operation can yield a list of submatches.
pregexp-match-positions takes optional third
and fourth arguments that specify the indices of
the text string within which the matching should
take place.
(pregexp-match-positions "needle"
"his hay needle stack -- my hay needle stack -- her hay needle stack"
24 43)
=> ((31 . 37))
|
Note that the returned indices are still reckoned
relative to the full text string.
|
pregexp-match regexp string | bigloo procedure |
The procedure pregexp-match is called like
pregexp-match-positions
but instead of returning index pairs it returns the
matching substrings:
(pregexp-match "brain" "bird")
=> #f
(pregexp-match "needle" "hay needle stack")
=> ("needle")
|
pregexp-match also takes optional third and
fourth arguments, with the same meaning as does
pregexp-match-positions .
|
pregexp-replace regexp string1 string2 | bigloo procedure |
The procedure pregexp-replace replaces the
matched portion of the text string by another
string. The first argument is the regexp,
the second the text string, and the third
is the insert string (string to be inserted).
(pregexp-replace "te" "liberte" "ty")
=> "liberty"
|
If the pattern doesn't occur in the text string, the returned string is
identical (eq? ) to the text string.
|
pregexp-replace* regexp string1 string2 | bigloo procedure |
The procedure pregexp-replace* replaces all matches in the
text string1 by the insert string2 :
(pregexp-replace* "te" "liberte egalite fraternite" "ty")
=> "liberty egality fratyrnity"
|
As with pregexp-replace , if the pattern doesn't occur in the text
string, the returned string is identical (eq? ) to the text string.
|
pregexp-split regexp string | bigloo procedure |
The procedure pregexp-split takes two arguments, a
regexp pattern and a text string, and returns a list of
substrings of the text string, where the pattern identifies the
delimiter separating the substrings.
(pregexp-split ":" "/bin:/usr/bin:/usr/bin/X11:/usr/local/bin")
=> ("/bin" "/usr/bin" "/usr/bin/X11" "/usr/local/bin")
(pregexp-split " " "pea soup")
=> ("pea" "soup")
|
If the first argument can match an empty string, then
the list of all the single-character substrings is returned.
(pregexp-split "" "smithereens")
=> ("s" "m" "i" "t" "h" "e" "r" "e" "e" "n" "s")
|
To identify one-or-more spaces as the delimiter,
take care to use the regexp " +" , not " *" .
(pregexp-split " +" "split pea soup")
=> ("split" "pea" "soup")
(pregexp-split " *" "split pea soup")
=> ("s" "p" "l" "i" "t" "p" "e" "a" "s" "o" "u" "p")
|
|
pregexp-quote string | bigloo procedure |
The procedure pregexp-quote takes an arbitrary string and
returns a U-regexp (string) that precisely represents it. In particular,
characters in the input string that could serve as regexp metacharacters are
escaped with a backslash, so that they safely match only themselves.
(pregexp-quote "cons")
=> "cons"
(pregexp-quote "list?")
=> "list\\?"
|
pregexp-quote is useful when building a composite regexp
from a mix of regexp strings and verbatim strings.
|
13.2 Regular Expressions Pattern Language
|
Here is a complete description of the regexp pattern
language recognized by the pregexp procedures.
The assertions ^ and $ identify the beginning and
the end of the text string respectively. They ensure that their
adjoining regexps match at one or other end of the text string.
Examples:
(pregexp-match-positions "^contact" "first contact") => #f
|
The regexp fails to match because contact does notoccur at the beginning of the text string.
(pregexp-match-positions "laugh$" "laugh laugh laugh laugh") => ((18 . 23))
|
The regexp matches the last laugh .
The metasequence \b asserts that
a word boundary exists.
(pregexp-match-positions "yack\\b" "yackety yack") => ((8 . 12))
|
The yack in yackety doesn't end at a wordboundary so it isn't matched. The second yack does and is. The metasequence \B has the opposite effect to \b . It
asserts that a word boundary does not exist.
(pregexp-match-positions "an\\B" "an analysis") => ((3 . 5))
|
The an that doesn't end in a word boundaryis matched.
13.2.2 Characters and character classes
|
Typically a character in the regexp matches the same character in the
text string. Sometimes it is necessary or convenient to use a regexp
metasequence to refer to a single character. Thus, metasequences
\n , \r , \t , and \. match the newline,
return, tab and period characters respectively. The metacharacter period ( . ) matches
any character other than newline.
(pregexp-match "p.t" "pet") => ("pet")
|
It also matches pat , pit , pot , put ,and p8t but not peat or pfffft . A character class matches any one character from a set of
characters. A typical format for this is the bracketed character
class [ ... ] , which matches any one character from the
non-empty sequence of characters enclosed within the
brackets. 2 Thus "p[aeiou]t" matches
pat , pet , pit , pot , put and nothing
else. Inside the brackets, a hyphen ( - ) between two characters
specifies the ascii range between the characters. Eg,
"ta[b-dgn-p]" matches tab , tac , tad ,
and tag , and tan , tao , tap . An initial caret ( ^ ) after the left bracket inverts the set
specified by the rest of the contents, ie, it specifies the set of
characters other than those identified in the brackets. Eg,
"do[^g]" matches all three-character sequences starting with
do except dog . Note that the metacharacter ^ inside brackets means something
quite different from what it means outside. Most other metacharacters
( . , * , + , ? , etc) cease to be metacharacters
when inside brackets, although you may still escape them for peace of
mind. - is a metacharacter only when it's inside brackets, and
neither the first nor the last character. Bracketed character classes cannot contain other bracketed character
classes (although they contain certain other types of character classes
--- see below). Thus a left bracket ( [ ) inside a bracketed
character class doesn't have to be a metacharacter; it can stand for
itself. Eg, "[a[b]" matches a , [ , and b . Furthermore, since empty bracketed character classes are disallowed, a
right bracket ( ] ) immediately occurring after the opening left
bracket also doesn't need to be a metacharacter. Eg, "[]ab]"
matches ] , a , and b .
13.2.3 Some frequently used character classes
|
Some standard character classes can be conveniently represented as
metasequences instead of as explicit bracketed expressions. \d
matches a digit ( [0-9] ); \s matches a whitespace
character; and \w matches a character that could be part of a
``word''. 3The upper-case versions of these metasequences stand for the inversions
of the corresponding character classes. Thus \D matches a
non-digit, \S a non-whitespace character, and \W a
non-``word'' character. Remember to include a double backslash when putting these metasequences
in a Scheme string:
(pregexp-match "\\d\\d" "0 dear, 1 have 2 read catch 22 before 9") => ("22")
|
These character classes can be used inside
a bracketed expression. Eg,
"[a-z\\d]" matches a lower-case letter
or a digit.
13.2.4 POSIX character classes
|
A POSIX character class is a special metasequence
of the form [: ... :] that can be used only
inside a bracketed expression. The POSIX classes
supported are
[:alnum:] letters and digits
[:alpha:] letters
[:algor:] the letters c , h , a and d
[:ascii:] 7-bit ascii characters
[:blank:] widthful whitespace, ie, space and tab
[:cntrl:] ``control'' characters, viz, those with code < 32
[:digit:] digits, same as \d
[:graph:] characters that use ink
[:lower:] lower-case letters
[:print:] ink-users plus widthful whitespace
[:space:] whitespace, same as \s
[:upper:] upper-case letters
[:word:] letters, digits, and underscore, same as \w
[:xdigit:] hex digits
|
For example, the regexp "[[:alpha:]_]" matches a letter or underscore.
(pregexp-match "[[:alpha:]_]" "--x--") => ("x")
(pregexp-match "[[:alpha:]_]" "--_--") => ("_")
(pregexp-match "[[:alpha:]_]" "--:--") => #f
|
The POSIX class notation is valid only inside a
bracketed expression. For instance, [:alpha:] ,
when not inside a bracketed expression, will not
be read as the letter class.
Rather it is (from previous principles) the character
class containing the characters : , a , l ,
p , h .
(pregexp-match "[[:alpha:]]" "--a--") => ("a")
(pregexp-match "[[:alpha:]]" "--_--") => #f
|
By placing a caret ( ^ ) immediately after
[: , you get the inversion of that POSIX
character class. Thus, [:^alpha]
is the class containing all characters
except the letters.
The quantifiers * , + , and ? match
respectively: zero or more, one or more, and zero or one instances of
the preceding subpattern.
(pregexp-match-positions "c[ad]*r" "cadaddadddr") => ((0 . 11))
(pregexp-match-positions "c[ad]*r" "cr") => ((0 . 2))
(pregexp-match-positions "c[ad]+r" "cadaddadddr") => ((0 . 11))
(pregexp-match-positions "c[ad]+r" "cr") => #f
(pregexp-match-positions "c[ad]?r" "cadaddadddr") => #f
(pregexp-match-positions "c[ad]?r" "cr") => ((0 . 2))
(pregexp-match-positions "c[ad]?r" "car") => ((0 . 3))
|
13.2.6 Numeric quantifiers
|
You can use braces to specify much finer-tuned quantification than is
possible with * , + , ? . The quantifier {m} matches exactly m
instances of the preceding subpattern. m
must be a nonnegative integer. The quantifier {m,n} matches at least m and at most
n instances. m and n are nonnegative integers with
m <= n . You may omit either or both numbers, in which case
m defaults to 0 and n to infinity. It is evident that + and ? are abbreviations for
{1,} and {0,1} respectively. * abbreviates
{,} , which is the same as {0,} .
(pregexp-match "[aeiou]{3}" "vacuous") => ("uou")
(pregexp-match "[aeiou]{3}" "evolve") => #f
(pregexp-match "[aeiou]{2,3}" "evolve") => #f
(pregexp-match "[aeiou]{2,3}" "zeugma") => ("eu")
|
13.2.7 Non-greedy quantifiers
|
The quantifiers described above are greedy, ie, they match the
maximal number of instances that would still lead to an overall match
for the full pattern.
(pregexp-match "<.*>" "<tag1> <tag2> <tag3>")
=> ("<tag1> <tag2> <tag3>")
|
To make these quantifiers non-greedy, append a ? to them.
Non-greedy quantifiers match the minimal number of instances needed to
ensure an overall match.
(pregexp-match "<.*?>" "<tag1> <tag2> <tag3>") => ("<tag1>")
|
The non-greedy quantifiers are respectively:
*? , +? , ?? , {m}? , {m,n}? .
Note the two uses of the metacharacter ? .
Clustering, ie, enclosure within parens ( ... ) ,
identifies the enclosed subpattern as a single entity. It causes
the matcher to capture the submatch, or the portion of the
string matching the subpattern, in addition to the overall match.
(pregexp-match "([a-z]+) ([0-9]+), ([0-9]+)" "jan 1, 1970")
=> ("jan 1, 1970" "jan" "1" "1970")
|
Clustering also causes a following quantifier to treat
the entire enclosed subpattern as an entity.
(pregexp-match "(poo )*" "poo poo platter") => ("poo poo " "poo ")
|
The number of submatches returned is always equal to the number of
subpatterns specified in the regexp, even if a particular subpattern
happens to match more than one substring or no substring at all.
(pregexp-match "([a-z ]+;)*" "lather; rinse; repeat;")
=> ("lather; rinse; repeat;" " repeat;")
|
Here the * -quantified subpattern matches threetimes, but it is the last submatch that is returned. It is also possible for a quantified subpattern to
fail to match, even if the overall pattern matches.
In such cases, the failing submatch is represented
by #f .
(define date-re
;match `month year' or `month day, year'.
;subpattern matches day, if present
(pregexp "([a-z]+) +([0-9]+,)? *([0-9]+)"))
(pregexp-match date-re "jan 1, 1970")
=> ("jan 1, 1970" "jan" "1," "1970")
(pregexp-match date-re "jan 1970")
=> ("jan 1970" "jan" #f "1970")
|
Submatches can be used in the insert string argument of the procedures
pregexp-replace and pregexp-replace* . The insert string
can use \n as a backreference to refer back to the
nth submatch, ie, the substring that matched the nth
subpattern. \0 refers to the entire match, and it can also be
specified as \& .
(pregexp-replace "_(.+?)_"
"the _nina_, the _pinta_, and the _santa maria_"
"*\\1*")
=> "the *nina*, the _pinta_, and the _santa maria_"
(pregexp-replace* "_(.+?)_"
"the _nina_, the _pinta_, and the _santa maria_"
"*\\1*")
=> "the *nina*, the *pinta*, and the *santa maria*"
;recall: \S stands for non-whitespace character
(pregexp-replace "(\\S+) (\\S+) (\\S+)"
"eat to live"
"\\3 \\2 \\1")
=> "live to eat"
|
Use \\ in the insert string to specify a literal
backslash. Also, \$ stands for an empty string,
and is useful for separating a backreference \n
from an immediately following number. Backreferences can also be used within the regexp
pattern to refer back to an already matched subpattern
in the pattern. \n stands for an exact repeat
of the nth submatch. 4
(pregexp-match "([a-z]+) and \\1"
"billions and billions")
=> ("billions and billions" "billions")
|
Note that the backreference is not simply a repeatof the previous subpattern. Rather it is a repeat of
the particular substring already matched by the
subpattern. In the above example, the backreference can only match
billions . It will not match millions , even
though the subpattern it harks back to --- ([a-z]+)
--- would have had no problem doing so:
(pregexp-match "([a-z]+) and \\1"
"billions and millions")
=> #f
|
The following corrects doubled words:
(pregexp-replace* "(\\S+) \\1"
"now is the the time for all good men to to come to the aid of of the party"
"\\1")
=> "now is the time for all good men to come to the aid of the party"
|
The following marks all immediately repeating patterns
in a number string:
(pregexp-replace* "(\\d+)\\1"
"123340983242432420980980234"
"{\\1,\\1}")
=> "12{3,3}40983{24,24}3242{098,098}0234"
|
13.2.10 Non-capturing clusters
|
It is often required to specify a cluster
(typically for quantification) but without triggering
the capture of submatch information. Such
clusters are called non-capturing. In such cases,
use (?: instead of ( as the cluster opener. In
the following example, the non-capturing cluster
eliminates the ``directory'' portion of a given
pathname, and the capturing cluster identifies the
basename.
(pregexp-match "^(?:[a-z]*/)*([a-z]+)$"
"/usr/local/bin/mzscheme")
=> ("/usr/local/bin/mzscheme" "mzscheme")
|
The location between the ? and the : of a non-capturing
cluster is called a cloister. 5 You can put modifiers there
that will cause the enclustered subpattern to be treated specially. The
modifier i causes the subpattern to match
case-insensitively:
(pregexp-match "(?i:hearth)" "HeartH") => ("HeartH")
|
The modifier x causes the subpattern to match
space-insensitively, ie, spaces and
comments within the
subpattern are ignored. Comments are introduced
as usual with a semicolon ( ; ) and extend till
the end of the line. If you need
to include a literal space or semicolon in
a space-insensitized subpattern, escape it
with a backslash.
(pregexp-match "(?x: a lot)" "alot")
=> ("alot")
(pregexp-match "(?x: a \\ lot)" "a lot")
=> ("a lot")
(pregexp-match "(?x:
a \\ man \\; \\ # ignore
a \\ plan \\; \\ # me
a \\ canal # completely
)"
"a man; a plan; a canal")
=> ("a man; a plan; a canal")
|
You can put more than one modifier in the cloister.
(pregexp-match "(?ix:
a \\ man \\; \\ # ignore
a \\ plan \\; \\ # me
a \\ canal # completely
)"
"A Man; a Plan; a Canal")
=> ("A Man; a Plan; a Canal")
|
A minus sign before a modifier inverts its meaning.
Thus, you can use -i and -x in a
subcluster to overturn the insensitivities caused by an
enclosing cluster.
(pregexp-match "(?i:the (?-i:TeX)book)"
"The TeXbook")
=> ("The TeXbook")
|
This regexp will allow any casing for the and book but insists that TeX not be
differently cased.
You can specify a list of alternate
subpatterns by separating them by | . The |
separates subpatterns in the nearest enclosing cluster
(or in the entire pattern string if there are no
enclosing parens).
(pregexp-match "f(ee|i|o|um)" "a small, final fee")
=> ("fi" "i")
(pregexp-replace* "([yi])s(e[sdr]?|ing|ation)"
"it is energising to analyse an organisation
pulsing with noisy organisms"
"\\1z\\2")
=> "it is energizing to analyze an organization
pulsing with noisy organisms"
|
Note again that if you wish
to use clustering merely to specify a list of alternate
subpatterns but do not want the submatch, use (?:
instead of ( .
(pregexp-match "f(?:ee|i|o|um)" "fun for all")
=> ("fo")
|
An important thing to note about alternation is that
the leftmost matching alternate is picked regardless of
its length. Thus, if one of the alternates is a prefix
of a later alternate, the latter may not have
a chance to match.
(pregexp-match "call|call-with-current-continuation"
"call-with-current-continuation")
=> ("call")
|
To allow the longer alternate to have a shot at
matching, place it before the shorter one:
(pregexp-match "call-with-current-continuation|call"
"call-with-current-continuation")
=> ("call-with-current-continuation")
|
In any case, an overall match for the entire regexp is
always preferred to an overall nonmatch. In the
following, the longer alternate still wins, because its
preferred shorter prefix fails to yield an overall
match.
(pregexp-match "(?:call|call-with-current-continuation) constrained"
"call-with-current-continuation constrained")
=> ("call-with-current-continuation constrained")
|
We've already seen that greedy quantifiers match
the maximal number of times, but the overriding priority
is that the overall match succeed. Consider
(pregexp-match "a*a" "aaaa")
|
The regexp consists of two subregexps, a* followed by a .
The subregexp a* cannot be allowed to match
all four a 's in the text string "aaaa" , even though
* is a greedy quantifier. It may match only the first
three, leaving the last one for the second subregexp.
This ensures that the full regexp matches successfully. The regexp matcher accomplishes this via a process
called backtracking. The matcher
tentatively allows the greedy quantifier
to match all four a 's, but then when it becomes
clear that the overall match is in jeopardy, it
backtracks to a less greedy match of
three a 's. If even this fails, as in the
call
(pregexp-match "a*aa" "aaaa")
|
the matcher backtracks even further. Overallfailure is conceded only when all possible backtracking
has been tried with no success. Backtracking is not restricted to greedy quantifiers.
Nongreedy quantifiers match as few instances as
possible, and progressively backtrack to more and more
instances in order to attain an overall match. There
is backtracking in alternation too, as the more
rightward alternates are tried when locally successful
leftward ones fail to yield an overall match.
13.2.14 Disabling backtracking
|
Sometimes it is efficient to disable backtracking. For
example, we may wish to commit to a choice, or
we know that trying alternatives is fruitless. A
nonbacktracking regexp is enclosed in (?> ... ) .
(pregexp-match "(?>a+)." "aaaa")
=> #f
|
In this call, the subregexp ?>a* greedily matches
all four a 's, and is denied the opportunity to
backpedal. So the overall match is denied. The effect
of the regexp is therefore to match one or more a 's
followed by something that is definitely non- a .
13.2.15 Looking ahead and behind
|
You can have assertions in your pattern that look
ahead or behind to ensure that a subpattern does
or does not occur. These ``look around'' assertions are
specified by putting the subpattern checked for in a
cluster whose leading characters are: ?= (for positive
lookahead), ?! (negative lookahead), ?<=
(positive lookbehind), ?<! (negative lookbehind).
Note that the subpattern in the assertion does not
generate a match in the final result. It merely allows
or disallows the rest of the match.
Positive lookahead ( ?= ) peeks ahead to ensure that
its subpattern could match.
(pregexp-match-positions "grey(?=hound)"
"i left my grey socks at the greyhound")
=> ((28 . 32))
|
The regexp "grey(?=hound)" matches grey , but only if it is followed by hound . Thus, the first
grey in the text string is not matched. Negative lookahead ( ?! ) peeks ahead
to ensure that its subpattern could not possibly match.
(pregexp-match-positions "grey(?!hound)"
"the gray greyhound ate the grey socks")
=> ((27 . 31))
|
The regexp "grey(?!hound)" matches grey , butonly if it is not followed by hound . Thus
the grey just before socks is matched.
Positive lookbehind ( ?<= ) checks that its subpattern could match
immediately to the left of the current position in
the text string.
(pregexp-match-positions "(?<=grey)hound"
"the hound in the picture is not a greyhound")
=> ((38 . 43))
|
The regexp (?<=grey)hound matches hound , but only if it is preceded by grey . Negative lookbehind
( ?<! ) checks that its subpattern
could not possibly match immediately to the left.
(pregexp-match-positions "(?<!grey)hound"
"the greyhound in the picture is not a hound")
=> ((38 . 43))
|
The regexp (?<!grey)hound matches hound , but only if
it is not preceded by grey . Lookaheads and lookbehinds can be convenient when they
are not confusing.
Here's an extended example from Friedl that covers many of the features
described above. The problem is to fashion a regexp that will match any
and only IP addresses or dotted quads, ie, four numbers separated
by three dots, with each number between 0 and 255. We will use the
commenting mechanism to build the final regexp with clarity. First, a
subregexp n0-255 that matches 0 through 255.
(define n0-255
"(?x:
\\d ; 0 through 9
| \\d\\d ; 00 through 99
| [01]\\d\\d ;000 through 199
| 2[0-4]\\d ;200 through 249
| 25[0-5] ;250 through 255
)")
|
The first two alternates simply get all single- and
double-digit numbers. Since 0-padding is allowed, we
need to match both 1 and 01. We need to be careful
when getting 3-digit numbers, since numbers above 255
must be excluded. So we fashion alternates to get 000
through 199, then 200 through 249, and finally 250
through 255. 6An IP-address is a string that consists of
four n0-255 s with three dots separating
them.
(define ip-re1
(string-append
"^" ;nothing before
n0-255 ;the first n0-255,
"(?x:" ;then the subpattern of
"\\." ;a dot followed by
n0-255 ;an n0-255,
")" ;which is
"{3}" ;repeated exactly 3 times
"$" ;with nothing following
))
|
Let's try it out.
(pregexp-match ip-re1 "1.2.3.4") => ("1.2.3.4")
(pregexp-match ip-re1 "55.155.255.265") => #f
|
which is fine, except that we also have
(pregexp-match ip-re1 "0.00.000.00") => ("0.00.000.00")
|
All-zero sequences are not valid IP addresses! Lookahead to the rescue.
Before starting to match ip-re1 , we look ahead to ensure we don't
have all zeros. We could use positive lookahead to ensure there
is a digit other than zero.
(define ip-re
(string-append
"(?=.*[1-9])" ;ensure there's a non-0 digit
ip-re1))
|
Or we could use negative lookahead to ensure that what's ahead isn't
composed of only zeros and dots.
(define ip-re
(string-append
"(?![0.]*$)" ;not just zeros and dots
;(note: dot is not metachar inside [])
ip-re1))
|
The regexp ip-re will match all and only valid IP addresses.
(pregexp-match ip-re "1.2.3.4") => ("1.2.3.4")
(pregexp-match ip-re "0.0.0.0") => #f
|
|