guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Stupid module and pregexp questions


From: Tom Lord
Subject: Re: Stupid module and pregexp questions
Date: Wed, 30 Apr 2003 10:11:53 -0700 (PDT)


       > Uh, oh. What I'd like to see would be a `standard' regexp
       > interface for Scheme (and may be a `standard' regexp
       > implementation for Guile).

       > This would boost all those portable programs which nowadays
       > try to parse strange things the hard way.

I realize it's a bit cliche but: the nice thing about regexp standards
is that there are so many to choose from.   Just to throw out some 
observations:

Posix regexps are pretty well specified, big subsets of them have the
nice mathematical properties we expect of regular expressions, and
they are _fairly_ friendly to DFA based matchers.  There are a bunch
of decent test suites for them and, as of last year, one huge test
suite that combines all of those and adds some new tests.  Alas, there
are a lot of buggy implmentations of them, and one apparently
irreconcilable disagreement about interpreting the spec (is
regexp concatenation left associative or right associative?).

I've always thought of Perl regexps as a kind of moving target and the
first-found semantics are not so mathematically nice (so are they
really "Schemey"?  and does that matter).  They are easier to
implement but first-found doesn't mix too well with DFA techniques.
There are lots of implementations of variations on Perl regexps.
Lots of programmers are familiar with them.   Some of their
non-regular extensions (e.g., look-ahead) are apparently pretty
convenient.

The Unicode consortium publishes a tech report that almost specifies a
really minimalist, very clean regular expression language
(particularly addressing how character sets should work given the size
and complexity of Unicode).  The regular expression language in the
XML Schema standard is based on this.  It's a bit low on expressive
power compared to Posix and Perl -- but the limitations make it a
DFA-friendly regexp language (you can always see if a given string
matches with a single pass over the string).

GNU Emacs has its own little language: roughly a Posix-like syntax and
feature set, with a Perl-like semantics.  It's chief virtue is a
simple and long-lived implementation with syntax options and the
option to do matching with strict Posix semantics.  It's chief
drawback is speed -- it's frustratingly too slow for some applications
that would otherwise be natural in an extensible editor.

--------

I think using Perl or Emacs expressions as the primary type of regexp
in Guile would be a mistake.  Since they're mathematically icky,
there's less you can do to process them automatically.  Since they're
mostly DFA-unfriendly, they don't scale too well to
performance-intensive applications of non-trivial expressions.  If
someone needs Perl expressions (because, say, their application has to
read in expressions from some data file that Perl apps might also
read), can't they just add a trivial Guile binding for them?

Posix regexps would have some advantages.  They can scale _pretty_
well (if you avoid submatch reporting, backrefs, and context
operators).  They're reasonably clean.  You can do tricks like calling
`grep' or `sed' as a subprocess with a Posix expression, then use that
same expression internally.  But also disadvantages: the backreference
operator, context operators, and sub-match position reporting features
are hard to implement correctly and are DFA-unfriendly (meaning that
they can force backtracking to take place either in time (lots of
string passes) or space (lots of ancillary data during a sing string
pass)).

As a historic note: my impression is that context, sub-match reporting
and backrefs happened to be easy to add to an early NFA-based
implementation.  I doubt that the early forms had the strict
recursively-leftmost-longest semantics of the current standard.  In
other words, I think it's just an unfortunate accident that Posix
regexps have these features, and strongly suspect that when those
features were first standardized, there were exactly 0 correct
implementations.  The standard authors, I'm guessing, simply
underestimated the difficulty and cost of getting them right as
specified.  If one had the freedom to start fresh and design regexp
tools today, I don't think the result would look much like Posix
regexps.

-----

So what does that leave?   Something really minimalist like the XML
Schema regular expression language -- a language that's less
expressive, but mathematically clean and very DFA-friendly.

In a non-lisp language, that minimalism would be a problem.  In
something like Perl, it seems to me, the regexp language grows
essentially to embed control structures that would be hard or at least
tedious to express directly in regular Perl code.  It's as if there's
an embedded language in Perl just for writing pattern-matching "loops" 
that use more traditional regular expressions as primitive
statements.  

But in lisp languages: well, that's part of what macros and
higher-order functions are for -- building little languages.

So my vague idea is: pick a truly minimalist, DFA-friendly pattern
language.  Leave out anchors ("^" and "$"), context operators
(e.g. "match the empty string if followed by whitespace"), sub-match
position reporting, and backreferences.   Provide primitives like
"return the length of the shortest/longest match starting here",
"return #t if any match starts here".   Add features like
state-labeling (let users attach data to NFA states) and interfaces
like "run the DFA for the next 10 characters of this string" and
"return a list of all the NFA state labels of the current state".

Next, invent little languages on top of that to take over the role of
backrefs, sub-match reporting, context operators and so forth.  Things
like sub-match reporting should probably _not_ have the same semantics
as in Posix -- but should be designed so that scans can avoid
backtracking.

Finally, design a surface syntax that can compete with the Posix or
Perl syntax, but that "compiles" fairly trivially into Scheme
expressions over the minimalist expression language.   In other words, 
even though the minimalist language would lack anchors or submatches,
a user typing a pattern into a dialog box for a regexp-search or
regexp-query-replace command should have _some_ access to anchoring
and submatches without having to resort to typeing s-exps.

Maybe that suggestion, to choose a minimalist, truly regular regular
expression language -- then do the rest in scheme -- satisfies the
spirit of "do as little as possible in C".

Another design dimension to consider: what are Guile's plans re:
Unicode?

-t




reply via email to

[Prev in Thread] Current Thread [Next in Thread]