guile-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Stupid module and pregexp questions


From: Tom Lord
Subject: Re: Stupid module and pregexp questions
Date: Mon, 5 May 2003 10:33:09 -0700 (PDT)



    > If all (or nearly all) of the libs that guile might choose to
    > link against for (ice-9 regex) on various platforms are
    > consistent with each other, 

They are not.

    > then I should perhaps withdraw my
    > suggestions.  I was just under the impression that they vary
    > substantially, and wanted to have at least one familiar regex
    > subsystem in the core that eliminated the variance.

You might want to ask RMS if he has a version of GNU regex that is
reasonably correct these days.  I recall that he did some work on that
last year -- but that might have been just for the version in Emacs.

GNU regex has the virtues of being a very small implementation with
standard Posix entry points, and the drawback (for _some_ apps) of
using a backtracking-based implementation.

That would be a net win for (ice-9 regex): (a correct) GNU regex (if
you can scare one up) is small enough to distribute with Guile, and
then those people who need a faster version need only find one that
offers posix entry points.

Be careful because I think that most versions of GNU regex in
circulation (many slightly different versions, all labeled "0.12") 
have subtle bugs.

In src/hackerlab/tests/rx-posix-tests (e.g., in arch-1.0pre18) there
are some conformance tests for Posix matchers and a driver program
"=generic-test-regex.c" that you can use to run them.

We were slightly talking at cross purposes: when I concluded that you
should really consider starting with a minimalist truly-regular
pattern language, I was thinking beyond just (ice-9 regex) -- for
example, what would we want in a regex srfi?

Then, when people start talking about using Perl regexps to be
"compatible" to Perl and Python:  well, what's the point of that?  
Some of the biggest wins for using Scheme as the basis of an
app-framework/extension/scripting language are the opportunities to 
provide a far superior programming environment based on a language
that a fair number of implementors work on making quite fast.


         > Also, if one of the main things you're arguing is that perl
         > and emacs-style regexes have extensions that we need to do
         > without if we want good performance, then I'm not trying to
         > argue against that assertion.

That's not quite what I'm saying.  It isn't about "extensions": it's
about the basic semantics of regexp operators like | and *.  Some
regexp languages (Emacs is one example, Javascript's another) are
defined in terms of backtracking implementations.  They aren't
"regular expression" engines in the theoretical CS sense at all.  They
don't provide equivalent DFA's even for very simple patterns like
"else|elseif".   You can't build a program like "lex" to statically
compile their patterns for a linear scanner.   You can't build a
dynamic DFA engine like Rx.    Those pattern languages simply don't
scale to large problems (font lock mode, anyone?) and don't have nice
simple algebraic properties for automatic manipulation of patterns
(SREs, anyone?).

The extensions in those languages are a little bit relevant: they tend
to be extensions that are easy to implement given a backtracking
matcher as a starting point, and basically impossible to implement
(efficiently) in DFA-based implementations.  So the extensions
compound the problem -- but the root of the problem is the semantics
of very basic operators.  (Subexpression position reporting, anchors,
and backreferences in Posix appear to have started out as similar
backtracking-oriented extensions.  When Posix was standardized, those
extensions were given semantics wrongly presumed to be DFA-friendly.
As a result, Posix regexps are in some sense the worst of both worlds:
they force backtracking matchers to do nearly exhaustive searches and they
force DFA-based implementations to add a complicated layer of
weird-ass recursive backtracking.)

        > I'm really just ruminating on the advisability of a
        > well-defined, invariant, and reasonably familiar regex
        > syntax for guile's core.

To my ears, that's different from just thinking about what backs up
(ice-9 regex).    That should be more at the level of "What would a
good regex SRFI do?"   And I stand by my answer there: none of the
standard regexp languages are any good, though the XML Schema version
comes closest.   A pattern language that managed to be truly a regular
expression language in the CS-theory sense is the only sane basis 
for something like an SRE approach.


        > I'd probably be perfectly happy with a good POSIX
        > implementation in the core, perhaps even with a subset of
        > POSIX if dropping certain bits were somehow important...

As I said, ask about for a working GNU regex for (ice-9 regex).
I think that's the only practical choice you (might) have.

But I encourage you also to have higher ambitions and just point to my
experience using Rx in systas.  Having a dynamic DFA engine in there
enabled a bunch of neat applications that would be simply out of reach
in many environments.

-t




reply via email to

[Prev in Thread] Current Thread [Next in Thread]