[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Regular expression libraries
From: |
Clément Pit--Claudel |
Subject: |
Regular expression libraries |
Date: |
Thu, 15 Dec 2016 14:00:25 -0500 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 |
Hi emacs-devel,
The top of regex.c says this:
/* Ignore some GCC warnings for now. This section should go away
once the Emacs and Gnulib regex code is merged. */
And indeed I've seen brief discussions of this merge in the past
(https://lists.gnu.org/archive/html/emacs-devel/2002-04/msg00083.html ,
https://lists.gnu.org/archive/html/emacs-devel/2016-07/msg01115.html). Looking
at this is more detail, though, it is not clear how such a merge could happen:
* Emacs has special regexp features based on syntax classes, or the position of
the point. These features don't seem to be supported in gnulib nor glibc
* Emacs uses a gap buffer, whereas gnulib and glibc expects a char* for the
subject string
Stefan also expressed a desire to move to a better regular expression engine —
ideally one that uses a non-backtracking implementation when possible, to avoid
the exponential blowups that Emacs currently suffers from on certain regular
expressions.
I've had a quick look at the existing options. Here's what I gathered (the
list below includes only libraries whose licensing is compatible with Emacs):
* There are only two usable, non-backtracking regexp libraries:
- RE2 (from Google, C++) doesn't support any form of backtracking, though, so
"\\(.*\\)\1" won't work anymore. Additionally, RE2 does not support matching
over a gap buffer (the subject string needs to be a contiguous chunk of
memory). Discussion here: https://github.com/google/re2/issues/126
- TRE (http://laurikari.net/tre/) uses backtracking if needed, but otherwise
uses a finite automaton. It's stable, can be relatively easily linked into
Emacs, and performance is similar to what we have (I benchmarked it). It does
support matching on a gap buffer (it has an interface to use the moral
equivalent of a random iterator, reguexec). It doesn't support Emacs' native
encoding out of the box, though, and the library is unmaintained; there are
multiple forks which fix various security vulnerabilities.
In both cases, the syntax isn't compatible with that of Emacs, which would
force us to write a compatibility layer. And Emacs' internal encoding isn't
supported either.
* Backtracking implementations are more common, but none seem to fit the bill:
- Oniguruma and Onigmo (C, used in Ruby, the Atom text editor, and others)
support a wide range of encodings, including allegedly Emacs' internal
encoding, as well as many syntaxes, including Emacs' syntax. But they don't
support matching on a gap buffer either (they take a contiguous string).
Discussion here: https://github.com/k-takata/Onigmo/issues/83
- PCRE (C, used in Perl and others) doesn't have Emacs encoding, nor Emacs
syntax. It requires a contiguous buffer, but it does support partial matches.
Partial matches can apparently be restarted, which could almost work in our
case, but the semantics of restarting a search from a partial match are broken.
- Boost (C++) can use a custom random-access iterator for the subject string,
which should work for gap buffers; but I'm not sure how good the encoding story
is.
If I understand our current implementation, our regexp.c functions take the
moral equivalent of two strings, and match over that, pretending that it's just
one large string. In practice, these two strings are the two halves of the gap
buffer. Correct?
The bottom line is that, even though it should be relatively easy to change
string-match-p to use onigmo, changing re-search-forward sounds tricky. Did I
miss something? Was there a concrete plan for merging with glibc, or for using
an automaton-based matching strategy? The advantages of switching could be
support for a more common syntax, more flexible regular expressions (named
groups, assertions, …), better performance, and better handling of special
cases.
Cheers,
Clément.
smime.p7s
Description: S/MIME Cryptographic Signature
- Regular expression libraries,
Clément Pit--Claudel <=
- Re: Regular expression libraries, Eli Zaretskii, 2016/12/15
- Re: Regular expression libraries, Paul Eggert, 2016/12/15
- Re: Regular expression libraries, Andreas Schwab, 2016/12/15
- Re: Regular expression libraries, Paul Eggert, 2016/12/16
- Re: Regular expression libraries, Clément Pit--Claudel, 2016/12/16
- Re: Regular expression libraries, Clément Pit--Claudel, 2016/12/16
- Re: Regular expression libraries, Lars Ingebrigtsen, 2016/12/16
- Re: Regular expression libraries, Clément Pit--Claudel, 2016/12/16
- Re: Regular expression libraries, Eli Zaretskii, 2016/12/16
- Re: Regular expression libraries, Paul Eggert, 2016/12/16