[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#64128: regexp parser zero-width assertion bugs
From: |
Mattias Engdegård |
Subject: |
bug#64128: regexp parser zero-width assertion bugs |
Date: |
Mon, 19 Jun 2023 21:52:40 +0200 |
19 juni 2023 kl. 21.21 skrev Paul Eggert <eggert@cs.ucla.edu>:
> If I understand things correctly, this would cause "\b*c" to be treated like
> "\b\*c".
Actually it already works that way. What the patch does, is preventing AB\b*C
from being treated as \(?:AB\b\)*C but as AB\b\*C instead, which I think we can
all agree is less wrong.
You can check the test cases in the patch:
(should (equal (string-match "q\\b*!" "q*!") 0))
(should (equal (string-match "q\\b*!" "!") nil))
which in current Emacs produce 2 and 0 respectively.
> It's long been documented that the only reason "*" is ordinary at the start
> of a regular expression or subexpression is "historical compatibility", and
> it's also long been documented that you shouldn't take advantage of this and
> you should backslash-escape the "*" anyway. In contrast, for constructs like
> \b* there is not a historical compatibility reason, so there's not a good
> argument for treating "*" as an ordinary character after "\b".
Sure, we can turn \b and \B into group B assertions, but the patch was more
conservative in nature.
We also have \` to consider -- I think we have to preserve \`* meaning \`\* for
compatibility, historical or not, because it's something we keep sighting in
the wild.
> Instead, \b should not be a special case before "*", and \b* should be
> equivalent to \(\b\)* and should match only the empty string. Similarly for
> the other zero-width backslash escapes. This is what I would expect from
> these constructs from the longstanding documentation.
>
> If we instead added a rule to say that a construct that can only match the
> empty string causes following "*" to ordinary, then \b* and \(\b\)* would
> both be equivalent to \*. Although consistent, this would be confusing: it
> would compound the historical-compatibility mistake. Let's keep things simple
> instead.
Yes, I definitely would be confused by such semantics.
> Also, whatever change we make to the behavior should be documented in the
> manual and in etc/NEWS.
Will be happy to oblige, although in this case it really just was a bug fix.
What I really would like to see is the regexp parser somehow separated from the
NFA bytecode generator, which would make both clearer. The parser could then be
re-used for other purposes such as a different back-end (DFA construction) or a
built-in xr-like converter.
- bug#64128: regexp parser zero-width assertion bugs, (continued)
- bug#64128: regexp parser zero-width assertion bugs, Stefan Monnier, 2023/06/17
- bug#64128: regexp parser zero-width assertion bugs, Mattias Engdegård, 2023/06/17
- bug#64128: regexp parser zero-width assertion bugs, Paul Eggert, 2023/06/17
- bug#64128: regexp parser zero-width assertion bugs, Eli Zaretskii, 2023/06/18
- bug#64128: regexp parser zero-width assertion bugs, Mattias Engdegård, 2023/06/18
- bug#64128: regexp parser zero-width assertion bugs, Stefan Monnier, 2023/06/18
- bug#64128: regexp parser zero-width assertion bugs, Mattias Engdegård, 2023/06/19
- bug#64128: regexp parser zero-width assertion bugs, Stefan Monnier, 2023/06/19
- bug#64128: regexp parser zero-width assertion bugs, Mattias Engdegård, 2023/06/19
- bug#64128: regexp parser zero-width assertion bugs, Paul Eggert, 2023/06/19
- bug#64128: regexp parser zero-width assertion bugs,
Mattias Engdegård <=
- bug#64128: regexp parser zero-width assertion bugs, Stefan Monnier, 2023/06/19
- bug#64128: regexp parser zero-width assertion bugs, Mattias Engdegård, 2023/06/20
- bug#64128: regexp parser zero-width assertion bugs, Paul Eggert, 2023/06/21
- bug#64128: regexp parser zero-width assertion bugs, Mattias Engdegård, 2023/06/21
- bug#64128: regexp parser zero-width assertion bugs, Paul Eggert, 2023/06/19
- bug#64128: regexp parser zero-width assertion bugs, Paul Eggert, 2023/06/19