bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: possible bug in regex and dfa


From: arnold
Subject: Re: possible bug in regex and dfa
Date: Sun, 18 Jul 2021 12:59:24 -0600
User-agent: Heirloom mailx 12.5 7/5/10

Bruno Haible <bruno@clisp.org> wrote:

> Hi Arnold,
>
> > Dot matching newline isn't the issue here.
> > 
> > It's ^ matching in the middle of a string.  For my purposes, ^ should
> > only match at the beginning of a *string* (as $ should only match at
> > the end of a string).  I haven't rechecked POSIX, but this is how awk
> > has behaved since forever.
>
> Hmm. Regarding POSIX: I've read section 9.3.8 and 9.4.9 of [1],
> the description of REG_NOTBOL, REG_NOTEOL in [2], and the description
> of REG_NEWLINE in [3]. If I understand it correctly, within POSIX,
> ".^" should not match a newline because
>   - if REG_NEWLINE is set, '^' matches after the newline but '.' does not
>     match the newline,
>   - if REG_NEWLINE is not set, '.' matches newline but '^' does not match
>     after the newline.

That makes sense.  This is why I felt that, for gawk, ".^" is an invalid
regexp. (Indeed, the original Unix awk rejects it as such.)

REG_NEWLINE is not included in any of the RE_*_AWK definitions since I
want exactly the behavior you describe: dot matches newline but ^ does
not match after the newline.

To me this feels very much like a bug.

> However, GNU regex.h also has a flag RE_CONTEXT_INDEP_ANCHORS; I don't know
> what effect it has.

In this case it makes things worse, causing gawk to match ".^" literally.

> > (And how I've documented things in the manual, also since forever.)
>
> If you want the behaviour of the GNU regex to be stable over time, you
> should contribute unit tests to tests/test-regex.c.

This is a separate issue. It almost sounds like you're saying "it's your
fault there's a bug here, you didn't contribute unit tests".  I hope
that's not your intent; if it is then sorry, I don't buy it.

In any case, I've supplied a regexp, input data, and in the gawk dist,
a test harness, so that debugging can be done if one of the Gnulib
maintainers will look into this particular issue.

Thanks,

Arnold



reply via email to

[Prev in Thread] Current Thread [Next in Thread]