bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Dealing with character ranges in grep


From: Jim Meyering
Subject: Re: Dealing with character ranges in grep
Date: Thu, 16 Jun 2011 07:58:05 +0200

Jim Meyering wrote:

> Bruno Haible wrote:
>> Paolo,
>>
>>> > [=e=] to match "e" as well as accented versions like é, è and ê).
>>> > That is the one feature that you get with glibc, and that you would
>>> > sacrifice when building --with-included-regex.
>>>
>>> I agree.  It's up to distros to choose, of course.
>>
>> If you are on the point of sacrificing a glibc feature in many programs,
>> then IMO you should first talk with the glibc people to see what alternative
>> they can offer.
>
> People who build the tools currently have the choice of using
> --with-included-regex or
> --without-included-regex
>
> Note that putting equivalence classes (and backrefs) aside, the
> interpretation of ranges is done in dfa.c, which means the vast
> majority of range uses never even require use of regexp code.
>
> However, backreferences force these tools to skip the DFA-based
> optimization and resort to running the regexp code.  In that case,
> there is a dichotomy.  Adding a backreference to a range-including
> regexp would have the surprising consequence of changing how that range
> is interpreted when the tool is built to use glibc's regexp code.
>
> Thus, if we go this route, we are effectively saying
> that people who want self-consistent regex-handling
> in our tools must build with --with-included-regex or end
> up causing subtle problems.
>
> That's a big leap.
> I'm not saying I won't take upstream grep over the edge,
> but I'd like to hear what a few distro-maintainers think.

To clarify...
I like Arnold's proposal to make regex range handling sane
and locale-independent.

It goes like this (at least for gawk, grep and sed):

  change how dfa.c interprets ranges like [a-z]
  change how gnulib's reg* code handles ranges

Always use the included regex code (the one from gnulib),
so that its interpretation is consistent with that of dfa.c.

Grep's current upstream default is to build --with-included-regex,
which makes grep use glibc's regex code.

To make this proposed change go through, that configure-time option would
have to be eliminated, so that we always build with the gnulib-provided
regex code.  Of course, if glibc ever changes, we can detect that and
automatically prefer it when possible.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]