bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Dealing with character ranges in grep


From: Johannes Meixner
Subject: Re: Dealing with character ranges in grep
Date: Thu, 16 Jun 2011 10:07:01 +0200 (CEST)
User-agent: Alpine 2.00 (LNX 1167 2008-08-23)


Hello,

recently I became openSUSE package maintainer for grep and gawk.

I added Stanislav Brabec, openSUSE package maintainer for sed.


In short:
I support and appreciate everything which leads to consistence.


On Jun 16 07:58 Jim Meyering wrote:
Jim Meyering wrote:
Bruno Haible wrote:
Paolo,

[=e=] to match "e" as well as accented versions like é, è and ê).
That is the one feature that you get with glibc, and that you would
sacrifice when building --with-included-regex.

I agree.  It's up to distros to choose, of course.

If you are on the point of sacrificing a glibc feature in many programs,
then IMO you should first talk with the glibc people to see what alternative
they can offer.

People who build the tools currently have the choice of using
--with-included-regex or
--without-included-regex

Note that putting equivalence classes (and backrefs) aside, the
interpretation of ranges is done in dfa.c, which means the vast
majority of range uses never even require use of regexp code.

However, backreferences force these tools to skip the DFA-based
optimization and resort to running the regexp code.  In that case,
there is a dichotomy.  Adding a backreference to a range-including
regexp would have the surprising consequence of changing how that range
is interpreted when the tool is built to use glibc's regexp code.

Thus, if we go this route, we are effectively saying
that people who want self-consistent regex-handling
in our tools must build with --with-included-regex or end
up causing subtle problems.

That's a big leap.
I'm not saying I won't take upstream grep over the edge,
but I'd like to hear what a few distro-maintainers think.

To clarify...
I like Arnold's proposal to make regex range handling sane
and locale-independent.

It goes like this (at least for gawk, grep and sed):

 change how dfa.c interprets ranges like [a-z]
 change how gnulib's reg* code handles ranges

Always use the included regex code (the one from gnulib),
so that its interpretation is consistent with that of dfa.c.

Grep's current upstream default is to build --with-included-regex,
which makes grep use glibc's regex code.

To make this proposed change go through, that configure-time option would
have to be eliminated, so that we always build with the gnulib-provided
regex code.  Of course, if glibc ever changes, we can detect that and
automatically prefer it when possible.

I don't mind how various tools actually handle regular expressions
(in particular character ranges) and I do not care if this or that
special feature is supported or not.

But I do very much appreciate any effort which lets all those
various tools handle regular expressions in the exact same way.


I would even more appreciate it if all those various tools work
in the exact same way for all the various Linux distributions.

Therefore I support it if the compile-time choice
was eliminated how regex-handling can be build
(--with-included-regex versus --without-included-regex).


I think that those tools are so very basic tools, that
consistent behaviour must have topmost priority because
neither normal users understand inconsistent behaviour
nor experts who work on various Linux systems like
subtle inconsistent behaviour of the basic tools.


Kind Regards
Johannes Meixner
--
SUSE LINUX Products GmbH -- Maxfeldstrasse 5 -- 90409 Nuernberg -- Germany
HRB 16746 (AG Nuernberg) GF: Jeff Hawn, Jennifer Guild, Felix Imendoerffer

reply via email to

[Prev in Thread] Current Thread [Next in Thread]