bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: grep - HUGE performance Problems in grep 2.5.1


From: Stepan Kasal
Subject: Re: grep - HUGE performance Problems in grep 2.5.1
Date: Fri, 16 May 2003 09:58:13 +0200
User-agent: Mutt/1.2.5.1i

Hello,

On Thu, May 15, 2003 at 04:02:13PM -0400, Bryant Misener wrote:
> For grep 2.4.2   the commands below takes  0.55 seconds to find 45549 lines
> For grep 2.5.1 the commands below takes  5 minutes and 47 seconds....!!!!
> 
> This test is running under Linux - RedHat 9.0 using the downloaded gz 
> files from your site.

Quick temporary fix: recompile grep, usings
         ./configure --with-included-regex

Warning: this may not work for future releases of grep.
        I guess that the next release (2.5.2) will be the last one for
        whcih this works.
A better explanation is below.

There are sevral things which cause this in combination:
- grep 2.5.1 links against newer regex library (part of new glibc), which is
  fully POSIX compliant, especially with respect to weird character sets
- that new regex library is slower (that's the price for being fully compliant),
  and it's extremelly slow when wierd charsets are used
- RedHat uses UTF-8 by default, which is such a "weird charset"

There are several things you might try:

1) --with-included-regex
This configure option tells grep to use the old regex library, which is not
Unicode-aware.

2) export LC_ALL=C
This tells grep tu use good old ASCII, not Unicode.
That should somewhat help though you won't get the performance of the old regex.

OTOH, if you try to download and compile grep-2.4.2 you'll probably find out
that it is as slow as 2.5.1.  The reason is that it defaults to
--without-included-regex, thus uses the "new" regex library, which is slow.

The long term solution is to make new regex.c quick enough, at least for
ASCII, perhaps also for UTF-8.

Since being correct was always much more important for GNU programs than
the speed, grep will adopt the new regex.  So it's quite possible that the
next release 2.5.2 will be the last one which contains the old regex.c
code.

Well, acquire a habit of resetting locale to "C" in each script you
write.

Hope this explains the situation,
        Stepan Kasal




reply via email to

[Prev in Thread] Current Thread [Next in Thread]