bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bug#22357: grep -f not only huge memory usage, but also huge time co


From: Trevor Cordes
Subject: Re: bug#22357: grep -f not only huge memory usage, but also huge time cost
Date: Sun, 11 Dec 2016 05:28:56 -0600

On 2016-12-11 Norihiro Tanaka wrote:
> The changes switch used algorithm.  They convert grep -w -F to grep
> -w.

Hi, thanks for helping!  Sorry, yes, I forgot I was using
--fixed-strings (-F), so yes, my example should have used -F.

> Try following test case and before and after the changes, please.
> 
> $ yes $(printf %040d 0) | head -10000000 >k
> $ time -p env LC_ALL=C grep -w -F 0 k
> 
> It did not finish in 40s on my machine without the changes, but
> finished in 0.54s after the changes, and following case finished in
> 1.35s regardless whether the changes applied or not.
> 
> $ time -p env LC_ALL=C grep -w 0 k

OK, I see now where you are going with your examples.  You are trying
to show why the change was made in the first place.  That makes sense.

> It may be an extreme example, but if a lot of people assume that grep
> -F is faster than grep, I think that it is a bug.

I've read in numerous places (O'Reilly books mostly) that grep or pcre
is often/sometimes faster than fgrep, so I think it is (somewhat) common
knowledge and I wouldn't worry too much about that perception.  (Even
though what you say is intuitive.)

> > grep -w -f /usr/share/dict/words /tmp/greptest
> > (good older version: 2 seconds to complete, minimal memory)
> > (any version after the above commits: 10 or more minutes, never
> > waited for it to finish, 1.2GB RAM usage and 100% cpu)  
> 
> I believe that it can not happen, as the changes work for grep -F
> only. Is it grep -F -w, correctly?  grep -F -w is not good at long
> pattern after the changes.

Yes, you are correct, I mistakenly left out the -F in my example.

> I think that the real issue of bug#21763, bug#22239 and bug#22357 is
> that dfa uses a lot of memory for a long pattern, but We have no idea
> to improve it currently.

Let me digress for a moment... I think it's important you can reproduce
my test case symptom before we spend time on the solution:

> However, the poor performance will be improved partly.  Following case
> is returned in 2.4s after patched on my machine.
> 
> $ env LC_ALL=C grep -F -w -f /usr/share/dict/words /dev/null

On my box the above runs for >2m (never completes before I ^C) on the
version **AFTER** the commits (v2.22).  On the test build just *BEFORE*
the commits (2.21.73-8058), it runs in <2s.  So for me, I had a working
command (-F -w -f) that used to run quickly, that suddenly became
slow/useless.  And none of the suggestions I've read can alleviate the
problem for me, not locale settings, nor anything else.  If you can
confirm this is the case on your box, then maybe we can figure out why?

The main question I think for my case is why is "-F -w -f" switching
modes to the slow DFA at all?  AFAIK there's no encoding flaw in the
input, so why is grep even switching modes in this use case?  What was
wrong with just using the old mode?  From what I've read, people don't
want DFA-mode when using -F and -f.

Or, there should be an option to grep that we can specify to turn off
this new behavior to get back the old (<2s) performance.

Thanks!



reply via email to

[Prev in Thread] Current Thread [Next in Thread]