bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bug#22357: grep -f not only huge memory usage, but also huge time co


From: Trevor Cordes
Subject: Re: bug#22357: grep -f not only huge memory usage, but also huge time cost
Date: Mon, 12 Dec 2016 04:36:21 -0600

On 2016-12-12 Norihiro Tanaka wrote:
> > On my box the above runs for >2m (never completes before I ^C) on
> > the version **AFTER** the commits (v2.22).  On the test build just
> > *BEFORE* the commits (2.21.73-8058), it runs in <2s.  So for me, I
> > had a working command (-F -w -f) that used to run quickly, that
> > suddenly became slow/useless.  And none of the suggestions I've
> > read can alleviate the problem for me, not locale settings, nor
> > anything else.  If you can confirm this is the case on your box,
> > then maybe we can figure out why?  
> 
> Can you re-run with current master, as dfa has been improved
> frequently?

env LC_ALL=C grep -F -w -f /usr/share/dict/words /dev/null
ran for 3+ minutes and then quit; no change then.  I had run on HEAD
before my git bisect anyway, just to ensure it hadn't been fixed in the
meantime.

If HEAD helps, then it still isn't good enough, as taking 2+ orders of
magnitude more time is still not a viable solution for me.

> First is faster after the changes, and second is slower after the

The problem, I think, isn't "faster" or "slower", it's usable vs.
unusable. I think it's not possible to change a working-for-decades use
case to increase runtime by 2, 3 or 10 orders of magnitude without
leaving a bunch of previously-happy end users stranded.  Hence all the
bug reports.  Conversely, not many other people are going to notice or
care that their other use case improved by 10 or 20% runtime.  This
change is exact that tradeoff: under 1 order of magnitude improvement
for some people, at the expense of many orders of magnitude detriment
for others.

> If we think simple algorithm as we select kws matcher for grep -F -w
> -f and select grep dfa matcher for -F -w -e, we will get extreamly
> different performance for following two cases.

OK, I looked at the code a lot more closely.  Now I see a more salient
point perhaps we should be looking at:

"In a unibyte locale, switch from fgrep to grep if the pattern matches
words (where grep is typically faster)."

Why is this even something that's being done?  If a user bothered to
specify fgrep or -F then that user knows what they are doing!  Only
advanced users even know that fgrep/-F exists and use it.  What the
above comment says is that "since we think it'll be faster, let's
override the will of the user because we think we know better"!  Why not
give the user credit that they know what they are doing?

And in this case, again consider my above point that you are gaining
tiny improvements for some at the expense of horrendous penalties for
others.

Perhaps undoing the commit and simply expounding on the issue in the
docs (man page, etc) would be a better approach?  Then the user has
control of the situation in their choice of -F or no -F.  What is wrong
with that?

> changes.  It's a trade-off.  Can you have any idea to select the
> better matcher for both two cases?
> 
> - $ env LC_ALL=C grep -F -w -f /usr/share/dict/words /dev/null
> 
> - $ env LC_ALL=C grep -F -w -e "`cat /usr/share/dict/words`" /dev/null

Yes, you are correct, if this commit is to stand and we want to find
some way to solve my (and the other bug reporters') use case
performance problems, then that question will have to be answered.  I
would say empirical tests would have to be run to discover where the
performance gain/loss crossover point lies (perhaps based on
input/expression size) and switch based on that.

However, my hunch is not trying to outsmart the user in the first place
is the correct approach.

Thanks for your time and input!



reply via email to

[Prev in Thread] Current Thread [Next in Thread]