bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bug#40634: Massive pattern list handling with -E format seems very s


From: Jim Meyering
Subject: Re: bug#40634: Massive pattern list handling with -E format seems very slow since 2.28.
Date: Fri, 11 Sep 2020 14:47:49 +0200

On Sun, Apr 19, 2020 at 4:10 AM Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
> On Sun, 19 Apr 2020 07:41:49 +0900
> Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
> > On Sat, 18 Apr 2020 00:22:26 +0900
> > Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
> >
> > >
> > > On Fri, 17 Apr 2020 10:24:42 +0900
> > > Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
> > >
> > > >
> > > > On Fri, 17 Apr 2020 09:35:36 +0900
> > > > Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
> > > >
> > > > >
> > > > > On Thu, 16 Apr 2020 16:00:29 -0700
> > > > > Paul Eggert <eggert@cs.ucla.edu> wrote:
> > > > >
> > > > > > On 4/16/20 3:53 PM, Norihiro Tanaka wrote:
> > > > > >
> > > > > > > I have had no idea to solve the problem yet.  If we revert it, 
> > > > > > > bug#33357
> > > > > > > will come back.
> > > > > >
> > > > > > Yes, I'd rather not revert if we can help it.
> > > > > >
> > > > > > My own thought was to not analyze the regular expression if we 
> > > > > > discover that the input is empty. :-)
> > > > >
> > > > > Now, I have a idea, it is that we build indexes of epsilon nodes
> > > > > including in follows before remove epsilon nodes.
> > > >
> > > >
> > > > I wrote fix for the bug, but it will be slower then at grep 2.27 yet.
> > >
> > > It was improved previous patch.
> >
> > Sorry, correct patch is here.
>
> I made the previous patch even simpler.
>
> before:
>
> $ env LC_ALL=C time -p src/grep -E -v -m1 -f grep-patterns.txt /dev/null
> real 7.24
> user 7.14
> sys 0.09
>
> after:
>
> $ env LC_ALL=C time -p src/grep -E -v -m1 -f grep-patterns.txt /dev/null
> real 0.62
> user 0.52
> sys 0.10

Thank you for this patch. I have rebased and made minor syntactic changes.
I'll push it to gnulib soon, if not today, then by Monday.

I am considering creating a test case in grep, but it feels too tight
to be feasible: I would use a relative perf test, requiring that a
passing test incur a perf cost of less than say 100x. Here's the
beginnings of my attempt (note: this is just an outline -- obviously
would not rely on having "time" in path or as a shell builtin):

gen()
{
  local n=$1
  local i=1
  while : ; do
    local pat=$(printf $i | sha1sum | cut -d' ' -f1)
    printf '%s\n' "$pat$pat(\$|$pat)"
    i=$(expr $i + 1)
    test $i = $n && break
  done
}

gen 4000 > pats-4000
head -400 pats-4000 > pats-400

# With fixed code, that a 10x input size increase (n=400 to 4000)
# induces a 40x runtime increase: .05 -> 2.0s
# Just prior to this change, it's 150x: 0.2 -> 30s

env LC_ALL=C time -p src/grep -E -v -m1 -f pats-400 /dev/null
env LC_ALL=C time -p src/grep -E -v -m1 -f pats-4000 /dev/null



reply via email to

[Prev in Thread] Current Thread [Next in Thread]