[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: bug#22357: grep -f not only huge memory usage, but also huge time co
From: |
Bruno Haible |
Subject: |
Re: bug#22357: grep -f not only huge memory usage, but also huge time cost |
Date: |
Mon, 12 Dec 2016 12:49:29 +0100 |
User-agent: |
KMail/4.8.5 (Linux/3.8.0-44-generic; KDE/4.8.5; x86_64; ; ) |
Hi Paul,
> Before installing anything like that, I'd first like to look into improving
> the
> DFA performance, along the lines suggested by Norhiro Tanaka. Part of the
> problem appears to be that position-set merging, even with his latest
> proposed
> changes, is O(N**2) where N is the pattern size. Perhaps we should change the
> representation of position sets to use simple arrays; although this would
> take
> more space, merging would be O(N) and the code would surely be simpler.
I'm confused. Which code are you talking about? gnulib's dfa.c, function merge()
(line 1994)? This merge function operates in O(s1->nelem + s2->nelem), since in
each loop round i+j is being incremented by at least 1. Thus it is
asymptotically
optimal. Or do you mean that merge() is being called too frequently? But then
changing the representation of position_set wouldn't help.
When I run a gprof build of grep with the two files file1, file2, I get:
$ ./grep -v -f <(head -n 2000 ../../../file1) <(head -n 2000 ../../../file2) |
cat > /dev/null
$ gprof grep
...
% cumulative self self total
time seconds seconds calls s/call s/call name
48.84 1.47 1.47 26339 0.00 0.00 dfastate
15.62 1.94 0.47 24365 0.00 0.00 state_index
14.29 2.37 0.43 50101 0.00 0.00 merge
...
$ ./grep -v -f <(head -n 5000 ../../../file1) <(head -n 5000 ../../../file2) |
cat > /dev/null
$ gprof grep
...
% cumulative self self total
time seconds seconds calls s/call s/call name
54.06 10.45 10.45 67061 0.00 0.00 dfastate
13.50 13.06 2.61 62127 0.00 0.00 state_index
12.21 15.42 2.36 126516 0.00 0.00 merge
...
So, the larger the two input files, the larger the relative time eaten by
'dfastate'. IMO this means the bottleneck is 'dfastate'.
And when I run a gcov build of grep with the two files file1, file2, I get:
$ ./grep -v -f <(head -n 5000 ../../../file1) <(head -n 5000 ../../../file2) |
cat > /dev/null
$ cd ../lib
$ gcov dfa.gcda
Find attached the dfa.c.gcov.
Conclusions?
Bruno
dfa.c.gcov.gz
Description: GNU Zip compressed data
- Re: bug#22357: grep -f not only huge memory usage, but also huge time cost, Norihiro Tanaka, 2016/12/10
- Re: bug#22357: grep -f not only huge memory usage, but also huge time cost, Trevor Cordes, 2016/12/11
- Re: bug#22357: grep -f not only huge memory usage, but also huge time cost, Norihiro Tanaka, 2016/12/11
- Re: bug#22357: grep -f not only huge memory usage, but also huge time cost, Bruno Haible, 2016/12/11
- Re: bug#22357: grep -f not only huge memory usage, but also huge time cost, arnold, 2016/12/11
- Re: bug#22357: grep -f not only huge memory usage, but also huge time cost, Paul Eggert, 2016/12/11
- Re: bug#22357: grep -f not only huge memory usage, but also huge time cost,
Bruno Haible <=
- Re: bug#22357: grep -f not only huge memory usage, but also huge time cost, Paul Eggert, 2016/12/14
- Re: bug#22357: grep -f not only huge memory usage, but also huge time cost, Norihiro Tanaka, 2016/12/17
- Re: bug#22357: grep -f not only huge memory usage, but also huge time cost, Paul Eggert, 2016/12/19
- Re: bug#22357: grep -f not only huge memory usage, but also huge time cost, Norihiro Tanaka, 2016/12/19
- Re: bug#22357: grep -f not only huge memory usage, but also huge time cost, Paul Eggert, 2016/12/19
- Re: bug#22357: grep -f not only huge memory usage, but also huge time cost, Norihiro Tanaka, 2016/12/20
- Re: bug#22357: grep -f not only huge memory usage, but also huge time cost, Norihiro Tanaka, 2016/12/12
- Re: bug#22357: grep -f not only huge memory usage, but also huge time cost, Trevor Cordes, 2016/12/12
- Re: bug#22357: grep -f not only huge memory usage, but also huge time cost, Paul Eggert, 2016/12/12
- Re: bug#22357: grep -f not only huge memory usage, but also huge time cost, Trevor Cordes, 2016/12/12