bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bug#22357: grep -f not only huge memory usage, but also huge time co


From: Bruno Haible
Subject: Re: bug#22357: grep -f not only huge memory usage, but also huge time cost
Date: Sun, 11 Dec 2016 18:55:32 +0100
User-agent: KMail/4.8.5 (Linux/3.8.0-44-generic; KDE/4.8.5; x86_64; ; )

Trevor Cordes wrote:
> I've read in numerous places (O'Reilly books mostly) that grep or pcre
> is often/sometimes faster than fgrep, so I think it is (somewhat) common
> knowledge and I wouldn't worry too much about that perception.

It is wrong to put the burden of the algorithm choice on the user.

Many computation tasks can be done through several different algorithms that
produce the same results but with different performance characteristics. The
main benefit of having one program that incorporates both/all algorithms is
that the user does NOT have to thing about which algorithm is better.
For example:

  - When you write a C program, you don't have to choose among
      y = 2 * x;
    and
      y = x + x;
    depending on CPU. The compiler picks the right instruction for you.

  - When you compute a sin(x) value, you don't have to say whether you
    want the Taylor series around 0 or a polynomial approximation.
    The implementation does it for you.

  - You don't need to tell 'sort' whether it should sort in memory or
    use temporary files.

  - When you execute a database query SELECT X WHERE A = 10 AND B = 20
    you don't have to tell the database engine whether to match the A
    column first and then the B column or vice versa - the database engine
    knows better which way to choose.

And so on. Computer science wouldn't be where it is without this paradigm.

Norihiro Tanaka wrote:
> dfa matcher is not always slower than kws matcher. ...
> It's a trade-off.  Can you have any idea to select the better
> matcher for both two cases?

There are at least two approaches.

1) If you have vastly different run times (2 sec vs. 10 min or so), the
program can set up two threads and run each algorithm in one thread. Buffer
the output. When a thread terminates, use its output and kill the other thread.

Now that is still suboptimal, because on average this approach will lose a
factor of 2, just because it does not know which is the best algorithm.

2) You can run a set of benchmarks, indexed by 2 parameters (m,n)
    m = number of keywords,
    n = total number of bytes of the files to be searched,
run each of the two algorithms, and note the best algorithm. This way
you fill a matrix of results (in the (m,n) plane). Now find a formula
that roughly tells you for given (m,n) whether you are in one area of
the plane (where kwset is better) or in the other area of the plane
(where dfa is better). Finally, code this formula into the 'grep' program.

Bruno




reply via email to

[Prev in Thread] Current Thread [Next in Thread]