[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ELPA] New package: find-dups

From: Michael Heerdegen
Subject: Re: [ELPA] New package: find-dups
Date: Wed, 11 Oct 2017 19:56:26 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.0.60 (gnu/linux)

Robert Weiner <address@hidden> writes:

> This seems incredibly complicated.  It would help if you would state
> the general problem you are trying to solve and the performance
> characteristics you need.  It certainly is not a generic duplicate
> removal library.  Why can't you flatten your list and then just apply
> a sequence of predicate matches as needed or use hashing as mentioned
> in the commentary?

I guess the name is misleading, I'll try to find a better one.

Look at the example of finding files with equal contents in your file
system: you have a list or stream of, say, 10000 files in a file
hierarchy.  If you calculate hashes of all of those 10000 files, it will
take hours.

It's wiser to do it in steps: first, look at the file's sizes of all
files.  That's a very fast test, and files with equal contents have the
same size.  You can discard all files with unique sizes.

In a second step, we have less many files.  We could look at the first N
bytes of the files.  That's still quite fast.  Left are groups of files
with equal sizes and equal heads.  For those it's worth of calculating a
hash sum to see which have also equal contents.

The idea of the library is to abstract over the type of elements and the
number and kinds of test.  So you can write the above as 

#+begin_src emacs-lisp
(find-dups my-sequence-of-file-names
           (list (list (lambda (file)
                         (file-attribute-size (file-attributes file)))
                 (list (lambda (file)
                          (format "head %s"
                                  (shell-quote-argument file))))
                 (list (lambda (file)
                          (format "md5sum %s | awk '{print $1;}'"
                                  (shell-quote-argument file))))

and `find-dups' executes the algorithm with the steps as specified.  You
need just to specify a number of tests but don't need to write out the
code yourself.

Do you need a mathematical formulation of the abstract problem that the
algorithm solves, and how it works?  I had hoped the example in the
header is a good explanation...



reply via email to

[Prev in Thread] Current Thread [Next in Thread]