[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ELPA] New package: find-dups

From: Eli Zaretskii
Subject: Re: [ELPA] New package: find-dups
Date: Wed, 11 Oct 2017 21:56:43 +0300

> From: Michael Heerdegen <address@hidden>
> Date: Wed, 11 Oct 2017 19:56:26 +0200
> Cc: address@hidden, Emacs Development <address@hidden>
> #+begin_src emacs-lisp
> (find-dups my-sequence-of-file-names
>            (list (list (lambda (file)
>                          (file-attribute-size (file-attributes file)))
>                        #'eq)
>                  (list (lambda (file)
>                          (shell-command-to-string
>                           (format "head %s"
>                                   (shell-quote-argument file))))
>                        #'equal)
>                  (list (lambda (file)
>                          (shell-command-to-string
>                           (format "md5sum %s | awk '{print $1;}'"
>                                   (shell-quote-argument file))))
>                        #'equal)))
> #+end_src

Apologies for barging into the middle of a discussion, but starting
processes and making strings out of their output to process just a
portion of a file is sub-optimal, because process creation is not
cheap.  It is easier to simply read a predefined number of bytes into
a buffer; insert-file-contents-literally supports that.  Likewise with
md5sum: we have the md5 primitive for that.

In general, working with buffers is much more efficient in Emacs than
working with strings, so avoid strings, let alone large strings, as
much as you can.

One other comment is that shell-command-to-string decodes the output
from the shell command, which is not something you want here, because
AFAIU you are looking for files whose contents is identical on the
byte-stream level, i.e. 2 files which have the same characters, but
are encoded differently on disk (like one UTF-8, the other Latin-1)
should be considered different in this contents, whereas
shell-command-to-string will/might produce identical strings for them.
(Decoding is also expensive run-time wise.)

reply via email to

[Prev in Thread] Current Thread [Next in Thread]