[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Using hash instead of timestamps to check for changes.
From: |
Paul Smith |
Subject: |
Re: Using hash instead of timestamps to check for changes. |
Date: |
Fri, 27 Mar 2015 11:45:08 -0400 |
On Fri, 2015-03-27 at 14:42 +0100, Glen Stark wrote:
> Is this planned? Has the idea already been rejected, and if so could
> you point me to the discussion so I can inform myself?
There is no formal planning around it right now, and it's not at the top
of my TODO list for GNU make.
> If it is planned, or you agree it's worth doing, how can I help? I'm
> willing to write the code if someone is willing to help me work into the
> code a little. Until now I'm only a user, not maintainer of Make, and
> would need some tips about how to fit the functionality into the overall
> design of Make. Someone to bounce ideas off, and direct questions to
> would be wonderful. If someone else is working on it already, I'd like
> to help however I can -- testing, debugging, etc.
I'm not aware of anyone working on it. It sounds like a simple thing,
but actually there are a lot of issues that need to be considered before
any implementation can be started. The important thing to remember is
that currently make is completely stateless... or rather, it uses the
filesystem to maintain its state (in the form of modification times).
Any change to a method of determining "out-of-date-ness" such as a hash
of the file content means introducing a separate state that make has to
maintain: this adds a lot of complexity and corner cases to work
through.
Before anyone can consider writing code of this magnitude, they should
familiarize themselves with the FSF's requirements for contributing to
the GNU project; you'll need to assign copyright to the FSF for the work
contributed to GNU make, which involves some legal paperwork on your
part and, if your employer has rights to your work which most do, at
least in the U.S., even if you don't do the work on the job, your
employer will have to agree as well.
On the technical side, there are various things to consider:
* What form will the extra state be kept in? One file per
directory? One file per target? Something else?
* If we use one file per target things are simpler, although that
adds up to a LOT of files in bigger builds and some platforms
might have problems.
* If we use one file per directory, there are lots of issues:
* When is the file written? Every time a target is
updated? Once at the end of the build?
* How will make handle the state file if it's killed in
the middle of a build?
* How will make handle missing/corrupted state files?
Will it fall back on modification times, or just rebuild
everything?
* How do we handle recursion, where multiple instances of
make could be running in the same directory?
* We need to consider platform-specific issues; for example on
UNIX systems a cheap/fast method of keeping per-file metadata
might be to make a symbolic link containing the data, but that
won't work on Windows or VMS, etc.
* What type of extra state will we use? My suspicion is that
md5sum is not the best. We don't really need it: we want
fingerprinting not a cryptographic hash. We don't even need to
do de-dup so we won't run into the birthday paradox: we only
want to know if the file has changed since the last time we saw
it. Probably a straightforward, well-distributed hash like
xxhash would be sufficient. If you combine both mod time AND
the hash that's pretty definitive; you can probably get away
with a 32bit hash.
* What are the performance implications? You're committing to
having make read the entire content of every single file
involved in the build into memory, just to decide what to
update! That's definitely going to hurt: a simple "nothing to
do" build will suffer a big performance penalty. In fact, in a
way the fewer jobs make needs to run the slower it will be,
since it will have to check the hash of every target where the
mod time doesn't give an answer. Maybe the hashing could be
done per-block instead of on the entire file so you could fail
faster, or something. But now you're storing more state per
target (multiple hashes per target).
* Do we really need to hash the file? Maybe simply expanding the
current checking is sufficient. For example, if in addition to
mod time we also considered the size of the file (and maybe
other things maintained by the filesystem like inode, for tools
which don't just overwrite the same file) we could increase our
accuracy WITHOUT resorting to a separate state file. Is that
good enough?
* What if people want to define their own "out-of-date-ness" test?
Maybe someone wants to integrate with inotify, or they want to
check the preprocessor output so that files are not considered
changed just because a comment changes, or something.