Re: Using hash instead of timestamps to check for changes.

bug-make

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Using hash instead of timestamps to check for changes.

From:	Paul Smith
Subject:	Re: Using hash instead of timestamps to check for changes.
Date:	Sat, 04 Apr 2015 14:11:45 -0400

On Thu, 2015-04-02 at 13:20 +0200, Glen Stark wrote:
>   You asked "what if people want to define their own "out-of-date-ness"
>   test?".  I found that a really exciting idea.  As I thought about
>   this, I realized I what I really want is not to replace Make's current
>   behavior, but to add an additional check to the existing timestamp
>   check.
> 
>    My thinking is that the timestamp is in fact an overly conservative
>   test.  We never have the case that the timestamp indicates something
>   *has not* been changed when in fact it has (i.e. we always build if
>   something has changed),

That's interesting, because in my experience the main reason people are
upset about timestamps these days is the exact opposite: with the
increase in capabilities of systems, in particular larger build servers,
it is possible to have situations where targets are updated too quickly
to reliably determine out-of-date-ness based solely on timestamps.
Filesystems which support sub-second modified time stamping mitigate the
issue somewhat, but not completely, and not all users can use these
filesystems.

At the same time, it's rare that I (at least) modify the timestamp on a
file unless I've changed it.  Sure, sometimes it might happen (mostly by
accident) but this is rare enough to not be a big problem.  And as you
point out, this is annoying in that it could result in extra rebuilds,
but it's safe: it's much more significant to have the problem that make
decides NOT to rebuild things which DO need to be rebuilt.

For targets which OFTEN have timestamps incorrectly updated (say, for
example, autogenerated files which end up not changing) there are
well-defined methods for dealing with this, already used by autoconf,
etc.: they just generate the file to a temporary location, compare it,
and only replace the target if it's really different.

Possibly your environment has a higher-than-normal incidence of this,
for some reason, but maybe thinking about ways to address that situation
might be simpler?

I'm not saying that alternative methods of "file changed" detection are
not interesting to me, but it's a big, big problem to address in a
holistic way.

>   So I propose modify Make to accept a tool to perform additional
>   checks, the first being a hash checker.  Any additional checkers
>   should have the property that while they may return a false positive,
>   they never return a false negative (they never incorrectly say no,
>   nothing important was changed).

I don't agree with this.  You are looking at this in only one direction:
how to avoid builds when timestamps indicate they should happen but
other, specialized results would show that the build is not needed.

But in fact we already know that our current timestamp model is
insufficient in the opposite direction: how to know that a build is
needed, even though a timestamp says it's not.  Any new support should
make it possible to help with that, in a way much more serious, problem
as well.

>   My off-the-cuff suggestion for the interface of the external tool
>   would be a simple executable, returning 0 if no rebuid is needed, 1 if
>   one is needed, and perhaps another number(s) for error cases .

You haven't specified the INPUT to this tool.  What does "a rebuild is
needed" mean?  Are you suggesting that make would invoke this tool with
targets and prerequisites and ask the tool to decide whether the targets
are out of date?  Or are you suggesting that the tool would take one
file as an argument and determine whether that file has been updated
since the last time make was run?

>   The only downside I see is the performance cost of starting and
>   terminating the executable, but I'm assuming this will be small in
>   comparision to the file-access operations, and non-existant compared
>   to the cost of unecessary builds.  I guess the relevant benchmark will
>   be increase in clean build time, which I imagine will be negligent for
>   most real cases.

Another option is to take advantage of the loadable object and/or Guile
support capabilities in newer versions of make.  Or some combination.

>   - One file per target would mean approximately factor 2 increase in
>     the number of build targets.  Not beautiful, but only systems which
>     are already approaching their limits would be affected.  These
>     systems could continue using the default Make (timestamp based)
>     behavior.

Well, it's not clear what you are defining as a "target" here.  Remember
that for your model to work it must keep records for not just the files
people typically think of as targets (.o files, libraries, etc.) but
also every prerequisite: so every .c, .h, etc. file.  That's basically
doubling the number of files in a built version of your source tree.

> 4 What kind of state?
> =====================
> 
>   Based on the performance and reliability of GIT, I'm inclined to
>   suggest using SHA1 stored in a one-file-per-target basis.  To start
>   with I think making it a text file is reasonable.  I'm unfamiliar with
>   xxhash, but I'm open to trying anything.  With the right
>   implementation it should be trivial to evaluate a few possibilities.

I recommend against a cryptographically secure algorithm like SHA.
First, it's slow (comparatively speaking).  Second, its output is large,
per file.  And finally, it's just not needed.  Git has excellent reasons
for wanting this, but none of them apply in this situation.  A simple,
well-distributed hashing function will be significantly faster and the
resulting value much smaller, and it will be just as reliable for what
you want, which is just to know if the file is different than it was
before.

Finally, Eddy Welbourne's followup has this critical observation:

On Thu, 2015-04-02 at 17:48 +0000, Edward Welbourne wrote:
> The problem with any "is this change material" check, to evade doing
> downstream build steps, is that you have to do the check on every make
> run, once there is a maybe-material change present that it's saving you
> from responding to.  You can use a timestamp check as a cheap pre-test
> to that (file hasn't changed since last time, so can't contain a
> material change) but once it *has* saved you doing some downstream work,
> you are doing some checking that you must repeat each time make runs.

This is an excellent point and needs to be considered.  Suppose that
computing the hash takes 1/10th the time of doing the compile.  That
means that after 10 builds of your system the cumulative time of those
builds is actually LARGER than if you'd just bitten the bullet and
rebuilt it the first time.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Using hash instead of timestamps to check for changes., Glen Stark, 2015/04/02
- Re: Using hash instead of timestamps to check for changes., Edward Welbourne, 2015/04/02
  - RE: Using hash instead of timestamps to check for changes., Martin Dorey, 2015/04/02
- Re: Using hash instead of timestamps to check for changes., Paul Smith <=
  - Re: Using hash instead of timestamps to check for changes., Tim Murphy, 2015/04/04
    - Re: Using hash instead of timestamps to check for changes., Eric Melski, 2015/04/06
    - Re: Using hash instead of timestamps to check for changes., Enrico Weigelt, metux IT consult, 2015/04/11
    - Re: Using hash instead of timestamps to check for changes., Tim Murphy, 2015/04/11
    - Re: Using hash instead of timestamps to check for changes., Eric Melski, 2015/04/11
- Re: Using hash instead of timestamps to check for changes., David Boyce, 2015/04/07
- Re: Using hash instead of timestamps to check for changes., Daniel Herring, 2015/04/13

Prev by Date: RE: Using hash instead of timestamps to check for changes.
Next by Date: Re: Using hash instead of timestamps to check for changes.
Previous by thread: RE: Using hash instead of timestamps to check for changes.
Next by thread: Re: Using hash instead of timestamps to check for changes.
Index(es):
- Date
- Thread