[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Using hash instead of timestamps to check for changes.

From: Glen Stark
Subject: Re: Using hash instead of timestamps to check for changes.
Date: Thu, 02 Apr 2015 13:20:21 +0200

Hello Paul

Sorry to take so long to reply.  I wanted to think your input over, and
I've had a pretty heavy load lately.

Signing over the copyright, and any other legal steps won't be a
problem.  My company has no rights to work I do in my own time.  I'm
mainly worried about the technical issues, and finding the time to do
the work.  Until now I've been pretty happy to let Make run in the
background, and haven't put a lot of thought into how it works.
Obviously that will have to change.

I'd like to thank you for your thoughtful response.  I'm gratified that
you took the time to engage in a technical analysis, and start the ball
rolling on the design discussion.  The points you raised merit thought
and discussion.

After reading over your mail a couple of times, I realized that I hadn't
thought things through very well.  In fact, rather than saying "hash
instead of time", I should have said "optional additional hash check
when timestamp has changed".  I think this fixes all the performance
concerns, and opens the door to adding additional checks (like the
is-a-comment-only) check, which I think is an exciting idea.

Here are my additional thoughts:

1 Maintaining the state.

  Your point about Make not maintaining any external state beyond what
  the filesystem tracks is well made.  I'm reluctant to add the extra
  complexity of tracking extra state, and it's clear to me that this
  will likely be the source of some "Oh, I hadn't thought of that"
  moments.  But in this case I think the benefit is worth the cost.

2 Adding additional "is-changed" checks.

  You asked "what if people want to define their own "out-of-date-ness"
  test?".  I found that a really exciting idea.  As I thought about
  this, I realized I what I really want is not to replace Make's current
  behavior, but to add an additional check to the existing timestamp

   My thinking is that the timestamp is in fact an overly conservative
  test.  We never have the case that the timestamp indicates something
  *has not* been changed when in fact it has (i.e. we always build if
  something has changed), but we do have an issue that building is
  unecessarily performed, causing an undue performance penalty -- the
  cost of building the target and its dependants. Thus we get a big
  build-time win whenever the additional test takes less time than
  building the target and its dependants.

  I think it's very important that Make remain reliable from the point
  of view that if something *should* be built, it *will* be built.
  Unecessarily rebuilding something is less of a fail than failing to
  rebuild something which should be.

  So I propose modify Make to accept a tool to perform additional
  checks, the first being a hash checker.  Any additional checkers
  should have the property that while they may return a false positive,
  they never return a false negative (they never incorrectly say no,
  nothing important was changed).

  We need only specify the interface of that tool, and people can write
  tools which satisfy their needs -- I'm interested in exploring the
  hash tool first, but might be interested in making further such
  'plugins', and projects with special needs could specify their own.
  Very exciting.

  As I see it, like this, the project becomes a way of simplifying the
  syntax of Yukimasa Sugizaki's suggestion, and officially supporting
  that workflow.

  My off-the-cuff suggestion for the interface of the external tool
  would be a simple executable, returning 0 if no rebuid is needed, 1 if
  one is needed, and perhaps another number(s) for error cases .  This
  strikes me as having several advantages -- the biggest being the
  flexibility it offers Make users.  For the case where users want to
  apply mutltiple additional criteria requiring state, this could be
  done in a single file.

  The only downside I see is the performance cost of starting and
  terminating the executable, but I'm assuming this will be small in
  comparision to the file-access operations, and non-existant compared
  to the cost of unecessary builds.  I guess the relevant benchmark will
  be increase in clean build time, which I imagine will be negligent for
  most real cases.

3 One file per target

  - The issues you raised regarding one-file-per-directory are tricky
    and would significantly slow development.  I especialy think the
    concurrency issues would be nice to avoid, at least in a first
  - One file per target would mean approximately factor 2 increase in
    the number of build targets.  Not beautiful, but only systems which
    are already approaching their limits would be affected.  These
    systems could continue using the default Make (timestamp based)
  - This somehow seems more consistent with Make's current behavior to
    me, which in turn seems lower risk.
  - I don't have any better ideas.
  - For projects on teams where 2n build targets is impractical, they
    can use the default, timestamp only behavior.

4 What kind of state?

  Based on the performance and reliability of GIT, I'm inclined to
  suggest using SHA1 stored in a one-file-per-target basis.  To start
  with I think making it a text file is reasonable.  I'm unfamiliar with
  xxhash, but I'm open to trying anything.  With the right
  implementation it should be trivial to evaluate a few possibilities.

5 Perfromance implications

  As mentioned earlier, if we change the goal from replacing the
  time-stamp to supplementing the time-stamp, I think a lot of the
  performance implications fall away.  The 'nothing-to-do' build will
  remain unchanged.

  The worst case scenario, I'm thinking is a full build, where no hashes
  have yet been written.  As long as hash-generation and file-saving is
  negligible compared to build-time, that should be no problem.  In the
  use-cases I deal with on a daily bassis (building big ugly c++ files),
  this will be easility satisfied.  If you can think of some good
  test-cases where this might not be satisfied, let me know, and I'll
  run some benchmarks.  Again though, if we keep the timestamp as
  default, project can decide based on their circumstances if the
  tradeoff is worthwhile.

  Per block sounds like a good idea as a later optimization, if we, or
  someone else determines it would be valuable.  To start with I woulde
  keep it simple.

6 Next steps

  My tentative suggestion, depending on your next feedback, is to do
  something like the following:

  - Determine a syntax for makefiles to specify which additional checks
    (and perhaps in what order) should be perfomed.  I think this should
    be easy to use for one additional test, but open to adding
    additional tests later.  It should be easy for Makefile generators
    (like autotools and cmake) to take advantage of.  I could see using
    an environment variable, but I could also imagine being able to
    steer the beavior on a Makefile to Makefile, or target to target
    basis.  I ask for input from the experts here.
  - Hash out in rough strokes how the call would be made -- my ad-hoc
    approach would be a seperate executable with integer return value
    indicating needs-rebuild, doesn't-need, or error, but again I ask
    for input from the experts.

  If that sounds reasonable, I should probably start poking around the
  Make codebase so I can get started at some point.

  Again, many thanks for your time,

  Glen Stark

On Fri, 2015-03-27 at 11:48 -0400, Paul Smith wrote:
> On Fri, 2015-03-27 at 11:45 -0400, Paul Smith wrote:
> >       * Do we really need to hash the file?  Maybe simply expanding the
> >         current checking is sufficient.  For example, if in addition to
> >         mod time we also considered the size of the file (and maybe
> >         other things maintained by the filesystem like inode, for tools
> >         which don't just overwrite the same file) we could increase our
> >         accuracy WITHOUT resorting to a separate state file.  Is that
> >         good enough?
> Actually I typed faster than my brain: we still need a state file of
> course to compare sizes.  But at least it's still based on filesystem
> metadata and doesn't require make to hash the contents of every file in
> the build.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]