[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Using hash instead of timestamps to check for changes.
From: |
Glen Stark |
Subject: |
Re: Using hash instead of timestamps to check for changes. |
Date: |
Thu, 02 Apr 2015 13:20:21 +0200 |
Hello Paul
Sorry to take so long to reply. I wanted to think your input over, and
I've had a pretty heavy load lately.
Signing over the copyright, and any other legal steps won't be a
problem. My company has no rights to work I do in my own time. I'm
mainly worried about the technical issues, and finding the time to do
the work. Until now I've been pretty happy to let Make run in the
background, and haven't put a lot of thought into how it works.
Obviously that will have to change.
I'd like to thank you for your thoughtful response. I'm gratified that
you took the time to engage in a technical analysis, and start the ball
rolling on the design discussion. The points you raised merit thought
and discussion.
After reading over your mail a couple of times, I realized that I hadn't
thought things through very well. In fact, rather than saying "hash
instead of time", I should have said "optional additional hash check
when timestamp has changed". I think this fixes all the performance
concerns, and opens the door to adding additional checks (like the
is-a-comment-only) check, which I think is an exciting idea.
Here are my additional thoughts:
1 Maintaining the state.
========================
Your point about Make not maintaining any external state beyond what
the filesystem tracks is well made. I'm reluctant to add the extra
complexity of tracking extra state, and it's clear to me that this
will likely be the source of some "Oh, I hadn't thought of that"
moments. But in this case I think the benefit is worth the cost.
2 Adding additional "is-changed" checks.
========================================
You asked "what if people want to define their own "out-of-date-ness"
test?". I found that a really exciting idea. As I thought about
this, I realized I what I really want is not to replace Make's current
behavior, but to add an additional check to the existing timestamp
check.
My thinking is that the timestamp is in fact an overly conservative
test. We never have the case that the timestamp indicates something
*has not* been changed when in fact it has (i.e. we always build if
something has changed), but we do have an issue that building is
unecessarily performed, causing an undue performance penalty -- the
cost of building the target and its dependants. Thus we get a big
build-time win whenever the additional test takes less time than
building the target and its dependants.
I think it's very important that Make remain reliable from the point
of view that if something *should* be built, it *will* be built.
Unecessarily rebuilding something is less of a fail than failing to
rebuild something which should be.
So I propose modify Make to accept a tool to perform additional
checks, the first being a hash checker. Any additional checkers
should have the property that while they may return a false positive,
they never return a false negative (they never incorrectly say no,
nothing important was changed).
We need only specify the interface of that tool, and people can write
tools which satisfy their needs -- I'm interested in exploring the
hash tool first, but might be interested in making further such
'plugins', and projects with special needs could specify their own.
Very exciting.
As I see it, like this, the project becomes a way of simplifying the
syntax of Yukimasa Sugizaki's suggestion, and officially supporting
that workflow.
My off-the-cuff suggestion for the interface of the external tool
would be a simple executable, returning 0 if no rebuid is needed, 1 if
one is needed, and perhaps another number(s) for error cases . This
strikes me as having several advantages -- the biggest being the
flexibility it offers Make users. For the case where users want to
apply mutltiple additional criteria requiring state, this could be
done in a single file.
The only downside I see is the performance cost of starting and
terminating the executable, but I'm assuming this will be small in
comparision to the file-access operations, and non-existant compared
to the cost of unecessary builds. I guess the relevant benchmark will
be increase in clean build time, which I imagine will be negligent for
most real cases.
3 One file per target
=====================
- The issues you raised regarding one-file-per-directory are tricky
and would significantly slow development. I especialy think the
concurrency issues would be nice to avoid, at least in a first
iteration.
- One file per target would mean approximately factor 2 increase in
the number of build targets. Not beautiful, but only systems which
are already approaching their limits would be affected. These
systems could continue using the default Make (timestamp based)
behavior.
- This somehow seems more consistent with Make's current behavior to
me, which in turn seems lower risk.
- I don't have any better ideas.
- For projects on teams where 2n build targets is impractical, they
can use the default, timestamp only behavior.
4 What kind of state?
=====================
Based on the performance and reliability of GIT, I'm inclined to
suggest using SHA1 stored in a one-file-per-target basis. To start
with I think making it a text file is reasonable. I'm unfamiliar with
xxhash, but I'm open to trying anything. With the right
implementation it should be trivial to evaluate a few possibilities.
5 Perfromance implications
==========================
As mentioned earlier, if we change the goal from replacing the
time-stamp to supplementing the time-stamp, I think a lot of the
performance implications fall away. The 'nothing-to-do' build will
remain unchanged.
The worst case scenario, I'm thinking is a full build, where no hashes
have yet been written. As long as hash-generation and file-saving is
negligible compared to build-time, that should be no problem. In the
use-cases I deal with on a daily bassis (building big ugly c++ files),
this will be easility satisfied. If you can think of some good
test-cases where this might not be satisfied, let me know, and I'll
run some benchmarks. Again though, if we keep the timestamp as
default, project can decide based on their circumstances if the
tradeoff is worthwhile.
Per block sounds like a good idea as a later optimization, if we, or
someone else determines it would be valuable. To start with I woulde
keep it simple.
6 Next steps
============
My tentative suggestion, depending on your next feedback, is to do
something like the following:
- Determine a syntax for makefiles to specify which additional checks
(and perhaps in what order) should be perfomed. I think this should
be easy to use for one additional test, but open to adding
additional tests later. It should be easy for Makefile generators
(like autotools and cmake) to take advantage of. I could see using
an environment variable, but I could also imagine being able to
steer the beavior on a Makefile to Makefile, or target to target
basis. I ask for input from the experts here.
- Hash out in rough strokes how the call would be made -- my ad-hoc
approach would be a seperate executable with integer return value
indicating needs-rebuild, doesn't-need, or error, but again I ask
for input from the experts.
If that sounds reasonable, I should probably start poking around the
Make codebase so I can get started at some point.
Again, many thanks for your time,
Glen Stark
On Fri, 2015-03-27 at 11:48 -0400, Paul Smith wrote:
> On Fri, 2015-03-27 at 11:45 -0400, Paul Smith wrote:
> > * Do we really need to hash the file? Maybe simply expanding the
> > current checking is sufficient. For example, if in addition to
> > mod time we also considered the size of the file (and maybe
> > other things maintained by the filesystem like inode, for tools
> > which don't just overwrite the same file) we could increase our
> > accuracy WITHOUT resorting to a separate state file. Is that
> > good enough?
>
> Actually I typed faster than my brain: we still need a state file of
> course to compare sizes. But at least it's still based on filesystem
> metadata and doesn't require make to hash the contents of every file in
> the build.
>
- Re: Using hash instead of timestamps to check for changes.,
Glen Stark <=