[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gnu-arch-users] [BUG] FEATURE PLANS: "perfect" summary deltas
From: |
Tom Lord |
Subject: |
Re: [Gnu-arch-users] [BUG] FEATURE PLANS: "perfect" summary deltas |
Date: |
Fri, 9 Jul 2004 09:40:59 -0700 (PDT) |
> From: Matthew Dempsky <address@hidden>
> Tom Lord <address@hidden> writes:
> > [ summary delta description ]
> Out of curiousity, I decided to test out how well this idea works with
> a copy of Tom's tla--devo--1.2 tree.
Neat. Thanks.
> The test was conducted by adding cacherevs to every revision in
> tla--devo--1.2 along with conducting a delta between each basis-spine
> and spine-leaf pair and then tar'ing the result.
> The cacherevs ranged in size from about 1.4 to 1.6 MB, however, none
> of the summary patches exceeded 200 kB (well under 1/4th of even the
> smallest cacherevs). In fact, only 28 of the 115 patches weighed in
> over 100 kB. Therefor, the entire version fits in a single phase,
> giving a very nice alternative to cacherevs if even a single revision
> is available.
Good news.
> As to the builder algorithm, why not simply use Dijkstra's shortest
> path algorithm using patch size as edge weight and an edge anywhere a
> patch or patch summary exists? (Also a starting node representing an
> empty directory, so zero-weight edges between it and revlib'd or
> pristine revisions and cacherev-weight edges between it and
> cacherevs'd revisions.) It would be general enough to still meet
> Aaron's hypothetical smart server, while simple enough to fulfill
> Tom's no-more-complicated-than-necessary goal.
> This actually seems pretty similar to the algorithms described, but it
> seems to differ that a) I actually called it Dijkstra's shortest path,
> and b) I'm not sure if the other described algorithms would allow
> mixing regular patches and summary patches.
Well....
There are three main points to the "perfect" summary delta proposal:
1. We always construct graphs having the same topology. We don't
need Dijkstra's algorithm in its general form: we can specialize it
for this topology.
Why does this matter? Because in this graph, given just a node,
_if_we_don't_know_the_graph_toplogy_, requires querying the
archive. We want to minimize such roundrtrips and we want to
minimize the bandwidth consumed by each edge-querying roundtrip.
If we _do_ know the graph topology (and it has some handy
properties) then we can minimize the edge queries (to about 4, in
this case). You don't need a fully general shortest path
algorithm.
aaron's alternative idea (the "separate delta directory") also aims
to minimize roundtrips: just get a listing of that one directory
and now you know a big chunk of the graph. First: I think it would
interact poorly with smart servers because a smart server may be
willing to offer up a complete graph of deltas (so, while we only
have one roundtrip, the bandwidth can start to look pretty
interesting in large versions). Second: I think it would interact
poorly with smart servers because it requires smart servers to
eagerly describe what deltas are available rather than seeing if,
upon demand for a specific delta, it's handy to provide it.
2. You're right that delta and cacherev sizes give you weights for
the edges of graphs.
Alas, our least-common-denominator dumb-file-system server model
doesn't have any way to determine the size of a delta or cacherev
other than by fetching it and measuring it.
On the other hand, at the time that you _store_ a delta or cacherev
in an archive, you can measure its size _without_ fetching it
(since you've got it right there).
So, absent modifying the archive format to record sizes (see
below), why not split the shortest path algorithm into two
asynchronous parts: writing clients evaluate edge weights and use
that to decide which variation of a fixed-topology graph to build;
reading clients assume that all edges on the graph have weight 1
because they trust that the writing clients would not have
created those edges if that were not a reasonable assumption.
Putting 1 and 2 together (and ignoring some of the later parts of the
thread between abentley and i where we talked about variations on the
idea): yes, Dijkstra's algorithm is there but more as a "ghost" than a
literal rendition; it's a more specialized algorithm that gives the
same answer Dijkstra's would but for a limited class of graphs. Part
of the general algorithm is being computed at write time; part of it
at read time.
3. Do we or do we not muck with the archive format?
Actually, I shouldn't say "archive format".
Do we or do we not muck with the archive _abstraction_ because,
with a few exceptions (like checksums and signatures) when we
change the archive format, we're implying a change to the archive
abstraction.
Without some fancy footwork, some of Abentley's ideas would change
the archive abstraction in some deep ways that impact things such
as what smart servers can do. I see a negative impact in the
particulars.
The "perfect" summary delta doesn't change the archive abstraction
at all. If tla hadn't dropped support for "commit --base"
revisions, you'd be able to trick the builder into using summary
deltas without any changes at all to tla and the only changes we'd
need to make would be to make the interface to summaries more
convenient.
Of course, the discussion of the elegence (imo) of "perfect" summary
deltas is hampered by three things:
A. I dropped "commit --base" (aka "commit --tag") in tla and
therefore, the current builder knows nothing about it.
So it _looks_ like we're adding something that requires new
work on builder algorithms but, at core, it really only requires
finishing the port from shell-script larch to C tla in this
one, otherwise obscure area.
B. The discussion is concurrent with Abentley's work on the
backwards-builder and so there's plenty of opportunity
for discussion to get hung up on silly details about the
impact on that work, making the "perfect" summary delta
idea sound weirder than it really is.
C. Your experiment is great work and very comforting.
Something that is absolutely not a priority but that might help
shed some light at some point is an empirically supported
characterization of "typical" changerates and change-natures and
their determining variables, combined with analysis about how
effective "perfect" summaries (or any alternative) is predicted to
be. (For now, this not being rocket science, and especially given
checks such as yours, I trust my intuition filtered through
feedback from others thinking about the same topic.)
-t
Re: [Gnu-arch-users] [BUG] FEATURE PLANS: "perfect" summary deltas, Aaron Bentley, 2004/07/10