gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] RFC: arch protocol, smart server, and tla implement


From: Tom Lord
Subject: Re: [Gnu-arch-users] RFC: arch protocol, smart server, and tla implementation prototypes
Date: Sat, 31 Jan 2004 15:54:59 -0800 (PST)


    > From: Aaron Bentley <address@hidden>

    > Tom Lord wrote:
    >> Excellent.  So we have been using phrases like "streaming the archive
    >> protocol" in different ways.  And what you suggest there is roughly
    >> what I propose as well.  Slight difference: I think that pfs_get_files
    >> should be asynchronous and let you retrieve the files one at a time

    > How would that work?  You'd say give me these files, then return when 
    > the first one's ready?

We're talking about the pfs.h protocol.

More like you'd say "give me these N files" then you'd make N calls
for "next file".   It probably makes sense for the "give me these N"
call to return a request ID which the "next file" request takes as an
argument, and allow two such request/reply sequences to be interleaved
(and to be interleaved with other pfs.h requests).

In archive-pfs.c, implementing (say) arch_archive_delta, I'd expect
that to eventually be used as:

        request_the_delta ();
        patch = get_next_file ();
        while (not_done ())
          {
            patch = compose (patch, get_next_file ());
          }

If I'm right that compose can be fast enough, get_next_file can
actually read from the server.   If I'm wrong -- request_the_delta can
just read the lot up front and get_next_file just give the buffered
files back.


    > You'd get background processing, and the behavior would get less
    > deterministic.  I thought this was specifically what you didn't
    > want!  I must be confused.

The place where I don't want async or non-determistic behavior is
above the archive.h layer.  I don't want it in build_revision, for
example.   I could care less about what happens in pfs-*.c and only
care a little bit about pfs.c (in this regard).


    >> so that client-side space requirements don't have to be the sum of the
    >> file sizes.

    > According to the thread, the aggregate storage requirement will
    > be roughly twice that of the equivalent DELTA.  I'd be willing
    > to make the sacrifice of storing that data temporarily to retain
    > the appearance of a single-threaded program.

I think that it varies wildly depending on the project.  For example,
I can believe that the compose of 100 linux kernel patches, mostly
working on different parts of the kernel, isn't much smaller than the
largest of those patches.   On the other hand, I'd expect the compose
of 100 tla patches to be much smaller than the sum of the 100.

Also, one problem that is really worth solving in the medium term, I
think, is the "huge number of clients requesting `update' problem".
For example, the linux kernel is in arch, a gazillion people track the
arch sources pretty passively, the 2.9 release announcement come out,
and a gazillion people run `update'.  If 70% of those gazillion people
are asking for the same delta, and that delta is is even only 50%
smaller than the sum of the patches constituting it, that's a pretty
big savings in server bandwidth if the server can compose-and-cache
that delta.


    > If you're really serious about this, I guess you could use callbacks to 
    > compose and delete the changesets as they come in, but it looks hard to 
    > me.  What if the first changeset to complete retrieval is the  one in 
    > the middle?

Such non-determinism as you describe would be contained in the
pfs-{dav,sftp,...} layer.   I doubt there's any reason to really go
that far but if one did, it would be contained in that layer.

In the archive-pfs layer, you're still reading files serially,
probably with no intervening operations --- you just happen to be
composing retrieves between reading individual files.


    > I believe all supported protocols can be treated as streamy.  e.g. 
    > pfs_get_files can "stream" on the native filesystem using
    > fork().  

For which there is unlikely to be any substantial benfit.   I don't
think tossing in forks client-side is a good idea at this time.
_Maybe_ later.

    > Oh, so compose_changeset is performed on a pair of changesets, not a 
    > sequence?  

Right.

    > I imagined an operation that would merge a series of 
    > changesets into a single one.  For that, we'd have to wait for all 
    > downloads to complete.

I don't think there's any performance advantage to that assuming that
compose can be fast enough to more or less keep up with incoming
changesets.   I suspect it can.

    >>   if the proposed archive-cachedelta
    >>   command were added, then archive-pfs.c could take advantage of that
    >>   in `delta' in order to help with bandwidth issues.

    > I don't understand this.

Yes you do, just not the way I said it.   archive-pfs has to implement
DELTA.  In a generic (lower-level-transport-independent)
implementation, if cachedelta is present, it can take cachedelta into
account when generating the list of patches to request (the ones it
will compose).


    > In theory, tla could use an internal archd all the time via a
    > pipe.  

That'd be one way structure the code, yes.


    > That would make the archd easier to develop 

Not sure about that.   Not sure it's wrong either.

    > >About backwards building:

    > >Even within a single archive there's choices such as
    > >when to use an archive cached rev vs. when to build-by-patching.

    > >For example, untar'ing a cacherev from a local-fs archive is generally
    > >faster than copying one from a library.   An untar plus a couple of
    > >changeset applications may be preferable to copying a revision already
    > >present in a library.

    > Yes.  It gets to a place where good estimates depend on
    > 1. local hardware
    > 2. throughput to archive
    > 3. availability of cacherevs
    > 4. presence of local revisions

    > We could tweak that for years.  (Have I missed any?)

Yup and (undoubtedly, of course :-).

(I do think that a few simple tweaks that might take a little
experimenation to find will likely prove to be "good enough".  Simple
heuristics are probably sufficient -- but which ones?  By saying that
the heuristics will likely need tuning I only meant that when the
first version has some weird edge cases that cause someone to gripe,
don't act too surprised.)


    > >We _don't_ in an empirical sense know that composing changesets will
    > >be significantly cheaper however I think it's a safe bet.  Changeset
    > >size is (roughly, and only in the expected case) a function of the
    > >amount of human work that goes into one commit -- relatively constant
    > >across projects.   Tree-size is arbitrary.    Faced with some
    > >O(tree-size) algorithms that can be avoided by changeset composition
    > >there's two strategies: try to eliminate the O(tree-size) steps from
    > >the algorithms or use changeset composition.    You and others have
    > >done a lot of work so far on eliminating O(tree-size) steps (and
    > >that's very good) --- the other path wants exploring, too.

    > >It will be a little tricky.

It's starting to look like compose_changesets should be the next "big
hard thing" I work on for tla.

(That and/or partial commits.)

    > >    > - may, depending on implementation, require users to install
    > >    > interdiff

    > >It will add new code for diff composition to tla but I don't see any
    > >good reason to add a new dependency.

    > So do you propose simple concatenation of diffs?  Or do we absorb 
    > interdiff into arch?  We could write our own, but I don't see why.

Grab code from patchutils or interdiff or write it from scratch.
That stuff is detail oriented but otherwise trivial.

It really shouldn't be much harder than a simple implementation of
join(1) but with a slightly different set of parameters.  (Stream two
files with feedback from each about which one to read a line from
next, compare lines in some way, keep track of line numbers....).
(Solving it for the "inexact case" where the diffs aren't guaranteed
to compose --- that's much harder just because of the need to
introduce all kinds of heuristics.   Larry Wall [author of patch]
should write that part.   But it isn't needed for the arch hacks we're
talking about.)

(There's also the tree-delta stuff for which we can't borrow code.)


    > I know not this "changeset-utils" of which you speak.  Would this 
    > include standalone tools for creating and applying changesets?

You know -- I did have to make a decision at one point about whether
to factor out mkpatch/dopatch and distribute it separately.    That
separate distribution would also want to include inventory.
Architecturally it makes perfect sense.

-t





reply via email to

[Prev in Thread] Current Thread [Next in Thread]