gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] arch lkml


From: Eric W. Biederman
Subject: Re: [Gnu-arch-users] arch lkml
Date: 12 Dec 2003 09:04:54 -0700
User-agent: Gnus/5.0808 (Gnus v5.8.8) Emacs/21.2

Tom Lord <address@hidden> writes:

>     Eric:
> 
> I'll reply with perhaps more than you wanted to know :-)

Actually I really enjoyed it.  It took a little while to digest
but it is good information.  

Let me give some background about where I am coming from before
I get into replies.  Open source projects when writing archivers
seem to reach the local optimum of what you can do with their archive
format and then stop making progress.  So it is my intention to review
source code control systems until I find one or write one 
whose archive format does not look like it will be a bottle neck in
the future.  If your archive format is not a bottleneck then the rest
will come with time, as people scratch itches.

> A big theme in the preceding is "dirt simple".  An example of what
> that implies can be seen in recent discussions about "archive
> signing".  There's a minor crisis going on the free software world
> because project hosts keep getting cracked and people want to be sure
> that their source archives are not corrupted -- hence an interest in
> signing.  Contrast recent discussions on the dev@ list for svn with
> those on arch.  Here in arch-land we're ~30% done implementing signing
> after just a few days -- I could (an not officially, but ...)  commit
> to having it up and running by the end of the month.  Now read the
> dev@ for svn (and think through the performance and complexity
> implications of their most rapid-route suggestion of a hook-based
> answer).   Meanwhile, we're not worrying about "how to possibly do it"
> over here -- but "how to do it in any of the several possible ways
> while not screwing up the security concerns and while keeping a
> long-term view of the architecture."
> 

And this I see as a very hopeful sign.  As I understand things with
arch I have a good feeling.  At the same time there are things that
feel like they can be made simpler.   A similar issue to the signing
issue is the long term reliability issue, which last I looked had
not been strongly addressed yet in arch.  

As one of my coworkers like to say with the complexity of computers
it is not a surprise that they fail, it is a surprise that they work at
all.

>     >>> The one very obvious potential issue I see with arch as it
>     >>> currently stands is that it does not use one of the more
>     >>> sophisticated storage formats for storing deltas.  Which means
>     >>> as your archive size increases the work can increase.  I think
>     >>> with a different backend format cacherev would not be
>     >>> necessary.  But I may be wrong.
> 
> I don't think that you are "wrong", per se -- just that you're worried
> about solving imaginary/outdated problems.
> 
> The classic (so-called:) "sophisticated" formats (e.g., SCCS) come to
> us from a distant time (by industry timescales).  There were designed
> to solve a problem that arose in:
> 
>   (a) a non-networked environment
>   (b) an environment in which disk-space was precious
> 
> and so consequently they:
> 
>   (1) [re: (a)] assume computation close to the repository for 
>       every operation
> 
>   (2) [re: (b)] are aimed at balls-to-the-wall optimized compression
>       of revisions rather than just a comperable "big-O" (loosely
>       speaking) optimization

This sounds right.  Given the fact a lot of those formats never
even made it into common use in the open source world I need to play
with the sophisticated techniques instead of just reading about them
to see if that is really the case or to see if there is something very
useful that has been overlooked.
 
> When those formats (or their more modern design-thinking copy-cats
> like svn) are used in a more modern environment, that means:
> 
>   (i) [re: (1)] they need to introduce smart servers and thus create
>       both a performance bottleneck and a single point of failure --
>       obstacles that can be overcome only with ridiculous amounts of
>       complexity (which of course creates new problems)
> 
>   (ii) [re: (2)] they create systems which: flop over dead upon a 
>        media failure;  add absurd complexity (hence problems) to try
>        to optimize "common case" access patterns;  aren't 
>        easily extensible, and so forth.

Right.  The one thing that I like is that with a good distributed
system while backups can still be important you get a lot of that
kind of effort free, as every one you are working with mirrors your
change into their archives.

> Here is a better way to understand the arch format:
> 
> * separation of archive host and archive client
> 
>   We recognize that the general case is that that these two hosts are
>   different and that when performance becomes really interesting, 
>   it's going to be a matter of clients vastly outnumbering hosts.
> 
>   Therefore, we want to push as much work as possible to the client
>   side.

Agreed.

> * minimization of network bandwidth
> 
>   We recognize that the important case is of a developer who
>   is active over a prolonged period of time in some project whose
>   mainline archive is hosted remotely.   We also recognize that the
>   internet is basically slow.
> 
>   Therefore, we want to minimize network bandwith primarily, and 
>   latency secondarily, between our developer and his project.

Sounds good.

>   It is damn hard to beat compressed changesets for that purpose
>   other than by increments.

Maybe.  There are certain types of historical research that look like
they have issues.  And a historical perspective on what has happened
with the code at times can be very important.
 
>   To be sure, arch's push-mirror could do better on latency issues --
>   it should eventually work more like rsync and _maybe_ even ask for a
>   little help from server-side computation.
> 
>   But in general, for our important-case developer, arch makes 
>   network performance about as close to irrelevent as it can be.

Arch like to grab entire archives instead of just a relevant of slice
of them by default.  So there is a penalty their.

> * serendipitous archive-as-payloads
> 
>   So we're going to have long-lived client sites that pull
>   updates or push updates to or from some archive host.
> 
>   In fact, we'll have _many_ such clients for our most interesting
>   hosts.
> 
>   One idea is that we should, archive-side, just cache the payloads 
>   that we'll send to these clients -- so that when a client needs some
>   "thing" we just find it on disk and stream it out.
> 
>   It turns out, then, that if instead of "caching" we simply "memoize" 
>   those payloads ("caching" can discard stuff -- "memoizing" holds on
>   to it forever) that the memo we create contains, in a _reasonably_
>   compact form, all of the information we would have in an archive.
> 
>   So: the memo _is_ the archive.  Disregarding archive-cached
>   revisions for the moment, an arch archive _is_ just a memo of the
>   payloads that a client might ask for.   
> 
>   That's dirt-simple to implement, allows arch to use "dumb servers",
>   and is the same "big-O" space complexity as the more sophisticated
>   formats.  

I need to read through this one a little more.  As I don't have a
clear idea by what you mean with memorizing payloads.


> * rely on client-side caches, but keep it simple, stupid
> 
>   Client-side, our interesting developer is going to have some 
>   pretty interesting access patterns -- focused mainly on recent
>   revisions, semi-random-access within those.
> 
>   Revision libraries provide that in, again, a dirt-simple way that
>   has just about optimal performance characteristics.   The recent
>   hacks for --hard-links, library paths, sparse libraries and so 
>   forth just make that sweeter and sweeter:  our interesting developer
>   can these days (using the head revision at least -- you'll see it in
>   the next release) check out a tree just as fast as he can hard-link
>   to it.   Whole tree comparisons take, in the expected case, the 
>   time of `find' traversals of the two trees plus the time to diff the
>   files that actually changed.
> 
>   In terms of _time_ complexity, this romps all over the old
>   "sophisticated" formats and their modern descendents.
> 
>   In terms of _space_ complexity, it's awefully damn competitive with
>   the "sophisticated" formats, especially when you multiply the
>   difference between them by the price-per-byte of disk storage.

Right and this looks very nice.   I suspect a lot more of the cases
in arch where I have performance concerns can be optimized with
similar caches.  So what I need to concentrate on are
a) making certain the semantics of the basic archive are not
   artificially limited.
b) helping to find performance problems.

> * cache-rev is for patzers
> 
>   Archive cached revisions pretty much don't matter to our
>   interesting-case developer -- they mostly help the more casual
>   users of arch archives.   They're a courtesy in that regard.
>   If you're working with some project, make a local mirror, pass
>   --no-cached to push-mirror, and make sure you set up your libraries 
>   well.   Arch would _arguably_ be better off not even bothering with
>   cached revs.
> 
>   I exaggerate slightly:  if your ongoing relationship to a project
>   is that you want to keep current with the head revision but don't
>   care about keeping the history of it locally (perhaps you are a
>   "tester"), then cached-revs (and, later, cached-summary-deltas) are
>   a win for you.   But these things really aren't of central
>   importance to arch.

Arch makes a lot of distinctions about caches and mirrors and normal
archives.  It is my feel that this is showing the limitations of arch.
Why can't these all handled in the same way?  Code in an archive.

>     > As I understand the literature recent work on version control has
>     > used what is a variation on the gzip format for storing multiple
>     > versions.  The idea is you compress the first file like normal.
>     > But for the second and subsequent files you look back into your
>     > archive (which you are simply appending to) and use previous text
>     > for compression.  This makes both appends and random access fast
>     > and in addition this happens to work for random binary files.
> 
> arch is not heavily optimized for binary files, that's true.   I've
> yet to see any evidence that this matters all that much for software
> development and given today's disk economics.

The important case with binary files is that they are handled.  Heavy
optimization is not necessary.

>     > Thinking about the implications of this I wonder why no one has built
>     > a general purpose archiver like with a structure like that.  I should
>     > give roughly the same compression ration as tar | gzip while still
>     > giving you the random access properties of zip. 
> 
>     > So it looks like the very practical thing to do is to build a general
>     > purpose, compressed, random access file archiver, which can be used
>     > for backups or file archiving whichever makes most sense.  And then
>     > come back and look at improving arch.
> 
> _Possibly_.  It's not obvious that it'll be worth it (nor that it
> won't).  But if you are going to go that route, google around for Josh
> MacDonald first.

I think the research will be worth it for me so I can see if there is
something real in the sophisticated ideas.  What I can conclude at
this point is there is a lot of value in using the directory structure
of a server for holding the pieces of your archive so the individual
files in an archive do not need to change.

Eric




reply via email to

[Prev in Thread] Current Thread [Next in Thread]