gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] arch lkml


From: Eric W. Biederman
Subject: Re: [Gnu-arch-users] arch lkml
Date: 13 Dec 2003 01:26:35 -0700
User-agent: Gnus/5.0808 (Gnus v5.8.8) Emacs/21.2

Tom Lord <address@hidden> writes:

>     > From: address@hidden (Eric W. Biederman)
> 
>     > Arch makes a lot of distinctions about caches and mirrors and
>     > normal archives.  It is my feel that this is showing the
>     > limitations of arch.  Why can't these all handled in the same
>     > way?  Code in an archive.

I liked the response it clarifies a lot of the design of arch.   But
it actually failed to address where I have unsettling feelings.
 
> Scalability.

I guess my feel is that if you have the concept of a distributed
archive where each instance of it had some subset of the entire
archive it would be quite similar to the current case.  Different
instances would play the roles of caches, mirrors, and development
archives.  

As far as I can tell because of the current structure of arch
repositories I cannot have two copies of the same repository on two
separate machines because then there would be no way to prevent a
collision when two simultaneous revisions of the same base version
are checked into the different archives.

> Look at it this way: we have some set of "raw data" (all the data stored
> by `commit', `tag', or `import' across all of the archives in some
> domain of consideration (e.g., "the free software developer community"
> or "the developers employed by XYZZY Corp.").
> 
> That raw data expands over time according to some simple, core,
> transactional rules (i.e., what `commit', `import', and `tag' mean).

And this is fundamentally where my concern lies.  What `commit',
`import', and `tag' mean.  These things are tied very closely to the
archive design and format.  If the semantics are too limited an archive
can hit a wall.  If the semantics are simply abstract specifications
the result is likely technically impractical.  So a careful balance
must be made.

The problem is once an archive system is a fundamental part of your
process it becomes very hard to change.  

The best practical test I can think of for having semantics that are a
superset of other systems is if you can import and reexport other
archives without loss of information.  One of the things unicode got
right.  Of course once you have done sophisticated things you may no
longer be able to reexport into a lesser format without data loss, but
that does not apply into the import/export case.

> At the same time, we have an _open_ended_ number of access patterns
> for people reading that data.  Extremely open-ended with variations on
> access patterns reflecting network topologies, what parts of the data
> are needed more quickly than others, what kind of indexing is needed,
> who's doing what concurrently, etc.
> 
> One idea is to look for a "silver bullet" archive format:  one that
> will satisfy all of those access patterns and, at the same time,
> preserve the transactional semantics of the three core operations.
> This is, of course, quixotic quest.
> 
> A better idea, in my opinion, is to (a) optimize the heck out of those
> three core operations;  (b) make it is as cheap as practical to move
> data around, especially across networks;   (c) keep the data in a
> format that makes it very _easy_ to create _ancillary_ data structures
> that optimize the access patterns of a particular situation;  (d)
> start building those ancillary data structures;  (e) do all this using
> techniques that are simple enough to get Right.

I agree in principle but disagree in practice.  A "silver bullet"
solution is the wrong thing to search for.  But a solution that is
good enough to solve my problems as a project maintainer, to solve
my problems in working with the kernel maintainers, and that has a
base that is good enough to last decades is very important.

Looking for a base format that can do everything efficiently is
certainly the wrong thing.  But at the same time it is stupid to
stop the consideration of other formats because you have found
something.  If a better layout can be found that meets all of your
original requirements but can do more things simply that is a better
thing.

What are my requirements and problems?  
I wish I could easily give a list, and make this problem easy to solve
but I can't.  The best I can do is think things through one piece
at a time and look.  At least for now.

The open questions I have are for making my decision are:

1) Is there truly a benefit to the binary data structures used by
   xdelta, svn, and talked about in several academic papers.

2) Are the semantics of arch rich enough to keep it from running
   into a wall I care about in the future?

Eric




reply via email to

[Prev in Thread] Current Thread [Next in Thread]