[Arx-users] The Future (long)

arx-users
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Arx-users] The Future (long)

From:	Walter Landry
Subject:	[Arx-users] The Future (long)
Date:	Wed, 07 Dec 2005 00:35:37 -0800 (PST)
Greetings,

I have opened up a completely new, incompatible branch of ArX
(arx.3.0). There are and will be a number of changes in it.  I have
gotten the arx-changes list working again, so you can follow my
progress there.  This email describes in some detail what is going on
there.  If something is confusing, please speak up.

First of all, I have changed over from the term "archive" to
"repository".  Tla and related kin are the only systems that use the
term archive, so I thought it needlessly confusing to have a different
name.  Moreover, the third party serialization library uses the term
"archive" to mean something totally different.

I have also converted ArX to only use url's (see [1]).  So repos no
longer have names.  There is still a weak form of registration, in the
same sense that ssh registers hosts.  There is nothing that the user
has to do.  ArX just records what public keys are being used by the
repo, and is indexed by location.  So if you access a repo two
different ways (e.g. via sftp and http), then you will have two
different registrations.

Earlier, I discussed changing the format for revision names to
something like

  url#branch.subbranch,revision

It turns out that if you type this into a browser, the part of the url
after "#" is not sent to the server.  All work is done on the client.
It would be nice to be able to set up a cgi script like ViewCVS that
could just use these urls directly.  So, unless someone objects, I
would like to change the "#" to "," and the "," to "@".  So it would be

  url,address@hidden

There are some other minor changes that I've done, but the big one is
going to be a major change in the repository format.  After much
consideration, I have come up with the following format.  First, an
outline of the directory structure.

repo/
  README
  keys
  version
  pending/
  index
  listing
  branch1./
    index
    listing
    0/
      listing
      aaaaaaaaaaaaaaa
        rev.tgz
        log
        log.sig
      cccccccccccccccaaaaaaaaaaaaaaa
        patch.tgz
        log
        log.sig
      dddddddddddddddccccccccccccccc
        patch.tgz
        log
        log.sig
      ...
     256/
      qqqqqqqqqqqqqqqppppppppppppppp
        patch.tgz
        log
        log.sig

At the root of the repo, there are two files: "keys" which has the
public gpg keys, and README which tells you the format of the repo and
not to touch anything :)

For every branch, there is a directory with the same name as the
branch, but with a period "." appended.  The period "." makes it easy
to distinguish branch names (which can be almost anything) from
everything else.  Within branches, there can be subbranches,
subsubbranches, etc., using the same format.

Also within (sub)*branches, there can be revisions.  They are
divided in chunks of 256.  So revisions 0-255 are in directory 0,
256-511 are in directory 256, etc.

Within each revisions directory, there are directories for each
revision.  Each revision is named by its sha256 and its parent.  The
idea is that a simple directory listing can give us all of the
revisions and how they are pieced together.

This sha256 is calculated the same way they are currently calculated
for each revision (taking the hash of the manifest).  We use the first
15 characters of the base-16 encoding of the sha256 to give a
directory name with 30 characters.  This is to fit within a 31
character limit imposed by a number of filesystems.

This is not the complete hash, but it is 60 bits of it.  To get an
accidental collision, we would have to have 2^30 different revisions
(about 1 billion) from the same parent revision.  The only danger
of accidental collisions is that it would cause you to be unable to
commit or mirror.  The signatures and full hashes are still checked,
so there is no danger from a malicious replacement.

Within the revision subdirectory, we can have up to four files:

  1) "rev.tgz"

    This is a full copy of the project tree at this revision.  This is
    _only_ present for the first branch of a revision.  So it is
    present when someone does an initial import, but also when you run
    "fork" and "commit".  This allows you to truncate history, so that
    you don't have to carry history back to the end of time.
    Otherwise, any repo format will eventually get unwieldy and
    difficult to carry around.

    Incidentally, this should dramatically improve ArX's abysmal
    performance on import.  For imports, ArX currently creates a patch
    from nothing, which entails a lot of extra work.  Making "rev.tgz"
    is mostly copying files and tar'ing it up.

  2) "patch.tgz"

    This is a patch against the previous revision.  It will basically
    be the same as the current patch format.

  3) "log"

    This contains all of the usual things (author, date, commit
    message, etc.) plus two hashes.  One is the revision hash, and the
    other is the sha256 of "patch.tgz".  This allows us to sign
    the revision, the patch, and the log all with one signature.

    One thing it does not have is the branch name.  It only has the
    revision hash.  This will allow you to move branches around
    without having to recalculate hashes or signatures.

  4) "log.sig"

    This is the gpg signature of "log".


One thing of note is that, since the log is completely separate from
the revision, it is easy to rewrite logs to correct misspellings etc.

I am considering adding a non-authoritative "index" file to every
directory except the revision directories.  It would contain a single
hash of its subdirectories.  So you would be able to read a single
short file to see if anything has changed in the repo.  Updating the
"index" file atomically is tricky over remote connections.  So if the
file is missing or corrupt, ArX will look for changes the
old-fashioned way.

The "index" file is separate from a "listing" file, which will still
be maintained for systems that read over plain old http.  It is
"listing" instead of ".listing" to get around some restrictive ftp
upload policies.  Also, it will become a serialized list of
directories instead of the one-file-per-line format, so that you can
have special characters (e.g. carriage return, NULL) in revision
names.

To add a revision, we create the revision directory in a toplevel
"pending" directory.  We then rename the directory to its destination.
Instead of the "break-lock" command, ArX might have a "clean-repo"
command that can remove pending revisions.  I might not even bother
with "clean-repo", instead putting a note in the manual.

Once the revision is written, we update the "index" files, starting
from the bottom.  To do so, we write a completely new file and put it
in "index.new".  On filesystems where renaming a file on top of an
existing file is atomic (e.g. most Unix filesystems, recent Windows
filesystems), we simply rename the new file over the old.  For most
remote protocols and some filesytems, it is not atomic, so instead we
remove "index" and move "index.new" to "index".

The revisions are numbered by their maximum distance from the root.
For example, a graph with numbering

     aaa   (0)
  /       \
 |         |
 |         |
bbb (1)    |
 |        eee (1)
 |         /
ccc (2)   /
  \      /
    ddd     (3)

So to get ddd, you can type

  arx get url,address@hidden

To get eee, you have to disambiguate it

  arx get url,address@hidden

In this case, there is a merge between ccc and eee, resulting in ddd.
For merges, ArX will actually store two patches: one is a patch of ccc
to ddd, and the other is a patch from eee to ddd.  So there would be
two directories within branch/0: eeeccc/ and eeeddd/.  This allows us
to update from either branch to ddd.  It also simplifies annotating
down both sides of the merge.

One thing to note is that you don't really have to have branch names
at all.  You can just have revisions at the top level.  In that case,
you could think of that branch as the default.  So running

  arx get url

would give you the head of that branch (if there was a unique one).
So for simple cases, you don't even have to name your project.  For
more complicated cases, you could use the default branch as a tag that
downloads all of the components for the current head.

It is fairly simple to go from there to using the branch==repo
paradigm that hg, bzr, darcs, etc. have.  My thought right now is that
that paradigm is sufficiently different from the separate repo and
tree paradigm that I would want a different command for it.  You would
have to implement pull, unpull, push, and clone, modify init, get rid
of make-repo, get, browse, delete-branch, and tree-branch, plus some
other things that I am surely forgetting.  But the new functionality
would be mostly piecing together existing components.

For the actual patches, I am thinking of using skip-deltas [2].
Subversion uses them on individual files, while we will use them on
whole trees.  The main benefit is log(N) access to any revision, so
cached revisions are no longer necessary.

Unforunately, it also requires O(N log(N)) space, instead of O(N).  I
ran some tests on the ArX tree, and there was a 50% space penalty.
This is for revisions with lots of PDF's, which may make the penalty
smaller than normal.  I am planning to look at the gcc repo (100,000
revisions) to see what kind of space penalty it has.  The gcc
subversion repo takes up 8.8 GB, so space is important.

But the big downside to skip-deltas is that they make the code more
complex and slow other things down.  In order to get a particular
patch, I have to compose a number of different patches.  This will
make annotate, in particular, somewhat slower.  It will also make
get-patch more complicated and slower.  Finally, it will slow down
commit, because I have to compute a skip delta instead of an ordinary
delta.  In practice, that means applying log(N) patches to a pristine
tree before we can make the diff.  For local archives, I don't know
that it will be a big deal, but I will have to measure.

Skip-deltas is probably the least firm part of the new format.  I just
can't think of anything else that is going to work well.

For the project tree-format, the only real difference is that I will
get rid of the patch logs in the tree, and instead just have a file
which contains all of the patches that have been applied to the tree.
The patchlogs take up way too much space and are duplicated from the
repo.  The file would just be a serialized graph of the ancestry,
making it easy to read and write.

Merges are detected by seeing that there is a new, complete line of
patches in the patch log.  So if you individually apply all of the
patches, it will still know that you merged.

In any case, in thinking about the various formats, I came up with a
list of desirable characteristics.  Not all of them are possible at
the same time, but they are all desired.  If you know of any that I
missed, let me know.  In no particular order, and how the new format
scores


* Reliable: Won't break if interrupted at arbitrary times.
  -
  The only bad things that can happen are that there are pending
  revisions left in the repo, or index files are missing.  Missing
  index files merely slow down ArX, and the next commit to the repo
  will fix it.

* Repairable: If broken, it is easy to fix.  This argues for storing
  everything in multiple files with simple formats.
  -
  The patches are still the simple tarballs of patches that they were
  before.  You may now be able to ignore some corruption, because
  skip-deltas won't need them to construct any new revisions.

* Fast annotate:
  -
  This requires going through all of the patch logs to find out which
  ones affect a given file, then combining the appropriate patches to
  get a real delta between versions.  Not particularly fast, though it
  does scale as O(Number of revisions).  So long histories will
  suffer.

  It won't be anywhere near as fast as a format that uses weaves
  storage.

* Fast merging, including perhaps strategies like fast weave merges:
  -
  Regular, 3-way merges should not be too bad, especially with
  O(log(N)) access to any revision.  Weave merge will not be
  particularly fast, but perhaps fast enough for the rare cases when
  you need it.

* Fast access to any revision, in particular the latest revision, even
  remotely:
  -
  O(log(N)) access to any revision, which is not bad.  For 60000
  revisions, that is about 16 patches.

* No need to download the entire history just to check out the latest
  version (monotone, bzr, hg, and git are all bad in this respect).
  -
  Yes.  You can also commit into the remote repository directly.

* Easy to specify any revision: darcs is bad in this respect, because
  there is no universal number for every revision.
  -
  For ordinary, linear development, you can use the sequence number.
  Once there is parallel development, you only need to use enough of
  the hash's name to uniquify it.

* Fast diffs against any revision, in particular the latest revision:
  -
  The latest revision is fast, and other revisions are O(log(N)).

* Fast commits:
  -
  Not as fast as before, because of the need to compute a skip-delta.
  Applying exact patches to a tree is actually quite fast, but it is
  extra work.  Testing will tell whether that is a problem.

* Fast imports:
  -
  Much better than before.  It should be mostly equivalent to moving
  all of the files and tarring them up.

* Fast repo verification (e.g. svn has the "svnadmin verify" command):
  -
  Nothing really planned here.

* No extra directories: (Subversion has .svn subdirectories in every
  directory, most other systems have a single special directory at the
  top, and svk has no special directories, instead keeping that
  information elsewhere)
  -
  There is still a _arx directory at the top of every project tree.

* Efficient storage of repo and project tree: (unpacked git is
  terrible here.  tla/baz/arx all store the complete patch logs of all
  revisions in separate files in the project tree, bloating the space
  requirements for projects with long histories)
  -
  The project tree is very efficient since we have gotten rid of the
  patch logs.  The repo is somewhat efficient, although it could be
  better.  It depends on the number of revisions.

* Can move repo and project tree around in filesystem or between
  machines with tar:
  -
  Yes

* Fast repo syncing for both CPU, latency, and bandwidth:
  -
  The "index" files allow you to figure out a null sync in just one
  read of a file of 32 bytes.  A non-null sync has to recurse down,
  requiring two round trips (one for "index", one for the directory
  listing, though I may be able to do the listing asynchronously) for
  every level.  Certainly faster than the current method, which has to
  recurse down into every branch in the repo.

* Complete history by default, truncated history when desired:
  -
  If you don't have write permissions on the original repo, then you
  must either "mirror" or "fork" to commit.  In either case, you will
  not need to contact the initial repo except for updates.  It then
  becomes a matter of educating users about the proper method to use.

  However, in the future, it would not be hard to implement
  branch==repo, which will give you complete history by default.

* Checksums on patches and revisions.
  -
  sha256 on the revision and patch.

* Signatures on patches and revisions:
  -
  The signature on the patch log covers the sha256 of the revision and
  patch.  Sha256 should be good for the next 50 years or so, barring
  unforseen developments.  The same can not be said for sha1.

* Convergence when merging, so repeated merges don't create repeated
  commits:
  -
  Yes.  Patches from all inputs to the merge are stored, so updates
  from the branches bring you to the merged revision.

* push/pull over dumb protocols:
  -
  As before, ftp and webdav servers work for free.  Plain http servers
  must use update-listing.

* Distributed:
  -
  Using hashes to disambiguate revisions allows people to work in
  parallel, and then pull in revisions from each other.

* Easy branching (e.g. you don't have to come up with a new name every
  time you want to make a branch):
  -
  Yes.  A branch does not have to lock anything or use a different
  name.  It just uses a different hash.

* handles collections of projects:
  -
  Yes, same as before with "arx tag".

* No repo maintenance (no archive caches or git's packing, or even
  make-archive (as in darcs and bzr)):
  -
  make-repo is still required, but archive caches are not.

* Handles a large number of revisions (~60000):
  -
  When updating, you know which group of 256 revisions you are in, so
  you usually only need to list one directory.  Basically you
  have to list (Number of new revisions)/256 directories,
  although you may have to list a directory with (Number of
  total revisions)/256 entries.  for ~60000 revisions, that is ~256
  entries, making a total of about 1 KB to read.
  
  When doing an initial get, you have to list all of revision
  directories.  That is about 60000*32 bytes=2MB.  That is proabaly a
  small number compared to the size of the project tree.  You also
  have to get and apply about 16 patches, as well as the initial
  rev.tgz.

* Handles a large number of branches(~100):
  -
  When updating, you may spuriously notice changes that happened in a
  parallel branch.  However, that will only result in an extra listing
  of a revision directory, which we might do anyway to cut down on
  latency.  The directory listings will have about (Number of
  branches)*256*32 bytes.  For 100 branches (a pretty extreme
  example), that would be about 800 KB.

* Human readable revision names:
  -
  The default sequence numbers are human readable.  Hashes only come
  in when there is a need to disambiguate.

* No need to name repos or projects: (darcs/bzr/hg/git is good, tla is
  ultra bad)
  -
  Naming is optional

* Handles cherry picking, and makes it a merge when you have applied
  all of the patches:
  -
  Yes

* Can disapprove patches:
  -
  No

* Can host on any filesystem including 8.3 systems.  This includes
  running a server on a filesystem that can not store files in the
  repo.  Using a single file database would be one solution, although
  that causes other problems.

  8.3 filesystems are not so common, so you might want to try to get
  away with only 31 character, case-insensitive filenames, with a max
  path length of 255 and max directory depth of 8.

  Also, some ftp sites have restrictive policies about what kind of
  files can be uploaded.  From the comcast website:

   NOTE: File names must consist of characters from "a-z", "A-Z",
   "0-9", '_' (underscore), '.' (period), '-' (hyphen). No other
   characters (including spaces) can be included in the file
   name. File names must not start with '.' or '-'.

  -
  8.3 filesystems will not work because of the 30 character revision
  names.  Similarly, illegal characters in a branch name or overly
  long branch names can cause problems.  I considered url-encoding
  branch names, but that will make non-ascii names much longer,
  possibly causing problems with length.

  An interesting note is that if two branches differ only in case,
  they will end up stored in the same place on case-insensitive
  filesystems.  Because of the hashes, the ancestry will not get
  confused.  But running a simple "arx get branch" will notice that
  there are two heads.

* Easy to backup. DB's have their own backup scripts, but being able
  to use rsync and having it do the right thing is awfully nice:
  -
  Yes

* Works with write-once media. (No one really has this, although
  tla/baz/arx and subversion (with fsfs) could be modified to do so.
  We just need a place to put the lock files.)
  -
  No

* Remote, multi-user, auditable, restricted repos like what CVS and
  subversion offers. Then there is only one person who needs to
  manage the repo, and a random user can't delete or modify old
  revisions.  Doing this without a smart server is painful.
  -
  No, but it could be added later with a smart server.

* Lightweight branches
  -
  Microbranches are as light as they can be, since they only have a
  small patch file.  However, that requires you to have the rest of
  the branches' revisions.

  Branches without history are not as lightweight as they used to be,
  because you always have a "rev.tgz" file.  However, the size of
  "rev.tgz" is usually much smaller than the size of a project tree,
  so it won't actually be that bad.  My WAG is a 30% addition over the
  size of a project tree.

* Able to remove revisions, even after they have had child revisions:
  -
  yes, although you will have to remove the children first.  You no
  longer get into the situation where you have two revisions with the
  same name.

Comments?

Cheers,
Walter

[1] http://lists.gnu.org/archive/html/arx-users/2005-07/msg00018.html
[2] http://svn.collab.net/repos/svn/trunk/notes/skip-deltas
[Prev in Thread]
Current Thread
[Next in Thread]
[Arx-users] The Future (long), Walter Landry <=
- Re: [Arx-users] The Future (long), Kevin Smith, 2005/12/07
  - Re: [Arx-users] The Future (long), Walter Landry, 2005/12/07
    - Re: [Arx-users] The Future (long), Kevin Smith, 2005/12/07
    - Re: [Arx-users] The Future (long), Walter Landry, 2005/12/08
    - Re: [Arx-users] The Future (long), Kevin Smith, 2005/12/08
    - Re: [Arx-users] The Future (long), Walter Landry, 2005/12/09
    - Re: [Arx-users] The Future (long), Kevin Smith, 2005/12/09
    - Re: [Arx-users] The Future (long), Walter Landry, 2005/12/09
    - Re: [Arx-users] The Future (long), Kevin Smith, 2005/12/09
    - Re: [Arx-users] The Future (long), Walter Landry, 2005/12/11
Prev by Date: Re: [Arx-users] Arx 2.2.4 mv/rm bombs on commit
Next by Date: Re: [Arx-users] The Future (long)
Previous by thread: [Arx-users] Arx 2.2.4 mv/rm bombs on commit
Next by thread: Re: [Arx-users] The Future (long)
Index(es):
- Date
- Thread