arx-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Arx-users] Further thoughts on ArX and simplicity


From: Walter Landry
Subject: Re: [Arx-users] Further thoughts on ArX and simplicity
Date: Tue, 19 Jul 2005 17:57:16 -0700 (PDT)

Kevin Smith <address@hidden> wrote:
> Walter Landry wrote:
> > 
> > This process will require changing the archive format.  The places
> > that I can think of are
> > 
> >   1) Continuations
> > 
> >     In logs, the "Continuation-of" header tells us where this revision
> >     branched from.  This is used for all branching.  It currently has
> >     the archive and revision.  It could be changed to have just the
> >     revision, and then there would be a "FORK_URL" file which would
> >     have the url of the previous branch.  If the location of the
> >     archive changes, we only need to change that one file.  Putting it
> >     in a separate file means that it won't interfere with checksums.
> 
> Can you point to any design docs that describe the existing archive 
> format? Poking around a bit, it looks like everything is stored binary 
> (ick). What are the main types of files in _arx, and at a very high 
> level, what are their intents and contents? Such information, even if 
> it's just an email posted to the list, will be a valuable resource for 
> anyone interested in fiddling with the code.

First of all, ArX uses the Boost serialization library to write many
things to disk.  This makes it easy to, for example, read and write a
map, list, or string to a file.

-------
ARCHIVE
-------

For the archive, at the top level there are two special entries

1) ,archive_version

  This file just contains a string describing the version of the
  archive.  I don't think that ArX currently checks the archive format.

2) ,meta-info/

  This directory contains the serialized name of the archive in "name"
  and associated public keys in "public_keys".  The public key file is
  just a gpg public key ring.

The rest of the archive is the actual data.  For a revision with the
name branch.subbranch,revision, that patch will be in the directory
branch/subbranch/,revision.  The patch itself can contain up to 5 files

1) branch.subbranch,revision.patches.tar.gz
2) branch.subbranch,revision.patches.tar.gz.sig

  This are the actual patch and associated gpg detached signature.

3) log

  This is a copy of the patch log.  It is also included in the patch
  itself, but it is out here so that ArX doesn't have to download and
  unpack a whole patch.

  The log is a serialization of these elements.

  map<string,string> headers;
    Things like "Summary", "Date", etc.

  map<string,list<string> > header_lists;
    Things like "New-patches", etc.

  map<string,list<pair<string,string> > > rename_lists;
    Things like "Moved-files", "Moved-directories".

  string body;
    The body of the log.


4) sha256
5) sha256.sig

  These are the SHA256 of the revision and associated gpg detached
  signature.  These are just the hex representation of the SHA256, not
  using serialization.

In the archive, there is also a directory in branch/subbranch/,cache,
which contains tarballs of project trees created with archive-cache.


------------
PROJECT TREE
------------

For the project tree, the following items are in _arx:

1) ++cache

  This is a directory containing the cached trees uses for diff,
  commit, etc.  It has the structure

    archive/branch/subbranch/,revision/

  If you used the --link-tree option when creating the project tree,
  then the project tree will be hard linked to the cached tree.

2) ++default-branch

  This is just a file containing the serialized name of the current
  branch of the project tree.

3) ++edit

  This has a list of paths that have been marked with "arx edit".  It
  is serialized names that have just been appended to the file.  So
  you have to detect EOF when reading the file.

4) ++manifest

  This is a partial list of the paths under version control.  This
  file, along with the ++changes file, has all paths.  It generally
  corresponds to the list of paths in the latest revision, although
  partial commits will alter that.  It also contains the properties
  for a each path.  The format has the manifest version and the
  complete revision name (hmm, another place to remove an archive
  name, as well a problem for hashes) at the beginning.  Then it is a
  list of file_attributes and sha256.  The file_attributes have the
  type of path (link, dir, control, file), name, property, and
  inventory id.  The SHA256 is raw binary to save space and decoding
  time.  The ++manifest file gets read every time you run diff, so I
  wanted it to be fast.  Now that I think about it, I don't think that
  I did timings to see the real difference.

5) ++changes

  This is a list of the files that have been changed via "mv", "rm",
  "add", and "property".  The format is similar to ++manifest, but
  with the keywords "add", "delete", "move", "set" and "unset".

6) patch-log

  This directory contains all of the logs for patches which have ever
  been applied to this tree (unless removed with "history -d").  So the file

    _arx/patch_log/archive/branch/subbranch/,revision

  has the patch log for archive/branch.subbranch,revision.

7) ++sha256

  This is the SHA256 of the ++manifest file.  It must be updated
  whenever the ++manifest file is updated.

> Obviously a "log" here is not what I think of as a "log". Perhaps we 
> could come up with a better name for it to avoid confusion.

I am confused by your confusion.  What do you think a log is?

<snip>
> > Unfortunately, I have been thinking over the hashes for revisions
> > work, and I found one problem: we don't know what the hash will be
> > before we create the patch.  That means that we don't know how to name
> > the patch log.  Systems that don't support cherry-picking can get away
> > with it, because there is always a context for a log.
> 
> Can you describe why we need to name the patch log before creating the 
> patch?

Part of the patch is adding a patch log.  We have to know where to put
the patch log.  The only way to guarantee that there are no conflicts
is to use the hash.  If we did not support cherry-picking, then it is
pretty simple to just order the logs in one big file.  With
cherry-picking, it is no longer determined whether one patch comes
before another.

We have to have the patch log in the tree because we have to record
which patches have been applied to the tree.  So when we run "get", we
will get a tree that knows that certain patches have been applied.
The revision hash incorporates the location and contents of the patch
log, so we have to know the location before we can compute the hash.

[random thoughts]

I _think_ we could work around this by creating a random number that
we use to uniquify the patch log, and then have some metadata in the
patch that maps "random numbers" -> "hashes".  We would have to
remember to generate this metadata whenever we run "diff", which could
get complicated if we are doing diffs of only part of the tree.  It
would also be a performance annoyance, because we would have to read
the mapping file before doing anything like printing out logs, looking
at ancestry, etc.

But then the mapping is not hashed, and so it could be replaced.  It
seems like a minor thing, but it does open up a hole.  Minor things
can sometimes be exploited into big things, especially with such a
complicated mechanism.

Alternatively, we could just not include the location of the patch log
in the information that gets hashed.  Then we could put it in the
right place after applying a patch.  That would mean that patches
created with "diff" would be different from patches created with
"commit".  It also means that it would be a little difficult to
manually verify a hash (but that is really not a big deal).  The
mapping would still not be hashed, but that is ok, because when you
get the patch, you presumably know which patch you want.  Hmm.  That
might work.

In any case, as noted above, the ++manifest file has the complete
revision name.  We could use only the partial revision name
(branch.subbranch,revision, not branch.subbranch,revision-hash) or
even take it out entirely.  I don't think that it will cause problems
with attackers replacing patches, since patches are referenced by
hash.  But I need to consider this more.

> > One solution is to not use checksums in the names, and instead use
> > random numbers.  This has the same collision-resistant properties of a
> > hash, but it doesn't have the self-verifying properties of a hash.
> > Normally, you don't care because you're checking crypto signatures.
> > But if your key is stolen, then an attacker could change revisions
> > that have already been published.
> > 
> > It might be possible to combine random numbers with hashes to get real
> > hash-based revisions, but whatever I might come up with will be an
> > ugly hack.  Just using random numbers will be rather straightforward.
> 
> Hm. Random numbers seem slightly more prone to collision (due to bad 
> random number generators or insufficient entropy). Probably ok if they 
> are long enough. I still want to really understand the stuff above so I 
> can see why this is necessary.
> 
> I guess my hope would be that my "fresh eyes" might be able to spot some 
> potential design simplifications, where your years in the arch/ArX world 
> may have biased you in certain directions.
> 
> >>So how can we get from here to there? I'm a huge fan of incremental 
> >>development, so I would really like to have a series of changes where 
> >>the system is never broken.
> >>
> >>Perhaps the first step would be to do the work described below. That 
> >>would dramatically cut down the number of places that would need to be 
> >>changed to reflect the big archive naming paradigm shift.
> 
> When I said "work below" I was referring to simplifying the UI to use 
> appropriate defaults and inferences to avoid every possible need for the 
> user to specify an archive. I think that's a big win even if we never 
> take the additional steps we're discussing.

I agree.

> >>After that, could the remote branch work be done first (switching to be 
> >>URL-based), without affecting other parts of the system? If so, that 
> >>seems like a good first step.
> 
> I would love to figure out a way to implement remote branching without 
> having to overhaul the archive format.

Archive names are pervasive, so I think modifying the archive format
is inevitable.  On the other hand, since we have to modify the format
anyway, we can sneak in a few other improvements.

Cheers,
Walter




reply via email to

[Prev in Thread] Current Thread [Next in Thread]