arx-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Arx-users] Hashes for revisions


From: Walter Landry
Subject: [Arx-users] Hashes for revisions
Date: Sat, 02 Apr 2005 17:04:42 -0500 (EST)

Greetings,

There is one use case that ArX does not handle well, and I think it
should.  Suppose you have a project that you are working on, and the
project is stored on machine A.  Then you mirror that archive to
laptop B, disconnect, and hack on the project.  Right now, the safest
way to do this is to make a new branch with a new archive and work on
that branch.  Otherwise, you might forget that you were hacking on the
laptop, and accidently commit a revision on machine A with the same
name (e.g. foo.bar,14).

But this is annoying, because where you are working on the project
should not influence which branch the work goes on.  If I am working
on feature xyz, then all of the work should go into the xyz branch.
Martin Pool expressed it rather well with

  As much as possible, the accidental difference of the location of
  the repository should not effect[sic] the semantics of branches.

The usual way to handle this is to use cryptographic hashes to
identify the revisions.  Then you are guaranteed (more or less) not to
have collisions.  This is what monotone does.  However, it leads to
long, ugly, non-intuitive revision names like a38f3b4bed7...  You also
can't tell if 32ff4ad4... came before or after this revision.

So here are my incomplete and perhaps incoherent thoughts.  I have not
figured out everything, especially the implementation.

In ArX, we already compute the hash of a revision.  So we can use both
the hash and the ordinary revision number to identify a revision.  You
would only use the hash to resolve ambiguities.  So a revision might
have the name foo/bar.baz,14,76a72bc9d.  But if you type "arx get
foo/bar.baz,14", you will get that revision unless there is another
revision with a different hash.

There are still some semantics to work out.  First of all, what does
"merge" do in the default case, when we are just updating a tree.
Suppose that we have two lines of development

  1 -> 2 -> 3 -> 4a -> 5a -> 6a -> 7a
              |
              -> 4b -> 5b -> 6b -> 7b

If I have a tree at 5a and I type merge, should I get a merge with 7a
or 7b?  I would assume 7a, since that is along the same microbranch.
But what if I am at revision 3?  I think that ArX should complain about an
ambiguity.

The problems get more complicated with "get".  Should a plain "get"
get 7a or 7b?  What if the (a) branch has not been touched for 3 years
while the (b) branch has 1000 more revisions following it?

One possibility is for there to be a way to mark a branch as no longer
interesting.  But what do we do when mirroring?  Should that
"uninteresting" mark carry over to the mirror?  What if the person
pulling the mirror is actively working on that branch?

So maybe the best thing is to have "get" retrieve the highest numbered
revision by default?  If that gets annoying for someone, then perhaps
they should create a new branch?  That does open up the possibility of
"revision number pollution", where one branch may create a large
number of revision numbers which mess up the main branch.  This is the
one thing that I have yet to figure out.

So that is the basic interface.  For implementation, my over riding
concern is that it is still possible to serve an ArX archive using a
simple ftp or webdav server.

One change to handle this is that patches would have to reference the
hash of the previous patch level.  That is easy.

Another is that we have to store the current revision in the project
tree, so that we know what revision to diff against.  No longer can we
assume that the highest patch number of the current branch is correct.
That is also easy.

We also have to change how revisions are stored in the archive.  Right
now, for the revision foo.bar,14 with hash a23d23aeb4, it is stored in
the directory foo/bar/,14.  I would change this to
foo/bar/,14,a23d23aeb4.  So the hash would be appended to the end of
the revision number.

However, if I append the full hash to the end, we will run into path
length problems, especially on Windows.  It takes 64 hex characters to
express a sha256 hash.  But we don't really need to append the entire
hash.  We are just trying to avoid accidental collisions, not
intentional ones.  A malicious attacker can always create files in the
wrong place, but those files revisions will not have the correct hash.

So how many bits should we have?  Since the revisions are already
uniquified by the archive, branch and revision number, we only need to
worry about how many different branches of the same code there are.
For the linux kernel, as I understand it, there are hundreds of
contributors.  If they each have 10 branches, then we are talking
about thousands of possible microbranches.  And that is only if no one
bothers to make a real branch instead.

Eight hex characters is 32 bits, which means you need about
2^(32/2)=65536 versions of the same revision to create an accidental
conflict.  So my current plan is to truncate the sha256 to 8
characters when storing it in the archive.  Eight characters is also a
nice length in that it is unlikely to be confused with a revision
number.

Currently, whenever we need the latest revision (for merge or get), we
list all of the revisions and pick the largest number.  We can still
list all of the revisions and partially order them, because they all
have revision numbers.  So if you are doing a merge, and you are at
revision 5 and merging up to revision 10, you don't have to worry
about any ambiguities arising from revisions before revision 5.

There are two nice side effects of these changes:

  1) We can have a real "delete-revision" that completely removes all
     traces of a revision.  Because of the hashes, there is no danger
     that someone will accidently get the wrong revision.

  2) We can have a "fix log" command that changes the log message in a
     revision by creating a new revision with a new hash.  The old
     revision is then deleted.

It is also nice that, if you don't use this feature, the workflow will
remain exactly the same as before.  There will still be revision
numbers that you can use to identify revisions.  One annoying part is
that merges can take a bit longer.  Given the previous example

  1 -> 2 -> 3 -> 4a -> 5a -> 6a -> 7a
              |
              -> 4b -> 5b -> 6b -> 7b

If you have a tree at 5a, you have to read the patch logs for 6a, 7a,
6b, and 7b to ascertain that you want to update to 7a.

These changes would make it possible to have microbranches, but I am
still not completely sold on how it all works.  Other systems that
tackle this problem all carry around the entire revision history,
which makes many decisions easier.  In any case, these changes are not
going to happen in the short term.  It is just something that I have
been thinking about for a while.

Cheers,
Walter




reply via email to

[Prev in Thread] Current Thread [Next in Thread]