monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] how to merge trees


From: Nathaniel Smith
Subject: [Monotone-devel] how to merge trees
Date: Tue, 16 Aug 2005 03:38:30 -0700
User-agent: Mutt/1.5.9i

Here's a scheme for doing tree merging, using as a foundation the
algorithm defined and justified in
  http://thread.gmane.org/gmane.comp.version-control.monotone.devel/4297
Really, we could swap in any good merge algorithm for scalar values;
probably there are better ones than that particular one, and it's an
area of ongoing research.  But the one there is very simple, better
justified analytically than any other existing merge algorithm, and
has no known fatal flaws.

I'll speak in terms of the new cset code, which has much simpler
representations than the current change_set.cc.

Also, I'm going to assume that we don't support file suturing -- a
file is born exactly once, and in any particular downwards path
through the ancestry, dies exactly once or not at all.  This isn't
because file suturing isn't a nice thing to support -- it is -- but I
don't know how to solve that problem, and there's not much point in
trying to figure it out until there's some reason to be believe that
having solved it at the tree-merging level, we'll be able to sanely
perform content merges on such files.

(Aside from that note, I'm going to completely ignore the problem of
merging text files from here on; it's also a difficult problem, but
if the initial version of the new code still has to fall back on 3-way
for merging file contents, then we're still doing better than we are
now...)

More precisely, we enforce the following lifecycle for files: 
  -- files are born exactly once.  For every file, there is a unique
     earliest revision which contains that file.  Call this birth
     revision B.  Every revision which contains that file is a
     descendent of B.
  -- once files are dead, they stay dead.  If some revision R is a
     descendent of B, and R does not contain the file, that file will
     not be contained in any descendent of R.
(It's potentially possible we could support more complex lifecycles
that include resurrection; the basic primitive you need is just a
binary scalar merging algorithm, and we know roughly how to write such
things.  But let's defer that for now; the above is good enough and
tractable.)

Okay.  Say we have two revisions we want to merge, A and B.  The first
thing we need to do is identify which files in A correspond to which
files in B.  I thought this was really hard, until I remembered that
in the Brave New csets World, a cset is just a mapping between two
trees.  Duh; to find corresponding files, all we have to do is
calculate csets between A and B.  The one danger is things like:

      R
     / \
    S   Q    (I bet everyone's getting really sick
    |\ /|    of this diagram by now)
    | X |
    |/ \|
    A   B

where a file was added in S and exists in both A and B.  If we
calculate a cset via S, we'll get the versions in A and B as
corresponding; if we calculate it via Q, we'll get them as two files
that have no corresponding version in the other revision.

Let's define an interesting boundary set.  The idea is that it's a
(generally small) set of common ancestors that we can calculate our
csets via, and if we do so, we'll know we found all possible
correspondences.  Once we've calculated all such csets, we "union"
their maps in the following sense:
  -- if map A has 'file1 <-> dead' and 'dead <-> file2', and map B has
     'file1 <-> file2', then go with map B. 
  -- otherwise, the maps will be identical (modulo bugs)

It turns out that a sufficiently interesting boundary set is simple:
"all least common ancestors".  In other words, the set of all common
ancestors such that no other common ancestor is a descendent of them.
In other words, what you'd get if you ran erase_ancestors on the set
of common ancestors.

The reason is our super-simple lifecycle model.  Assume that some file
exists in both A and B.  Then it has a unique originating revision R.
By our lifecycle rules above, R must be an ancestor of both A and B.
Thus it is an ancestor of at least one of the least common ancestors
(by definition of "least").  Call this common ancestor C.  The file
exists in both A and B, and C is ancestor of both, so the file must
exist in C, and in every other revision which is between C and either
A or B (again, by the lifecycle axioms).  Therefore, the file is
continuously alive on any changeset concatenation path from A to C,
and from C to A, and thus its identity will be correctly inferred.

Summary of above discussion/proof: if you find all the least common
ancestors of your two revisions to be merged, use _all_ of them to
calculate a changeset between the two revisions, and then combine the
resulting changesets using the simple rule above, you get perfectly
accurate file identification.

(Tangential note: in principle, 'diff' should be using similar tricks
to identify files, or it can give slightly inaccurate results.)

So, now we have a map between files in A and B.  On with merging.

First, we enforce the lifecycle rules.  Every file/directory that is
dead on one side or the either, is dead in the merge.  This makes life
simpler; we can ignore such files from now on.

Now, scan through all files.  For each file/directory, there are a
number of independent scalars:
  -- its basename          }
  -- its parent directory  } [1]
  -- its sha
  -- zero or more user-defined attributes; each unique keyname creates
     such a scalar.
[1] it likely is best to manage (basename, parent directory) as a
    single atomic scalar; the question is whether merging <rename
    foo/a foo/b> with <rename foo/a bar/a> should be a conflict.

For each scalar, check to see whether its values in the two revisions
differ.  (For these purposes, we treat attributes that are undefined
on one side as having the magical value 'undef'; this more-or-less
magically gets attribute lifecycle management right.)  Make a note of
each such scalars.

Now we use these notes to implement our scalar merge algorithm:
  http://thread.gmane.org/gmane.comp.version-control.monotone.devel/4297
The goal is, for each scalar, figure out whether or not it has a
marked ancestor that is closer than the LCA boundary.  

I describe the algorithm for a single scalar.  In reality, one wants
to do this with all scalars simultaneously, keeping a set of scalars
one is tracking, and splitting that set when the search forks.

So we have a single scalar, in two revisions A and B, and we want to
find out whether A or B wins.  Let's define a routine that, given a
revision R, finds *(R).  Actually, that's not quite what it does;
let's also give it another, optional, revision S.  Our routine will
either find *(R), or discover that *(R) is an ancestor of S.
Algorithm: Set our current revision C to R.  Repeat forever:
  -- if C is an ancestor of S, return that *(R) is an ancestor of S.
  -- if C has no parents, return C
  -- if C has 1 parent:
     -- if that parent's value is the same as the value of C, set C to
        that parent and iterate
     -- else, return C
  -- if C has 2 parents:
     -- if both values are different than C, return C
     -- if one parent has the same value and one parent has a
        different value, call this routine to calculate *() of the
        differently-valued parent, with S set to the same-valued
        parent.  If the result is an ancestor of the same-valued
        parent, set C to the same-valued parent and iterate;
        otherwise, return C.
     -- if both parents have the same value, call this routine on
        each, in each case setting S to the other parent.  Three
        possible cases:
        -- if one side clearly wins, set C to it and iterate
        -- if both sides win, return C
        -- if neither side wins, it can only be because they share the
           same *, in which case C's * is also the same.  set C to the
           LCA of the two sides and iterate.
If you found a mark, repeat with B.  If you didn't find a mark, then B
just wins -- because the above email proves that at least one side
always wins.  If B does have a mark, though, that's not a common
ancestor, then this is a conflict.

Most such conflicts are real conflicts.  The exception is if the
scalar in question is a file hash.  File hash conflicts should not be
escalated to the user directly, but should instead invoke the textual
merging algorithm; if _it_ raises conflicts too, _then_ we escalate to
user.  (But by running a scalar merge pass first, we can very often
avoid even touching the (expensive) textual merger; often only one
side of a merge will have even touched a given file, so scalar merging
on the hash alone suffices.)

For the real conflicts, we can either handle them immediately (e.g.,
prompt user, start xxdiff), or we can write down all such conflicts to
deal with later (the latter is probably better, since it's strictly
more general and allows much better interfaces).

Either way, after we've merged all scalars using the above rules, we
need to do one more thing -- scan for what I've been calling
structural conflicts.  It's possible that after we've done the local
scalar merging, we may still have global inconsistencies in our tree
structure.  Such problems include:
  -- multiple items with the same name, in the same directory
  -- circular directory structures
I think that is all, but I'm sure someone will correct me if not.

These sorts of conflicts also need some strategy for resolution; the
comments above apply.

I think that covers everything, in sufficiently exhausting detail.

-- Nathaniel

-- 
So let us espouse a less contested notion of truth and falsehood, even
if it is philosophically debatable (if we listen to philosophers, we
must debate everything, and there would be no end to the discussion).
  -- Serendipities, Umberto Eco




reply via email to

[Prev in Thread] Current Thread [Next in Thread]