monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] arch web pages


From: graydon hoare
Subject: Re: [Monotone-devel] arch web pages
Date: 23 Aug 2003 13:05:10 -0400
User-agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.2

Tom Tromey <address@hidden> writes:

> I looked at the arch web pages a little.  There are a couple
> interesting things there.
> 
>     http://arch.fifthvision.net/bin/view/Arx/GccHackers
> 
> This is a requirements list for a future gcc revision control system.
> It is probably pretty close to gcc consensus (if that exists).

ok. I can respond to these point-by-point, though I probably ought to
put up a page with this as a canned answer too:

GCC CRITERIA:
~~~~~~~~~~~~~

- data integrity guarantees: in design, monotone is better than CVS
  in this regard (transactional, distributed). in practise it is young
  so will have some unexpected bugs. we're moving to self-hosting soon,
  which will likely help shake things out.

- portability: monotone is probably not as portable as CVS, since it
  is written in C++ and uses some modern features, but should be close
  to "as portable" as g++ and boost, which is a fair number of modern
  platforms.

- end-to-end checksumming: monotone uses strong hashes for identifying
  everything; you can't get much more checksummy than monotone.

- anonymous read-only access by UID with only read-only privs: doable.
  "repositories" don't really exist much -- it's a distributed system
  after all -- but both news servers and depots can be set up about as
  read-only as you can imagine. even if someone does "write" where
  they're not allowed, it does no damage to data integrity (which is
  at the ends of the network, in RSA cert validation).

- remote write operations use strong crypto: there is no such thing as
  a "remote write operation" in monotone, generally. but everything
  that touches your database -- even local operations -- uses strong
  crypto. integrity is as strong as SHA1-derived RSA signatures;
  authority is distributed and client-evaluated.

- data cannot be modified by unpriviledged users without using the VC
  system: well, it's a file, so you can twiddle its bits. but bit
  twiddling will very likely be noticed by the endless hashing and
  cert checking.

- must be at least as fast as CVS: depends on the operation. I'm
  within an order of magnitude of local RCS when reconstructing file
  versions; remote file access doesn't happen in monotone so there's
  no other comparable algorithm to benchmark against. I expect to
  close the RCS gap a bit more but it certainly "feels" pretty snappy,
  since nearly every operation is local.

- efficient network protocol: all the networky stuff does
  transmissions of size proportional to the deltas. the NNTP
  transmission system is currently lockstep rather than pipelined, but
  the HTTP depot stuff is effectively pipelined (one request + long
  streaming send, each way)

- efficient tags and branches: a tag or branch-making command involves
  adding a single fixed-size cert to your database. it's nearly
  instant. transmitting it to another machine means transmitting a
  few hundred bytes (cryptographic data after all).

- efficient delta storage: delta storage is currently done with
  bring-to-front, so the retrieval time mirrors your access patterns
  (say if you're working on a branch, those branch tips will move to
  the front of the delta store, at a cost of possibly-redundant copies
  of similar heads). but the storage system is totally decoupled from
  all other metadata about ancestry or versions, so you can play with
  the storage algorithm to suit your needs. so long as it can produce
  a version with the right SHA1, it doesn't matter how.

- efficient method of extracting a logical change "after the fact":
  yes. build any 2 manifests, take their setwise difference, fetch all
  the deltas between files which changed in the manifest
  difference. this is the standard way of doing every delta
  computation.

- atomic application of logical change, one changelog msg: yes.

- atomic backout: not yet, but if I add it, sure. all it involves is
  deleting the newly-committed manifest cert describing the state as a
  descendant of its parent. I haven't written a 'backout' command yet
  but there's no design reason not to have one. note, though, that
  such a thing wouldn't back out a change from *other people's
  databases*, if you've transmitted the change already, since the
  system is distributed. I'm considering adding a "nullify" cert to
  indicate "dumb mistake" nodes you wish to backout.

- renames: yes, though in an interesting sense. files don't have
  permanent "inode"-like identities that last past their current
  version. identification of files is done either by pathname
  identification, or SHA1 identification, or explicit certs tying one
  file to another. renaming is only really relevant when exploring
  history to see what committers were intending; mechanical operations
  like checkouts or updates don't really care whether the new file
  version is a "renaming" or "creation", so long as it has the right
  SHA1. when (or "if") monotone or the user fails to notice or
  register a change as a rename, it just safely degrades to a
  delete+add pair, and full file data is transmitted rather than a
  delta, which isn't deadly.

- when merging branch A->B, remember last mergepoint and start from
  there. yes, definitely, this is always how "monotone merge" works.

- single-delta merge: this is also called cherrypicking. I don't have
  a "great" way of doing this now, but it's not outside the range of
  things it's relatively easy to implement. in the worst case you can
  diff the two tree states and pipe that to patch. it's not currently
  as easy as saying "add patch 33, remove patch 45" though.

- perform conflict resolution by formation of microbranches:
  yes. monotone makes no distinction between forks, conflicts and
  branches, save that branches have *names* and are supposed to stay
  forked, whereas forks and conflicts are intended to eventually
  converge.

- should allow different users to generate patches vs. apply them, and
  still smoothly function when the author updates: yes. using SHA1
  values means a file is identified by contents; doesn't matter where
  the contents came from. in fact, the model in monotone is even
  stronger: the unpriviledged person generates the patch, and the
  "priviledged" person (read: important, trusted) just generates a
  cert which rubber-stamps the patch. then it is automatically applied
  by anyone who trusts that rubber stamp. no "double-committing" stuff.

- efficient on-disk representation: I'm benchmarking against the GCC
  repository. currently I am more space-efficient in delta storage,
  but not as space-efficient overall, as CVS. more like 4 times the
  size. however, most of that is relations between very verbose and
  uncompressable cryptographic metadata; I think I can make it much
  smaller. 

  in any case, a major advantage monotone has in this area is that you
  can work off of arbitrary subsets of a database without getting it
  upset -- say "trim all but the last 2 years, and delete everything
  related to ada or fortran" -- and carry that around with you on
  disk. or say "aggregate pre-2.95 versions into large, sparse,
  per-release deltas", and work off that. versions are just SHA1
  codes, and the database is relational; there is no need for
  continuity or completeness in each database.

- generating ChangeLog entries: doable with a lua hook. not presently
  there, but easy enough to add.

LINUX KERNEL CRITERIA (excluding stuff already mentionned):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- advanced merge conflict tool: monotone does a merge3 from the least
  common ancestor (last mergepoint) and can drop you into an external
  merge tool (via lua hook) of your choice if that fails. it knows how
  to invoke ediff and xxdiff. I haven't written my own GUI for this. 

- remote branch repositories: not clear what this means, but monotone
  is fully distributed and branches can be made by anyone, at any time,
  on any machine, connected or disconnected.

- per-file checkin comments: if you like, yes. you can attach changelog
  certs to file versions, manifest versions, or both.

- storage of select inode metadata: you can extend the cert vocabulary
  with anything you like (links, pipes, devices, owners, ...) but
  you'll have to add hooks to interpret these values since they have
  system-specific meaning. the default file metadata vocabulary is a
  "portable interpretation" of merely file pathnames and their
  contents.

- "dontcommit file.c" to mark a private change: .. uh .. doable, but
  it's completely a UI issue. I haven't added that. is it really
  important?

- disconnected / distributed repositories: yes.

- ability to exchange changesets by email: yes, by any transport.

- patch splitting: eh.. perhaps. not obvious what to consider a
  splittable entity. if you can select inbetween-version-codes, I can
  certainly split edges on those. if not, it's not clear where to put
  the boundary. if you *do* split a patch / changeset though, the
  system will automatically identify the endpoints of the "one big
  patch" with the endpoints of "lots of little bitty patches", since
  they have the same SHA1 either way you arrive there. so splitting
  or aggregating doesn't break other people's work.

- archival of directories: no. I don't archive empty directories. I
  archive pathnames of files. directories are implied by file
  pathnames. I didn't feel like it was terribly worth writing code to
  managing the coming and going of empty directories -- do you really
  want your version to be considered different from mine just because
  it has an extra *empty* directory? -- but in theory it can be added
  with no fuss, just don't see a strong reason to care.

- a magical bk usage story about smooth and easy pushing and pulling
  and exchanging stuff with linus: not sure. I haven't used it with
  linus yet :) the theory is that this sort of scenario will work, but
  who knows about the practise? let's try.

> I've talked to Graydon a bit about merging.  I suppose these different
> things in arch -- star-merge, replay -- are just different ways of
> deciding how to apply patches when merging.  I think that could all be
> done, in theory.

suppose we have P=parent, US=our working copy, OTHER=some other
change. "arch update" applies diff(P,US) to OTHER and writes the
result into US. "arch replay" applies diff(P,OTHER) to US. as near as
I can tell "star-merge" does a 3-way merge using the fact that US and
OTHER share P as a parent.

monotone always does a 3-way merge when it can find a parent,
regardless of branch boundaries or anything, else it does a 2-way
merge. 3-way merge is the generalization (and correction) of both
"replay" and "update" described above: it means taking X=diff(P,OTHER)
and Y=diff(P,US), adjusting all the coordinates of edits in Y so that
they are made in terms of the coordinates after X, and applying the Y
to OTHER. I don't know why tom lord has chosen to implement update
operations using weaker merge operators when there's a known parent;
replay is strictly *more* likely to fail than a merge3, since it's
attempting to apply patches to blocks of data which may be in new
places. maybe with unidiff context matching and a sloppy "patch"
program (hashing lines, accepting fuzz-factors) it will often work,
but why? you have the parent; you ought to use it.

anyways, if you develop a magical better merge operator, I don't think
it'll be hard to wedge it into monotone. I already have a lua hook for
handling a failed merge3. I will likely add one for overriding the
initial attempt at merge3 and supplying your own, if you've a
preference (eg. a ChangeLog merger or something). since you're always
doing merge work on your local database, you should feel reasonably
comfortable tinkering with this stuff; it won't disrupt other users'
if you play with custom merge operators on your own.

> Does monotone handle file permissions and symlinks well?  Those are
> actually useful to handle.

no, it doesn't. by default I didn't want to add an interpretation of
these, as they don't strike me as either (a) terribly important or (b)
terribly portable (they may have variable semantics on different
platforms). maybe it's not a hard thing to add -- either some new
certs or a change to the manifest format -- but my aim is to err on
the side of simplicity at this stage. same reason I'm not handling
empty directories at the moment. I don't see it as "in general
demand". feel free to add this if you think it's a big feature.

-graydon





reply via email to

[Prev in Thread] Current Thread [Next in Thread]