monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] Re: results of mercurial user survey


From: Justin Patrin
Subject: Re: [Monotone-devel] Re: results of mercurial user survey
Date: Thu, 27 Apr 2006 18:45:19 -0700

On 4/27/06, Nathaniel Smith <address@hidden> wrote:
On Thu, Apr 27, 2006 at 03:36:20PM -0700, Justin Patrin wrote:
> For those of you who don't know we've "solved" this problem for
> OpenEmbedded by having a "snapshot" database for download on a system
> which is updated every night. This was when people are just starting
> they can grab the snapshot and pull only a few revisions. Having
> someone pull the entire OpenEmbedded database over netsync is really
> problematic as it takes hours and hours to finish and most people want
> it *now*.

Right, this is basically the "cheating" idea I mentioned --
essentially doing the same thing as pulling over HTTP would do, just
doing it internally inside netsync.  Certainly the idea has occurred
to us (it would be so easy to do!).  Oh, well, I guess I'll write one
of those long responses, that may be useful to point to later...


I should have forseen this... ;-)

> To be fair, 0.26 does look to be faster, but an initial pull is still
> going to take far too long on a decent sized database. IMHO an initial
> pull into an empty DB could be changed to not verify revisions (as
> this is the extremely long and complex part), although if there is a
> possibility for corruption in netsync this could be a problem
> (although I don't see how there could be, assuming it's over TCP).

It's not so much corruption in netsync -- it's corruption _anywhere_.
Monotone, of course, always validates the data it is handling, so
things like checkouts always check that the version being checked out
is fine.  But initial pull is important because it's the only time we
can get away with checking _everything_, and in particular, the bulk
of old history which can go years without being used directly.  (This
is the majority of data, so it's very probable that any accidental
corruption would end up somewhere in it, rather than in the small
percentage of the data that gets commonly hit.)

This isn't a particularly made up problem, either.  Lots and lots of
long-running CVS projects have simple disk corruption in their old
history, that was unnoticed until backups were gone.  (In particular,
I've heard stories about NFS issues that ended up with random blocks
of null bytes in the middle of the file.)  Monotone has twice had RAID
issues on its shared server that caused single-byte errors in old
files; if not for the checking netsync does, we might _still_ not have
noticed that.  "db check" will also catch this, of course, but it's
rather easy for no-one to think of running that for long periods as
well...

Or, here's what Larry McVoy has to say: "BK is a complicated system,
there are >10,000 replicas of the BK database holding Linux floating
around. If a problem starts moving through those there is no way to
fix them all by hand. This happened once before, a user tweaked the
ChangeSet file, and it costs $35,000 plus a custom release to fix
it."[1]  And this wouldn't be caught by checksumming, either, you need
real semantic validity checking to prevent this source of corruption.
(Similarly, remember: when engineering for reliability, your threat
model includes not just hardware failure and over-eager users, but
_yourself_.  Everyone writes bugs.)

And, well, I don't have $35,000, so...

All very good reasons. Unfortunately it's hard to tell new users that
this is all in their own good and they should trust us. The "slowness
of monotone" (note, I'm not one of the slowness critics, I'm fine with
the speed of monotone on a day-to-day basis (well, it is a bit slow,
but I don't mind it because I like what I'm getting)) is the thing
that is complained about most of all and, IMHO, is what has pushed
away a good # of people from OpenEmbedded. This is why there has been
this "war of the SCMs" going on, so that people can try the different
SCMs and see if they can find anything which does everything we want
it to.

However, the "war" hasn't, AFAIK, gone very far. There are a few vocal
people who tried various alternatives and updated tailor to work
better for trying these systems...but most of us just keep pluggin
away.

What *am* I trying to say here?


Believe me, we're really aware of how much pain the slowness causes on
a day to day basis :-(.  But once you give up safety, there's really
no way to get it back.  I'm not sure how well I would sleep at night if
I was recommending software that could bite you that hard --
especially if I gave up on safety while there were still things to
try.

As it happens, the actual checking proper seems to be almost invisible
in profiles of 0.26's pulls.  There's still some reason to think that
this is all doable.  So, that's why I think we should keep trying the
"improve speed while refusing to give up safety" strategy for now...


I understand all that and I actually do agree. It's still something
which has to be thought about more, though....old projects are going
to have lots of history and each individual new user doing a sanity
check on the whole thing is a bit much....especially when it's such a
large bottleneck compared to the actual bandwidth. But then again, we
haven't switched to 0.26 yet, so maybe I'm worrying about something
which is already fixed....and perhaps people should just stop girping
about it and accept that it's done this way for a reason and not just
because.

(This is why I personally like monotone, everything seems to be done
for a good reason, it's its own thing, not built on something else,
and there are people working on making it right. Not just right in a
"it works sense" but in a "here, I'll prove it to you" sense. :-) )

--
Justin Patrin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]