monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] Re: results of mercurial user survey


From: Justin Patrin
Subject: Re: [Monotone-devel] Re: results of mercurial user survey
Date: Fri, 28 Apr 2006 08:03:26 -0700

On 4/28/06, Richard Levitte - VMS Whacker <address@hidden> wrote:
In message <address@hidden> on Fri, 28 Apr 2006 08:29:44 +0200, Jon Bright 
<address@hidden> said:

jon> I'll avoid quoting your whole mail, since I'm not replying to
jon> anything specific.  I see your point about the safety.  Safety is
jon> good, and I like it.  But:
jon>
jon>   - The load of checking the validity of the database being
jon>     shifted to new users doing pulls seems to be the wrong place
jon>     to do it.  Couldn't we instead throw warnings at people on
jon>     start "Your database hasn't been checked for X days!  Run 'db
jon>     check'!"?

How long will it take before people start simply ignoring it?


Very true.

jon>   - Doing the detection on the user side, we're detecting two
jon>     classes of problems: first, semantic problems in the DB on
jon>     the remote side and second, transmission problems (at
jon>     whichever layer) between remote and user.
jon>
jon>   - The semantic problems aren't under the user's control - if
jon>     you're lucky, he writes a mail about them.  But you're
jon>     relying on something that seems like a pretty poor
jon>     communication path to discover your problems with old data.

The positive thing with having it done as it is done currently is that
it's more instantaneous.  Even if some users do not complain, I'm sure
a sufficiently large (basically, more than 0) percentage will, and
that should be enough to put responsible people on alert.

Let's say that 100 people are pulling your database from scratch in a
certain amount of time that your database is corrupt. Some of them
simply won't understand and will blame monotone. Some more won't
understand and will blame your project (i.e. OpenEmbedded). Some more
will simply lose interest because it's too hard. The remaining few
will complain. I *may* be overstating the # of people who won't say
anything, but it is quite possible that with this kind of workflow
you've alientated any number of potential new users both to your
project and monotone because the responsibility of not only checking
but also knowing what to do about failures is places squarely on the
new users. This is not a position we want to be in IMHO.

Perhaps this could be made better by putting some reporting into the
netsync process. When a puller sees a corruption problem it notifies
the server and the server notifies the person who owns the server
(this should IMHO be with an option for an e-mail address different
than any key as some projects, such as OpenEmbedded, do not use real
e-mail addresses for their keys). This would still cause new user
problems as described above (and I still feel this is a potential
problem) but a least the problem is more likely to be found on the
first bad pull instead of whenever a user decides to let the owner of
the server know.


The alternative, and this is especially important on servers, is that
someone will have to spend time reading the logs, or reading some
report done with "db check".  While that sounds fine and doable,
experience from a number of sysadmin jobs show that logs are often
only checked when a problem has been discovered, or read more and more
carelessly because it looks like the same damn pattern all the time.
So we basically get back to discovery.

This is an entirely valid problem. However, the trick is in reporting
failure, not success. Any tool or script which sends an e-mail every
day (week, hour, etc) which says "Everything OK!" is basically
worthless. In time the people getting such a mail get numbed to it and
are more and more likely to dismiss it (or even delete it out of
hand).

However, such a problem can be solved. If the db check is done in the
background this would be done by only sending an error message if an
error is found. If someone does this through an automated script this
is done by checking for error conditions in the output and only then
sending a failure message.

<OT>
(I actually wrote a script a little while ago to do such a thing. It
not only checked for error conditions in output of automated (daily,
etc.) e-mails but it also checked to make sure that certain e-mails
were received every N time intervals. So if an e-mail came in with an
error condition a failure message was sent to the admins. If an e-mail
didn't come in in the time that it was supposed to a failur message
was sent to the admins. It turned out to work very well.)
</OT>


The thing with a distributed system is exactly that, that it's
distributed.  If an error occurs somewhere and remains undetected, it
will be trasmitted to everyone involved.  In monotone's case, it would
be everyone that does a pull.  At that point, it will be very
difficult to change it, since it might be pushed around, so any
correction that you do locally might be "corrected" back to the
erroneous state.

True. And this is where the idea of blacklisted revisions comes up.
This is useful not only for these kind of corrupt nodes but also for
problem nodes. OpenEmbedded has had several such instances of problems
where we had a revision which was nearly impossible to deal with (read
merge) and so we had everyone kill it locally.


>From that point of view, it's probably a lot better for any monotone
instance to detect problems and refuse to store them locally.  That
makes the problem contained in the database where it resides instead
of having it propagate all over the place.

jon>   - The transmission problems seem like they should be detectable
jon>     with some simple SHAing.

I haven't really looked, but I was under the impression that there is
some kind of HMAC authentication in place.  No?


From what someone said earlier transmission problems should not
happen. The consistency check is to make sure the data that was pulled
is correct.

--
Justin Patrin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]