monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] Hash collisions resiliency


From: William Uther
Subject: Re: [Monotone-devel] Hash collisions resiliency
Date: Thu, 14 Apr 2005 09:49:25 +1000


On Wed, 13 Apr 2005 15:31:46 -0700, Nathan Myers <address@hidden> wrote:

On Thu, Apr 14, 2005 at 08:10:21AM +1000, William Uther wrote:
  I'm a little confused here.  People are saying that hash collisions
are hard to detect - that you'd have to do a full-text check of all the
files.  That doesn't seem right.  Couldn't you just have a table of
hashes in the database.

No.  Suppose you add a file to your repository, and I add one to
mine.  Unbeknownst to us, they has the same.  When we reconcile our
repositories, now there are two files with the same hash.

yes, and later in my email I wrote:

   Of course, you might have a distributed hash collision - two
machines, each of which with a distinct file, but the files have the
same hash.  This would not be detected until the data was sent between
the machines, but it could be trivially detected at that point.

Of course, as I elaborate below, this detection is not as trivial as I thought in all cases...

On Wed, 13 Apr 2005 23:39:53 +0100, Bruce Stephens <address@hidden> wrote:

William Uther <address@hidden> writes:

   Couldn't you just have a table of hashes in the database.  When
   you add a new hash, you add it to the table.  If it is already
   there, *THEN* you do a full-text check of the two files with the
   matching hashes.

Which we already have, of course (which is the whole point).  There
are several tables, but that's OK: it's not a problem if a file
version happens to have the same hash as a changeset, I think?

That's what I thought. Which is why I was surprised that people were saying there was a problem of hash collisions.

   If the two files actually are the same, then you're ok.  If the
   two files are different, you have a problem, *and* you have
   detected it.

I suspect the problem is that things with the same hash as something
that's already there are very common: most commits change a few files,
and you know which files have changed by checking their hashes.  And
checking the contents instead is likely to be expensive enough that
you wouldn't want to do it.

When looking for changed files in your working copy you don't want to have to look at the file contents at all in the common case. You should be used file timestamps - only check file contents or hashes when the timestamps change or you are asked to do an explicit check.

However over a network, when syncing two databases, you will have this issue. If the hashes on two different diffs, one on each machine, happen to be the same, then the machines might believe that each of them has the same diff as the other simply through the exchange of hashes without ever exchanging the diffs themselves.

I don't think this is a big issue - rsync has the same issue. Moreover, a hash collision here will lead to strange things happening in the recently synced information - which is probably near head, and about to be used where the corruption will be discovered. It isn't going to silently corrupt old revisions which rarely get checked, right?

Cheers,

Will          :-}

--
Dr William Uther                           National ICT Australia
Phone: +61 2 9385 6357               Computer Science and Engineering
Email: address@hidden          University of New South Wales
Web: http://www.cse.unsw.edu.au/~willu/     Sydney, Australia





reply via email to

[Prev in Thread] Current Thread [Next in Thread]