adonthell-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Adonthell-devel] Distributed editing


From: Kai Sterker
Subject: Re: [Adonthell-devel] Distributed editing
Date: Mon, 27 Jun 2011 23:32:40 +0200

On Tue, Jun 21, 2011 at 8:51 PM, Kai Sterker <address@hidden> wrote:

> But here's the problem: if I create some connector templates and
> somebody else as well, they end up having the same id and things will
> get ugly when we both try to commit and merge our connector template
> data file.
>
> My idea would be to instead get a hash of, say, the host name where
> the editor runs and XOR that with the simple counter to get values
> that are locally and globally unique. Sounds about right?

[snip]

> I guess my main problem is that I don't really have any experience
> with hashes and such. Will a short string such as the host name
> provide for unique enough hashes that XORing with a counter will still
> avoid collisions. How long should the hash be? 32 bits would be
> convenient, and should be plenty. Or play it safe and use 64 bits? And
> is the XOR such a good idea, or would it be better to have the hash as
> prefix and then additional bits for the counter?
>
> So if anyone has any insights to share, that would be appreciated.
> Otherwise I guess I have some reading to do, since the last thing I
> want to do is changing the id generation algorithm after a lot of
> models and stuff have been created.

Well, I did some reading and experimented with stuff I'd like to share.

First of, there's a standard for globally unique ids: UUIDs.[1]. But
these are 128 bits in size and therefore too long as far as I am
concerned. So I investigated further into the area of hashes and found
that the probability of a collision depends greatly on the subset of
distinct input values in contrast to the number of possible hashes
[2].

So that got me thinking about how many connectors or models will we
really have, and how many distinct people will be working on those.
Honestly, though I'd wish for many more, I doubt that latter will be
more than a hand full and former probably a value in the low
thousands, more or less distributed evenly among the few contributors.

That led me to the conclusion, that a 32 bit hash should be
sufficient. So I picked two random implementations [3][4] and wrote a
little test program that basically generates 1000 distinct strings and
for each string integers in the range [0; 16384[. It than hashes the
string and XORs with the integer, then checks the result for
duplicates in the prior results. For comparison, instead of XORing
with the hash, I used the integer as a seed for the hashing. This
ought to simulate 1000 people generating ~16000 files each. Much more
than expected. I then tweaked the numbers and repeated the exercise.
Here a couple of the results:

Hash | "People" | "Files" | Collisions (seed) | Collisions (xor)
================================================================
[3]  |     1000 |   16384 |             30997 |           16384
[4]  |     1000 |   16384 |             35164 |           16384
----------------------------------------------------------------
[3]  |      100 |  163840 |             31043 |          131072
[4]  |      100 |  163840 |             33574 |               0
----------------------------------------------------------------
[3]  |       10 |    8192 |                 2 |               0
[4]  |       10 |    8192 |                 0 |               0
----------------------------------------------------------------
[3]  |    10000 |    4096 |            194736 |          221184
[4]  |    10000 |    4096 |            219526 |          212992
----------------------------------------------------------------

It seems the xor generally fares better, at least when there are few
contributors. In the more realistic case with very few contributors
and few files, there's hardly any difference.


That's certainly not exhaustive, but my feeling tells me that we
should be okay with a 32 bit hash for both connectors and map objects.
The trick will be to keep the running number low, which should be easy
for the connectors, as they will all be stored in one file. For map
objects, which are kept distributed over a tree of directories, it
will be more difficult. Gotta think of something there.

Anyway, I'll attach the two test programs. Feel free to experiment a
bit. Compile with

   g++ hash32.cc -std=c++0x -o hash32
   g++ hashsf.cc -std=c++0x -o hashsf

Thoughts are still welcome. However, I'm finally back in the mood for
programming, so don't wait too long with your input.

Kai

[1] http://en.wikipedia.org/wiki/Universally_unique_identifier
[2] http://en.wikipedia.org/wiki/Birthday_problem#Probability_table
[3] http://burtleburtle.net/bob/hash/evahash.html
[4] http://www.azillionmonkeys.com/qed/hash.html

Attachment: hash32.cc
Description: Text Data

Attachment: hashsf.cc
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]