monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] some observations about roster node IDs


From: Zack Weinberg
Subject: [Monotone-devel] some observations about roster node IDs
Date: Thu, 28 Jun 2007 15:07:47 -0700

Performance testing on the file_path/split_path changes led me to
thinking about node_id lookups.  Some things I've noticed:

regenerate_caches (on a monotone database) spends nearly seven percent
of runtime in dynamic_cast<> implementation, going from node_t to
dir_t and/or file_t.

We are currently doing node ID lookups with a hybrid_map, which is
both a STL map (i.e. red-black tree) and a hashtable.  Lookups do not
show up in my profiles, but I suspect they would for other workloads.

Node IDs are 32-bit integers, so I decided to figure out how dense
they are in the number space.  The attached graphs show what that
looks like; start by looking at nvm.png, which has one dot for each
node ID appearing in each revision found by 'mtn automate select
b:net.venge.monotone'.  Revisions increase vertically, in toposort
order.  You can clearly see some milestones in development; for
instance, the sudden spike in directories near roster #3200  probably
represents the new test system landing.  'all.png' plots the same
thing but for *all* revisions in my local database, alphabetically by
revision ID; this goes some way to explaining the huge gap between
nids 100 and 4000 (give or take) -- revisions not on nvm have taken
that part of the number space.

So this doesn't look all that dense, but the thing to realize is that
most of the gaps are small.  'gaphist.png' shows a histogram of the
base-10 log of the size of all gaps (in the n.v.m data) and
'runhist.png' shows the same histogram but for the runs (i.e.
consecutive in-use numbers).  Note the differing y-axis scales.

----

Some conclusions:

* It would not be totally crazy to use two vectors for the node ID ->
node_t map: one for true node IDs, one for temp node IDs.  (There are
no temp node IDs in any of the graphs, of course.  Temp node IDs start
at 2^31, so clearly we cannot use one vector for both.)  The
true-node-ID vector would have a lot of unused entries, but I suspect
its memory usage would actually be better than what we are currently
doing, and it doesn't cost too much to skip 3000 unused slots in an
iteration over an 8000-element vector.  One does worry about scaling.
Some sort of sparse array structure might work better.

* We should seriously consider separating dir from file nodes so we're
not blowing so much time in dynamic_cast.  I made a stab at this
locally, it gets messy but not, I think, *too* messy.

* We might want to think about packing node IDs better.  I can think
of several algorithms of varying degrees of cleverness, but suspect
someone else has a better sense of how to do it.

Data sets and/or crazy shell one-liners to extract them from your very
own database on request.  Be warned that they get huge; the data set
corresponding to 'all.png' has 12 billion records!

zw

Attachment: nvm.png
Description: PNG image

Attachment: all.png
Description: PNG image

Attachment: gaphist.png
Description: PNG image

Attachment: runhist.png
Description: PNG image


reply via email to

[Prev in Thread] Current Thread [Next in Thread]