monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

gzipping analysis (was Re: [Monotone-devel] Re: results of mercurial use


From: Nathaniel Smith
Subject: gzipping analysis (was Re: [Monotone-devel] Re: results of mercurial user survey)
Date: Sat, 29 Apr 2006 18:29:37 -0700
User-agent: Mutt/1.5.11

On Sat, Apr 29, 2006 at 01:17:58PM -0700, Graydon Hoare wrote:
> Timothy Brownawell wrote:
> 
> >Netsync (initial pull, counting both server and client) appears to be:
> >   13% libz (56% compression + 22% decompression + housekeeping)
> >   11% Botan::SHA_160::hash
> 
> It's important to differentiate two classes of gzip and sha1 here. There 
> is a portion we do on the networking stream (compressing and hmac'ing 
> packets) and a portion we do on the database.

Right -- call graph profiles should give us some idea what's happening
here.

> We can give up the gzip on either without much hassle: on the protocol 
> stream it'll just result in more overall transmission (which is probably 
> OK since we're not near wire speed yet) and on the database it'll mean 
> more database storage allocated, which might be ok on the xdeltas but 
> probably is unwise for the fulltexts. Especially full rosters; they're 
> pretty big.

But, I bet you it's the fulltexts that are the entire source of the
problem.

Random datums:
  -- on my laptop, gzip of fully random data is ~10MiB/s, and of pure
     zeros is ~44MiB/s (this is measurement is using the command line
     'gzip', like all measurements in this email)
  -- this makes it seem like gzipping network data and delta data
     should be a win, since there are only tens of megabytes involved
  -- especially, the network is going to be slower than gzip for the
     forseeable future (except in very special cases)
  -- for whatever it's worth, mercurial finds it worth its while to
     uncompress and recompress the deltas it is sending over the
     network

The way things work now, where we repeatedly store each new version in
the db as a fulltext when it arrives, we are compressing _much_ more
data than we have to.  So my bet is that this is overwhelming delta
and network gzipping.

In my local db of nvm*, looking just at rosters, I have:
  1.8 MiB fulltexts
  4.6 MiB deltas
  --> 6.4 MiB on disk
those are the compressed, on-disk numbers, of course.  If the same
data is uncompressed, it comes out to:
  8.8 MiB fulltexts
  9.0 MiB deltas
  --> 17.8 MiB cleartext
sending all of this data straight through gzip again takes 1.14s on my
desktop.

However!  The full text of all reconstructed rosters, together, comes
to something like 1200 MiB.

(How I got this number: selecting 100 random revisions with 'select i:
| head -n 100', then calling get_roster on each and piping the output
through wc -c, then multiplying the result by the total number of
revisions in my db, divided by 100.  So this is a statistical
estimate; actually reconstructing everything in random order was too
slow.  Strangely, my db has many more roster deltas than revisions, so
there may be something else going on here, but that just means that
this number is like a lower bound.)

Because we never write a delta to the db without first writing and
then erasing its corresponding compressed fulltext, this means we are
pumping 1200 MiB through gzip.

(A little under twice that much is going through SHA1, since we SHA1
both the roster and the manifest for every revision.  Smarter storage
won't reduce this, though if it turns out to be necessary we might
consider using only a CRC of some sort on rosters, since their hash
is only used to detect disk corruption.  SHA1 is much faster than
gzip, though -- the implementation in monotone clocks in at 50MiB/s on
my laptop (same machine as the gzip numbers above), and openssl clocks
in at 150MiB/s, so even more improvement is possible.  So that
suggests a lower bound of 1200 / 150 = 8 seconds, and a current
overhead of roughly 2400 / 50 = 48 seconds, counting just pure SHA1
cost. These numbers will vary with the history in question, and the
CPU in use, of course.)

It would be useful to extend this analysis to files; rosters are only
part of the story.  (Though a significant part, since they're
effectively very large files that change at every single revision.)

> >   6% compute_delta_insns (strangely, this is 99% from compute_delta...
> >which gets .09% of total)
> 
> Yeah, one area worth considering is an explicit xdelta inverter. It can 
> be done, sorta, though it produces less-optimal deltas, and needs one of 
> the input fulltexts anyways. It's still probably a win.

I'm being slow today -- can we use this trick to invert a delta in the
middle of a chain?  I'm not sure it will help much, if we have to
actually reconstruct the base version to use it?

-- Nathaniel

-- 
So let us espouse a less contested notion of truth and falsehood, even
if it is philosophically debatable (if we listen to philosophers, we
must debate everything, and there would be no end to the discussion).
  -- Serendipities, Umberto Eco




reply via email to

[Prev in Thread] Current Thread [Next in Thread]