[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Monotone-devel] Improving the performance of annotate
From: |
Eric Anderson |
Subject: |
Re: [Monotone-devel] Improving the performance of annotate |
Date: |
Wed, 19 Jul 2006 00:32:34 -0700 |
Daniel Carosone <address@hidden> writes:
> Subject: Re: [Monotone-devel] Improving the performance of annotate
>
> > 1) only parsing the
> > portion of the roster that was relevant to the file being annotated.
>
> my suggestion and preference here would be storing the roster details
> in open sql, or at least caching the relevant details in sql. This is
> similar to the 'per-file DAG' information discussed previously. The
> idea of inventing and storing our own additional indexes when we
> already have a storage layer with these capabilities just seems
> incongruous.
The index would be pretty simple, 4 byte file id, 4 byte offset into
roster, but putting it properly into SQL is definitely the more
elegant path. You could have a per-file, per-revision roster entry,
or you could have a table of (type, name, content_hash, ident, birth,
path_mark, content_mark); the main downside is that you get a lot of
compression out of eliminating the redundant hashes that show up in
the roster text:
> The implications of this on storage size certainly need some
> examination.
./mtn get_roster 341e4a18c594cec49896fa97bd4e74de7bee5827 | gzip -9v | wc -c
==> 61241
./mtn get_roster 341e4a18c594cec49896fa97bd4e74de7bee5827 | perl -e 'use
Compress::Zlib; while(<STDIN>) { if (/^$/o) { print compress($buf,9); $buf =
""; } else { $buf .= $_ } } print compress($buf,9)' | wc ==> 195291
(the second line splits the roster into records and compresses each separately)
195291/61241 = 3.2; which is a substantial size increase; I don't know
how to get the db size used up by the other revision data, but it
could be smaller.
> > 2) skipping the version hash check in database.cc
>
> At the moment, you're skipping the check of the roster as its
> retreived before you start parsing it, right?
I'm skipping the check that roster_id = sha1(roster_data)
> So you save lots of
> checking of rosters you don't necessarily end up using. Furthermore,
> most of your remaining time is spent, uh, "express-parsing" rosters to
> see if they're relevant. If we could find relevant rosters quickly,
> the remaining saving for both changes could be much less significant.
The current annotate code considers every roster to be relevant.
There is a comment in the code about this:
// If we have content-marks which are *not* equal to the current
// rev, we can jump back to them directly. If we have only a
// content-mark equal to the current rev, it means we made a
// decision here, and we must move to the immediate parent revs.
//
// Unfortunately, while this algorithm *could* use the marking
// information as suggested above, it seems to work much better
// (especially wrt. merges) when it goes rev-by-rev, so we leave it
// that way for now.
But indeed, if you could skip the non-relevant rosters it could be a
big win.
> Some thoughts on this to throw into the philosophical debate that may
> follow:
>
> - I'm very supportive of the validate-everything approach taken by
> monotone, the reasons previously stated are and will remain sound.
> It does come at a cost, and some operations may not wish to pay
> that cost or need the assurances it provides.
Me too; I generally consider skipping validation when validation time
is a large fraction of total runtime and the validation isn't doing
much. In this case it's validating that sqllite didn't retrieve the
wrong bits, and that the disk/fs didn't give a silent
corruption. That's why in this case I think that having a option to
disable the check and to turn it off by default for annotate could be
a good choice.
> From: Graydon Hoare <address@hidden>
> Subject: [Monotone-devel] Re: Improving the performance of annotate
>
> [ nifty rewriting of the roster storage approach ]
Any guess as to how difficult that would be? Given your description,
I don't think I would be qualified to make the change. If it would
take a while to make this change, is it worth transitioning through a
relatively ugly hack to make it tolerably fast while work on the
correct fix progresses?
-Eric