gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gnu-arch-users] Re: [PATCH] arch speedups on big trees


From: Chris Mason
Subject: [Gnu-arch-users] Re: [PATCH] arch speedups on big trees
Date: Thu, 08 Jan 2004 18:46:22 -0500

On Thu, 2004-01-08 at 17:51, Miles Bader wrote:
> On Thu, Jan 08, 2004 at 09:27:38AM -0500, Chris Mason wrote:
> > So as soon as the diff program makes output, we open the tmp file?
> > 
> > Because of arch_binary_files_differ(), we are 100% sure diff is going to
> > have output.  So just open the tmp file to start with and make life easy
> 
> I don't know -- the current codebase doesn't seem to use
> arch_binary_files_differ before diffing; maybe your code-base does (but
> honestly, I'm really hoping Tom _doesn't_ merge your branch, because there
> are currently way, way, too many problems with it).
> 

Apparently we're reading different copies of arch_invoke_diff().  I'm
reading vanilla tla-1.1.  As for the mergability of my patch, we've
gotten a little off track.

1) Do inode signatures actually help performance in the current form?  I
think they make most uses slower (except revision libraries).  In order
for the sig to help performance, it needs to be used as a cache more
frequently than it gets updated.  This either doesn't happen, or happens
because arch is doing too many inventories anyway.  Once you take out
some of the extra inventories, the inode sigs make less sense.

2) Does a reverse mapping safely allow arch_apply_changeset to skip
whole tree inventories?  I provided a sample reverse mapping
implementation to help argue that it does.  It's fine if you don't like
the sample implementation, I'd rather discuss the safety of the concept
first.

> Doing a binary comparison before diffing is a solution to this problem,
> but of course ends up reading the files twice.  This sort of thing is
> _normally_ covered up in NFS (and in linux, in local filesystems too) by
> short-term caching, _but_ I'm not really sure how confident I can be about
> this; for instance, what if there are lots of really big files, will only
> parts of them be cached, resulting in redundant reads even when very close
> in time?  Does the added efficiency of not invoking the diff program make
> it worthwhile anyway?  I guess the answer probably depends on what
> filesystem you're using...

How big is big?  In order for reading the file twice to hurt, it would
have to be a considerable percentage of the system ram.   The fork+exec
cost is pretty high compared to the overhead of the double read,
especially when you consider that for most commits most of the files are
unchanged.  It's easy to benchmark though.

-chris






reply via email to

[Prev in Thread] Current Thread [Next in Thread]