[Gnu-arch-users] Re: [PATCH] arch speedups on big trees

gnu-arch-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gnu-arch-users] Re: [PATCH] arch speedups on big trees

From:	Chris Mason
Subject:	[Gnu-arch-users] Re: [PATCH] arch speedups on big trees
Date:	Fri, 09 Jan 2004 17:16:13 -0500

On Fri, 2004-01-09 at 16:44, Miles Bader wrote:
> On Fri, Jan 09, 2004 at 09:03:06AM -0500, Chris Mason wrote:
> > > Huh?  Inode caches help inventories, but they help changeset-creation even
> > > more.  On source trees I use, changeset-creation _with_ inode caches is
> > > pretty reasonable, but _without_ them, it's a disk-killing operation even
> > > with kernel caching of file contents (because two copies of the tree's
> > > data is [orig+modified] tree is bigger than the RAM available for 
> > > caching).
> > 
> > inode sigs do make tree inventory reads faster, but at the cost of
> > making tree changes much slower.
> 
> The problem is that for large trees, changeset-creation is a _killer_ without
> something like inode-sigs.  On a local disk at least, even doing a full a
> inventory is _extremely_ cheap compared to that (you could probably do
> hundreds of them and still be cheaper than the diffs required by
> changeset-creation without inode-sigs).

Ok, please consider reading the benchmarks I posted, or run some of your
own with my patch.  I'm very interested in examples where it performs
worse than vanilla arch.  All my trees are hard linked, so if you're
looking for performance issues, start with straight copied trees.

Our opinions really aren't that far apart, I completely agree that inode
sigs make reading the tree faster.  I'm saying that vanilla arch reads
the tree so more often than it needs to and has an extremely high cost
for updating the inode sig.  My patches prevent the extra reads, so they
don't need the existing inode sig implementation for performance.

I agree a different inode sig implementation would definitely still
improve things with or without my patches in place.

> > While applying a changeset, change the sigs using an inode sig format
> > that allows for partial updates.  If Tom really doesn't want an indexed
> > database file, the sig checksums and other important metadata could go
> > into the files in ++id-mapping in my current patch.
> 
> Please, not the individual file-per-file implementation!
> 
I'm completely agnostic on the actual implementation of the id mapping,
as long as it isn't O(n) for inserts, deletes and searches. A db file
will be more space efficient, the one file per id will be less complex. 
The decision is Tom's (if he even buys into the reverse mapping at all).

> > > Is a reverse-mapping even necessary?
> > 
> > The early versions of my code didn't have the reverse mapping, and after
> > Tom clued me into some of the issues involved, I took an approach very
> > similar to what you describe below.  The problem is that for changesets
> > that add files, a full inventory would be required to makes sure that id
> > doesn't already exist in the tree.  
> > 
> > The patches that I need to apply frequently add files, so this method
> > was too slow for my usage.
> 
> So all you really need is a big bag of existing ids you can check against for
> conflicts?
> 
Almost.  If someone locally does a tla mv, you want to be able to find
the new location for that id without doing a whole tree scan. So my
patches change tla mv, move, add, delete to update the reverse mapping.

> You can't rely on your DB to catch any conflicts if you're running at a point
> where the user could have changed the tree, so it seems that you've _got_ to
> at least do a single full inventory at the start of a given user-level tla
> command.

See above, all tla commands used to change the tree also update the
mapping.

> 
> Given that, keeping the DB on disk seems pointless.
> 
> Why not just be more careful to maintain a big-bag-of-ids internally to arch,
> based on the first inventory that gets done in a given user command (since
> it's an arch-updated data-structure, this should be no less accurate than
> maintaining an on-disk version, and since it's based on an initial inventory
> instead of a possibly-out-of-date disk DB, it should be more accurate).
> 
> Then use the `changeset based' method I described earlier for everything
> else; if you've already implemented a version of this, it shouldn't be too
> hard...

Because the problem can be solved without reading the whole tree ;-)
There's just no good reason for arch to take significantly longer than
patch for applying the changeset.  

The project tree really is a database, just one stored in the
filesystem.  You've got objects (files, directories) and metadata (inode
information and ids).

Many operations need to connect the id to the file name.  There exists
an O(1) index to retrieve the name of the file's id based on a
manipulation of the file name or storing it internally in the file.

Many operations want to connect the file name to the id.  There exists
an O(N) search operation to find the file corresponding to a given id,
it's called inventory the entire tree.

My patch just adds and maintains a faster index to find the file name
for a given id.  The index could easily be extended to contain other
data such as inode sigs and checksums.

-chris

[Prev in Thread]

Current Thread

[Next in Thread]

[Gnu-arch-users] Re: [PATCH] arch speedups on big trees, (continued)

Prev by Date: Re: [Gnu-arch-users] marking merges in --summary output
Next by Date: Re: [Gnu-arch-users] Re: [PATCH] arch speedups on big trees
Previous by thread: [Gnu-arch-users] Re: [PATCH] arch speedups on big trees
Next by thread: [Gnu-arch-users] Re: [PATCH] arch speedups on big trees
Index(es):
- Date
- Thread