gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] arch_inventory_traversal


From: Tom Lord
Subject: Re: [Gnu-arch-users] arch_inventory_traversal
Date: Thu, 23 Oct 2003 10:39:45 -0700 (PDT)



    > From: address@hidden (RedHog (=?iso-8859-1?q?Egil_M=F6ller?=))

    > I have recently been queried about how good Arch (tla) was on
    > handling large inventories. As a test, I imported the Linux
    > kernel tree (about 100Mb), and it turned out a one-file-change
    > took about 20min to commit. This is not acceptable, so I started
    > profiling.

We've certainly seen some reports about rather better performance 
on that very tree but, ok.....  let's look into this.


    > I found out that arch_inventory_traversal was called 6 times with the
    > same root during a normal commit (3 timnes when making the changeset,
    > and the same number of times when updating the pristine tree), and
    > that it accounts for about 50% of the total running time of a commit.
    > As the inventory(-liting) does not change over a commit, caching the
    > result would be possible, and would save nearly half of the running
    > time (12min is still too much, but much better than 20).

    > My question now is whether someone opposes this, or if there are
    > issues I have overlooked, such as there being commands that actually
    > calls arch_inventory_traversal twice with the same root and changes
    > the inventory in between?

There certainly are things that can happen that will change the
inventory between calls to arch_inventory_traversal.   In other words,
the _wrong_ change to make would be to modify arch_inventory_traversal
to simply cache its results between calls.

There are many particular code paths where redundant traversals might
occur, and its easy to prove they're redundant: the _right_ change to
make would be to identify those paths as they are noticed and modify
the code to pass a cached inventory along them.

I've done a bit of that already -- no objection to doing more of the
same.

Two places where global caching, rather than passing a cached
inventory along execution paths, would be (and already is, in some
circumstances) acceptable: places where you are taking inventory of a
pristine tree or revlib tree.  In both cases, there is sometimes
already a cached inventory in a file in the tree and you can just read
that -- not doing a traversal at all.  The trick is to find any
remaining places where a caller isn't using that cache but could be.

Some things that might be interesting to know or experiment with:

*) how much memory your machine has?   how much disk cache? cpu & speed?
*) drive type?
*) what's the time for a cold inventory --tags --source --both --all >
   /dev/null?
*) what's the time for a warm inventory --tags --source --both --all >
   /dev/null?
*) what tagging method are you using?

That you're spending 12 minutes doing something _other_ than inventory
suggests that you are doing a lot of file comparisons.   Is that what
your profiling shows?

Doing a lot of file comparisons suggests two possibilities: 

1) the inode-signature optimization is not in play for you.
   This may be a side effect of it needing to be extended for the 
   tagging method you use.  (Very soon now, really.)  It might also
   be a symptom of how you ran your tests.

2) many files have, in fact, changed (either in content or in inode
   signature):  and that's not the expected case in a developer
   commit.

-t

p.s.: I really hate performance-tuning-by-proxy.   This blows.   
      I'm sure we could knock this down pretty quickly if I had
      higher bandwidth access to examples of the problem.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]