help-gsl
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-gsl] seeking highly efficient caches scheme for demanding engi


From: Gordan Bobic
Subject: Re: [Help-gsl] seeking highly efficient caches scheme for demanding engineering computing?
Date: Fri, 27 Jul 2007 09:20:42 +0100 (BST)

On Fri, 27 Jul 2007, Michael wrote:

To same the time of costly function evaluation, I want to explore the
possibility of caching.

LOL! Would you believe I wrote about this in the previous email before I read this. :-)

Suppose in millions of calls to the function evaluation, there are some
computations, in which only a portion of the parameters change, others
remain the same as the parameters that are used in some other function
evaluations some time before. To save time, I can group the varying
parameters into sub-groups, and cache the results for later use. This needs
a highly efficient cache and a efficient organzation and look-up. Also, the
decision of whether a group of parameters has been seen before should be
based on data trunk in binary form, instead of decimal formats, so I can
compare memory data trunks directly using memory comparison.

I know a lot of people are going to choke when I suggest this, but you CAN compare floats with good reliability and success when it comes to cache look-ups.

I find that the best strategy is to use two levels of caching. Level 1 will typically store the Most Recently Used (MRU) sub-expression. If you organized your loops so that the inner-most sub-expression is the outer-most loop and the outer-most sub-expression is in the inner-most loop, you will get extremely good results from the MRU caching scheme.

In your evaluation function, do something like:
// Assuming function a * (bX + c) + d

float Eval (float a, float b, float c, float d)
{
        static unsigned int     x;
        static float            xx;
        static float            CacheBX[1024];
        static float            CachedB;
        static float            CacheBXC[1024];
        static float            CachedC

        bool CacheBDirty = false;
        bool CacheCDirty = false;

        if (CachedB != b)
        {
                CacheBDirty = true;
                for (x = 0, xx = 0; x < 1024; x++)
                        CacheBX[x] = b * xx++;
        }

        if (CacheBDirty || CachedC != c)
        {
                CacheCDirty = true;
                for (x = 0, xx = 0; x < 1024, x++)
                        CacheBXC = CacheBC[x] + c;
        }

        // etc. You get the idea.


You can improve the cache effectiveness by truncating your floats a bit to cope with rounding errors that may creep in, depending on what the rest of your code does. It will also depend on how much precision you actually need. For example, to truncate your floats down to about 5 signifficant figures from the 7 it normally has, you could do something like:

*((unsigned int*)&b) &= 0xFFFFFF00; // truncate b to 16-bit precision

Don't put this in your evaluator function - do it in the outer loops as soon as you have the new value for b, obviously - no need to do it every time, only needs to be done once when the value changes.

You can extend this further, with a level 2 cache. If you do this (and it may or may not be effective depending on your problem, your code and your compiler), make your L1 cache a pointer to the underlying L2 cache contained array. You will still get the full speed of the L1 cache, without the need to duplicate your current MRU data twice (remember what I said about data fittingin the CPU caches?).

Don't malloc()/new your data arrays. Alocate a fixed number of data slots, and round-robin them. For example, if you are saving the most recently used 4 entries, and you round-robin through the cache, remember you don't need to check the MRU entry (that's in your L1 cache), so you only need to check 3 out of the 4. When you are doing a mod operationto wrap the cache around, don't use MRU % 4, use MRU &= 0x3 (this is a good reason to use caches of power-of-2 size! :-) )

For a quick and dirty implementation you can use the STL map, along the lines of:

typedef map <float, float*> Cache1;
Cache1 MyCache;
MyCache[a] = MyFloatArrayPointer

But this will be slow (trust me, I tried it). It is good enough to establish your cache hit-rates. You can even new() your arrays for a hit-rate check prototype, but make sure that you don't iterate over it with the same data repeatedly if you're not cleaning up after a sensible number of MRU entries, as otherwise you'll fill teh cache once, and then get 100% hit rate for the other 1000 runs you may be doing in your test!

If your data block sizes are small, caching can easily take more time than it saves. Make sure you check this. This is why it is extra vital that the caching implementation is very tight. Caching code can be difficult to vectorize, whereas number crunching a whole array generally vectorizes nicely. That means that with a decent compiler (ICC on P4), your cache may show rather questionable benefit, whereas on the same machine when you compile your program with GCC, it will show considerable benefit (caching avoids the loop that is expensive (up to 8.5x more expensive on P4!) when not vectorized).

And remember to listen to your compiler's warnings and vectorizing reports. There is a lot of performance to be gained by making sure your loops vectorize wherever even remotely plausible.

As an aside on the subject of ICC - use the latest v9.1. At the moment v10 seems to be quite broken (slower, vectorizes less, generates broken code in some cases, etc, don't use it until they've fixed it).

Does anybody have good suggestions and pointers on this approach?

I hope this helped. :-)

Gordan




reply via email to

[Prev in Thread] Current Thread [Next in Thread]