Re: [Help-gsl] seeking highly efficient caches scheme for demanding engi

help-gsl

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-gsl] seeking highly efficient caches scheme for demanding engi

From:	Gordan Bobic
Subject:	Re: [Help-gsl] seeking highly efficient caches scheme for demanding engineering computing?
Date:	Fri, 27 Jul 2007 09:20:42 +0100 (BST)

On Fri, 27 Jul 2007, Michael wrote:

To same the time of costly function evaluation, I want to explore the
possibility of caching.

LOL! Would you believe I wrote about this in the previous email before Iread this. :-)

Suppose in millions of calls to the function evaluation, there are some
computations, in which only a portion of the parameters change, others
remain the same as the parameters that are used in some other function
evaluations some time before. To save time, I can group the varying
parameters into sub-groups, and cache the results for later use. This needs
a highly efficient cache and a efficient organzation and look-up. Also, the
decision of whether a group of parameters has been seen before should be
based on data trunk in binary form, instead of decimal formats, so I can
compare memory data trunks directly using memory comparison.

I know a lot of people are going to choke when I suggest this, but you CANcompare floats with good reliability and success when it comes to cachelook-ups.

I find that the best strategy is to use two levels of caching. Level 1will typically store the Most Recently Used (MRU) sub-expression. If youorganized your loops so that the inner-most sub-expression is theouter-most loop and the outer-most sub-expression is in the inner-mostloop, you will get extremely good results from the MRU caching scheme.


In your evaluation function, do something like:
// Assuming function a * (bX + c) + d

float Eval (float a, float b, float c, float d)
{
        static unsigned int     x;
        static float            xx;
        static float            CacheBX[1024];
        static float            CachedB;
        static float            CacheBXC[1024];
        static float            CachedC

        bool CacheBDirty = false;
        bool CacheCDirty = false;

        if (CachedB != b)
        {
                CacheBDirty = true;
                for (x = 0, xx = 0; x < 1024; x++)
                        CacheBX[x] = b * xx++;
        }

        if (CacheBDirty || CachedC != c)
        {
                CacheCDirty = true;
                for (x = 0, xx = 0; x < 1024, x++)
                        CacheBXC = CacheBC[x] + c;
        }

        // etc. You get the idea.

You can improve the cache effectiveness by truncating your floats a bit tocope with rounding errors that may creep in, depending on what the rest ofyour code does. It will also depend on how much precision you actuallyneed. For example, to truncate your floats down to about 5 signifficantfigures from the 7 it normally has, you could do something like:


*((unsigned int*)&b) &= 0xFFFFFF00; // truncate b to 16-bit precision

Don't put this in your evaluator function - do it in the outer loops assoon as you have the new value for b, obviously - no need to do it everytime, only needs to be done once when the value changes.

You can extend this further, with a level 2 cache. If you do this (and itmay or may not be effective depending on your problem, your code and yourcompiler), make your L1 cache a pointer to the underlying L2 cachecontained array. You will still get the full speed of the L1 cache,without the need to duplicate your current MRU data twice (remember what Isaid about data fittingin the CPU caches?).

Don't malloc()/new your data arrays. Alocate a fixed number of data slots,and round-robin them. For example, if you are saving the most recentlyused 4 entries, and you round-robin through the cache, remember you don'tneed to check the MRU entry (that's in your L1 cache), so you only need tocheck 3 out of the 4. When you are doing a mod operationto wrap the cachearound, don't use MRU % 4, use MRU &= 0x3 (this is a good reason to usecaches of power-of-2 size! :-) )

For a quick and dirty implementation you can use the STL map, along thelines of:


typedef map <float, float*> Cache1;
Cache1 MyCache;
MyCache[a] = MyFloatArrayPointer

But this will be slow (trust me, I tried it). It is good enough toestablish your cache hit-rates. You can even new() your arrays for ahit-rate check prototype, but make sure that you don't iterate over itwith the same data repeatedly if you're not cleaning up after a sensiblenumber of MRU entries, as otherwise you'll fill teh cache once, and thenget 100% hit rate for the other 1000 runs you may be doing in your test!

If your data block sizes are small, caching can easily take more time thanit saves. Make sure you check this. This is why it is extra vital that thecaching implementation is very tight. Caching code can be difficult tovectorize, whereas number crunching a whole array generally vectorizesnicely. That means that with a decent compiler (ICC on P4), your cache mayshow rather questionable benefit, whereas on the same machine when youcompile your program with GCC, it will show considerable benefit (cachingavoids the loop that is expensive (up to 8.5x more expensive on P4!) whennot vectorized).

And remember to listen to your compiler's warnings and vectorizingreports. There is a lot of performance to be gained by making sure yourloops vectorize wherever even remotely plausible.

As an aside on the subject of ICC - use the latest v9.1. At the moment v10seems to be quite broken (slower, vectorizes less, generates broken codein some cases, etc, don't use it until they've fixed it).

Does anybody have good suggestions and pointers on this approach?


I hope this helped. :-)

Gordan

[Prev in Thread]

Current Thread

[Next in Thread]

[Help-gsl] seeking highly efficient caches scheme for demanding engineering computing?, Michael, 2007/07/27
- Re: [Help-gsl] seeking highly efficient caches scheme for demanding engineering computing?, Gordan Bobic <=

Prev by Date: Re: [Help-gsl] C/C++ speed optimization bible/resources/pointers needed, and about using GSL...
Next by Date: [Help-gsl] Re: C/C++ speed optimization bible/resources/pointers needed, and about using GSL...
Previous by thread: [Help-gsl] seeking highly efficient caches scheme for demanding engineering computing?
Next by thread: [Help-gsl] Sum of vector values
Index(es):
- Date
- Thread