help-gsl
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-gsl] C/C++ speed optimization bible/resources/pointers needed


From: Gordan Bobic
Subject: Re: [Help-gsl] C/C++ speed optimization bible/resources/pointers needed, and about using GSL...
Date: Fri, 27 Jul 2007 08:51:19 +0100 (BST)

On Fri, 27 Jul 2007, Michael wrote:

I am in the middle of programming to solve an engineering problem
where the speed is huge concern. The project involving lots of
numerical integration and then there are several loops/levels of
optimization on top of the function evaluation engine.

A few general rules I've found to help a lot:

- Don't use unnecessary precision if you don't need it. Don't use a double if a float will do. This is particularly important it code that the compiler can vectorize. Even if your SIMD vector unit can handle doubles, it can typically handle twice as many floats as doubles for the same operation in the same amount of time.

- Use static variables in your functions wherever possible. This especially goes for arrays, if you can get away with it. If you know the maximum size you'll need for an array, static declare it once, and just use that. malloc()/new is slow.

- Use a good compiler for your target platform. Sadly, GCC isn't great. It's not even good. When it comes to tight vectorized loops and you don't want to reach for the hand crafted assembler, I have seen performance boosts of between 4.5x (on P3 / Core2) and 8.5x (P4) when using Intel's ICC compiler instead of GCC. GSL certainly compiles and works fine with ICC. IBM have a compiler for the PowerPC, and I believe Sun have their own optimizing compiler for the SPARC. They are almost certainly worth looking into.

As you probably
know, the key to a successful optimization is a fast underlying
objective function evaluator. The faster it is, the more promising the
optimization result(perhaps global optimal). However our project
requires many numerical integrations which prohibits us from making it
super fast. At the heart of the numerical integration is a smart
integrator and a super-fast integrand function evaluator. Even worse,
our function evaluation is in complex-domain. So the key point is how
to arrange our C/C++ code to make it highly efficient in every aspect.
Could anybody give some advice/pointers on how to improve the speed of
C/C++ program? How to arrange code? How to make it highly efficient
and super fast? What options do I have if I don't have luxury to use
multi-threaded, multi-core or distributed computing? But I do have a
P4 at least. Please recommend some good bibles and resources! Thank
you!

On a P4, ICC will utterly annihilate GCC in terms of performance of the resulting code, especially when it comes to heavy maths. Get a copy and try it. Enable -vec-report3 and see which of your loops don't vectorize. Where possible, re-arrange them so that they do vectorize. The compiler often needs a hand with assorted little hacks to help it vectorize the code, but they are generally not ugly, and are most of the time limited to:

1) Copy the object property into a local variable. This will help persuade compiler that there is no vector dependance it needs to worry about.

2) If you have a loop where you are operating with the iterator on an array, have a shadow iterator of a vectorizable type. Remember you cannot use mixed type in vector operations. For example, you can do:

static unsigned int     i;
static float            ii;
static float            baz[16];

for (i = 0, ii = 0; i < 16; i++)
        baz[i] *= ii; // vectorizes
        //baz[i] *= i; // doesn't vectorize

3) If your function parameters are changing partially, evaluate them partially and cache the results for each part so you don't have to re-calculate. For example, if your function is something like:

Y = a * (bX^2 + c) + d;

Arrange your loops so the outer-most one works on the inner-most evaluation (in this case X*X, followed by multiplication by b, followed by addition of c, followed my multiplication by a, followed by addition of d in the innermost loop. You can then cache the values of X*X (which is, incidentally, much faster than (pow(X,2)), b*X^2, b*X^2+c, etc, so when your outer parameters change, you don't have to re-calculate the inner terms. How you design your caches is also important. This can cause more overhead than it saves, so you have to optimize it very carefully while keeping your underlying algorithm structure in mind.

4) Keep your data sizes in mind. If your frequently used data doesn't fit in the CPU caches, you are likely to start experiencing slow-downs on the order of 20x or so due to latencies. Use a float when you don't need a double, as they are half the size.

5) Write the optimized code yourself. GSL and similar libraries are great for a rapid proof of concept prototype, but there is a price to pay in terms of performance when using generic code vs. bespoke code specifically optimized for a particular task.

6) Learn the compiler switches for your compiler. Test the accuracy of your resulting numbers. When you start cutting corners (e.g. "-ffast-math -mfpmath=sse,387" on GCC, "-fp-model fast=2 -rcd" on ICC) you may get more speed, but sometimes the precision on your floats will reduce. This may or may not be acceptable for your application.

7) Think about your algorithms. If you are doing a calculation that is a modulo of a power of 2 on an int, use a bit-wise AND instead. It is an order of magnitude faster. (e.g. instead of X % 4, do X &= 0x3).

There are hundreds of little hacks you can do to speed your code up. It is impossible to simmarize them all, and they will differ from project to project and they won't all be appropriate all the time. I hope this gets you started on the right path, though. :-)

Good luck.

Gordan




reply via email to

[Prev in Thread] Current Thread [Next in Thread]