help-gsl
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-gsl] Re: C/C++ speed optimization bible/resources/pointers nee


From: Gordan Bobic
Subject: Re: [Help-gsl] Re: C/C++ speed optimization bible/resources/pointers needed, and about using GSL...
Date: Mon, 06 Aug 2007 19:01:16 +0100
User-agent: Thunderbird 2.0.0.6 (Windows/20070728)

Oliver Jennrich wrote:
On 7/27/07, Gordan Bobic <address@hidden> wrote:
On Fri, 27 Jul 2007, Jochen Küpper wrote:

[...example..]
Using floats instead of doubles can lead to quite significant performance
differences.
On you Pentium 3, not the average number cruncher these days.
A Opteron or any of the modern Intel CPUs would be more appropriate.
*sigh*

On an x86-64 Core2/1.9GHz, CentOS/x86-64 v5, ICC v9.1.051/x86-64
Using the small sample program I posted earlier.
Compiled with: icc -msse3 -xP -fp-model fast=2

Using floats: 2.65 seconds
Using doubles: 5.29 seconds

Twice as many floats vectorize per operation as doubles. Thus it goes
twice as fast. How much more evidence do you require?

No you guys got me interested.

Here is what I tried:

#include <stdio.h>
#include <math.h>
int main ()
{
  const float foo = 29.123;

  unsigned int    j,k;
  unsigned int    i;
  double a[] = {1,2,3,4,5,6,7,8};
  double b[] = {5,6,7,8,9,10,11,12};
  double c[] = {0,0,0,0,0,0,0,0};

  for (k=0;k<100000;k++){
    for (j=0;j<10000;j++){
      for (i = 0; i < 8; i++)
        {
          c[ i ] = (j*k*(a[ i ]+b[ i ]));
        }
    }
  }
  printf("%f", c[3]);
  return 0;
}

with gcc 4.1.1
gcc -O3 -march=pentium-m -malign-double -mfpmath=sse -msse2  -Wall -o
vect vect.c -ftree-vectorize -ftree-vectorizer-verbose=5

on a
x86 Family 6 Model 13 Stepping 8 GenuineIntel ~1862 Mhz

The multiplication with j and k ist just so that -O3 doesn't optimize
the outer loops to oblivion, and to raise the overall times above the
clock noise

The results are puzzling:

double, no vectorization: 23.797s
double vectorization: 23.858s
float, no vec: 15.561s
float, vec: 5.843s

long double, no vec (as sse2 is not enough...): 33.344s

Ok, I do understand why long double is slower than double (I think).
But why does vectorization not make the slightest bit of difference
when using doubles?

Assuming that GCC's optimizer doesn't do something daft here (and that's a pretty big assumption), you are only getting partial vectorization here. You cannot mix types in vectorizable statements. Mixing types makes them non-vectorizable. Use a shadow iterator of the same type as your other data elements:

 for (k = 0, kk = 0;k < 100000; k++)
    for (j = 0, jj = 0;j < 10000; j++)
      for (i = 0; i < 8; i++)
          c[i] = (jj++ * kk++ * (a[i] + b[i]));

where kk and jj are of the same type as a[], b[] and c[];
You'll find that goes faster and vectorizes better.

Gordan




reply via email to

[Prev in Thread] Current Thread [Next in Thread]