discuss-gnuradio
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Discuss-gnuradio] Re-writing blocks using intel libraries


From: Matt Ettus
Subject: Re: [Discuss-gnuradio] Re-writing blocks using intel libraries
Date: Tue, 11 Dec 2007 19:55:10 -0800
User-agent: Thunderbird 2.0.0.9 (X11/20071115)


General curiosity questions:

 Are you using oprofile to measure performance?

I am a bit of a maverick, and for various reasons am using a pure C++ environment. I hacked my own 'connect_block' function (can;t wait for v3.2, where these will be part of native gr). I am measuring the performance using a custom block (gr_throughput) that simply reports the average number of samples processed per second.

While pure C++ may be desirable for some reasons, performance is not really one of them. When you use Python, it isn't running anything that is really performance critical.

 Which blocks are causing you the biggest problem?

I got a 2x improvement on all the filtering blocks.

That isn't surprising. I believe our SSE filtering code was optimized for prior generations of processors, so a new Core2 optimized version would be useful, and likely competitive with IPP. Also, are you sure that when you compile our code with Intel's compiler that you are even getting the SSE versions? Or are the pure C++ versions called?

Another thing, which I believe was mentioned earlier -- if you really care about FIR filter performance, you should be using the FFT versions of the filters. The difference in performance can be huge, making the 2x you get from IPP insignificant.

About a 40% improvement for sine/cosine generation blocks. This includes gr_expj, gr_rotate.
There is definitely room for improvement here.

 Are your problems caused primarily by lack of CPU cycles, cache
 misses or mis-predicted branches?

I am not sure, since I am not at all a software expect (mostly dsp/comm). My guess is that the SSE instructions are not being used (or not used to a full extent). Even the 'multiply' block is VERY slow compared to a vector x vector multiplication in the Intel library. Some of the gr_blocks process each sample using a separate function call (e.g.
for (n=0; n<noutput_samples; n++)
        scale(in[n])

Replacing this with a single vectorized function call is much faster.

Those function calls should be inlined if nothing else.

In any case, GCC is not vectorizing this, but it would be trivial to write it in SSE or intrinsics, which would allow this to be done in open source code, without having to resort to IPP. That would be a very useful contribution.

Matt





reply via email to

[Prev in Thread] Current Thread [Next in Thread]