[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Discuss-gnuradio] Re-writing blocks using intel libraries
From: |
Martin Dvh |
Subject: |
Re: [Discuss-gnuradio] Re-writing blocks using intel libraries |
Date: |
Wed, 12 Dec 2007 03:37:54 +0100 |
User-agent: |
Icedove 1.5.0.14pre (X11/20071018) |
Eric Blossom wrote:
> On Tue, Dec 11, 2007 at 03:41:46PM -0800, Eugene Grayver wrote:
>> Please see answers in-line.
>>
>> Thanks!
>
>> General curiosity questions:
>>
>> Are you using oprofile to measure performance?
>>
>> I am a bit of a maverick, and for various reasons am using a pure C++
>> environment. I hacked my own 'connect_block' function (can;t wait for
>> v3.2, where these will be part of native gr).
>
> The trunk contains C++ code for connect, hier_block2, etc. Some of
> the pieces that are still missing include C++ support for the USRP
> daughterboards, but Johnathan Corgan is working on that now.
>
>> I am measuring the performance using a custom block (gr_throughput)
>> that simply reports the average number of samples processed per
>> second.
>
>> What h/w platform are you running on / tuning for?
>>
>> The platform is currently Intel Xeon or Core2 Duo.
>>
>> You're not trying to run your app on a cache-crippled machine like a
>> Celeron, are you? ;)
>>
>> No, very high end.
>>
>> Which blocks are causing you the biggest problem?
>>
>> I got a 2x improvement on all the filtering blocks.
>
> If these are FIR filters, were you using gr_fft_filter_{fff,ccc}
> or the gr_fir_filter* blocks? The FFT one's are _much_ faster with a
> break-even point around 16 taps IIRC.
>
>> About a 40% improvement for sine/cosine generation blocks. This
>> includes gr_expj, gr_rotate.
>
> No surprise there, and that's a great example of SIMD code that should
> be in GNU Radio.
>
>> Are your problems caused primarily by lack of CPU cycles, cache
>> misses or mis-predicted branches?
>>
>> I am not sure, since I am not at all a software expect (mostly dsp/comm).
>> My guess is that the SSE instructions are not being used (or not used to a
>> full extent). Even the 'multiply' block is VERY slow compared to a vector
>> x vector multiplication in the Intel library.
>
> OK.
>
>> Some of the gr_blocks
>> process each sample using a separate function call (e.g.
>> for (n=0; n<noutput_samples; n++)
>> scale(in[n])
>>
>> Replacing this with a single vectorized function call is much faster.
>
> OK.
>
>>> We would not accept the changes.
>
>> That's what I expected. We'll try to contribute the more dsp-centric
>> blocks such as demodulators.
>
> That would be great! Or if you want to code up an SSE Taylor series
> expansion for sine/cosine good to 23-bits or so, we'd love that too ;)
I am working on this in the little spare time I have.
I already got a SSE taylor series for atan2, working in gnuradio.
The atan2 needs some code cleanup and wrapper code to switch implementations
(if (processor=X86, processor supports_SSE2)=>optimized else generic)
The sin/cos is far from ready.
Greetings,
Martin
> Thanks for telling us about your experience.
>
> Eric
>
>
> _______________________________________________
> Discuss-gnuradio mailing list
> address@hidden
> http://lists.gnu.org/mailman/listinfo/discuss-gnuradio
>