---------- Forwarded message ----------
From:
Yu-Hua Yang <address@hidden>
Date: 2009/7/2
Subject: Re: [Discuss-gnuradio] CUDA-Enabled GNURadio gr_benchmark10 possible improvements
To: Martin DvH <
address@hidden>
Cc: discuss-gnuradio <
address@hidden>
Thanks Martin, for your generous effort to help me.
It appears only one time so I think I am in the clear.
I decided to abandon and comment out all the cuda.multiply_const_ff
function calls and concentrate on cuda.fir_filter_fff as suggested.
Things I got questions/concerns
1. I increased output_multiple by doing "options.output_multiple =
xxx" and this has no effect on the computing time of either CUDA or
CPU. Did I do something wrong?
2. I increased the taps by doing
"taps = range(1,256)" and also increasing number of blocks of
fir_filter in the code and voila, I am now able to get CUDA to be
faster than just CPU. However, if I implement something like "taps =
range(1,512)" the CUDA part would be extremely slow (~20 seconds) while
the CPU is still cool (~ 2 sec). Why? But this maybe related to what
you were saying about max number of taps...although why is CPU able to
still compute?
3. I had to increase the number of fir_filter blocks to 14 blocks
before I can start seeing CUDA out-perform CPU. Experimentally its
fine, I achieved my objective, but how is this "increased computation"
justified in a normal GNURadio operation? I mean, when would a normal
GNURadio operation require a chain of 14 fir_filters? I guess this is
going beyond just "benchmarking" and asking where else can I take
advantage of CUDA's computation power in GNURadio in a "normal"
operation?
4. Looking at cuda_fir_fff_7_kernel, which I believe is the core of
cuda_fir_filter, it seems you are using shared memory right? Just
making sure we are not using global or local memory which would
disastrously slow down the CUDA computations.