[Discuss-gnuradio] Re: GNU Radio GPGPU WIP Branch Status?

discuss-gnuradio
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Discuss-gnuradio] Re: GNU Radio GPGPU WIP Branch Status?

From:	Martin DvH
Subject:	[Discuss-gnuradio] Re: GNU Radio GPGPU WIP Branch Status?
Date:	Sun, 25 Jan 2009 16:10:00 +0100
On Sat, 2009-01-24 at 15:16 -0500, Michael Dickens wrote:
> Hi Martin - I'm hoping you can provide some feedback to me on your GNU  
> Radio GPGPU branch.
> 
> Can you tell me the status of your GPGPU WIP branch?  Are you planning  
> on implementing any of your suggested improvements?
Yes, but lack of time at the moment.
>   Do you mind if I  
> make a copy of it, and update it to the latest trunk as well as  
> convert it to be a component of the primary configure?  I think this  
> last piece will make it easier to use.
I don't mind.
In fact, you would make me very happy by doing this.
Note that one of the problems is that the CUDA scheduler and circbuf emulation 
need changes in the heart of Gnuradio.
Since I created my CUDA branch a lot of changes were made to
gnuradio-core in this area and at the moment the CUDA gnuradio-core is
not compatible with trunk gnuradio-core.
If you eliminate the need for emulating a circular buffer, the changes
are much less drastic.
Also note that cuda kernels can only access memory that is instantiated
in the same CPU thread. Current gnuradio-core trunk uses multiple
threads, you can stop it from using multiple threads with an environment
variable (I don't know from the top-of-my-head what it was. It was
something like GR_SINGLE_THREADED=1.)

See also the email discussion between me and Eric at the bottom of this
mail, and my example for elimination of the need for emulating a
circular buffer below.
 
> It looks like the scheduler runs entirely on the CPU, and certain  
> blocks transfer data to the GPU and perform the computation there.   
> Clearly this is a non-optimal implementation.  Do you consider it  
> "working" or "complete" in any sense (and, if so, what sense)?
With allmost all CUDA application you do the scheduling on the CPU.
In this sense I think my code is a good proof of concept.
More comments below.

> 
> The following is my take; please correct me if/where I'm wrong. - MLD
> 
> I've looked at the README.cuda and the code in gr-cude, and it looks  
> like the following is true:
> 
> * Created buffering for CUDA;
-created extra CUDA types for gr.io_signature.
-changed scheduling so circular buffers are emulated (on the GPU)
> * Created host <-> GPU data transfer blocks;
> * Created a few CUDA specific blocks.
> 
> The end result is that GNU Radio applications can make use of certain  
> CUDA-specific blocks, e.g., in the WFM receiver example.  But,  
> unfortunately, the performance isn't any better than using just a CPU  
> - generally because (1) the overhead in transferring data between the  
> CPU and GPU; (2) buffering on the GPU doesn't allow for a simple  
> memory mapped circular buffer (as used by most standard GNU Radio  
> installs).
(3) call-overhead. Even without the CPU-GPU transferres you still have a
very substancial call-overhead when running a CUDA kernel.
> 
> (1) The computational intensity (radio of computation time to data  
> transport time, roughly) can be improved by increasing the amount of  
> data per transfer, but with increased latency in the computation.   
> Clearly there is a trade-off, and by default GNU Radio keeps the  
> amount of "work" done per call on the order of a few 1000 items in  
> order keep buffering within reason and latency low.
Indeed.
The same goes for the (large) call-overhead.
Data is only transferred to the GPU at the begin of the processing
chain, and transfered back to the CPU at the end of the processing
chain.
Even if that is done in large chunks, you still have a problem when the
processing kernels are only allowed to do a small number of computations
at a time. The call-overhead is very substantial.
This is also true for the GPU-GPU memory copies done to emulate a
circular buffer.


> 
> (2) Martin designed a circular buffer emulator (of sorts) that runs on  
> the CPU and GPU, but still each block has to copy all memory after  
> each operation which is will reduce the computational intensity for  
> small amounts of data processed per operation.
True, Note that these copies are done entirely on the GPU, but they are
initiated on the CPU. From the point of view of the CPU these are
asynchronous but. But the GPU cannot run kernels while doing these
copies.
> 
> Improvements are suggested by:
> 
> (A) switching to using pinned memory (on the GPU, I presume);
Yes, see the bandwidthtest example in the CUDA SDK about this.

> 
> (B) do host-to-cuda and cuda-to-host data transfers in the background  
> to allow the host processor to doing other work;
Yes, this needs pinned memory.
> (C) restructure the scheduler somewhat.
change the scheduler so you get larger computational intensity.
(more work per call)
Also tune the calculation of num_blocks,   num_iterations,
num_threads_per_block and restructure the cuda kernel code if needed,
Try to eliminate circular buffer emulation completely.

(D)
restructure the code so the cuda specific circular bufer emulation code
is in a new baseclass gr-block-cuda in stead of scattered all over
gnuradio-core/srclib/runtime
(mainly gr_vmcircbuf.cc, but also see gr_flat_flowgraph,
gr_single_threaded_scheduler)
All gr-cuda blocks would then inherit form gr-block-cuda.

One of the things I thought of, but is not implemented but in the code
is the use of GR_BUFFER_EXTERNAL_BLOCK_HAS_ITS_OWN_FACTORY
(see gr_flat_flowgraph.cc)



There are few things that would really make the cuda code faster
1. increase computational intensity (discussed above)
This will definitely give a major performance boost 
2. tune the instantiation of the kernels. (number of kernels run in
parallel,   calculation of num_blocks,   num_iterations,
num_threads_per_block)
Also check the memory accesses and local memory use (read CUDA
documentation, this is a complex subject)

2. eliminate the circular buffer emulation entirely.
This way each cuda block has to make sure it reads and writes to the
right memory location.
If the size of the buffer is a power of two this could be done by
masking the indexes like this.
pseudocode:
          unsigned mask=bufsize-1;//bufsize needs to be power of two.
          for (unsigned int i = thread_id; i < size; i += num_threads)
          {
            outbuf[i & mask] = inbuf[i & mask]
          }
Note that this adds computations and eliminates copying.
Benchmarking should be done to see what is faster in which
circumstances.
3. improve the FIR filter block, it is not optimal (especially when
using decimation) and it can't run with large numbers of taps or large
decimation factor.
4. automatically or manually combine the code of several sequential
blocks in a flow-graph so they end up in a single kernel.
This is no easy task but would eliminate a lot of the call overhead.
5. totally different approach. Run several flowgraphs in parallel in
stead of trying to parallellise the code inside a block.
For example, run 32 FM receivers, by giving every block 32 inputs and 32
outputs in stead of one.
A similar approuch could be used when a generic block is used multiple
times in the same flowgraph.
(a flowgraph usually has several multiply blocks and FIR filters)


example kernel code which does not need a circular buffer:

        __global__ void cudai_multiply_cc_kernel(gr_complex* g_odata, const
gr_complex* g_idata_a, const gr_complex* g_idata_b, const unsigned int
size, const unsigned bufsize)
        {
          const unsigned int num_threads = blockDim.x * gridDim.x;
          const unsigned int thread_id = blockIdx.x * blockDim.x +
threadIdx.x;
          gr_complex product;
          unsigned it memmask = bufsize-1;//bufsize must be power of two
          for (unsigned int i = thread_id; i < size; i += num_threads)
          {
            //complex multiply
            //(a.x+ja.y)(b.x+jb.y)=(a.x*b.x-a.y*b.y) +j(a.x*b.y +a.y*b.x)
            product.x = g_idata_a[i & memmask].x * g_idata_b[i& memmask].x -
g_idata_a[i& memmask].y * g_idata_b[i& memmask].y;
            product.y = g_idata_a[i& memmask].x * g_idata_b[i& memmask].y +
g_idata_a[i& memmask].y * g_idata_b[i& memmask].x;
            g_odata[i& memmask]=product;
          }
        }

void
        cudai_multiply_cc (gr_complex *device_output, const gr_complex *
device_input_a, const gr_complex * device_input_b,unsigned int n,
cudai_general_kernel_params *params)
        {
          cudai_general_kernel_params tmp_params;
          if(NULL==params)
          {
            params=&tmp_params;
            cudai_get_general_kernel_params ( params, n);
          }

          LOCAL_CUDA_SYNC( "cudai_multiply_cc" );
          cudai_multiply_cc_kernel<<< params->griddim, params->threaddim,
params->dynamic_shared_mem_size>>>
                                (device_output,device_input_a,device_input_b,
n);

        }


> 
> I'm wondering if you can elaborate on (A) and (C) above [(B) I  
> understand].  Do you anticipate these changes to be simple / straight  
> forward?  What specific changes do you believe would increase this  
> branch's performance?

Success,
Martin


Discussion about getting CUDA code in gnuradio-core between my and Eric
Blossom.
-------- Forwarded Message --------
From: Martin DvH <address@hidden>
To: Eric Blossom <address@hidden>
Cc: Johnathan Corgan <address@hidden>
Subject: Re: [Commit-gnuradio] r9522 - in 
gnuradio/branches/developers/nldudok1/gpgpu-wip: . 
gnuradio-core/src/lib/runtime testbed
Date: Thu, 20 Nov 2008 01:17:22 +0100

On Sun, 2008-09-07 at 13:08 -0700, Eric Blossom wrote:
> On Sun, Sep 07, 2008 at 01:41:44PM -0600, address@hidden wrote:
> > Author: nldudok1
> > Date: 2008-09-07 13:41:42 -0600 (Sun, 07 Sep 2008)
> > New Revision: 9522
> > 
> > Added:
> >    gnuradio/branches/developers/nldudok1/gpgpu-wip/testbed/
> >    
> > gnuradio/branches/developers/nldudok1/gpgpu-wip/testbed/cuda_test_dial_tone.py
> >    
> > gnuradio/branches/developers/nldudok1/gpgpu-wip/testbed/cuda_test_dial_tone_wav.py
> >    
> > gnuradio/branches/developers/nldudok1/gpgpu-wip/testbed/cuda_test_vector.py
> >    
> > gnuradio/branches/developers/nldudok1/gpgpu-wip/testbed/cuda_test_vector_int.py
> > Modified:
> >    
> > gnuradio/branches/developers/nldudok1/gpgpu-wip/gnuradio-core/src/lib/runtime/gr_buffer.cc
> >    
> > gnuradio/branches/developers/nldudok1/gpgpu-wip/gnuradio-core/src/lib/runtime/gr_buffer.h
> >    
> > gnuradio/branches/developers/nldudok1/gpgpu-wip/gnuradio-core/src/lib/runtime/gr_buffer.i
> >    
> > gnuradio/branches/developers/nldudok1/gpgpu-wip/gnuradio-core/src/lib/runtime/gr_flat_flowgraph.cc
> >    
> > gnuradio/branches/developers/nldudok1/gpgpu-wip/gnuradio-core/src/lib/runtime/gr_flat_flowgraph.h
> >    
> > gnuradio/branches/developers/nldudok1/gpgpu-wip/gnuradio-core/src/lib/runtime/gr_io_signature.h
> >    
> > gnuradio/branches/developers/nldudok1/gpgpu-wip/gnuradio-core/src/lib/runtime/gr_runtime_types.h
> >    
> > gnuradio/branches/developers/nldudok1/gpgpu-wip/gnuradio-core/src/lib/runtime/gr_single_threaded_scheduler.cc
> >    
> > gnuradio/branches/developers/nldudok1/gpgpu-wip/gnuradio-core/src/lib/runtime/gr_vmcircbuf.cc
> >    
> > gnuradio/branches/developers/nldudok1/gpgpu-wip/gnuradio-core/src/lib/runtime/gr_vmcircbuf.h
> >    
> > gnuradio/branches/developers/nldudok1/gpgpu-wip/gnuradio-core/src/lib/runtime/gr_vmcircbuf_cuda.cc
> >    
> > gnuradio/branches/developers/nldudok1/gpgpu-wip/gnuradio-core/src/lib/runtime/gr_vmcircbuf_cuda.h
> >    
> > gnuradio/branches/developers/nldudok1/gpgpu-wip/gnuradio-core/src/lib/runtime/qa_gr_buffer.cc
> > Log:
> > allocate cuda buffers in work thread of block
> 
> Martin,
> 
> The way you're going after the CUDA stuff looks like it's going to
> introduce a set of quite substantial changes into the guts of GNU
> Radio.
I know, I don't like it too.

>   Is there a way to get what you want where you don't have to
> modify these low-level classes?  E.g., creating a new subclass to
> derive CUDA blocks from?  It appears that the CUDA stuff requires a
> bunch of copying.  Couldn't this be done in a new base class, instead
> of in the buffer code?
> 
I have been struggling with this too. I would rather have the CUDA stuff
completely seperate from gnuradio-core.

I don't see however how this can be done completely.
A block can not instantiate its own circular buffer.
Even if it could, the other blocks couldn't see that this is a special
kind of circular buffer.

The copying itsself however is a generic way to emulate a circular
buffer on a platform which does not have mmap.
This could be usefull for other platforms too.
(Run GnuRadio on an embedded processor without mmu)
So I also put it in the gnuradio-core buffer class for now.

I could remove the copying (for emulating circular buffer) from the
gnuradio-core buffer code and move it to a new special cuda-block
baseclass.
The new baseclass block has to have complete access to the
class-variables, pointers and methods of the circbufs, buffer readers
and such.



> FYI, there were a lot of mods to gr_buffer and gr_buffer_reader for
> thread safety under the mp-sched.
I saw there were a lot of changes there.

I haven't looked at integrating my cuda code with this new buffer code
yet.
I first want to try and see if I can't remove the copying altogether.

For outputting data I could add code to split the work up so  memory is
never written past the end of the buffer.

For reading from input buffers this is more tricky. Blocks are now free
to read from anywhere in the buffer in multiple places in the code.
(fir filters would be complicated this way with the large history they
use)


Greetings,
Martin


> Eric
>
[Prev in Thread]
Current Thread
[Next in Thread]
[Discuss-gnuradio] Re: GNU Radio GPGPU WIP Branch Status?, Martin DvH <=
Prev by Date: Re: [Discuss-gnuradio] GRC question
Next by Date: [Discuss-gnuradio] Listen to Mp3 via FM broadcast
Previous by thread: [Discuss-gnuradio] GRC question
Next by thread: [Discuss-gnuradio] Listen to Mp3 via FM broadcast
Index(es):
- Date
- Thread