discuss-gnuradio
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Discuss-gnuradio] SCHED_FIFO and NPTL


From: Stephane Fillod
Subject: Re: [Discuss-gnuradio] SCHED_FIFO and NPTL
Date: Thu, 9 Mar 2006 00:24:30 +0100
User-agent: Mutt/1.5.11

On Wed, Mar 08, 2006 at 12:26:07AM -0800, Eric Blossom wrote:
> On Wed, Mar 08, 2006 at 02:18:35AM -0500, Frank Brickle wrote:
> > Eric Blossom wrote:
> > 
> > >Using LD_ASSUME_KERNEL=2.4.19 effectively forces the old (pre-NPTL)
> > >behavior, which means that acquiring an uncontested mutex requires a
> > >trip to the kernel.  I believe it also means that mutexes won't work
> > >in shared memory across process boundaries.  Those seem like total
> > >losers to me...

Who has measured the cost of pre-NPTL(aka LinuxThread) mutexes in a typical
GNU Radio application ?

Right now, do we need mutexes across process boundaries (mutexes across
thread are fine) ?

> > This adds up to: futexes don't work (yet). Is that right?
> 
> I guess that's the real question.  
> 
> My understanding is that they work, it's just a question of how they
> work vis-a-vis real time.  In our case I'm not worried about issues
> such as priority inversion on mutexes.  I assume all our signal
> processing threads will be running at the same pri and waking up
> anybody who's waiting will be fine.  I just want to ensure that I run
> before the X-server.

SCHED_FIFO is necessary but not enough for stuff like X-server.
Indeed, some badly written video drivers can call cli/sti
directly from user-space on x86 systems (provided iopl succeeded).
Binary-only drivers can hurt the same way. Unfortunately, hard-RT
solutions like Xenomai can't do much in that case, because they can't
virtualize the masking. Otherwise Xenomai is doing great, even
with loaded X-server and knee-bending activity.

cli/sti pairs are not the only latency killers. This page[3] tries
to list them for hard-RT systems. One big culprit encountered by 
hard-RT developers on common PCs is the SMI subsystem. The interrupt 
handler for a System Management Interrupt (SMI) runs from a protected 
memory space (SMRAM), so the OS has no access to this handler code. 
SMI is primarily used for ACPI and APM support. It can be used for 
USB legacy support (USB keyboard, disk, etc.). It has been seen SMI 
eating up several *hundred* of *milliseconds* !

[3] http://rtai.dk/cgi-bin/gratiswiki.pl?Latency_Killer

> 
> > I'm still confused about a couple of things. On one hand the holy grail 
> > appears to be getting the number of frames in a jack buffer down to 64 
> > so as to minimize the roundtrip latency. On the other hand you want to 
> > eliminate xruns. They aren't the same by any means.
> 
> Definitely.  In a hard-real-time world, this problem is solvable.

Indeed.

> At this point, I've got my eye more on minimizing the USRP latency
> since it determines the tightest MAC loop we can build on the host.
> Basically, I want to make sure that I'm running at a better pri than
> the X-server and friends.  (This all assumes that I've got sufficient
> cycles to actually get the work done in real time.)
> 
> The smallest chunk of work we can conveniently get across the USB is
> 128 complex samples (512 bytes).  Assuming we're running at 4MS/s
> that's 32uS of samples.  If we're able to get N uS of CPU every
> 32uS, then solving the audio problem shouldn't be a problem ;)

I assume you meant 'us' (micro-second), and not micro sample or micro
Siemens :-)

Talking about getting to the 32us area, this is what the Xenomai/RTAI
guys are doing, turning Linux+average PCs into convenient DSP. On non-
flawed hardware, one can get better than 10us latency. However, taking
only 50% of a 32us period is not going to lead to linear consumption
(think context switch, and not only register file). Besides, the bigger
the chuncks of data, the more efficient the optimization of data 
crunching loops, let alone the depth (history necessary for feedback)
required by some filtering algorithms. But you know the drill already ;)


> (I'm also reviewing chapter 5 of the USB 2.0 spec to see how often
> the h/w arbitrates for access to the USB.)

Oh btw, the regular Linux USB stack probably won't be able to guarantee
32us latencies. The best bet here, is the usb4rt[3] project for Xenomai,
a rewrite of USB stack with hard-RT in mind.

[4] http://developer.berlios.de/projects/usb4rt

> > It's not clear that minimizing roundtrip latency means much when you're 
> > using DSP buffers of 512 frames or more. By the same token, in what I've 
> > observed, the chief culprit for the xruns is the X window system. There 
> > is a very delicate balancing act going on in the process priorities 
> > between the audio subsystem and the video subsystem. I'm not convinced 
> > that running SCHED_FIFO is going to routinely enable the audio subsystem 
> > to slide in under the video action under all circumstances.
> 
> Taking a look at the output of 
> 
>   $ ps -eo pid,tid,class,rtprio,ni,pri,comm 
> 
> on my system indicates that the X-server is not running with real time
> scheduling.  Only the migration tasks are.  I assume these migrate
> tasks across CPUs (this output is from a dual core machine).  The
> stuff with negative niceness (the NI column) also have preferential
> scheduling over other time shared processes, but should be shutout by
> the SCHED_FIFO stuff.

Lucky you, dual CPU systems(SMP) have an easy solution. One dedicated CPU 
bind to the time critical tasks, the other CPU for Linux, X-server, etc.

All in all, this might be the easiest path for people needing realtime 
on Linux. Just think these chips are coming common stream and hopefully cheaper.

> > Bottom line, it hasn't actually been proved that running SCHED_FIFO will 
> > squash the existing latency and continuity problems, so I'm not at all 
> > sure much is missing without that capability.

IMHO, SCHED_FIFO is necessary, but not enough.

> I'm not sure if the JACK FAQ is up to date or not.  Do you have data
> on the success or failure of folks running JACK on ALSA using NPTL and
> if they are able to get sufficiently good performance?  I guess Ardour
> users would be a good test case.  How about the dttsp stuff under Linux?

Whatever the solution, we're going to need to be able to evaluate its
"fitness" for our particular GNU Radio load. IMO, we should check
that by ourselves.

> Returning to the audio xrun problem: I think that in the typical USRP
> + audio situation, xruns are aggravated by differences in the clocks
> between the two domains and the fact that we aren't doing anything to
> handle that situation.

I do agree. Then, how to we distinguish an xrun off a missed deadline
from a xrun off a clock difference ?
What are solutions in sight? "Get to know your clocks"?

-- 
Stephane




reply via email to

[Prev in Thread] Current Thread [Next in Thread]