[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Thread model

From: olafBuddenhagen
Subject: Re: Thread model
Date: Tue, 18 Mar 2008 10:11:18 +0100
User-agent: Mutt/1.5.17+20080114 (2008-01-14)


On Mon, Mar 17, 2008 at 10:47:40AM +0100, Neal H. Walfield wrote:
> At Sun, 16 Mar 2008 08:21:19 +0100, <olafBuddenhagen@gmx.net> wrote:
> > On Wed, Mar 12, 2008 at 05:12:03PM +0100, Marcus Brinkmann wrote:

> > > As for the threading model, more than one kernel thread per real
> > > CPU doesn't seem to make much sense in most cases.
> > 
> > Well, add a "per processing step" to make this statement more
> > generally true. In some cases, it's useful to have a distinct set of
> > worker threads for each processing step, working in a assemly line
> > like fashion each thread picks a request from one wait queue, does
> > it's work, and pushes it to the next wait queue.
> > 
> > This model specifically should improve performance for servers that
> > are constantly busy, processing many requests in parallel; and under
> > the condition that the amount of data touched in the processing is
> > relatively small compared to the amount of code.
> > 
> > It also simplifies control flow and locking. Certain optimisations
> > become obvious.
> What sort of optimizations?  It seems to me that if you have a small
> amount of data, then you'll be able to run almostly entirely out of
> the L2 cache.  In this case, switching to another thread will
> introduce the overhead of shifting the cache to the other CPU.

Of course. But as we have one thread per CPU for each processing step,
it is perfectly possible to have a request go through all processing
steps without ever changing CPU. Whether this is always really
desirable, is another question...

Note that this approach could also be implemented using just a single
kernel thread per CPU, and some kind of userspace sheduler to switch
between the various processing steps. While this variant gives more
direct control, it seems to me that it's more complicated... Not sure
which is preferable.

In any case, the general advantage of this approach is that rather than
having a single code path handling all stages of each request in
succession, before switching to the next request, a particular piece of
code handles a certain stage of many requests in a loop, before
scheduling sets in and another stage is handled in another loop. This
improves code locality, as well as data locality for any kind of data
shared between requests, at the expense of locality in data private to
the individual requests. In many cases, this can be a considerable win.

(Especially on Pentium4 and related "Netburst" architecture processors,
where it's very expensive to reload the trace cache -- a first level
code cache storing pretranslated instructions. Note that the next
generation of Intel processors, to be introduced by the end of this year
-- unlike the current "Core 2" models -- will be based on an enhanced
Netburst architecture again, and thus reintroduce the trace cache...)

Beside of this inherent advantage, having the code handling an
individual stage in a relatively tight, regular loop, can prompt further
optimizations, which are not possible, not useful, or at least less
obvious otherwise -- stuff like loop unrolling, using data-driven
algorithms instead of branchy code, etc. Together with the increased
code (and partially data) locality, this allows for much better use of
parallelism in the processor -- it helps mitigate the fact that modern
processors are extremely fast at executing optimized regular loops, but
very poor at typical application code that is very irregular and full of

I'm not sure how much difference the additional optimizations can make;
but at least the locality part can be very decisive.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]