l4-hurd
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RPC overhead


From: Jonathan S. Shapiro
Subject: Re: RPC overhead
Date: Tue, 08 Jul 2008 09:57:23 -0400

Neil:

If you are running on an IA32 implementation, you should be able to
confirm this using the hardware event tracing registers. Use them to
separately count S cache misses and U+S cache misses. First measure the
ping-pong test, then yours. Look at the S cache misses in particular.
While you are about it, check TLB behavior as well.

It was always Jochen's practice to count performance from trap
instruction to kernel exit. While the L4 trap interface is very well
designed from the IDL compiler point of view, the correct measurement
point is at the boundary of the IDL procedure (because the trap-layer
interface design can mandate copies that might be removed with a
different design).

Further, I have always felt that the microbenchmark approach on this was
a good way to measure the ukernel implementation, but a bad way to
measure systemic behavior. The I/D cache effects and TLB effects are a
sizable component of the cost of separation in real systems.

Not, mind you, that EROS or Coyotos does any better on these issues.
What I'm trying to say is that using microbenchmarks exclusively
constitutes "misleading by omission".

In this respect, it is my opinion that measuring l4linux does not help,
because it is my impression that the IPCs required in that
implementation do not do much string motion (therefore have minimal
cache footprint). That measurement is a fine measurement of hosted linux
performance, but if your goal was merely to run linux quickly, sticking
a microkernel under it is a step in the wrong direction. There are other
reasons to run linux over a u-kernel, but in all of those you are going
to sacrifice some performance for what you get (e.g. virtualization).


shap

On Tue, 2008-07-08 at 10:00 +1000, Ben Leslie wrote:
> On Mon Jul 07, 2008 at 17:00:58 +0200, Neal H. Walfield wrote:
> >I ran an application benchmark on Viengoos.  Specifically, the
> >application is derived from the GCbench program.  You can find it
> >here:
> 
> >Each invocation includes approximately 12 words of payload and each
> >reply contains 2 words.  This suggests an RPC overhead of 1350 cycles
> >or 1.2 us.
> >
> >The 4.2 us represents approximately 5000 cycles.  This leaves 3650
> >unaccounted cycles.  This seems to be a bit more than one can simply
> >accounted to secondary cache effects, however, perhaps ping pong
> >really measures the very hot case and I'm running with very cold
> >caches.  I hope someone else can suggest how to figure out to what end
> >these cycles are being put, has a theory, or can confirm that these
> >cycle counts are not, in fact, too high.
> 
> That seem about right to me in terms of cache effects. ping-pong runs
> very hot. Next step would be to turn on performance monitoring counters
> and get a count of cache misses etc.
> 
> Cheers,
> 
> Benno
> 





reply via email to

[Prev in Thread] Current Thread [Next in Thread]