l4-hurd
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RPC overhead


From: Neal H. Walfield
Subject: RPC overhead
Date: Mon, 07 Jul 2008 17:00:58 +0200
User-agent: Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8 (Shijō) APEL/10.6 Emacs/21.4 (i486-pc-linux-gnu) MULE/5.0 (SAKAKI)

I ran an application benchmark on Viengoos.  Specifically, the
application is derived from the GCbench program.  You can find it
here:

  
http://cvs.savannah.gnu.org/viewvc/hurd-l4/benchmarks/GCbench.c?root=hurd&view=log

The benchmark takes 239.4 seconds to complete.  During this time, it
aggressively uses Viengoos' services.  (Viengoos is implemented as a
user-level server running on top of Pistachio.)  I disabled all other
threads so that the only two threads that were running were the
application's main thread and Viengoos' service thread.  Thus,
whenever the application makes a call, it should never block.

In Viengoos, I used l4_system_clock to read out the time on the
receipt of a message.  Just before Viengoos sends a reply, it again
reads the time and records the difference.  This is recorded in a
per-method variable.  The number of calls per method is also recorded.

In the application, I instrumented the RPC stubs to do the same: just
before l4_call is invoked, I call l4_system_clock.  On return, I again
call l4_system_clock and save the difference in a per-method variable.
The number of calls per method is again recorded.

Below are the four most used system calls:

                   Time (ms)    % Time            us per call  
                 User   Kernel    U  K  # Calls   User Kernel  delta
object discard   18,054  15,171   7% 6%  686,960   26.2 22.0    4.2
object alloc        730     567   0% 0%   91,123    8.0  6.2    1.8
cap copy            868     515   0% 0%   90,464    9.5  5.6    3.9
folio alloc          30      27   0% 0%     712    43.1 37.8    5.3


I'd expect the amount of time measured from user space minus the time
measured in Viengoos to correspond to the RPC overhead.  On this
machine (which has an AMD 1.2 Ghz K7 Duron with a 64kb L2 cache)
ping-pong reports the following costs associated with Inter-AS IPC:

  IPC ( 0 MRs): 627.01 cycles, 0.52us, 0.00 instrs
  IPC ( 4 MRs): 660.87 cycles, 0.55us, 0.00 instrs
  IPC ( 8 MRs): 670.11 cycles, 0.56us, 0.00 instrs
  IPC (12 MRs): 678.08 cycles, 0.56us, 0.00 instrs
  IPC (16 MRs): 675.67 cycles, 0.56us, 0.00 instrs
  IPC (20 MRs): 683.11 cycles, 0.57us, 0.00 instrs
  IPC (24 MRs): 691.04 cycles, 0.57us, 0.00 instrs
  IPC (28 MRs): 697.73 cycles, 0.58us, 0.00 instrs
  IPC (32 MRs): 697.39 cycles, 0.58us, 0.00 instrs
  IPC (36 MRs): 701.98 cycles, 0.58us, 0.00 instrs
  IPC (40 MRs): 714.57 cycles, 0.59us, 0.00 instrs
  IPC (44 MRs): 718.00 cycles, 0.60us, 0.00 instrs
  IPC (48 MRs): 720.20 cycles, 0.60us, 0.00 instrs
  IPC (52 MRs): 729.10 cycles, 0.60us, 0.00 instrs
  IPC (56 MRs): 736.47 cycles, 0.61us, 0.00 instrs
  IPC (60 MRs): 733.48 cycles, 0.61us, 0.00 instrs

Each invocation includes approximately 12 words of payload and each
reply contains 2 words.  This suggests an RPC overhead of 1350 cycles
or 1.2 us.

The 4.2 us represents approximately 5000 cycles.  This leaves 3650
unaccounted cycles.  This seems to be a bit more than one can simply
accounted to secondary cache effects, however, perhaps ping pong
really measures the very hot case and I'm running with very cold
caches.  I hope someone else can suggest how to figure out to what end
these cycles are being put, has a theory, or can confirm that these
cycle counts are not, in fact, too high.

Thanks,
Neal





reply via email to

[Prev in Thread] Current Thread [Next in Thread]