l4-hurd
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Some initial results


From: Neal H. Walfield
Subject: Some initial results
Date: Thu, 26 Jun 2008 15:23:21 +0200
User-agent: Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8 (Shijō) APEL/10.6 Emacs/21.4 (i486-pc-linux-gnu) MULE/5.0 (SAKAKI)

I've ported the Boehm Garbage Collector to Viengoos.  This was quite
straightforward: it basically compiled and worked straight out of the
box once the required functionality was implemented.  (For the
interested: Unix signals, mmap, munmap and mprotect.  However, this is
already for an advanced configuration and it can work with less.)

I then modified the collection scheduler to take advantage of
Viengoos's availability feature so as to reduce the GC overhead.  The
basic idea is to only perform a collection when the amount of memory
approximates the available memory.

Because the amount of available memory may decrease, we need a way to
release memory to the operating system.  The Boehm GC already provides
basic support for this: after a collection completes, it munmaps
chunks that have not been recently used.  I improved it a bit, the
details are not relevant for this discussion.

The scheduler functions as follows:

  if (allocated_memory < 15/16 * available
      && used_memory < 2/3 * allocated_memory)
    try_to_unmap ()

  if (allocated_memory < 15/16 * available)
    perform_gc ()
    try_to_unmap ()

allocated_memory is the amount of memory allocated from the system.
used_memory is the amount of memory the collector has given out.
try_to_unmap tries to unmap enough memory such that the amount of
allocated memory is below 7/8s the available memory.


To determine the effectiveness of this, I ran a few experiments based
on this benchmark:

  http://www.hpl.hp.com/personal/Hans_Boehm/gc/gc_bench/GCBench.c

The basic idea is that it builds trees, frees and repeats.  I modified
the benchmark to loop 100 times.  I ran the the benchmark on Viengoos
and GNU Linux (on a AMD Duron 1.2Ghz with 512MB, 100MB was provided to
reserved for Pistachio) with the default scheduler and the Viengoos
scheduler.  On GNU Linux, I approximated the Viengoos scheduler by
fixing the availability a priori.  I also ran the tests on Viengoos
again with a memory hog.  The memory hog starts after about one
minute, allocates and writes to 2.5MB of memory per second.  After
allocating half the memory, it sleeps for a minute and then releases
the memory at 2.5MB per second.  The scheduler adapts quite well and
there is little overhead.  The results are shows in this graph:

  http://walfield.org/gcbench/gcbench-progress.png

The summary is:

                          Time (s/rel)  # Collections  Overhead (s)

GNU Linux/Fixed/No Hog      216  1           89            15
Viengoos/Viengoos/No Hog    342  1.58       108            45
Viengoos/Viengoos/Hog       360  1.66       193            51
GNU Linux/Boehm/No Hog      408  1.88      9201           213
Viengoos/Boehm/No Hog       448  2.07      9206           235
Viengoos/Boehm/Hog          488  2.25      9204           257

Some observations:

1) The adaptive Viengoos scheduler takes 75% the time of the
   Boehm scheduler.

2) The hog has little impact.  The adaptation can be seen in this
   graph:

  http://walfield.org/gcbench/gcbench-vg-vg-hog.png

3) Linux is damn fast.  In particular, if you look at the first graph,
   you'll see that the Viengoos runs take 40 seconds to complete the
   first iteration (this is also the time it takes to allocate all
   available memory!).  Each subsequent iteration takes about 3
   seconds.  The problem here is the address space construction.


I also colleted some profiling data to see where the slow spots are.
The data is from the run the Viengoos scheduler and without a hog.
Unless otherwise noted, all the measurements start once Viengoos has
control and end just before Viengoos replies.  That is, they exclude
L4 IPC time.

 - 68 (19%) seconds in the page ager
    - 45 seconds (12%) making 1821419 l4_unmap's to get reference bits
 - 17 seconds (4%) to handle 1022431 page faults 
 - 25.7 seconds (7%) discarding objects.  797121 calls.
 - 1.5 seconds (<1%) allocating 89718 objects
 - 0.5 seconds (<1%) executing 88504 cap_copy's

All other object invocations were negligible.

Don't forget that with the exception of the l4_unmap, these numbers do
not include the time to execute the two IPCs.

The initial ramp up time is constructing the address space.  This is
clearly very, very expensive.  The address space is built just like in
EROS: with cappages and inserting capabilities into them.  On a fault,
Viengoos walks the cappages to find the correct object.  During this
time, the process is allocating objects and inserting them into the
address space.  This is all of the object_alloc calls and the cap
copies.  This is still only two seconds.  I'm guessing that a lot of
time is doing IPC.  I will soon try to better measure this.

The second major slow down is the page ager.  The page ager runs 4
times per second and scans about 400MB looking collecting reference
bits.  This latter part takes 12% of the time.  This is even using
l4_unmap's batch mode allowing for the collection of reference bits
associated with 32 fpages.  It seems to me that L4 is just bad a this.

These two results are strong motivation to port Viengoos to native
hardware and to not continue using L4 as a hardware abstraction.


Neal




reply via email to

[Prev in Thread] Current Thread [Next in Thread]