[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [pooma-dev] timers and performance measurement under Linux
From: |
Julian C. Cummings |
Subject: |
RE: [pooma-dev] timers and performance measurement under Linux |
Date: |
Mon, 6 Aug 2001 11:20:56 -0700 |
Hi
Jim,
It's
probably a good idea to have an option for choosing between CPU
time
and
wallclock time as the benchmark measurement, since they each
have
their
own purpose. Normally CPU time is interesting for measuring
on-node
performance and wallclock time is more useful for measuring parallel
speedup.
As for
threads, I think you can only measure CPU time while a thread
is
active
on a processor. Doing anything else doesn't really make sense.
So
with
clock(), you would want to call clock() when a thread first begins to
work
(not
before the thread is actually launched!) and then call clock() again
when
the
thread is done working (but before the thread exits). If each thread does
a tiny
amount of work, the resolution issue you mentioned will be an
issue.
The
clock() function has different resolutions on different architectures.
But
normally the variation in performance measurement from one run to the
next
is far
greater than the timer resolution anyway.
The
gettimeofday() function is probably the best thing to use for
wallclock
time
measurement. This is what we used in the old Timer class in Pooma
r1.
I
haven't looked at your check-in yet, but hopefully you remembered to
check
for
overflow in the microseconds counter and increment the seconds
counter
accordingly. Other than that, I remember that code as being pretty
simple.
As for
your comments on the PIII performance, I think what you are seeing
is
correct. The out-of-cache performance is not very good. You will see
closer
to optimal performance only when the problem size is in-cache, and
the
caches are much smaller than what we were used to on the SGI
boxes.
With
an optimized C code kernel, you should be able to see the cache
effect
and
stronger flops numbers for small problem sizes. (But of course, it
gets
harder
to measure accurately, too.) I'm not aware of any profiling tools
from
KAI,
so I think prof/gprof is all there is, unless you know how to access
Pentium
hardware counters. Quite a while ago, I fooled around with the
performance
analysis tools that come with Intel VTune. They were not terribly
helpful, other
than
telling me the obvious, that all the time was being spent in the kernel
loop.
The
real problem is that the optimization and inlining of the compiler were
poor
enough
that there was little to do to improve performance. I remember
being
surprised to find that Blitz code optimized much better under VTune than
Pooma
code. I think the reasons had to do with the amount of template
complexity
in
Pooma versus Blitz. Maybe things are better under newer versions of
VTune,
but
Allan didn't seem too impressed with the newest VTune when I asked
him
about
it. In any case, I sort of thought that VTune was not an important
target
for
Pooma and its customers, so I haven't pursued it further.
Julian
C.
How important is it that the timer used by the benchmark class
measure CPU time? I'm not sure how we've even been using this class for
parallel codes - in fact, I was thinking that back when we were just testing
with threads, we were measuring wall-clock time. But the default
implementation of Utilities/Clock.h uses the "clock()" call, which is supposed
to measure "elapsed processor time", though I'm not completely sure what that
means for a multithreaded program on a multi-processor. On the SGI we were
using the high-performance timers, which I believe access the CPUs hardware
performance registers, so I would think that that was CPU time as well.
Anyway, the problem is that clock() only has 10 ms granularity under Linux and
that's really crappy if you're trying to measure scaling with problem size.
I've written a version that uses gettimeofday, which has a granularity of
something like 15 microseconds under Linux (may depend on processor speed?),
but this is definitely not a measure of CPU time. For serial programs this
just means that you need to test on an unloaded machine to have the results
make sense. For parallel programs, this means that you need to interpret the
results differently. Anyway, I'm using this and it seems generally useful, so
I'd like to check it in, but was wondering if the clock-time versus CPU-time
measurement was a problem for anyone. I suppose we could put an accessor on
Clock that would allow the user to find out what type of time was being
presented, and Benchmark could examine this and calculate appropriately, if
necessary.
As a side comment, I'm using KCC on one of our 1 GHz PIII
Linux boxes. ABCTest (which is doing a simple non-stencil calculation, so this
is as fast as it gets) is only getting 45 MFlops for the C version, which is
the fastest. This seems pretty pathetic. What have other Linux users seen on
what types of boxes? Also, what is available on Linux for profiling? Is
prof/gprof it? Does KAI sell anything commercial? Does the PIII have any
hardware performance monitoring and is there access to it? Does VTune (under
Windows) inline sufficiently well to do performance testing there (allegedly
good profiling tools, so it might be OK for the serial stuff, but I don't know
if it is capable of "inlining everything").
Jim