[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [pooma-dev] timers and performance measurement under Linux
From: |
James Crotinger |
Subject: |
RE: [pooma-dev] timers and performance measurement under Linux |
Date: |
Mon, 6 Aug 2001 12:47:36 -0600 |
Julian:
---------------------------------------------------
The
gettimeofday() function is probably the best thing to use for
wallclock
time
measurement. This is what we used in the old Timer class in Pooma
r1.
I
haven't looked at your check-in yet, but hopefully you remembered to
check
for
overflow in the microseconds counter and increment the seconds
counter
accordingly. Other than that, I remember that code as
being pretty simple.
-----------------------------------------------------------
I
just did:
return tv.tv_sec + 1.e-6 *
tv.tv_usec;
This mirrors what we are doing with clock_gettime. My
interpretation of gettimeofday is that tv_usec should always be less than 1e6
- it is supposed to return the number of seconds and microseconds since
12:00 am Jan 1, 1970. I checked this under Linux - tv_usec resets to zero
everytime tv_sec is increased. So I don't see a reason to put our own (%
1000000) after it, and indeed if it were over 1000000 I'm not even sure how I'd
interpret tv_sec.
Julian:
---------------------------------------------------
As
for your comments on the PIII performance, I think what you are seeing
is
correct. The out-of-cache performance is not very good. You will
see
closer to optimal performance only when the problem size is in-cache,
and
the
caches are much smaller than what we were used to on the SGI
boxes.
With
an optimized C code kernel, you should be able to see the cache
effect
and
stronger flops numbers for small problem sizes. (But of course, it
gets
harder to measure accurately, too.) I'm not aware of any
profiling tools from
KAI,
so I think prof/gprof is all there is, unless you know how to access
Pentium
hardware counters.
-----------------------------------------------------------
Oh, this
number is definitely memory bandwidth limited - there are three to four loads
and two stores every trip through the loop, which does four flops (two
multiplies and two adds). I get a peak C performance of about 390 MFlops for N =
60 or so. The peak POOMA II Brick performance is only 115 at a slightly higher N
and then it drops off very rapidly to about
30.
I tried
gprof with "KCC -pg" generated code this morning, and gprof crashed after about
10 minutes of crunching on the output of a run. Has anyone else out there seen
this? I'm going to try compiling with gcc, but I'm not sure it generates good
enough code for me to trust the profile results to guide me to the right
optimizations.
Jim