freepooma-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

timings using optimized codewarrior


From: John Hall
Subject: timings using optimized codewarrior
Date: Tue, 5 Jun 2001 21:55:45 -0600

Gang:
Well, my student and I just got an optimized version of a simple little diffusion stencil using Metrowerks Codewarrior and frankly the results are kind of interesting. First, optimized code runs around 6 times faster than unoptimzed code. Dave Nystrom and I seem to recall that optimized code under R1 ran 10-20 times faster than unoptimized using KCC. I don't want to attach too much significance to this though. It is an indicator that either Metrowerks optimizer is not all that hot (my belief) or that somehow the abstractions of R2 are somehow less onerous in debug mode (also a possibility).

Anyhow, this was a 2-D diffusion stencil and no matter what we tried we always got a linear response in the timing study. For the first block, this is a good result since we were just running the same size problem for more cycles.

But, we tried to run larger and larger problems to get us to go out of L2 cache (on this PIII we had a 256K L2 cache), and we were never able to notice a drop. It stayed linear versus the total number of cells. This will make more sense when we can convert the units to something like MFlops. So either you guys have done something really impressive regarding cache utilization, or we are running so slowly that cache misses are not noticeable.

Also, we ran optimized on a Mac and a PC and the result differed exactly by the difference in clock speed. This was a surprise, since Mac advocates had always claimed that the Motorola floating point performance was better than that of Intel at a given clock rate. (This was for a 650 MHZ PIII laptop and a 500 MHz G3 laptop).

As an aside, the Brick Engine ran reproducibly slightly slower than compressibleBrick (only about 5 percent, so it was basically a dead heat), but, I would have expected Brick to be slightly faster than compressibleBrick (and probably by more than a few percent since it should have less overhead).

Anyhow, here is the raw data for the Brick PIII runs:
cellsXY cycles  elapsed time (secs):        Total Cells
101     1000    24      24      24              10201
101     2000    48      48      47              10201
101     3000    72      71      72              10201
101     5000    119     122     121             10201
101     10000   243     243     244             10201
101     25000   608     607     606             10201

25      1000    2       2       2               625
51      1000    6       5       5               2601
101     1000    24      24      24              10201
201     1000    103     103     103             40401
501     1000    635                             251001
1001    1000    2520                            1002001


15      5000    4       4       3               225
25      5000    8       9       8               625
51      5000    29      29      28              2601
75      5000    61      61      61              5625
101     5000    124                             10201
201     5000    518                             40401
401     5000    2054                            160801


3       10000   3                               9
6       10000   3                               36
12      10000   6                               144
24      10000   15                              576

3       100000  28                              9
6       100000  34                              36
12      100000  59                              144
24      100000  150                             576

This was single processor with no MPI, etc under Win 2000. All I/O was turned off within the timed region so it was just Cycle Manager loop overhead (Tecolote Loops over models, in this case 1 Model) and floating point calculations being timed. There were 3 fields involved, Temperature, Conductivity and a TmpField to collect the stencil info. We used difftime and the time_t time functions to collect our data (in seconds), so only high granularity can be studied.

Code:
This is the relation between Conductivity (Lval) and Temperature:

template<class Traits>
void DiffRelation<Traits>::ConFuncT6( const ScalarField& Conductivity,
                                      const ScalarField& Temperature ) {
    Conductivity = (1.0/(2.0*Dim))*pow(Temperature,Real(6.0));
}

This is the relation between TmpField (Lval) and Conductivity and Temperature:
template<class Traits>
void OffsetRelation<Traits>::sumNeighbors(const ScalarField& TmpField,
        const ScalarField& Conductivity, const ScalarField& Temperature ) {
        Interval<Dim> ND = Temperature.domain();
        Loc<Dim> offset;
        TmpField = 0.0;
        for ( int d = 0; d < Dim; ++d ) {
                for ( int off = 0; off < Dim; ++off ) {
                  offset[off] = off==d ? 1:0;
                }
TmpField(ND) += Temperature(ND+offset)*Conductivity(ND+offset) +
                                Temperature(ND-offset)*Conductivity(ND-offset);
        }
}

So the loop was just over this one line:

Temperature = (1.0-2.0*Conductivity*Dim)*Temperature + TmpField;

which kept causing the other two updater dependencies above to get called as relations.

There are some obvious optimizations which can be performed on this code, but, it was the relative timings of optimized and unoptimized executables (factor of 6) along with these simple scaling studies that we were interested in.

Hope this is interesting,
John Hall and Richard Williams


reply via email to

[Prev in Thread] Current Thread [Next in Thread]