timings using optimized codewarrior

freepooma-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

timings using optimized codewarrior

From:	John Hall
Subject:	timings using optimized codewarrior
Date:	Tue, 5 Jun 2001 21:55:45 -0600

Gang:

Well, my student and I just got an optimized version of a simplelittle diffusion stencil using Metrowerks Codewarrior and frankly theresults are kind of interesting. First, optimized code runs around 6times faster than unoptimzed code. Dave Nystrom and I seem to recallthat optimized code under R1 ran 10-20 times faster than unoptimizedusing KCC. I don't want to attach too much significance to thisthough. It is an indicator that either Metrowerks optimizer is notall that hot (my belief) or that somehow the abstractions of R2 aresomehow less onerous in debug mode (also a possibility).

Anyhow, this was a 2-D diffusion stencil and no matter what we triedwe always got a linear response in the timing study. For the firstblock, this is a good result since we were just running the same sizeproblem for more cycles.

But, we tried to run larger and larger problems to get us to go outof L2 cache (on this PIII we had a 256K L2 cache), and we were neverable to notice a drop. It stayed linear versus the total number ofcells. This will make more sense when we can convert the units tosomething like MFlops. So either you guys have done something reallyimpressive regarding cache utilization, or we are running so slowlythat cache misses are not noticeable.

Also, we ran optimized on a Mac and a PC and the result differedexactly by the difference in clock speed. This was a surprise, sinceMac advocates had always claimed that the Motorola floating pointperformance was better than that of Intel at a given clock rate.(This was for a 650 MHZ PIII laptop and a 500 MHz G3 laptop).

As an aside, the Brick Engine ran reproducibly slightly slower thancompressibleBrick (only about 5 percent, so it was basically a deadheat), but, I would have expected Brick to be slightly faster thancompressibleBrick (and probably by more than a few percent since itshould have less overhead).


Anyhow, here is the raw data for the Brick PIII runs:
cellsXY cycles  elapsed time (secs):        Total Cells
101     1000    24      24      24              10201
101     2000    48      48      47              10201
101     3000    72      71      72              10201
101     5000    119     122     121             10201
101     10000   243     243     244             10201
101     25000   608     607     606             10201

25      1000    2       2       2               625
51      1000    6       5       5               2601
101     1000    24      24      24              10201
201     1000    103     103     103             40401
501     1000    635                             251001
1001    1000    2520                            1002001


15      5000    4       4       3               225
25      5000    8       9       8               625
51      5000    29      29      28              2601
75      5000    61      61      61              5625
101     5000    124                             10201
201     5000    518                             40401
401     5000    2054                            160801


3       10000   3                               9
6       10000   3                               36
12      10000   6                               144
24      10000   15                              576

3       100000  28                              9
6       100000  34                              36
12      100000  59                              144
24      100000  150                             576

This was single processor with no MPI, etc under Win 2000. All I/Owas turned off within the timed region so it was just Cycle Managerloop overhead (Tecolote Loops over models, in this case 1 Model) andfloating point calculations being timed. There were 3 fieldsinvolved, Temperature, Conductivity and a TmpField to collect thestencil info. We used difftime and the time_t time functions tocollect our data (in seconds), so only high granularity can bestudied.


Code:
This is the relation between Conductivity (Lval) and Temperature:

template<class Traits>
void DiffRelation<Traits>::ConFuncT6( const ScalarField& Conductivity,
                                      const ScalarField& Temperature ) {
    Conductivity = (1.0/(2.0*Dim))*pow(Temperature,Real(6.0));
}

This is the relation between TmpField (Lval) and Conductivity and Temperature:
template<class Traits>
void OffsetRelation<Traits>::sumNeighbors(const ScalarField& TmpField,
        const ScalarField& Conductivity, const ScalarField& Temperature ) {
        Interval<Dim> ND = Temperature.domain();
        Loc<Dim> offset;
        TmpField = 0.0;
        for ( int d = 0; d < Dim; ++d ) {
                for ( int off = 0; off < Dim; ++off ) {
                  offset[off] = off==d ? 1:0;
                }

TmpField(ND) +=Temperature(ND+offset)*Conductivity(ND+offset) +

                                Temperature(ND-offset)*Conductivity(ND-offset);
        }
}

So the loop was just over this one line:

Temperature = (1.0-2.0*Conductivity*Dim)*Temperature + TmpField;

which kept causing the other two updater dependencies above to getcalled as relations.

There are some obvious optimizations which can be performed on thiscode, but, it was the relative timings of optimized and unoptimizedexecutables (factor of 6) along with these simple scaling studiesthat we were interested in.


Hope this is interesting,
John Hall and Richard Williams

[Prev in Thread]

Current Thread

[Next in Thread]

timings using optimized codewarrior, John Hall <=

Prev by Date: Patch: Reorder Member and Base Class Initializers
Next by Date: gcc
Previous by thread: Patch: Reorder Member and Base Class Initializers
Next by thread: gcc
Index(es):
- Date
- Thread