Re: musings on performance

pspp-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: musings on performance

From:	John Darrington
Subject:	Re: musings on performance
Date:	Tue, 9 May 2006 14:57:02 +0800
User-agent:	Mutt/1.5.9i

Disclaimer: I'm not an expert in this field.

On Mon, May 08, 2006 at 10:14:44PM -0700, Ben Pfaff wrote:
     Over the last few days I've been thinking about the performance
     story for PSPP.  
     
     [.
      .
      .]

     That said, there are numerous optimizations that are likely to
     have real benefit, that we should consider implementing as time
     goes on.

It would be useful to have some benchmark tests, so that we can see
the effect of each of them.  Also benchmarking would be good for
marketing purposes.

     
     3. (This is where it gets more interesting, in my opinion.)
        Transformations are almost completely parallelizable on a
        case-by-case basis.  That is, any given case can be
        transformed without knowledge of any other case. 

        [.
         .
         .]

        Furthermore, I suspect that (and Jason can correct me on this)
        much statistical analysis is susceptible to efficient parallel
        implementation via "map-reduce", which is a two-step model:
     
             i. "Map" - A master program partitions the data set into
                arbitrary sized collections of cases ("chunks").  It
                hands each chunk to a different CPU or computer, which
                processes it in some specified manner and hands the
                result (in whatever format) back to the master
                program.
     
             ii. "Reduce" - The master program combines the results,
                 in some manner, to form a single output.
     
        As a trivial example, imagine that we want the mean of 1e12
        values.  The master program could break the values into 100
        chunk of 1e10 values each, and tell 100 other computers to each
        find the mean of a single chunk.  Those computers would each
        report back their single value and the master program would in
        turn take the mean of those 100 values, yielding the overall
        mean of all 1e12 values.

Do you have access to a 100 machine cluster to test it?
     
        In step 3, we would use threads plus map-reduce or some other
        parallel programming model to improve performance on symmetric
        multiprocessing (SMP) machines.
     
     
     4. Take #3 and then extend it, by allowing jobs to be farmed out
        not just to threads but to processes running on other
        computers as well.  I won't speculate on how much work this
        would be, but it's clearly a big job.

If you implement #3 using MPI, then there's nothing to be done for #4.

However I dabbled in MPI parallel processing a few years ago, and
struggled to come up with a real-life problem which was large enough
to overcome the extra overhead.

Some of the math for statistical procedures would need to be
paralellised inside gsl --- I don't know if gsl supports parallel
execution. For example, most matrix operations can be parallelised
which is worth doing for very large matrices.



On the downside ...

I'm acutely aware, that in some places the code performs very badly.
In particular, operations which involve percentiles are currently
implemented in a very non-optimal manner, and in fact, will probably
exhaust memory if passed very large data sets.

Also, there are opportunities to cache things that procedures use.
Eg: most parametric procedures make use of the data's covariance
matrix.  If we can let that persist between procedures, that will
avoid a lot of calculations being repeated ;  just so long as we
invalidate that cache when appropriate.

It's not going to be much use having a PSPP which can copy data from A
to B at the speed of light, if any procedures take a year to execute.

J'

-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.

signature.asc
Description: Digital signature

[Prev in Thread]

Current Thread

[Next in Thread]

musings on performance, Ben Pfaff, 2006/05/09
- Re: musings on performance, John Darrington <=
  - Re: musings on performance, Ben Pfaff, 2006/05/09
    - Re: musings on performance, Jason Stover, 2006/05/09
    - parametric models [Was: Re: musings on performance], John Darrington, 2006/05/10
    - Re: parametric models [Was: Re: musings on performance], Jason Stover, 2006/05/10
- Re: musings on performance, Jason Stover, 2006/05/09
- Re: musings on performance, Ben Pfaff, 2006/05/15
  - Re: musings on performance, Jason Stover, 2006/05/15

Prev by Date: musings on performance
Next by Date: Re: musings on performance
Previous by thread: musings on performance
Next by thread: Re: musings on performance
Index(es):
- Date
- Thread