[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: musings on performance
From: |
John Darrington |
Subject: |
Re: musings on performance |
Date: |
Tue, 9 May 2006 14:57:02 +0800 |
User-agent: |
Mutt/1.5.9i |
Disclaimer: I'm not an expert in this field.
On Mon, May 08, 2006 at 10:14:44PM -0700, Ben Pfaff wrote:
Over the last few days I've been thinking about the performance
story for PSPP.
[.
.
.]
That said, there are numerous optimizations that are likely to
have real benefit, that we should consider implementing as time
goes on.
It would be useful to have some benchmark tests, so that we can see
the effect of each of them. Also benchmarking would be good for
marketing purposes.
3. (This is where it gets more interesting, in my opinion.)
Transformations are almost completely parallelizable on a
case-by-case basis. That is, any given case can be
transformed without knowledge of any other case.
[.
.
.]
Furthermore, I suspect that (and Jason can correct me on this)
much statistical analysis is susceptible to efficient parallel
implementation via "map-reduce", which is a two-step model:
i. "Map" - A master program partitions the data set into
arbitrary sized collections of cases ("chunks"). It
hands each chunk to a different CPU or computer, which
processes it in some specified manner and hands the
result (in whatever format) back to the master
program.
ii. "Reduce" - The master program combines the results,
in some manner, to form a single output.
As a trivial example, imagine that we want the mean of 1e12
values. The master program could break the values into 100
chunk of 1e10 values each, and tell 100 other computers to each
find the mean of a single chunk. Those computers would each
report back their single value and the master program would in
turn take the mean of those 100 values, yielding the overall
mean of all 1e12 values.
Do you have access to a 100 machine cluster to test it?
In step 3, we would use threads plus map-reduce or some other
parallel programming model to improve performance on symmetric
multiprocessing (SMP) machines.
4. Take #3 and then extend it, by allowing jobs to be farmed out
not just to threads but to processes running on other
computers as well. I won't speculate on how much work this
would be, but it's clearly a big job.
If you implement #3 using MPI, then there's nothing to be done for #4.
However I dabbled in MPI parallel processing a few years ago, and
struggled to come up with a real-life problem which was large enough
to overcome the extra overhead.
Some of the math for statistical procedures would need to be
paralellised inside gsl --- I don't know if gsl supports parallel
execution. For example, most matrix operations can be parallelised
which is worth doing for very large matrices.
On the downside ...
I'm acutely aware, that in some places the code performs very badly.
In particular, operations which involve percentiles are currently
implemented in a very non-optimal manner, and in fact, will probably
exhaust memory if passed very large data sets.
Also, there are opportunities to cache things that procedures use.
Eg: most parametric procedures make use of the data's covariance
matrix. If we can let that persist between procedures, that will
avoid a lot of calculations being repeated ; just so long as we
invalidate that cache when appropriate.
It's not going to be much use having a PSPP which can copy data from A
to B at the speed of light, if any procedures take a year to execute.
J'
--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.
signature.asc
Description: Digital signature
Re: musings on performance, Jason Stover, 2006/05/09
Re: musings on performance, Ben Pfaff, 2006/05/15