pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

musings on performance


From: Ben Pfaff
Subject: musings on performance
Date: Mon, 08 May 2006 22:14:44 -0700
User-agent: Gnus/5.110004 (No Gnus v0.4) Emacs/21.4 (gnu/linux)

Over the last few days I've been thinking about the performance
story for PSPP.  There are lots of low-level
"micro-optimizations" we can do.  Mostly, those are not
interesting to me, although some may actually be valuable.  In
fact, I've been steadily removing micro-optimizations as I clean
up code, or at least the ones that make the code less readable
for questionable benefit.

That said, there are numerous optimizations that are likely to
have real benefit, that we should consider implementing as time
goes on.  The ones that I've come up with so far are below.
They're roughly in order of increasing difficulty and benefit,
and each one tends to depend on the previous.

1. Eliminate the need to cache procedure output.  Currently,
   every procedure reads the entire active file and writes out a
   new copy of it.  For big data sets that's often a huge waste
   of time and space.

   This was almost impossible to get right before I refactored
   the code for implementing procedures.  Now, it should only be
   a few days of work.

2. Implement an optimizer for transformations.  SELECT IF can
   often be moved earlier without changing any semantics.
   Creation of variables whose values are never used can be
   dropped entirely.  I suspect that this is especially valuable
   after #1 is implemented because #1 will retain any
   non-temporary transformations from one procedure to the next
   for re-execution, whereas currently that does not happen.

   Doing a basic job of this should be a few days of work.  It's
   probably not worth putting more into it than that (and maybe
   not that much) unless it is demonstrably useful.

3. (This is where it gets more interesting, in my opinion.)
   Transformations are almost completely parallelizable on a
   case-by-case basis.  That is, any given case can be
   transformed without knowledge of any other case.  (Only two
   exceptions come to mind, off-hand, which are the LAG function
   and the $CASENUM variable in expressions.  Probably there are
   a few others.)

   Furthermore, I suspect that (and Jason can correct me on this)
   much statistical analysis is susceptible to efficient parallel
   implementation via "map-reduce", which is a two-step model:

        i. "Map" - A master program partitions the data set into
           arbitrary sized collections of cases ("chunks").  It
           hands each chunk to a different CPU or computer, which
           processes it in some specified manner and hands the
           result (in whatever format) back to the master
           program.

        ii. "Reduce" - The master program combines the results,
            in some manner, to form a single output.

   As a trivial example, imagine that we want the mean of 1e12
   values.  The master program could break the values into 100
   chunk of 1e10 values each, and tell 100 other computers to each
   find the mean of a single chunk.  Those computers would each
   report back their single value and the master program would in
   turn take the mean of those 100 values, yielding the overall
   mean of all 1e12 values.

   (Google is responsible for popularizing the map-reduce model.
   If you haven't read the paper on it, I'd doing so:
   http://labs.google.com/papers/mapreduce.html
   It is inspirational.)

   In step 3, we would use threads plus map-reduce or some other
   parallel programming model to improve performance on symmetric
   multiprocessing (SMP) machines.

   This is a good deal of work.  The basic implementation of the
   map-reduce model and support code (e.g. casefile partitioning
   and thread-safe access to data) would take a while.  Plus,
   we'd have to modify statistical procedures to take advantage
   of it.  But it could potentially yield a big benefit.

4. Take #3 and then extend it, by allowing jobs to be farmed out
   not just to threads but to processes running on other
   computers as well.  I won't speculate on how much work this
   would be, but it's clearly a big job.

If we implemented all of these, or even just through #3, I bet
PSPP would be one of the best things out there for dealing with
large amounts of data.

Let me add that this is all really just a musing.  I can't
imagine when I'll get a chance to work on it.  Oh, except that #1
is on my list of things of do whenever I get a chance.  The
others are just speculative.
-- 
Ben Pfaff 
email: address@hidden
web: http://benpfaff.org




reply via email to

[Prev in Thread] Current Thread [Next in Thread]