[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: musings on performance
From: |
Jason Stover |
Subject: |
Re: musings on performance |
Date: |
Tue, 9 May 2006 16:15:18 -0400 |
User-agent: |
Mutt/1.5.10i |
On Mon, May 08, 2006 at 10:14:44PM -0700, Ben Pfaff wrote:
> Furthermore, I suspect that (and Jason can correct me on this)
> much statistical analysis is susceptible to efficient parallel
> implementation via "map-reduce", which is a two-step model:
>
> i. "Map" - A master program partitions the data set into
> arbitrary sized collections of cases ("chunks"). It
> hands each chunk to a different CPU or computer, which
> processes it in some specified manner and hands the
> result (in whatever format) back to the master
> program.
>
> ii. "Reduce" - The master program combines the results,
> in some manner, to form a single output.
If we want the output from map-reduce and from non-map-reduce to be
indistinguishable, then there are some analyses that can be done this
way, but I do not know of a general rule for deciding how to
proceed. We would need a parallel linear algebra package, as John
mentioned. GSL doesn't do that. I don't know of any GPL'd
high-performance numerical linear algebra package in C (there has been
kvetching about this on the GSL list).
BUT: Many analyses on big data sets that fall under the term
'data-mining' do combine models, each of which could be fit to one
chunk of data, then all of them combined at the end. The output is
not exactly what you would see if you fit a single model to a
single large data set, but the estimated model would make predictions
that are comparable if not superior to a single estimated model. In
this scenario, we wouldn't need anything more than GSL's linear
algebra package since any one thread would have only a small piece of
the data.
I was actually thinking of writing something along this line this
summer, though I was just thinking about aggregating models, rather
than parallelizing their fitting procedures. I guess the two go
together.
One tangential point: estimated statistical models get better as the
number of data increase (assuming the model is a good one and the data
were randomly drawn). Having 1e12 data isn't much better than
having 1e6 data. Using SELECT IF to randomly draw a subset of a big
data set would suffice for most models. In fact, we should probably
write something like casefile_get_random_reader() to randomly read
cases from a large data source.
-Jason