[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Number of histogram bins
From: |
Ben Pfaff |
Subject: |
Re: Number of histogram bins |
Date: |
Sun, 05 Dec 2004 12:18:56 -0800 |
User-agent: |
Gnus/5.1006 (Gnus v5.10.6) Emacs/21.3 (gnu/linux) |
John Darrington <address@hidden> writes:
> Does anyone know how spss decides the number of bins to construct a
> histogram? Or can anyone suggest a suitable algorithm for doing so?
PSPP 0.1.0 had vestigial support for plotting histograms. At the
time, if I recall correctly, I checked out in some detail how
SPSS/PC+ chose the number of bins. Here's the code that that
version used to decide:
#define MIN_HIST_BARS 3
#define MAX_HIST_BARS 20
...
double upper = /* maximum value in data */;
double lower = /* minimum value in data */;
if (upper - lower >= 10)
{
double l, u;
u = round_up (upper, 5);
l = round_down (lower, 5);
nbars = (u - l) / 5;
if (nbars * 2 + 1 <= MAX_HIST_BARS)
{
nbars *= 2;
u = round_up (upper, 2.5);
l = round_down (lower, 2.5);
if (l + 1.25 <= lower && u - 1.25 >= upper)
nbars--, lower = l + 1.25, upper = u - 1.25;
else if (l + 1.25 <= lower)
lower = l + 1.25, upper = u + 1.25;
else if (u - 1.25 >= upper)
lower = l - 1.25, upper = u - 1.25;
else
nbars++, lower = l - 1.25, upper = u + 1.25;
}
else if (nbars < MAX_HIST_BARS)
{
if (l + 2.5 <= lower && u - 2.5 >= upper)
nbars--, lower = l + 2.5, upper = u - 2.5;
else if (l + 2.5 <= lower)
lower = l + 2.5, upper = u + 2.5;
else if (u - 2.5 >= upper)
lower = l - 2.5, upper = u - 2.5;
else
nbars++, lower = l - 2.5, upper = u + 2.5;
}
else
nbars = MAX_HIST_BARS;
}
else
{
nbars = /* number of unique values in data. */
if (nbars > MAX_HIST_BARS)
nbars = MAX_HIST_BARS;
}
if (nbars < MIN_HIST_BARS)
nbars = MIN_HIST_BARS;
interval = (upper - lower) / nbars;
It seemed to make some kind of sense at the time, but this was
way back in 1994 or so and I didn't write as many useful comments
then as I do now. I think that the rationale is roughly this:
the upper and lower values should by preference be rounded to
"round" numbers, like multiples of 5, because it makes the graph
easier to read and data tends to be more naturally interpretable
that way. Then it tries to recenter the actual range plotted
based on the actual lower and upper values.
I'm not sure we want any part of this anymore. The above is a
pretty weak defense of the rationale, and I actually wrote the
code.
A search for "histogram bin width" turned up this webpage:
http://www.fmrib.ox.ac.uk/analysis/techrep/tr00mj2/tr00mj2/node24.html
which gives the formula
W = 3.49 * s * N^(-1/3)
as an "optimal bin width" given s as the standard deviation from
the mean and N as the number of samples, as well as
W = 2 * IQR * N^(-1/3)
where IQR is additionally the interquartile range. Either one of
these would be pretty easy to implement, and the webpage claims
the latter is more robust.
--
On Perl: "It's as if H.P. Lovecraft, returned from the dead and speaking by
seance to Larry Wall, designed a language both elegant and terrifying for his
Elder Things to write programs in, and forgot that the Shoggoths didn't turn
out quite so well in the long run." --Matt Olson