[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Help-gsl] gsl_cdf_poisson_Q: strange values
Re: [Help-gsl] gsl_cdf_poisson_Q: strange values
Sun, 31 Mar 2019 15:15:32 +0100
Two thoughts: First, on the technical side, the numbers you're getting are
correct. They match what you'd get in R, which uses its own math library.
> ppois(5, 4, lower.tail=FALSE)
> ppois(56, 47, lower.tail=FALSE)
> ppois(560, 475, lower.tail=FALSE)
> ppois(5600, 4755, lower.tail=FALSE)
> ppois(56007, 47550, lower.tail=FALSE)
> ppois(560072, 475507, lower.tail=FALSE)
Here's another check, e.g. for the second line:
*As far as GSL is concerned, this is all working as expected.* So far, so
Which brings me to the second point: Why does this not match your
intuition? If I understand correctly, you assume that being the same
relative distance away from the mean should yield the same probability.
E.g. getting 560 instead of expected 475 successes implies a parameter
value for p that overshoots the known value by 560/475*p - p = 4.6%.
Similarly, getting 5600 instead of 4755 successes overshoots the true
parameter p by 4.6%. With the overshoot being constant, why do the
probabilities of exceeding by the observed count or more get smaller? (In
fact, they get exponentially smaller, and I'm using that term advisedly.)
You're using a Poisson approximation to a binomial distribution with large
*n*. To remove one source of doubt: the fact that you're using an
approximation (you could have used a Normal approximation instead, as John
did in his reply) is not a problem. But for clarity, let's plug in the
binomial distribution, which forces us to spell out what *n* is:
> pbinom(5, 18, p=0.256903, lower.tail=FALSE)
> pbinom(56, 185, p=0.256903, lower.tail=FALSE)
> pbinom(560, 1850, p=0.256903, lower.tail=FALSE)
> pbinom(5600, 18509, p=0.256903, lower.tail=FALSE)
> pbinom(56007, 185092, p=0.256903, lower.tail=FALSE)
But the binomial model is the sum of successes in *n* independent and
identically distributed (iid) Bernoulli trials. In your model, the
probability of success (getting the letter C) is known and fixed; and each
time you draw a letter according to that model, you do so with the same
fixed probability, independent of any previous draws. If that model is
correct (and in a way you could say you're refuting the null hypothesis
that it is correct), you have more of a chance to overshoot the expected
value when the number of independent draws is small. You may just get
(un)lucky. But as the number of draws increases, you can't continue to be
consistently (un)lucky. Most of the values have to be closer to the mean,
and that is in fact a function of *n*. Compare and contrast that with your
intuition, which said it's not a function of *n*, only a function of the
In what way does the probability of overshooting the mean depend on the
number of draws *n*? One simple answer that provides an upper bound comes
<https://en.wikipedia.org/wiki/Hoeffding%27s_inequality>, which states that
the sum of bounded iid random variables must be concentrated around the
expected value, with exponential tails on both sides. If you use the form
of Hoeffding's equality for exceeding the mean of *n* iid Bernoulli
variables, you find
Pr[H(n) >= (p+ε) n] <= exp(-2 ε^2 n)
where H(n) is a random variable counting the number of successes in *n* iid
Bernoulli trials, p is your success probability, and we can let ε = k/N-p,
i.e. ε is the probability overshoot of around 4.6% that we had seen
earlier. Then we find:
Pr[H(18) >= 5.44664] <= 0.927607
Pr[H(185) >= 55.9794] <= 0.46193
Pr[H(1850) >= 559.794] <= 0.000442345
Pr[H(18509) >= 5600.66] <= 2.76243e-34
Pr[H(185092) >= 56007.2] <= 0
So the upper bound for the probability of exceeding the mean by a fixed
relative amount in *n* iid Bernoulli trials is a function exp(-C n) for a
fixed constant C>0, i.e. the tail probability decreases exponentially for
In overly simple terms: In independent coin flips, you can't be
consistently more lucky than the true success probability. Everything else
being fixed, your luck will run out exponentially the longer you keep going.
On Thu, Mar 7, 2019 at 3:04 PM Stephan Lorenzen <address@hidden>
> Dear list,
> I have N=1850923 drawings of characters which can be either A,T,C or G.
> The probability of C is p=0.256903, so I would expect
> lambda=N*p=1850923*0.256903=475508 C's.
> Indeed, I got k=560073.
> According to my understanding, the probability to get 560073 or more C's
> in 1850923 drawings should be gsl_cdf_poisson_Q(k-1,
> lambda)=gsl_cdf_poisson_Q(560072, 475508)
> I get slightly but not dramatically more than expected, so I would
> expect a p-value of, whatwever, 0.2 or so.
> However, I got 0.
> I would expect the p-value in the same range than when getting 5 instead
> of 4 drawings, so I did a series:
> gsl_cdf_poisson_Q( 5, 4) = 0.21487
> gsl_cdf_poisson_Q( 56, 47) = 0.0859409
> gsl_cdf_poisson_Q( 560, 475) = 6.65119e-05
> gsl_cdf_poisson_Q( 5600, 4755) = 4.39096e-33
> gsl_cdf_poisson_Q( 56007, 47550) = 1.46886e-311
> gsl_cdf_poisson_Q(560072, 475507) = 0
> The results starting from the second line are certainly not what I expect.
> Probably I have a fundamental misunderstanding of gsl_cdf_poisson_Q?
> What would be the proper way to use it, or which function should I use
> Thanks for any answers.
> P.S. the actual problem concerns longer sequences with way lower
> p-values where the Poisson distribution is more appropriate. I just
> picked a simple example with a single letter and thus high p-value