pspp-dev
[Top][All Lists]

## Re: new algorithm for wilcoxon signed-rank test

 From: Ben Pfaff Subject: Re: new algorithm for wilcoxon signed-rank test Date: Wed, 04 Feb 2009 22:15:26 -0800 User-agent: Gnus/5.11 (Gnus v5.11) Emacs/22.2 (gnu/linux)

```Jason Stover <address@hidden> writes:

> Very creative, Ben. It's definitely worth a paper if it isn't already
> published.

I'm not aware of a publication.  I invented it myself by
scribbling lots of stuff on paper, but that doesn't mean much.

The page at http://comp9.psych.cornell.edu/Darlington/wilcoxon/wilcox0.htm
doesn't seem to have a good solution.

I don't fully understand what it's doing, but the implementation
in GNU R seems to be slow also.

Where would you publish it, by the way?  I don't know the
statistics field.

> I've read through the code of ranksum6 below, and it passed the few
> tests I ran. Walking through the algorithm, it always seems to give
> the correct value.

Here's an explanation.

For integers N and W, with N >= 1, let the function S(N,W) be the
number of subsets of 1, 2, 3, ..., N that sum to at least W.
There are 2**N subsets of N items, so S(N,W) is in the range
0...2**N.

There are a few trivial cases:

* For W <= 0, S(N,W) = 2**N.

* For W > N*(N+1)/2, S(N,W) = 0.

* S(1,1) = 1.

Notably, these trivial cases include all values of W for N = 1.

Now consider the remaining, nontrivial cases, that is, N > 1 and
1 <= W <= N*(N+1)/2.  In this case, apply the following identity:

S(N,W) = S(N-1, W) + S(N-1, W-N).

The first term on the right hand is the number of subsets that do
not include N that sum to at least W; the second term is the
number of subsets that do include N that sum to at least W.

Then we repeatedly apply the identity to the result, reducing the
value of N by 1 each time until we reach N=1.  Some expansions
yield trivial cases, e.g. if W - N <= 0 (in which case we add a
2**N term to the final result) or if W is greater than the new N.

Here is an example:

S(7,7) = S(6,7) + S(6,0)
= S(6,7) + 64

= (S(5,7) + S(5,1)) + 64

= (S(4,7) + S(4,2)) + (S(4,1) + S(4,0)) + 64
= S(4,7) + S(4,2) + S(4,1) + 80

= (S(3,7) + S(3,3)) + (S(3,2) + S(3,2)) + (S(3,1) + S(3,0)) + 80
= S(3,3) + 2*S(3,2) + S(3,1) + 88

= (S(2,3) + S(2,0)) + 2*(S(2,2) + S(2,0)) + (S(2,1) + S(2,0)) + 88
= S(2,3) + 2*S(2,2) + S(2,1) + 104

= (S(1,3) + S(1,1)) + 2*(S(1,2) + S(1,0)) + (S(1,1) + S(2,0)) + 104
= 2*S(1,1) + 112

= 114

Does this make the technique clear?  The actual implementation is
mildly clever, but it should be possible to puzzle it out after
the explanation above.

> But I would have to think more about the algorithm to convince myself
> it will always be correct. Or see a written explanation. ('sorry Ben,
> I know how you feel about being able figure out what code is supposed
> to do just by reading it.)

Have I said that?  Code can be very obscure, bad enough that it
is unintelligible.  Maybe I meant that *good* code should be
*written* so that it can figured out just from reading.  This
code, on the other hand, is the programming equivalent of hastily
scribbled notes, which is mostly because I wrote it at 5:30 am
after the baby woke me up and I had trouble getting back to
sleep.
--
"...dans ce pays-ci il est bon de tuer de temps en temps un amiral
pour encourager les autres."
--Voltaire, _Candide_

```