[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: new algorithm for wilcoxon signed-rank test
From: |
Ben Pfaff |
Subject: |
Re: new algorithm for wilcoxon signed-rank test |
Date: |
Wed, 04 Feb 2009 22:15:26 -0800 |
User-agent: |
Gnus/5.11 (Gnus v5.11) Emacs/22.2 (gnu/linux) |
Jason Stover <address@hidden> writes:
> Very creative, Ben. It's definitely worth a paper if it isn't already
> published.
I'm not aware of a publication. I invented it myself by
scribbling lots of stuff on paper, but that doesn't mean much.
The page at http://comp9.psych.cornell.edu/Darlington/wilcoxon/wilcox0.htm
seems to be by someone who has thought a bit about this, but he
doesn't seem to have a good solution.
I don't fully understand what it's doing, but the implementation
in GNU R seems to be slow also.
Where would you publish it, by the way? I don't know the
statistics field.
> I've read through the code of ranksum6 below, and it passed the few
> tests I ran. Walking through the algorithm, it always seems to give
> the correct value.
Here's an explanation.
For integers N and W, with N >= 1, let the function S(N,W) be the
number of subsets of 1, 2, 3, ..., N that sum to at least W.
There are 2**N subsets of N items, so S(N,W) is in the range
0...2**N.
There are a few trivial cases:
* For W <= 0, S(N,W) = 2**N.
* For W > N*(N+1)/2, S(N,W) = 0.
* S(1,1) = 1.
Notably, these trivial cases include all values of W for N = 1.
Now consider the remaining, nontrivial cases, that is, N > 1 and
1 <= W <= N*(N+1)/2. In this case, apply the following identity:
S(N,W) = S(N-1, W) + S(N-1, W-N).
The first term on the right hand is the number of subsets that do
not include N that sum to at least W; the second term is the
number of subsets that do include N that sum to at least W.
Then we repeatedly apply the identity to the result, reducing the
value of N by 1 each time until we reach N=1. Some expansions
yield trivial cases, e.g. if W - N <= 0 (in which case we add a
2**N term to the final result) or if W is greater than the new N.
Here is an example:
S(7,7) = S(6,7) + S(6,0)
= S(6,7) + 64
= (S(5,7) + S(5,1)) + 64
= (S(4,7) + S(4,2)) + (S(4,1) + S(4,0)) + 64
= S(4,7) + S(4,2) + S(4,1) + 80
= (S(3,7) + S(3,3)) + (S(3,2) + S(3,2)) + (S(3,1) + S(3,0)) + 80
= S(3,3) + 2*S(3,2) + S(3,1) + 88
= (S(2,3) + S(2,0)) + 2*(S(2,2) + S(2,0)) + (S(2,1) + S(2,0)) + 88
= S(2,3) + 2*S(2,2) + S(2,1) + 104
= (S(1,3) + S(1,1)) + 2*(S(1,2) + S(1,0)) + (S(1,1) + S(2,0)) + 104
= 2*S(1,1) + 112
= 114
Does this make the technique clear? The actual implementation is
mildly clever, but it should be possible to puzzle it out after
the explanation above.
> But I would have to think more about the algorithm to convince myself
> it will always be correct. Or see a written explanation. ('sorry Ben,
> I know how you feel about being able figure out what code is supposed
> to do just by reading it.)
Have I said that? Code can be very obscure, bad enough that it
is unintelligible. Maybe I meant that *good* code should be
*written* so that it can figured out just from reading. This
code, on the other hand, is the programming equivalent of hastily
scribbled notes, which is mostly because I wrote it at 5:30 am
after the baby woke me up and I had trouble getting back to
sleep.
--
"...dans ce pays-ci il est bon de tuer de temps en temps un amiral
pour encourager les autres."
--Voltaire, _Candide_