You might not have thought there could be yet new material for 2014, but there is: for the first time Sir Harold Jeffreys himself is making an appearance, and his joke, I admit, is funny. So, since it’s Saturday night, let’s listen in on Sir Harold’s howler in criticizing p-values. However, even comics try out “new material” with a dry run, say at a neighborhood “open mike night”. So I’m placing it here under rejected posts, knowing maybe 2 or at most 3 people will drop by. I will return with a spiffed up version at my regular gig next Saturday.
Harold Jeffreys: Using p-values implies that “An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred.” (1939, 316)
I say it’s funny, so to see why I’ll strive to give it a generous interpretation.
We can view p-values in terms of rejecting H0, as in the joke, as follows:There’s a test statistic D such that H0 is rejected if the observed D,i.e., d0 ,reaches or exceeds a cut-off d* where Pr(D > d*; H0) is very small, say .025. Equivalently, in terms of the p-value:
Reject H0 if Pr(D > d0; H0) < .025.
The report might be “reject H0 at level .025″.
Suppose we’d reject H0: The mean light deflection effect is 0, if we observe a 1.96 standard deviation difference (in one-sided Normal testing), reaching a p-value of .025. Were the observation been further into the rejection region, say 3 or 4 standard deviations, it too would have resulted in rejecting the null, and with an even smaller p-value. H0 “has not predicted” a 2, 3, 4, 5 etc. standard deviation difference. Why? Because differences that large are “far from” or improbable under the null. But wait a minute. What if we’ve only observed a 1 standard deviation difference (p-value = .16)? It is unfair to count it against the null that 1.96, 2, 3, 4 etc. standard deviation differences would have diverged seriously from the null, when we’ve only observed the 1 standard deviation difference. Yet the p-value tells you to compute Pr(D > 1; H0), which includes these more extreme outcomes. This is “a remarkable procedure” indeed! [i]
So much for making out the howler. The only problem is that significance tests do not do this, that is, they do not reject with, say, D = 1 because larger D values, further from might have occurred (but did not). D = 1 does not reach the cut-off, and does not lead to rejecting H0. Moreover, looking at the tail area makes it harder, not easier, to reject the null (although this isn’t the only function of the tail area): since it requires not merely that Pr(D = d0 ; H0 ) be small, but that Pr(D > d0 ; H0 ) be small. And this is well justified because when this probability is not small, you should not regard it as evidence of discrepancy from the null. Before getting to this, a few comments:
1.The joke talks about outcomes the null does not predict–just what we wouldn’t know without an assumed test statistic, but the tail area consideration arises in Fisherian tests in order to determine what outcomes H0 “has not predicted”. That is, it arises to identify a sensible test statistic D (I’ll return to N-P tests in a moment).
In familiar scientific tests, we know the outcomes that are further away from a given hypothesis in the direction of interest, e.g., the more patients show side effects after taking drug Z, the less indicative it is benign, not the other way around. But that’s to assume the equivalent of a test statistic. In Fisher’s set-up, one needs to identify a suitable measure of closeness, fit, or directional departure. Any particular outcome can be very improbable in some respect. Improbability of outcomes (under H0) should not indicate discrepancy from H0 if even less probable outcomes would occur under discrepancies from H0. (Note: To avoid confusion, I always use “discrepancy” to refer to the parameter values used in describing the underlying data generation; values of D are “differences”.)
2. N-P tests and tail areas: Now N-P tests do not consider “tail areas” explicitly, but they fall out of the desiderata of good tests and sensible test statistics. N-P tests were developed to provide the tests that Fisher used with a rationale by making explicit alternatives of interest—even if just in terms of directions of departure.
In order to determine the appropriate test and compare alternative tests “Neyman and I introduced the notions of the class of admissible hypotheses and the power function of a test. The class of admissible alternatives is formally related to the direction of deviations—changes in mean, changes in variability, departure from linear regression, existence of interactions, or what you will.” (Pearson 1955, 207)
Under N-P test criteria, tests should rarely reject a null erroneously, and as discrepancies from the null increase, the probability of signaling discordance from the null should increase. In addition to ensuring Pr(D < d*; H0) is high, one wants Pr(D > d*; H’: μ0 + γ) to increase as γ increases. Any sensible distance measure D must track discrepancies from H0. If you’re going to reason, the larger the D value, the worse the fit with H0, then observed differences must occur because of the falsity of H0 (in this connection consider Kadane’s howler).
3. But Fisher, strictly speaking, has only the null distribution, and an implicit interest in tests with sensitivity of a given type. To find out if H0 has or has not predicted observed results, we need a sensible distance measure.
Suppose I take an observed difference d0 as grounds to reject H0 on account of its being improbable under H0, when in fact larger differences (larger D values) are more probable under H0. Then, as Fisher rightly notes, the improbability of the observed difference was a poor indication of underlying discrepancy. This fallacy would be revealed by looking at the tail area; whereas it is readily committed, Fisher notes, with accounts that only look at the improbability of the observed outcome d0 under H0.
4. Even if you have a sensible distance measure D (tracking the discrepancy relevant for the inference), and observe D = d, the improbability of d under H0 should not be indicative of a genuine discrepancy, if it’s rather easy to bring about differences even greater than observed, under H0. Equivalently, we want a high probability of inferring H0 when H0 is true. In my terms, considering Pr(D < d*; H0) is what’s needed to block rejecting the null and inferring H’ when you haven’t rejected it with severity. In order to say that we have “sincerely tried”, to use Popper’s expression, to reject H’ when it is false and H0 is correct, we need Pr(D < d*; H0) to be high.
5. Concluding remarks:
The rationale for the tail area is twofold: to get the right direction of departure, but also to ensure Pr(test T does not reject null; H0 ) is high.
If we don’t have a sensible distance measure D, then we don’t know which outcomes we should regard as those H0 does or does not predict. That’s why we look at the tail area associated with D. Neyman and Pearson make alternatives explicit in order to arrive at relevant test statistics. If we have a sensible D, then Jeffreys’ criticism is equally puzzling because considering the tail area does not make it easier to reject H0 but harder. Harder because it’s not enough that the outcome be improbable under the null, outcomes even greater must be improbable under the null. And it makes it a lot harder (leading to blocking a rejection) just when it should: because the data could readily be produced by H0 [ii].
Either way, Jeffreys’ criticism, funny as it is, collapses.
When an observation does lead to rejecting the null, it is because of that outcome—not because of any unobserved outcomes. Considering other possible outcomes that could have arisen is essential for determining (and controlling) the capabilities of the given testing method. In fact, understanding the properties of our testing tool just is to understand what it would do under different outcomes, under different conjectures about what’s producing the data.
[i] Jeffreys’ next sentence, remarkably is: “On the face of it, the evidence might more reasonably be taken as evidence for the hypothesis, not against it.” This further supports my reading, as if we’d reject a fair coin null because it would not predict 100% heads, even though we only observed 51% heads. But the allegation has no relation to significance tests of the Fisherian or N-P varieties.
[ii] One may argue it should be even harder, but that is tantamount to arguing the purported error probabilities are close to the actual ones. Anyway, this is a distinct issue.
Ok, so those of you who didn’t want to pay the high cover charge of the better comedy clubs and wound up here, please share corrections and comments. (No rotten tomatoes.)