The nice thing about having a “rejected posts” blog, which I rarely utilize, is that it enables me to park something too long for a comment, but not polished enough to be “accepted” for the main blog. The thing is, I don’t have time to do more now, but would like to share my meanderings after yesterday’s exchange of comments with Gelman.
I entirely agree with Gelman that in studies with wide latitude for data-dependent choices in analyzing the data, we cannot say the study was stringently probing for the relevant error (erroneous interpretation) or giving its inferred hypothesis a hard time.
One should specify what the relevant error is. If it’s merely inferring some genuine statistical discrepancy from a null, that would differ from inferring a causal claim. Weakest of all would be merely reporting an observed association. I will assume the nulls are like those in the examples in the “The Garden of Forking Paths” paper, only I was using his (2013) version. I think they are all mere reports of observed associations except for the BEM ESP study. (That they make causal, or predictive, claims already discredits them).
They fall into the soothsayer’s trick of, in effect, issuing such vague predictions that they are guaranteed not to fail.
Here’s a link to Gelman and Loken’s “The Garden of Forking Paths” http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf
I agree entirely: “Once we recognize that analysis is contingent on data, the p-value argument disappears–one can no longer argue that, if nothing were going on, that something as extreme as what was observed would occur less than 5% of the time.” (Gelman 2013, p. 10). The nominal p-values does not reflect the improbability of such an extreme or more extreme result due to random noise or “nothing going on”.
A legitimate p-value of α must be such that
Pr(Test yields P-value < α; Ho: chance) ~ α.
With data dependent hypotheses, the probability the test outputs a small significance level can easily be HIGH, when it’s supposed to be LOW. See this post: “Capitalizing on Chance” reporting on Morrison and Henkel from the 1960s!![i]http://errorstatistics.com/2014/03/03/capitalizing-on-chance-2/
Notice, statistical facts about p-values demonstrate the invalidity of taking these nominal p-values as actual. So statistical facts about p-values are self-correcting or error correcting.
So, just as in my first impression of the “Garden” paper, Gelman’s concern is error statistical: it involves appealing to data that didn’t occur, but might have occurred, in order to evaluate inferences from the data that did occur. There is an appeal to a type of sampling distribution over researcher “degrees of freedom” akin to literal multiple testing, cherry-picking, barn-hunting and so on.
One of Gelman’s suggestions is (or appears to be) to report the nominal p-value, and then consider the prior that would render the p-value = to the resulting posterior. If the prior doesn’t seem believable, I take it you are to replace it with one that does. Then, using whatever prior you have selected, report the posterior probability that the effect is real. (In a later version of the paper, there is only reference to using a “pessimistic prior”.) This is remindful of Greenland’s “dualistic” view. Please search on error statistics.com.
Here are some problems I see with this:
- The supposition is that for the p-value to be indicative of evidence for the alternative (say in a one-sided test of a 0 null), the p-value should be like a posterior probability for the null, (1- p) to the non-null. This is questionable. http://errorstatistics.com/2014/07/14/the-p-values-overstate-the-evidence-against-the-null-fallacy/
Aside: Why even allow using the nominal p-value as a kind of likelihood to go into the Bayesian analysis if its illegitimate? Can we assume the probability model used to compute the likelihood from the nominal p-value?
- One may select the prior in such a way that one reports a low posterior that the effect is real. There’s wide latitude in the selection and it will depend on the framing of the “not-Ho” (non-null).Now one has “critics degrees of freedom” akin to researchers degrees of freedom.
One is not criticizing the study or pinpointing its flawed data dependencies, yet one is claiming to have grounds to criticize it.
Or suppose the effect inferred is entirely believable and now the original result is blessed—even though it should be criticized as having poorly tested the effect. Adjudicating between different assessments by different scientists will become a matter of defending one’s prior, when it should be a matter of identifying the methodological flaws in the study. The researcher will point to many other “replications” in a big field studying similar effects, etc.
There’s a crucial distinction between a poorly tested claim and an implausible claim. An adequate account of statistical testing needs to distinguish these.
I want to be able to say that the effect is quite plausible given all I know, etc., but this was a terrible test of it, and supplies poor grounds for the reality of the effect.
Gelman’s other suggestion, that these experimenters distinguish exploratory from confirmatory experiments, and that they be required to replicate their results is, on the face of it, more plausible. But the only way this would be convincing, as I see it, is if the data analysts were appropriately blinded. Else, they’ll do the same thing with the replication.
I agree of course that a mere nominal p-value “should not be taken literally” (in the sense that it’s not an actual p-value)—but I deny that this is equal to assigning p as a posterior probability to the null.
There are many other cases in which data-dependent hypotheses are well-tested by the same data used in their construction/selection Distinguishing cases has been the major goal of much of my general work in philosophy of science (and it carries over into PhilStat).
One last thing: Gelman is concerned that the p-values based on these data-dependent associations are misleading the journals and misrepresenting the results. This may be so in the “experimental” cases. But if the entire field knows that this is a data-dependent search for associations that seem indicative of supporting one or another conjecture, and that the p-value is merely a nominal or computed measure of fit, then it’s not clear there is misinterpretation. It’s just a reported pattern.
[i] When the hypotheses are tested on the same data that suggested them and when tests of significance are based on such data, then a spurious impression of validity may result. The computed level of significance may have almost no relation to the true level. . . . Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be “significant at the 5 percent level.” Does this mean that differences as large as the one tested would occur by chance only 5 percent of the time when the true difference is zero? The answer is no, because the difference tested has been selected from the twenty differences that were examined. The actual level of significance is not 5 percent, but 64 percent! (Selvin 1970, 104)
In this work we propose confidence intervals that account for the data driven selection stage: