“two problems that plague frequentist inference: multiple comparisons and multiple looks, or, as they are more commonly called, data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P-value. .. But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense, belies the claim of ‘objectivity’ that is often made for the P-value.” (S.Goodman 1999, p. 1010, Annals of Internal Medicine 130 (12)
What defies scientific sense, as I see it, are accounts of evidence that regard biasing techniques, such as data dredging, as “having nothing to do” with the evidence. Since these gambits open the door to handy ways of verification bias and high probability of finding impressive-looking patterns erroneously, they are at the heart of today’s criticisms of unreplicable statistics. The point of registration, admitting multiple testing, multiple modeling, cherry-picking, p-hacking, etc. is to combat the illicit inferences so readily enabled by ignoring them. Yet, we have epidemiologists like Goodman, and many others, touting notions of inference (likelihood ratios and Bayes factors) that proudly declare these ways of cheating irrelevant to evidence. Remember, by declaring them irrelevant to evidence, there is also no onus to mention that one found one’s hypothesis by scouring the data. Whenever defendants need to justify their data-dependent hypotheses (a practice that can even run up against legal statutes for evidence), they know whom to call.[i]
In this connection, consider the “replication crisis in psychology”. They often blame significance tests with permitting p-values too readily. But then, why are they only able to reproduce something like 30% of the previously claimed effects? What’s the answer? The implicit answer is that those earlier studies engaged in p-hacking and data dredging. All the more reason to want an account that picks up on such shenanigans rather than ignore them as irrelevant to evidence. Data dredging got you down? Here’s the cure: Use methods that regard such QRPs as irrelevant to evidence. Of course the replication project goes by the board: what are they going to do, check if they can get as good a likelihood ratio by replicating the data dredging?
[i] There would be no rationale for Joe Simmons, Leif Nelson and Uri Simonsohn’s suggestion for “A 21-word solution”: “Many support our call for transparency, and agree that researchers should fully disclose details of data collection and analysis. Many do not agree. What follows is a message for the former; we begin by preaching to the choir. Choir: There is no need to wait for everyone to catch up with your desire for a more transparent science. If you did not p-hack a finding, say it, and your results will be evaluated with the greater confidence they deserve. If you determined sample size in advance, say it. If you did not drop any variables, say it. If you did not drop any conditions, say it. The Fall 2012 Newsletter for the Society for Personality and Social Psychology See my Statistical Dirty Laundry post in my regular blog.
Pingback: What really defies common sense (Msc kvetch on rejected posts) | Error Statistics Philosophy
Click to access mayo-2014-prion-paper.pdf
Pingback: The Paradox of Replication, and the vindication of the P-value (but she can go deeper) (i) | Error Statistics Philosophy
Pingback: In defense of statistical recipes, but with enriched ingredients (scientist sees squirrel) | Error Statistics Philosophy
Click to access fraser_is-bayes-posterior-just-quick-and-dirty-confidence.pdf