A non-statistically significant difference strengthens or confirms a null, says Fisher (1955)



In Fisher (1955) [from “the triad”]: “it is a fallacy, so well known as to be a standard example, to conclude from a test of significance that the null hypothesis thereby established; at most it may be said to be confirmed or strengthened.”

I just noticed the last part of this sentence, which I think I’ve missed in a zillion readings, or else it didn’t seem very important. People erroneously think Fisherian tests can infer nothing from non-significant results, but I hadn’t remembered that Fisher himself made it blatant–even while he is busy yelling at N-P for introducing the Type 2 error!  Neyman and Pearson use power-analytic reasoning to determine how well the null is “confirmed”. If POW(μ’) is high, then a non-statistically significant result indicates μ≤ μ’.

Categories: phil stat | Leave a comment

On what “evidence-Based Bayesians” like Gelman really mean: rejected post



How to interpret subjective Bayesians who want to be hard-nosed Bayesians is often like swimming round and round in a funnel of currents where there’s nothing to hold on to. Well, I think I’ve recently stopped the flow and pegged it. Christian Hennig and I have often discussed this (on my regular blog) and something Gelman posted today, linked me to an earlier exchange between he and Christian.

Christian: I came across an exchange between you and Andrew because it was linked to by Andrew on a current blog post 

It really brings out the confusion I have had, we both have had, and which I am writing about right now (in my book), as to what people like Gelman mean when they talk about posterior probabilities. First:

a posterior of .9 to

H: “θ  is positive”

is identified with giving 9 to 1 odds on H.

Gelman had said: “it seems absurd to assign a 90% belief to the conclusion. I am not prepared to offer 9 to 1 odds on the basis of a pattern someone happened to see that could plausibly have occurred by chance”

Then Christian says, this would be to suggest “I don’t believe it” means “it doesn’t agree with my subjective probability” and Christian doubts Andrew could mean that. But I say he does mean that. His posterior probability is his subjective (however evidence-based) probability. ‘

Next the question is, what’s the probability assigned to? I think it is assigned to H:θ > 0

As for the meaning of “this event would occur 90% of the time in the long run under repeated trials” I’m guessing that “this event” is also H. The repeated “trials” allude to a repeated θ generating mechanism, or over different systems each with a θ. The outputs would be claims of form H (or not-H or different assertions about the  θ for the case at hand ), and he’s saying 90% of the time the outputs would be H, or H would be the case. The outputs are not ordinary test results, but states of affairs, namely θ > 0.

Bottom line: It seems to me that all Bayesians all who assign posteriors to parameters (aside from empirical Bayesians) really mean the kind of odds statement that you and I and most people associate with partial -belief or subjective probability. “Epistemic probability” would do as well, but equivocal. It doesn’t matter how terrifically objectively warranted that subjective probability assignment is, we’re talking meaning. And when one finally realizes this is what they meant all along, everything they say is less baffling. What do you think?



Andrew Gelman:

First off, a claimed 90% probability that θ>0 seems too strong. Given that the p-value (adjusted for multiple comparisons) was only 0.2—that is, a result that strong would occur a full 20% of the time just by chance alone, even with no true difference—it seems absurd to assign a 90% belief to the conclusion. I am not prepared to offer 9 to 1 odds on the basis of a pattern someone happened to see that could plausibly have occurred by chance,

Christian Hennig says:

May 1, 2015 at 1:06 pm

“Then the data under discussion (with a two-sided p-value of 0.2), combined with a uniform prior on θ, yields a 90% posterior probability that θ is positive. Do I believe this? No.”

What exactly would it mean to “believe” this? Are you referring to a “true unknown” posterior probability with which you compare the computed one? How would the “true” one be defined?

Later there’s this:
“I am not prepared to offer 9 to 1 odds on the basis of a pattern someone happened to see that could plausibly have occurred by chance, …”
…which kind of suggests that “I don’t believe it” means “it doesn’t agree with my subjective probability” – but knowing you a bit I’m pretty sure that’s not what you meant before. But what is it then?

Categories: Bayesian meanings, rejected posts | 11 Comments

Fraudulent until proved innocent: Is this really the new “Bayesian Forensics”? (ii) (rejected post)

Objectivity 1: Will the Real Junk Science Please Stand Up?


I saw some tweets last night alluding to a technique for Bayesian forensics, the basis for which published papers are to be retracted: So far as I can tell, your paper is guilty of being fraudulent so long as the/a prior Bayesian belief in its fraudulence is higher than in its innocence. Klaassen (2015):

“An important principle in criminal court cases is ‘in dubio pro reo’, which means that in case of doubt the accused is favored. In science one might argue that the leading principle should be ‘in dubio pro scientia’, which should mean that in case of doubt a publication should be withdrawn. Within the framework of this paper this would imply that if the posterior odds in favor of hypothesis HF of fabrication equal at least 1, then the conclusion should be that HF is true.”june 2015 update J ForsterNow the definition of “evidential value” (supposedly, the likelihood ratio of fraud to innocent), called V, must be at least 1. So it follows that any paper for which the prior for fraudulence exceeds that of innocence, “should be rejected and disqualified scientifically. Keeping this in mind one wonders what a reasonable choice of the prior odds would be.”(Klaassen 2015)

Yes, one really does wonder!

“V ≥ 1. Consequently, within this framework there does not exist exculpatory evidence. This is reasonable since bad science cannot be compensated by very good science. It should be very good anyway.”

What? I thought the point of the computation was to determine if there is evidence for bad science. So unless it is a good measure of evidence for bad science, this remark makes no sense. Yet even the best case can be regarded as bad science simply because the prior odds in favor of fraud exceed 1. And there’s no guarantee this prior odds ratio is a reflection of the evidence, especially since if it had to be evidence-based, there would be no reason for it at all. (They admit the computation cannot distinguish between QRPs and fraud, by the way.) Since this post is not yet in shape for my regular blog, but I wanted to write down something, it’s here in my “rejected posts” site for now.

Added June 9: I realize this is being applied to the problematic case of Jens Forster, but the method should stand or fall on its own. I thought rather strong grounds for concluding manipulation were already given in the Forster case. (See Forster on my regular blog). Since that analysis could (presumably) distinguish fraud from QRPs, it was more informative than the best this method can do. Thus, the question arises as to why this additional and much shakier method is introduced. (By the way, Forster admitted to QRPs, as normally defined.) Perhaps it’s in order to call for a retraction of other papers that did not admit of the earlier, Fisherian criticisms. It may be little more than formally dressing up the suspicion we’d have in any papers by an author who has retracted one(?) in a similar area. The danger is that it will live a life of its own as a tool to be used more generally. Further, just because someone can treat a statistic “frequentistly” doesn’t place the analysis within any sanctioned frequentist or error statistical home. Including the priors, and even the non-exhaustive, (apparently) data-dependent hypotheses, takes it out of frequentist hypotheses testing. Additionally, this is being used as a decision making tool to “announce untrustworthiness” or “call for retractions”, not merely analyze warranted evidence.

Klaassen, C. A. J. (2015). Evidential value in ANOVA-regression results in scientific integrity studies. arXiv:1405.4540v2 [stat.ME]. Discussion of the Klaassen method on pubpeer review: https://pubpeer.com/publications/5439C6BFF5744F6F47A2E0E9456703

Categories: danger, junk science, rejected posts | Tags: | 40 Comments

Msc Kvetch: Why isn’t the excuse for male cheating open to women?



In an op-ed in the NYT Sunday Review (May 24, 2015), “Infidelity Lurks in Your Genes,” Richard Friedman states that:

We have long known that men have a genetic, evolutionary impulse to cheat, because that increases the odds of having more of their offspring in the world.

But now there is intriguing new research showing that some women, too, are biologically inclined to wander, although not for clear evolutionary benefits.

I’ve never been sold on this evolutionary explanation for male cheating, but I wonder why it’s assumed women wouldn’t be entitled to it as well. For the male’s odds of having more offspring to increase, the woman has to have the baby, so why wouldn’t the woman also get the increased odds of more offspring? It’s the woman’s offspring too. Moreover, the desire to have babies tends to be greater among women than men.

Categories: Misc Kvetching | 8 Comments

Msc kvetch: What really defies scientific sense


Texas sharpshooter

“two problems that plague frequentist inference: multiple comparisons and multiple looks, or, as they are more commonly called, data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P-value. .. But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense, belies the claim of ‘objectivity’ that is often made for the P-value.” (S.Goodman 1999, p. 1010, Annals of Internal Medicine 130 (12)

What defies scientific sense, as I see it, are accounts of evidence that regard biasing techniques, such as data dredging, as “having nothing to do” with the evidence. Since these gambits open the door to handy ways of verification bias and high probability of finding impressive-looking patterns erroneously, they are at the heart of today’s criticisms of unreplicable statistics. The point of registration, admitting multiple testing, multiple modeling, cherry-picking, p-hacking, etc. is to combat the illicit inferences so readily enabled by ignoring them. Yet, we have epidemiologists like Goodman, and many others, touting notions of inference (likelihood ratios and Bayes factors) that proudly declare these ways of cheating irrelevant to evidence. Remember, by declaring them irrelevant to evidence, there is also no onus to mention that one found one’s hypothesis by scouring the data. Whenever defendants need to justify their data-dependent hypotheses (a practice that can even run up against legal statutes for evidence), they know whom to call.[i]

In this connection, consider the “replication crisis in psychology”. They often blame significance tests with permitting p-values too readily. But then, why are they only able to reproduce something like 30% of the previously claimed effects? What’s the answer? The implicit answer is that those earlier studies engaged in p-hacking and data dredging. All the more reason to want  an account that picks up on such shenanigans rather than ignore them as irrelevant to evidence. Data dredging got you down? Here’s the cure: Use methods that regard such QRPs as irrelevant to evidence. Of course the replication project goes by the board: what are they going to do, check if they can get as good a likelihood ratio by replicating the data dredging?

[i] There would be no rationale for Joe Simmons, Leif Nelson and Uri Simonsohn’s suggestion for “A 21-word solution”: “Many support our call for transparency, and agree that researchers should fully disclose details of data collection and analysis. Many do not agree. What follows is a message for the former; we begin by preaching to the choir. Choir: There is no need to wait for everyone to catch up with your desire for a more transparent science. If you did not p-hack a finding, say it, and your results will be evaluated with the greater confidence they deserve. If you determined sample size in advance, say it. If you did not drop any variables, say it. If you did not drop any conditions, say it. The Fall 2012 Newsletter for the Society for Personality and Social Psychology See my Statistical Dirty Laundry post in my regular blog.

Categories: bias, Misc Kvetching | 2 Comments

Too swamped to read about ‘the swamping problem’ in epistemology, but…



I was sent an interesting paper that is a quintessential exemplar of analytic epistemology. It’s called “What’s the Swamping Problem?” (by Duncan Prichard), and was tweeted to me by a philosophy graduate student, George Shiber. I’m too tired and swamped to read the fascinating ins and outs of the story. Still, here are some thoughts off-the-top of my head that couldn’t be squeezed into a tweet. I realize I’m not explaining the problem, that’s why this is in “rejected posts”–I didn’t accept it for the main blog. (Feel free to comment. Don’t worry, absolutely no one comes here unless I direct them through the swamps.)

1.Firstly, it deals with a case where the truth of some claim is given whereas we’d rarely know this. The issue should be relevant to the more typical case. Even then, it’s important to be able to demonstrate and check why a claim is true, and be able to communicate the reasons to others. In this connection, one wants information for finding out more things and without the method you don’t get this.

  1. Second, the goal isn’t merely knowing isolated factoids but methods. But that reminds me that nothing is said about learning the method in the paper. There’s a huge gap here. If knowing, is understood as true belief PLUS something, then we’ve got to hear what that something is. If it’s merely reliability without explanation of the method,(as is typical in reliabilist discussions) no wonder it doesn’t add much, at least wrt that one fact. It’s hard even to see the difference, unless the reliable method is spelled out. In particular, in my account, one always wants to know how to recognize and avoid errors in ranges we don’t yet know how to probe reliably. Knowing the method should help extend knowledge into unknown territory.
  2. We don’t want trivial truths. This is what’s wrong with standard confirmation theories, and where Popper was right. We want bold, fruitful, theories that interconnect areas in order to learn more things. I’d rather know how to spin-off fabulous coffee makers using my 3-D printer, say, then have a single good coffee now. The person who doesn’t care how a truth was arrived at is not a wise person. The issue of “understanding” comes up (one of my favorite notions), but little is said as what it amounts to.
  1. Also overlooked on philosophical accounts is the crucial importance of moving from unreliable claims to reliable claims (e.g., by averaging, in statistics.) . I don’t happen to think knowing merely that the method is reliable is of much use, w/o knowing why, w/o learning how specific mistakes were checked, errors are made to ramify to permit triangulation, etc.
  1. Finally, one wants an epistemic account that is relevant for the most interesting and actual cases, namely when one doesn’t know X or is not told X is a true belief. Since we are not given that here (unless I missed it) it doesn’t go very far.
  1. Extraneous: On my account, x is evidence for H only to the extent that H is well tested by x. That is, if x accords with H, it is only evidence for H to the extent that it’s improbable the method would have resulted in so good accordance if H is false. This goes over into entirely informal cases. One still wants to know how capable and incapable the method was to discern flaws.
  1. Related issues, though it might not be obvious at first, concerns the greater weight given to a data set that results from randomization, as opposed to the same data x arrived at through deliberate selection.

Or consider my favorite example: the relevance of stopping rules. People often say that if data x on 1009 trials achieves statistical significance at the .05 level, then it shouldn’t matter if x arose from a method that planned on doing 1009 trials all along, or one that first sought significance after the first 10, and still not getting it went on to 20, then 10 more and 10 more until finally at trial 1009 significance was found. The latter case involves what’s called optional stopping. In the case of, say, testing or estimating the mean of a Normal distribution the optional stopping method is unreliable, at any rate, the probability it erroneously infers significance is much higher than .05. It can be shown that this stopping rule is guaranteed to stop in finitely trials and reject the null hypothesis, even though it is true. (Search optional stopping on errorstatistics.com)

I may add to this later…You can read it: What Is The Swamping Problem

Categories: Misc Kvetching, Uncategorized | Leave a comment

Potti Update: “I suspect that we likely disagree with what constitutes validation” (Nevins and Potti)

PottiSo there was an internal whistleblower after all (despite denials by the Duke people involved): a med student Brad Perez. It’s in the Jan. 9, 2015 Cancer Letter. I haven’t studied this update yet, but thought I’d post the letter here on Rejected Posts. (Since my first post on Potti last May, I’ve received various e-mails and phone calls from people wanting to share the inside scoop, but I felt I should wait for some published item.)
          Here we have a great example of something I am increasingly seeing: Challenges to the scientific credentials of data analysis are dismissed as mere differences in “statistical philosophies” or as understandable disagreements about stringency of data validation.
         If so, then statistical philosophy is of crucial practical importance. While Potti and Nevins concur (with Perez) that data points in disagreement with their model are conveniently removed, they claim the cherry-picked data that do support their model give grounds for ignoring the anomalies. Since the model checks out in the cases it checks out, it is reasonable to ignore those annoying anomalous cases that refuse to get in line with their model. After all it’s only going to be the basis of your very own “personalized” cancer treatment!
Jan 9, 2015
 Extracts from their letter:
Nevins and Potti Respond To Perez’s Questions and Worries

Dear Brad,

We regret the fact that you have decided to terminate your fellowship in the group here and that your research experience did not tum out in a way that you found to be positive. We also appreciate your concerns about the nature of the work and the approaches taken to the problems. While we disagree with some of the measures you suggest should be taken to address the issues raised, we do recognize that there are some areas of the work that were less than perfect and need to be rectified.


 I suspect that we likely disagree with what constitutes validation.


We recognize that you are concerned about some of the methods used to develop predictors. As we have discussed, the reality is that there are often challenges in generating a predictor that necessitates trying various methods to explore the potential. Clearly, some instances arc very straightforward such as the pathway predictors since we have complete control of the characteristics of the training samples. But, other instances are not so clear and require various approaches to explore the potential of creating a useful signature including in some cases using information from initial cross validations to select samples. If that was all that was done in each instance, there is certainly a danger of overfitting and getting overly optimistic prediction results. We have tried in all instances to make use of independent samples for validation of which then puts the predictor to a real test. This has been done in most such cases but we do recognize that there are a few instances where there was no such opportunity. It was our judgment that since the methods used were essentially the same as in other cases that were validated, that it was then reasonable move forward. You clearly disagree and we respect that view but we do believe that our approach is reasonable as a method of investigation.

……We don’t ask you to condone an approach that you disagree with but do hope that you can understand that others might have a different point of view that is not necessarily wrong.

Finally, we would like to once again say that we regret this circumstance. We wish that this would have worked out differently but at this point, it is important to move forward.

Sincerely yours,

Joseph Nevins

Anil Potti

The Med Student’s Memo

Bradford Perez Submits His Research Concerns


Nevins and Potti Respond To Perez’s Questions and Worries


A Timeline of The Duke Scandal


The Cancer Letter’s Previous Coverage



I’ll put this up in my regular blog shortly

Categories: junk science, Potti and Duke controversy | Leave a comment

Why are hypothesis tests (often) poorly explained as an “idiot’s guide”?

From Aris Spanos:

“Inadequate knowledge by textbook writers who often do not have the technical skills to read and understand the original sources, and have to rely on second hand accounts of previous textbook writers that are often misleading or just outright erroneous. In most of these textbooks hypothesis testing is poorly explained as an idiot’s guide to combining off-the-shelf formulae with statistical table.

“A deliberate attempt to distort and cannibalize frequentist testing for certain Bayesian statisticians who revel in (unfairly) maligning frequentist inference in their misguided attempt to motivate their preferred viewpoint of statistical inference.” (Aris Spanos)



Categories: frequentists tests | Leave a comment

Gelman’s error statistical critique of data-dependent selections–they vitiate P-values: an extended comment

The nice thing about having a “rejected posts” blog, which I rarely utilize, is that it enables me to park something too long for a comment, but not polished enough to be “accepted” for the main blog. The thing is, I don’t have time to do more now, but would like to share my meanderings after yesterday’s exchange of comments with Gelman.

I entirely agree with Gelman that in studies with wide latitude for data-dependent choices in analyzing the data, we cannot say the study was stringently probing for the relevant error (erroneous interpretation) or giving its inferred hypothesis a hard time.

One should specify what the relevant error is. If it’s merely inferring some genuine statistical discrepancy from a null, that would differ from inferring a causal claim. Weakest of all would be merely reporting an observed association. I will assume the nulls are like those in the examples in the “The Garden of Forking Paths” paper, only I was using his (2013) version. I think they are all mere reports of observed associations except for the BEM ESP study. (That they make causal, or predictive, claims already discredits them).

They fall into the soothsayer’s trick of, in effect, issuing such vague predictions that they are guaranteed not to fail.

Here’s a link to Gelman and Loken’s “The Garden of Forking Paths” http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf

I agree entirely: “Once we recognize that analysis is contingent on data, the p-value argument disappears–one can no longer argue that, if nothing were going on, that something as extreme as what was observed would occur less than 5% of the time.” (Gelman 2013, p. 10). The nominal p-values does not reflect the improbability of such an extreme or more extreme result due to random noise or “nothing going on”.

A legitimate p-value of α must be such that

Pr(Test yields P-value < α; Ho: chance) ~ α.

With data dependent hypotheses, the probability the test outputs a small significance level can easily be HIGH, when it’s supposed to be LOW. See this post: “Capitalizing on Chance” reporting on Morrison and Henkel from the 1960s!![i]http://errorstatistics.com/2014/03/03/capitalizing-on-chance-2/

Notice, statistical facts about p-values demonstrate the invalidity of taking these nominal p-values as actual. So statistical facts about p-values are self-correcting or error correcting.

So, just as in my first impression of the “Garden” paper, Gelman’s concern is error statistical: it involves appealing to data that didn’t occur, but might have occurred, in order to evaluate inferences from the data that did occur. There is an appeal to a type of sampling distribution over researcher “degrees of freedom” akin to literal multiple testing, cherry-picking, barn-hunting and so on.

One of Gelman’s suggestions is (or appears to be) to report the nominal p-value, and then consider the prior that would render the p-value = to the resulting posterior. If the prior doesn’t seem believable, I take it you are to replace it with one that does. Then, using whatever prior you have selected, report the posterior probability that the effect is real. (In a later version of the paper, there is only reference to using a “pessimistic prior”.) This is remindful of Greenland’s “dualistic” view. Please search on error statistics.com.

Here are some problems I see with this:

  1. The supposition is that for the p-value to be indicative of evidence for the alternative (say in a one-sided test of a 0 null), the p-value should be like a posterior probability for the null, (1- p) to the non-null. This is questionable. http://errorstatistics.com/2014/07/14/the-p-values-overstate-the-evidence-against-the-null-fallacy/

Aside: Why even allow using the nominal p-value as a kind of likelihood to go into the Bayesian analysis if its illegitimate? Can we assume the probability model used to compute the likelihood from the nominal p-value?

  1. One may select the prior in such a way that one reports a low posterior that the effect is real. There’s wide latitude in the selection and it will depend on the framing of the “not-Ho” (non-null).Now one has “critics degrees of freedom” akin to researchers degrees of freedom.

One is not criticizing the study or pinpointing its flawed data dependencies, yet one is claiming to have grounds to criticize it.

Or suppose the effect inferred is entirely believable and now the original result is blessed—even though it should be criticized as having poorly tested the effect. Adjudicating between different assessments by different scientists will become a matter of defending one’s prior, when it should be a matter of identifying the methodological flaws in the study. The researcher will point to many other “replications” in a big field studying similar effects, etc.

There’s a crucial distinction between a poorly tested claim and an implausible claim. An adequate account of statistical testing needs to distinguish these.

I want to be able to say that the effect is quite plausible given all I know, etc., but this was a terrible test of it, and supplies poor grounds for the reality of the effect.

Gelman’s other suggestion, that these experimenters distinguish exploratory from confirmatory experiments, and that they be required to replicate their results is, on the face of it, more plausible. But the only way this would be convincing, as I see it, is if the data analysts were appropriately blinded. Else, they’ll do the same thing with the replication.

I agree of course that a mere nominal p-value “should not be taken literally” (in the sense that it’s not an actual p-value)—but I deny that this is equal to assigning p as a posterior probability to the null.

There are many other cases in which data-dependent hypotheses are well-tested by the same data used in their construction/selection  Distinguishing cases has been the major goal of much of my general work in philosophy of science (and it carries over into PhilStat).



 One last thing: Gelman is concerned that the p-values based on these data-dependent associations are misleading the journals and misrepresenting the results. This may be so in the “experimental” cases. But if the entire field knows that this is a data-dependent search for associations that seem indicative of supporting one or another conjecture, and that the p-value is merely a nominal or computed measure of fit, then it’s not clear there is misinterpretation. It’s just a reported pattern.

[i] When the hypotheses are tested on the same data that suggested them and when tests of significance are based on such data, then a spurious impression of validity may result. The computed level of significance may have almost no relation to the true level. . . . Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be “significant at the 5 percent level.” Does this mean that differences as large as the one tested would occur by chance only 5 percent of the time when the true difference is zero? The answer is no, because the difference tested has been selected from the twenty differences that were examined. The actual level of significance is not 5 percent, but 64 percent! (Selvin 1970, 104)



Categories: rejected posts | Tags: | 3 Comments

A SANTA LIVED AS A DEVIL AT NASA! Create an easy peasy Palindrome for December, win terrific books for free!

imagesTo avoid boredom, win a free book, and fulfill my birthday request, please ponder coming up with a palindrome for December. If created by anyone younger than 18, they get to select two books. All it needs to have is one word: math (aside from Elba, but we all know able/Elba). Now here’s a tip: consider words with “ight”: fight, light, sight, might. Then just add some words around as needed. (See rules, they cannot be identical to mine.)

Night am…. math gin

fit sight am ….math gist if

sat fight am…math gift as

You can search “palindrome” on my regular blog for past winners, and some on this blog too.

Thanx, Mayo

Categories: palindrome, rejected posts | 2 Comments

Blog at WordPress.com. The Adventure Journal Theme.


Get every new post delivered to your Inbox.