March | 2016 | Rejected Posts of D. Mayo

I extracted some illuminating gems from the recent discussion on my”Error Statistics Philosophy” blogpost, but I don’t have time to write them up, and won’t for a bit, so I’m parking a list of comments wherein the golden extracts lie here; it may be hard to extricate them from over 120 comments later on. (They’re all my comments, but as influenced by readers.) If you do happen wander into my Rejected Posts blog again, you can expect various unannounced tinkering on this post, and the results may not be altogether grammatical or error free. Don’t say I didn’t warn you.

I’m looking to explain how a frequentist error statistician (and lots of scientists) understand

Pr(Test T produces d(X)>d(x); Ho) ≤ p.

You say ” the probability that the data were produced by random chance alone” is tantamount to assigning a posterior probability to Ho, based on a prior) and I say it is intended to refer to an ordinary error probability. The reason it matters isn’t because 2(b) is an ideal way to phrase the type 1 error prob or the attained significance level. I admit it isn’t ideal But the supposition that it’s a posterior leaves one in the very difficult position of defending murky distinctions, as you’ll see in my next thumb’s up and down comment.

You see, for an error statistician, the probability of a test result is virtually always construed in terms of the HYPOTHETICAL frequency with which such results WOULD occur, computed UNDER the assumption of one or another hypothesized claim about the data generation. These are 3 key words.
Any result is viewed as of a general type, if it is to have any non-trivial probability for a frequentist.
Aside from the importance of the words HYPOTHETICAL and WOULD is the word UNDER.

Computing {d(X) > d(x)} UNDER a hypothesis, here, Ho, is not a conditional probability.** This may not matter very much, but I do think it makes it difficult for some to grasp the correct meaning of the intended error probability.

OK, well try your hand at my next little quiz.

…..
**See double misunderstandings about p-valueshttps://normaldeviate.wordpress.com/2013/03/14/double-misunderstandings-about-p-values/

———————————————-

Thumbs up or down? Assume the p-value of relevance is 1 in 3 million or 1 in 3.5 million. (Hint: there are 2 previous comments of mine in this post of relevance.)

only one experiment in three million would see an apparent signal this strong in a universe [where Ho is adequate].
the likelihood that their signal was a result of a chance fluctuation was less than one chance in 3.5 million
The probability of the background alone fluctuating up by this amount or more is about one in three million.
there is only a 1 in 3.5 million chance the signal isn’t real.
the likelihood that their signal would result by a chance fluctuation was less than one chance in 3.5 million
one in 3.5 million is the likelihood of finding a false positive—a fluke produced by random statistical fluctuation
there’s about a one-in-3.5 million chance that the signal they see would appear if there were [Ho adequate].
it is 99.99997 per cent likely to be genuine rather than a fluke.

They use likelihood when they should mean probability, but we let that go.

The answers will reflect the views of the highly respected PVPs–P-value police.

—————————————————

THUMBS UP OR DOWN ACCORDING TO THE P-VALUE POLICE (PVP)

1. only one experiment in three million would see an apparent signal this strong in a universe [where Ho is adequately describes the process].
up

the likelihood that their signal was a result of a chance fluctuation was less than one chance in 3.5 million
down
The probability of the background alone fluctuating up by this amount or more is about one in three million.
up
there is only a 1 in 3.5 million chance the signal isn’t real.
down
the likelihood that their signal would result by a chance fluctuation was less than one chance in 3.5 million
up
one in 3.5 million is the likelihood of finding a false positive—a fluke produced by random statistical fluctuation
down (or at least “not so good”)
there’s about a one-in-3.5 million chance that the signal they see would appear if there were no genuine effect [Ho adequate].
up
it is 99.99997 per cent likely to be genuine rather than a fluke.
down

I find #3 as a thumbs up especially interesting.

The real lesson, as I see it, is that even the thumbs up statements are not quite complete in themselves, in the sense that they need to go hand in hand with the INFERENCES I listed in an earlier comment, and repeat below. These incomplete statements are error probability statements, and they serve to justify or qualify the inferences which are not probability assignments.

In each case, there’s an implicit principle (severity) which leads to inferences which can be couched in various ways such as:

Thus, the results (i.e.,the ability to generate d(X) > d(x)) indicate(s):

the observed signals are not merely “apparent” but are genuine.
the observed excess of events are not due to background
“their signal” wasn’t (due to) a chance fluctuation.
“the signal they see” wasn’t the result of a process as described by Ho.

If you’re a probabilist (as I use that term), and assume that statistical inference must take the form of a posterior probability*, then unless you’re meticulous about the “was/would” distinction you may fall into the erroneous complement that Richard Morey aptly describes. So I agree with what he says about the concerns. But the error statistical inferences are 1,3,5,7 along with the corresponding error statistical qualification.

For this issue, please put aside the special considerations involved in the Higgs case. Also put to one side, for this exercise at least, the approximations of the models. If we’re trying to make sense out of the actual work statistical tools can perform, and the actual reasoning that’s operative and why, we are already allowing the rough and ready nature of scientific inference. It wouldn’t be interesting to block understanding of what may be learned from rough and ready tools by noting their approximative nature–as important as that is.

*I also include likelihoodists under “probabilists”.

****************************************************

Richard and everyone: The thumb’s up/downs weren’t mine!!! The are Spiegelhalter’s!
http://understandinguncertainty.org/explaining-5-sigma-higgs-how-well-did-they-do

I am not saying I agree with them! I wouldn’t rule #6 thumbs down, but he does. This was an exercise in deconstructing his and similar appraisals, (which are behind principle #2) in order to bring out the problem that may be found with 2(b). I can live with all of them except #8.

Please see what I say about “murky distinctions” in the comment from earlier:
http://errorstatistics.com/2016/03/12/a-small-p-value-indicates-its-improbable-that-the-results-are-due-to-chance-alone-fallacious-or-not-more-on-the-asa-p-value-doc/#comment-139716

****************************************

PVP’s explanation of official ruling on #6

****************************************

The insights to take away from this thumb’s up:
3. The probability of the background alone fluctuating up by this amount or more is about one in three million.

Given that the PVP are touchy about assigning probabilities to “the explanation” it is noteworthy that this is doing just that. Isn’t it?*
Abstract away as much as possible from the particularities of the Higg’s case, which involves a “background,” in order to get at the issue.

3′ The probability that chance variability alone (or the perhaps the random assignment of treatments) produces a difference as or larger than this is about one in 3 million. (The numbers don’t matter.)

In the case where p is very small, the “or larger” doesn’t really add any probability. The “or larger” is needed for BLOCKING inferences to real effects by producing p-values that are not small. But we can keep it in.

3” The probability that chance alone produces a difference as larger or larger than observed is 1 in 3 million (or other very small value).

3”’The probability that a difference this large or larger is produced by chance alone is 1 in 3 million (or other very small value).

I see no difference between 3, 3′, 3” and p”’. (The PVP seem forced into murky distinctions.)

For a frequentist who follows Fisher in avoiding isolated significant results, the “results” = the ability to produce such statistically significant results.

*Qualification: It’s never what the PVP called “explanation” alone, nor the data alone,at least for a sampling theorist-error statistician. It’s the overall test procedure,or even better: my ability to reliably bring about results that are very improbable under Ho”. I render it easy to bring about results that would be very difficult under Ho.

See also my comment below:

http://errorstatistics.com/2016/03/12/a-small-p-value-indicates-its-improbable-that-the-results-are-due-to-chance-alone-fallacious-or-not-more-on-the-asa-p-value-doc/comment-page-1/#comment-139772

The mistake is in thinking we start with the probabilistic question Richard states. I say we don’t. I don’t.

*********************************************

Here it is:

Richard: I want to come back to your first comment:

You wrote:
if I ask “What is the probability that the symptoms are due to heart disease?” I’m asking a straightforward question about whether the probability that the symptoms are caused by an actual case of heart disease, not the probability that I would see the symptoms assuming I had heart disease.

My reply: Stop right after the first comma, before “not the probability.” The real question of interest is: are these symptoms caused by heart disease (not the probability they are).

In setting out to answer the question suppose you found that it’s quite common to have even worse symptoms due to indigestion and no heart disease. This indicates it’s readily explainable without invoking heart disease. That is, in setting out to answer a non-probabilistic question* you frame it as of a general type, and start asking how often would this kind of thing be expected under various scenarios. You appeal to probabilistic considerations to answer your non-probabilistic question, and when you amass enough info, you answer it. Non-probabilistically.

Abstract from the specifics of the example on heart disease which brings in many other considerations.

*You know the answers are open to error, but that doesn’t make the actual question probabilistic.

*********************************

Mar 20, 2016: Look at #1:

only one experiment in three million would see an apparent signal this strong in a universe [where Ho is adequate/true].
a. This is a big thumb’s up for the PVP. Now I do have one beef with this, and that’s the fact it doesn’t say that the apparent signal is produced by, or because of, or due to, whatever mechanism is described by Ho. This is important, because unless this production connection is there, it’s not an actual p-value. (I’m thinking of that Kadane chestnut on my regular blog where some “result” (say a tsunami) is very improbable, and it’s alleged one can put any “null” hypothesis Ho at all to the right of “;” (e.g., no Higgs particle), and get a low p-value for Ho.
The p-value has to be computable because of Ho, that is, Ho assigns the probability to the results.

b. Now consider: “an apparent signal this strong in a universe where Ho is the case”. The blue words are what the particle physicist means by a “fluke” of that strength. So we get a thumb’s up to:
only one experiment in three million would see a fluke this strong (i.e., under Ho)
or
the probability of seeing a 5 sigma fluke is one in three million.

Laurie: I get you’re drift, and now I see that it arose because of some very central confusions between how different authors of comments on the ASA doc construed the model. I don’t want to relegate that to a mere problem of nomenclature or convention, but nevertheless, I’d like to take it up another time and return to my vacuous claim.

The Pr(P < p; Ho) = p (very small). Or

(1): Pr(Test T yields d(X) > d(x); Ho) = p (very small) This may be warranted by simulation or analytically, so not mere mathematics, and it’s always approximate, but I’m prepared to grant this.

Or, to allude to Fisher:

The probability of bringing about such statistically significant results “at will”, were Ho the case, is extremely small.

Now for the empirical and non-vacuous part (which we are to spoze is shown):

Spoze

(2): I do reliably bring about stat sig results d(X) > d(x). I’ve shown the capability of bringing about results each of which would occur in, say, 1 in 3 million experiments UNDER the supposition that Ho.

(It’s not even the infrequency that matters, but the use of distinct instruments with different assumptions, where errors in one are known to ramify in the at least one other)

I make the inductive inference (which again, can be put in various terms, but pick any one you like):

(3): There’s evidence of a genuine effect.

The move from (1) and (2) to inferring (3) is based on

(4): Claim (3) has passed a stringent or severe test by dint of (1) and (2). In informal cases, the strongest ones, actually, this is also called a strong argument from coincidence.

(4) is full of content and not at all vacuous, as is (2). I admit it’s “philosophical” but also justifiable on empirical grounds. But I won’t get into those now.

These aren’t all in chronological, but rather (somewhat) logical order. What’s the upshot? I’ll come back to this. Feel free to comment.

Blame it on the nasal spray

Sit down with students or friends and ask them what’s wrong with this study–or just ask yourself–and it will likely leap out. Now I’ve only read the paper quickly, and know nothing of oxytocin (OT) research. That’s why this is in the “Rejected Posts” blog. Plus, I’m writing this really quickly.

You see, I noticed a tweet about how non-statistically significant results are often stored in filedrawers, rarely published, and right away was prepared to applaud the authors for writing on their negative result. Now that I see the nature of the study, and the absence of any critique of the experiment itself (let alone the statistics), I am less impressed. What amazes me about so many studies is not that they fail to replicate but that the glaring flaws in the study aren’t considered! But I’m prepared to be corrected by those who do serious oxytocin research.

In a nutshell: Treateds get OT nasal spray, controls get a placebo spray; you’re told they’re looking for effects on sexual practices, when actually they’re looking for effects on trust bestowed upon experimenters with your answers).

The instructions were the follows: “You will now perform a task on the computer. The instruction concerning this task will appear on screen but if you have any question, do not hesitate. At the end of the computer test, you will have to fill a questionnaire that is in the envelope on your desk. As we want to examine if oxytocin has an influence on sexual practices and fantasies, do not be surprised by the intimate or awkward nature of the questions. Please answer as honestly as possible. You will not be judged. Also do not be afraid of being sincere in your answers, I will not look at your questionnaire, I swear it. It will be handled by one of the guy in charge of the optical reading device who will not be able to identify you (thanks to the coding system). At the end of the experiment, I will bring him all the questionnaires. I will just ask you to put the questionnaire back in the envelope once it is completed. You may close the envelope at the end and, if you want, you may even add tape. There is a tape dispenser on your desk”. There is some examples of questions they were asked to answer: “What was your wildest sex experiment ?”, “Are you satisfied with your sex life? Could you describe it? (frequency, quality,…)” Please report on a 7-point Likert scale (1 = not at all, it disgusts me à 7 = very much, I really like) your willingness to be involved in the following sexual practices: using sex toys, doing a threesome, having sex in public, watch other people having sex, watch porn before or during a sexual intercourse,…”

Imagine you’re a subject in the study. Is there a reason to care if the researcher knows details of your sex life? The presumption is that you do care. But anyone who really cared wouldn’t reveal whatever they deemed so embarrassing. But wait, there’s another crucial element to this experiment.

We’re told: “we want to examine if oxytocin has an influence on sexual practices and fantasies“. You’ve been sprayed with either OT or placebo, and I assume you don’t know which. Suppose OT does influence willingness to engage in wild sex experiments. Being sprayed today couldn’t very well change your previous behavior. So unless they had asked you last week (without spray) and now once again with spray, they can’t be looking for changes on actual practice. But OT spray could make you more willing to say you’re more willing to engage in “the following sexual practices: using sex toys, doing a threesome, having sex in public,…etc. etc.” It could also influence feelings right now, i.e., how satisfied you feel now that you’ve been “treated”. So since the subject reasons this must be the effect they have in mind, only scores on the “willingness” and “current feelings” questions could be picking up on the OT effect. But high numbers on willingness and feelings questions don’t reflect actual behaviors–unless the OT effect extends to exaggerating about past behaviors, that is, lying about them, in which case, once again, your own actual choices and behaviors in life are not revealed by the questionnaire. Given the tendency of subjects to answer as they suppose the researcher wants, I can imagine higher numbers on such questions (than if they weren’t told they’re examining if OT has an influence on sexual practices). But since the numbers don’t, indeed, can’t reflect true effects on sexual behavior, there’s scarce reason to regard them as private information revealed only to experimenters you trust. I’ll bet almost no one uses the tape*.

There are many, alternative criticisms of this study. For example, realizing they can’t be studying the influence of sex practices, you already mistrust the experimenter. Share yours.

Let me be clear: I don’t say OT isn’t related to love and trust––it’s active in childbirth, nursing, and, well…whatever. It is, after all, called the ‘love hormone’. My kvetch is with the capability of this study to discern the intended effect.

I say we need to publish analyses showing what’s wrong with the assumption that a given experiment is capable of distinguishing the “effects” of the “treatment” of interest. And what about those Likert scales! These aren’t exactly genuine measurements merely because they’re quantitative.

*It would be funny to look for a correlation between racy answers and tape.

Monthly Archives: March 2016

Souvenirs from “the improbability of statistically significant results being produced by chance alone”-under construction

Can today’s nasal spray influence yesterday’s sex practices? Non-replication isn’t nearly critical enough: Rejected post

Workshop

Archives

Meta

Follow Blog via Email