I was sent an interesting paper that is a quintessential exemplar of analytic epistemology. It’s called “What’s the Swamping Problem?” (by Duncan Prichard), and was tweeted to me by a philosophy graduate student, George Shiber. I’m too tired and swamped to read the fascinating ins and outs of the story. Still, here are some thoughts off-the-top of my head that couldn’t be squeezed into a tweet. I realize I’m not explaining the problem, that’s why this is in “rejected posts”–I didn’t accept it for the main blog. (Feel free to comment. Don’t worry, absolutely no one comes here unless I direct them through the swamps.)
1.Firstly, it deals with a case where the truth of some claim is given whereas we’d rarely know this. The issue should be relevant to the more typical case. Even then, it’s important to be able to demonstrate and check why a claim is true, and be able to communicate the reasons to others. In this connection, one wants information for finding out more things and without the method you don’t get this.
- Second, the goal isn’t merely knowing isolated factoids but methods. But that reminds me that nothing is said about learning the method in the paper. There’s a huge gap here. If knowing, is understood as true belief PLUS something, then we’ve got to hear what that something is. If it’s merely reliability without explanation of the method,(as is typical in reliabilist discussions) no wonder it doesn’t add much, at least wrt that one fact. It’s hard even to see the difference, unless the reliable method is spelled out. In particular, in my account, one always wants to know how to recognize and avoid errors in ranges we don’t yet know how to probe reliably. Knowing the method should help extend knowledge into unknown territory.
- We don’t want trivial truths. This is what’s wrong with standard confirmation theories, and where Popper was right. We want bold, fruitful, theories that interconnect areas in order to learn more things. I’d rather know how to spin-off fabulous coffee makers using my 3-D printer, say, then have a single good coffee now. The person who doesn’t care how a truth was arrived at is not a wise person. The issue of “understanding” comes up (one of my favorite notions), but little is said as what it amounts to.
- Also overlooked on philosophical accounts is the crucial importance of moving from unreliable claims to reliable claims (e.g., by averaging, in statistics.) . I don’t happen to think knowing merely that the method is reliable is of much use, w/o knowing why, w/o learning how specific mistakes were checked, errors are made to ramify to permit triangulation, etc.
- Finally, one wants an epistemic account that is relevant for the most interesting and actual cases, namely when one doesn’t know X or is not told X is a true belief. Since we are not given that here (unless I missed it) it doesn’t go very far.
- Extraneous: On my account, x is evidence for H only to the extent that H is well tested by x. That is, if x accords with H, it is only evidence for H to the extent that it’s improbable the method would have resulted in so good accordance if H is false. This goes over into entirely informal cases. One still wants to know how capable and incapable the method was to discern flaws.
- Related issues, though it might not be obvious at first, concerns the greater weight given to a data set that results from randomization, as opposed to the same data x arrived at through deliberate selection.
Or consider my favorite example: the relevance of stopping rules. People often say that if data x on 1009 trials achieves statistical significance at the .05 level, then it shouldn’t matter if x arose from a method that planned on doing 1009 trials all along, or one that first sought significance after the first 10, and still not getting it went on to 20, then 10 more and 10 more until finally at trial 1009 significance was found. The latter case involves what’s called optional stopping. In the case of, say, testing or estimating the mean of a Normal distribution the optional stopping method is unreliable, at any rate, the probability it erroneously infers significance is much higher than .05. It can be shown that this stopping rule is guaranteed to stop in finitely trials and reject the null hypothesis, even though it is true. (Search optional stopping on errorstatistics.com)
I may add to this later…You can read it: What Is The Swamping Problem
I was sent a link to “The Bayesian Kitchen” http://www.bayesiancook.blogspot.fr/2014/02/blending-p-values-and-posterior.html and while I cannot tell for sure from from the one post, I’m afraid the kitchen might be open for cookbook statistics. It is suggested (in this post) that real science is all about “science wise” error rates (as opposed to it capturing some early exploratory efforts to weed out associations possibly worth following up on, as in genomics). Here were my comments:
False discovery rates are frequentist but they have very little to do with how well warranted a given hypothesis or model is with data. Imagine the particle physicists trying to estimate the relative frequency with which discoveries in science are false, and using that to evaluate the evidence they had for a Standard Model Higgs on July 4, 2012. What number would they use? What reference class? And why would such a relative frequency be the slightest bit relevant to evaluating the evidential warrant for the Higgs particle, nor for estimating its various properties, nor for the further testing that is now ongoing. Instead physicists use sigma levels (and associated p-values)! They show that the probability is .9999999… that they would have discerned the fact that background alone was responsible for generating the pattern of bumps they repeatedly found (in two labs). This is an error probability. It was the basis for inferring that the SM Higgs hypothesis had passed with high severity, and they then moved on to determining what magnitudes had passed with severity. That’s what science is about! Not cookbooks, not mindless screening (which might be fine for early explorations of gene associations, but don’t use that as your model for science in general).
The newly popular attempt to apply false discovery rates to “science wise error rates” is a hybrid fashion that (inadvertently) institutionalizes cookbook statistics: dichotomous “up-down” tests, the highly artificial point against point hypotheses (a null and some alternative of interest—never mind everything else), identifying statistical and substantive hypotheses, and the supposition that alpha and power can be used as a quasi-Bayesian likelihood ratio. And finally, to top it all off, by plucking from thin air the assignments of “priors” to the null and alternative—on the order of .9 and .1—this hybrid animal reports that more than 50% of results in science are false! I talk about this more on my blog errorstatistics.com
(for just one example:
You might not have thought there could be yet new material for 2014, but there is: for the first time Sir Harold Jeffreys himself is making an appearance, and his joke, I admit, is funny. So, since it’s Saturday night, let’s listen in on Sir Harold’s howler in criticizing p-values. However, even comics try out “new material” with a dry run, say at a neighborhood “open mike night”. So I’m placing it here under rejected posts, knowing maybe 2 or at most 3 people will drop by. I will return with a spiffed up version at my regular gig next Saturday.
Harold Jeffreys: Using p-values implies that “An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred.” (1939, 316)
I say it’s funny, so to see why I’ll strive to give it a generous interpretation.
We can view p-values in terms of rejecting H0, as in the joke, as follows:There’s a test statistic D such that H0 is rejected if the observed D,i.e., d0 ,reaches or exceeds a cut-off d* where Pr(D > d*; H0) is very small, say .025. Equivalently, in terms of the p-value:
Reject H0 if Pr(D > d0; H0) < .025.
The report might be “reject H0 at level .025″.
Suppose we’d reject H0: The mean light deflection effect is 0, if we observe a 1.96 standard deviation difference (in one-sided Normal testing), reaching a p-value of .025. Were the observation been further into the rejection region, say 3 or 4 standard deviations, it too would have resulted in rejecting the null, and with an even smaller p-value. H0 “has not predicted” a 2, 3, 4, 5 etc. standard deviation difference. Why? Because differences that large are “far from” or improbable under the null. But wait a minute. What if we’ve only observed a 1 standard deviation difference (p-value = .16)? It is unfair to count it against the null that 1.96, 2, 3, 4 etc. standard deviation differences would have diverged seriously from the null, when we’ve only observed the 1 standard deviation difference. Yet the p-value tells you to compute Pr(D > 1; H0), which includes these more extreme outcomes. This is “a remarkable procedure” indeed! [i]
So much for making out the howler. The only problem is that significance tests do not do this, that is, they do not reject with, say, D = 1 because larger D values, further from might have occurred (but did not). D = 1 does not reach the cut-off, and does not lead to rejecting H0. Moreover, looking at the tail area makes it harder, not easier, to reject the null (although this isn’t the only function of the tail area): since it requires not merely that Pr(D = d0 ; H0 ) be small, but that Pr(D > d0 ; H0 ) be small. And this is well justified because when this probability is not small, you should not regard it as evidence of discrepancy from the null. Before getting to this, a few comments:
1.The joke talks about outcomes the null does not predict–just what we wouldn’t know without an assumed test statistic, but the tail area consideration arises in Fisherian tests in order to determine what outcomes H0 “has not predicted”. That is, it arises to identify a sensible test statistic D (I’ll return to N-P tests in a moment).
In familiar scientific tests, we know the outcomes that are further away from a given hypothesis in the direction of interest, e.g., the more patients show side effects after taking drug Z, the less indicative it is benign, not the other way around. But that’s to assume the equivalent of a test statistic. In Fisher’s set-up, one needs to identify a suitable measure of closeness, fit, or directional departure. Any particular outcome can be very improbable in some respect. Improbability of outcomes (under H0) should not indicate discrepancy from H0 if even less probable outcomes would occur under discrepancies from H0. (Note: To avoid confusion, I always use “discrepancy” to refer to the parameter values used in describing the underlying data generation; values of D are “differences”.)
2. N-P tests and tail areas: Now N-P tests do not consider “tail areas” explicitly, but they fall out of the desiderata of good tests and sensible test statistics. N-P tests were developed to provide the tests that Fisher used with a rationale by making explicit alternatives of interest—even if just in terms of directions of departure.
In order to determine the appropriate test and compare alternative tests “Neyman and I introduced the notions of the class of admissible hypotheses and the power function of a test. The class of admissible alternatives is formally related to the direction of deviations—changes in mean, changes in variability, departure from linear regression, existence of interactions, or what you will.” (Pearson 1955, 207)
Under N-P test criteria, tests should rarely reject a null erroneously, and as discrepancies from the null increase, the probability of signaling discordance from the null should increase. In addition to ensuring Pr(D < d*; H0) is high, one wants Pr(D > d*; H’: μ0 + γ) to increase as γ increases. Any sensible distance measure D must track discrepancies from H0. If you’re going to reason, the larger the D value, the worse the fit with H0, then observed differences must occur because of the falsity of H0 (in this connection consider Kadane’s howler).
3. But Fisher, strictly speaking, has only the null distribution, and an implicit interest in tests with sensitivity of a given type. To find out if H0 has or has not predicted observed results, we need a sensible distance measure.
Suppose I take an observed difference d0 as grounds to reject H0 on account of its being improbable under H0, when in fact larger differences (larger D values) are more probable under H0. Then, as Fisher rightly notes, the improbability of the observed difference was a poor indication of underlying discrepancy. This fallacy would be revealed by looking at the tail area; whereas it is readily committed, Fisher notes, with accounts that only look at the improbability of the observed outcome d0 under H0.
4. Even if you have a sensible distance measure D (tracking the discrepancy relevant for the inference), and observe D = d, the improbability of d under H0 should not be indicative of a genuine discrepancy, if it’s rather easy to bring about differences even greater than observed, under H0. Equivalently, we want a high probability of inferring H0 when H0 is true. In my terms, considering Pr(D < d*; H0) is what’s needed to block rejecting the null and inferring H’ when you haven’t rejected it with severity. In order to say that we have “sincerely tried”, to use Popper’s expression, to reject H’ when it is false and H0 is correct, we need Pr(D < d*; H0) to be high.
5. Concluding remarks:
The rationale for the tail area is twofold: to get the right direction of departure, but also to ensure Pr(test T does not reject null; H0 ) is high.
If we don’t have a sensible distance measure D, then we don’t know which outcomes we should regard as those H0 does or does not predict. That’s why we look at the tail area associated with D. Neyman and Pearson make alternatives explicit in order to arrive at relevant test statistics. If we have a sensible D, then Jeffreys’ criticism is equally puzzling because considering the tail area does not make it easier to reject H0 but harder. Harder because it’s not enough that the outcome be improbable under the null, outcomes even greater must be improbable under the null. And it makes it a lot harder (leading to blocking a rejection) just when it should: because the data could readily be produced by H0 [ii].
Either way, Jeffreys’ criticism, funny as it is, collapses.
When an observation does lead to rejecting the null, it is because of that outcome—not because of any unobserved outcomes. Considering other possible outcomes that could have arisen is essential for determining (and controlling) the capabilities of the given testing method. In fact, understanding the properties of our testing tool just is to understand what it would do under different outcomes, under different conjectures about what’s producing the data.
[i] Jeffreys’ next sentence, remarkably is: “On the face of it, the evidence might more reasonably be taken as evidence for the hypothesis, not against it.” This further supports my reading, as if we’d reject a fair coin null because it would not predict 100% heads, even though we only observed 51% heads. But the allegation has no relation to significance tests of the Fisherian or N-P varieties.
[ii] One may argue it should be even harder, but that is tantamount to arguing the purported error probabilities are close to the actual ones. Anyway, this is a distinct issue.
Zachery notes: “Ableton Live is a popular DJ software all the hipster kids use.”
MINIMUM REQUIREMENT**: A palindrome that includes Elba plus procedure.
BIO: Zachary David is a quantitative software developer at a Chicago-based proprietary trading firm and a student at Northwestern University. He infrequently blogs at http://zacharydavid.com.
BOOK SELECTION: “I’d love to get Error and Inference* off of my wish list and onto my desk.”
EDITOR: It’s yours!
STATEMENT: “Finally, after years of living in Wicker Park, my knowledge of hipsters has found its way into poetry and paid out in prizes. I would like to give a special thank you to professor Mayo for being very welcoming to this first time palindromist. I will definitely participate again… I enjoyed the mental work out. Perhaps the competition will pick up in the future.”
*Full title of book choice:
Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (by D. G. Mayo and A. Spanos, eds.,CUP 2010),
Note: The word for January 2014 is “optimal” (plus Elba). See January palindrome page.
**Nor can it repeat or be close to one that Mayo posts. Joint submissions are permitted (1 book); no age requirements. Professional palindromists not permitted to enter. Note: The rules became much easier starting May 2013, because no one was winning, or even severely trying. The requirements had been Elba + two selected words, rather than only one. I hope we can go back to the more severe requirements once people get astute at palindromes—it will increase your verbal IQ, improve mental muscles, and win you free books. (The book selection changes slightly each month).
PhilStock. I haven’t had time for stock research in the past 6 months, but fortunately, no changes to portfolios have been required. With Yellen’s assurances last week that the monthly methadone injections of $85 billion[i] of will continue, it’s bull, bull, bull, with new highs weekly. Even my airlines—generally the worst area to trade in—are, yes flying high (e.g., American from $1.90 to over $11., Delta, Jet Blue, all soaring). But look how low our Diamond Offshore mascot (DO) is [ii]. It is said that small investors typically jump into the market only after the bull has been running:
“The likely outcome is they’ll ride that last-gasp bull market for a short while and experience an enormous loss in personal wealth when the bubble collapses.”(link)
I’m guessing the next 4 months might be safe (T, VZ, WIN?): Remember, though, the one rule on PhilStock: Never ever listen to (i.e., act on) anything I say about the stock market.
[i]in monthly bond market purchases.
[ii] There’s an explanation (of course). It hardly matters with over 5% in special dividends. For why DO is the “mascot” of my regular blog, search rejected posts.
Some related posts:
Bad News is Good News on Wallstreet
I am reminded it is Friday, having just gotten a Skype call from friends back at Elbar; so here’s another little contest. Each of the following statisticians provided useful help on drafts of papers I was writing in response to (Bayesian) critics. One of them, to my surprise, attached the following poem to his remarks:
A toast is due to one who slays
Misguided followers of Bayes,
And in their heart strikes fear and terror,
With probabilities of error!
Without looking this up, guess the author:
1) I.J. Good
2) George Barnard
3) Erich Lehmann
4) Oscar Kempthorne
The first correct guess will receive an amusing picture from “the whaler” sent from Elbar.
(Note: The author wanted me to note that this poem was to be taken in a jocular vein. )