Fetishizing p-Values
Posted by Tom Leinster
The first time I understood the problem was when I read this:
values are not a substitute for real measures of effect size, and despite its popularity with researchers and journal editors, testing a null hypothesis is rarely the appropriate model in science [long list of references, from 1969 onwards]. In natural populations, the null hypothesis of zero differentiation is virtually always false, and if sample size is large enough, this can be demonstrated with any desired degree of statistical significance.
(Lou Jost, Molecular Ecology 18 (2009), 2088–2091.)
If Jost’s criticism is valid, it’s shockingly important. He is attacking what is perhaps the most common use of statistics in science.
This goes as follows:
- Formulate a null hypothesis.
- Collect some data.
- Perform a statistical test on the data.
- Conclude that if the null hypothesis is correct, the probability of obtaining data as extreme as, or more extreme than, yours is less than (say). That’s a -value of , or, sloppily, certainty that the null hypothesis is false.
- Trumpet the high statistical significance of your conclusion.
Low -values are taken to indicate high certainty, and they’re used all over science. And that, Jost claims, is a problem. I’ll explain that further in a moment.
Now there’s a whole book making the same point: The Cult of Statistical Significance, by two economists, Stephen T. Ziliak and Deirdre N. McCloskey. You can see their argument in this 15-page paper with the same title. Just because they’re economists doesn’t mean their prose is sober: according to one subheading, ‘Precision is Nice but Oomph is the Bomb’.
So, is this the scandalous state of affairs that Jost, Ziliak and McCloskey seem to believe?
First I’ll explain what I believe the basic point to be. Take a coin from your pocket. A truly unbiased coin has never been made. In particular, your coin is biased. That means that if you toss your coin enough times, you can demonstrate with certainty that it is biased. Or, to speak properly, you can reject the null hypothesis that the coin is fair with a -value of .
Does that matter? No. The coin in your pocket is almost certainly as fair as it needs to be for the purposes of making coin-toss-based decisions. The precision of the conclusion has nothing to do with the magnitude of the effect.
Coins don’t matter, but drug trials do. Suppose that you and I are each trying to develop a drug to speed up the healing of broken legs. You do some trials and run some statistics and conclude that your drug works—in other words, you reject the null hypothesis that patients taking your drug heal no faster than a control group. I do the same for my drug. Your -value is (a certainty of ), and mine is (a certainty of ).
What does this tell us about the comparative usefulness of our drugs? Nothing. That’s because we know nothing about the magnitude of the effect. I can now reveal that my wonder drug for broken legs is… an apple a day. Who knows, maybe this does have a minuscule positive effect; for the sake of argument, let’s say that it does. As with the coin example, given enough time to conduct trials, I can truthfully claim any degree of statistical significance I like.
The danger is that someone who buys into the ‘cult of statistical significance’ might simply look at the numbers and say ‘ is less than , so the second drug must be better’.
Ziliak and McCloskey’s book is reviewed in the Notices of the AMS (fairly positive), the Times Higher Education (positive but not so detailed), and an economics-ish blog called PanCrit (‘a poorly argued rant’). Olle Häggström, author of the Notices review, takes up the story:
A major point in The Cult of Statistical Significance is the observation that many researchers are so obsessed with statistical significance that they neglect to ask themselves whether the detected discrepancies are large enough to be of any subject-matter significance. Ziliak and McCloskey call this neglect sizeless science.
[…]
In one study, they have gone over all of the 369 papers published in the prestigious journal American Economic Review during the 1980s and 1990s that involve regression analysis. In the 1980s, 70 percent of the studied papers committed sizeless science, and in the 1990s this alarming figure had increased to a stunning 79 percent.
This analysis backs up Jost’s swipe at journal editors.
(How many academics would be brave enough to criticize journal editors like that? It may be significant that Jost, whose work I’ve enjoyed very much and rate very highly, is not employed by a university.)
So what should scientists be doing? Jost makes a brief suggestion:
The important scientific question is the real magnitude of the differentiation, not the smallness of the value (which confounds the magnitude of the effect with the sample size). Answering the question is a matter of parameter estimation, not hypothesis testing. In this approach, the final result should be an estimate of a meaningful measure of the magnitude of differentiation, accompanied by a confidence interval that describes the statistical uncertainty in this estimate. If the confidence interval includes zero, then the null hypothesis cannot be rejected. If the confidence interval does not include zero, then not only can we reject the null hypothesis, but we can have an idea of whether the real magnitude of the differentiation is large or small.
It seems to me that the basic point made by Jost and by Ziliak and McCloskey has to be right. But I’m not very knowledgeable about statistics or its application to science, so I’m less sure that it’s as common a fallacy as they suggest. Maybe you think they’re attacking a straw man. And maybe there are differences between the positions of Jost on the one hand, and Ziliak and McCloskey on the other, that I’m not detecting. I’d be interested to hear from those who know more about this than I do.
Re: Fetishizing p-Values
Do you have a direct like to the paper by Jost that you quote at the start of the post? I can’t find any journal called Molecular Biology that was at volume 18 in 2009.