2014 : WHAT SCIENTIFIC IDEA IS READY FOR RETIREMENT?

Contributors [ 176 ] | View All Responses [ 183 ]

Psychologist; Director, Harding Center for Risk Literacy, Max Planck Institute for Human Development; Author, How to Stay Smart in a Smart World

Scientific Inference Via Statistical Rituals

As a young man, Gottfried Wilhelm Leibniz had a beautiful dream: to discover the calculus that could map every single idea in the world into symbols. Such a universal calculus would put an end to all scholarly bickering—every passionate Edge discussion, for one, could be swiftly resolved by dispassionate calculation. Leibniz optimistically estimated that a few skilled persons should be able to work the whole thing out in five years. Yet nobody, Leibniz included, has yet found that holy grail.

Nonetheless, Leibniz's dream is alive and thriving in the social and neurosciences. Because the object of the dream has not been found, "ersatz objects" serve in its place. In some fields, it's multiple regression; in others, Bayesian statistics. But the champ is the null ritual:

1. Set up a null hypothesis of "no mean difference" or "zero correlation." Don't specify the predictions of your own research hypothesis.

2. Use 5 percent as a convention for rejecting the null. If significant, accept your research hypothesis. Report the result as p<.05, p<.01, or p.<.001, whichever comes next to the obtained p-value.

3. Always perform this procedure.

Not for a minute should anyone think that this procedure has much to do with statistics proper. Sir Ronald Fisher, to whom it has been wrongly attributed, in fact wrote that no researcher should use the same level of significance from experiment to experiment, while the eminent statisticians Jerzy Neyman & Egon Pearson would roll over in their graves if they knew about its current use. Bayesians too have always detested p-values. Yet open any journal in psychology, business, or neuroscience and you are likely to encounter page after page with p-values. To give just a few illustrations: In 2012, the average number of p-values in the Academy of Management Journal, the flagship empirical journal in its field, was 116 per article, ranging between 19 and 536! Typical of management, you might think. But if you take a look at all behavioral, neuropsychological and medical studies with humans published in 2011 in Nature, 89% of them reported p-values only—without even considering effect size, confidence interval, power, or model estimation.

A ritual is a collective or solemn ceremony consisting of actions performed to a prescribed order. It typically includes (i) sacred numbers or colors, (ii) delusions to avoid thinking about why one is performing the actions, and (iii) fear of being punished if one stops performing them. The null ritual contains all these features.

The number "5 percent" is held sacred, allegedly telling us the difference between a real effect and random noise. In fMRI studies, the numbers are replaced by colors, and the brain is said to light up.

The delusions are striking; if psychiatrists had any appreciation of statistics, they would have entered these aberrations into the DSM. Studies in the US, UK, and Germany showed that most researchers do not (or do not want to) understand what a p-value means. They confuse the p-value with the probability of a hypothesis, that is, p(Data|Ho) with p(Ho|Data), or with something else that wishful thinking desires, such as the probability that the data can be replicated. Startling errors are published in top journals. For instance, a most elementary point is that in order to investigate whether two means differ, one should test their difference. What should not be done is to test each mean against a common baseline, such as: "Neural activity increased with training (p < .05) but not in the control group (p > .05)." A 2011 paper in Nature Neuroscience presented an analysis of neuroscience articles in Science, Nature, Nature Neuroscience, Neuron and The Journal of Neuroscience showed that although 78 did as they should, 79 used the incorrect procedure.

Not performing the ritual can provoke great anxiety, even when it makes absolutely no sense. In one study (the authors' names are irrelevant), Internet participants were asked whether there is a difference between heroism and altruism. The far majority felt so: 2,347 respondents (97.5%) said yes, and 58 said no. What did the authors do with that information? They computed a chi-square test, calculated that c2(1) = 2178.60, p < .0001, and came to the astounding conclusion that there were indeed more people saying yes than no.

One manifestation of obsessive-compulsive disorder is the ritual of compulsive hand washing, even if there is no reason to do so. Likewise, researchers adhering to the null ritual perform statistical inferences all the time, even in situations where there is no point: that is, when no random sample was taken from a population, or no population was defined in the first place. In those cases, the statistical model of repeated random sampling from a population does not even apply, and good descriptive statistics is called for. So even if a significant p-value has been happily calculated, it's not clear what population is meant. The problem is not statistics, but its mistaken use as an automatic inference machine.

Finally, just as compulsive worrying and hand washing can interfere with the quality of life, the craving for significant p-values can undermine the quality of research. Which it has: Finding significant theories has been largely replaced by finding significant p-values. This surrogate goal encourages questionable research practices such as selectively reporting studies and conditions that "worked", or excluding data after looking at their impact on the results. According to a 2012 survey in Psychological Science of some 2,000 psychologists, over 90% admitted to having engaged in at least one of these or other questionable research practices. This massive borderline cheating in order to produce significant p-values is likely more harmful to progress than the rare cases of outright fraud. One harmful outcome is a flood of published but irreproducible results. Genetic and medical research using big data has encountered similar surprises when trying in vain to replicate published findings.

I do not mean to throw out the baby with the bathwater and get rid of statistics, which offers a highly useful toolbox for researchers. But it is time to get rid of statistical rituals that nurture automatic and mindless inferences.

Scientists should study rituals, not perform rituals themselves.

Return to Table of Contents