[Video designed for full screen viewing. Click icon in video image to expand.]
Robert Kurzban: P-Hacking and the Replication Crisis
The first three talks this morning have been optimistic. We've heard about the promise of big data, we've heard about advances in emotions, and we've just heard from Fiery, who very cleverly managed to find a way to leave before I gave my remarks about how we're understanding something deep about human nature. There's a risk that my remarks are going to be understood as pessimistic but they're really not. My optimism is embodied in the notion that what we're doing here is important and we can do it better.
I really wanted to take this opportunity to have a chance to speak to the people here about what's been going on in some corners of psychology, mostly in areas like social psychology and decision-making. In fact, Danny Kahneman has chimed in on this discussion, which is really what some people thought about as a crisis in certain parts of psychology, which is that insofar as replication is a hallmark of what science is about, there's not a lot of it and what there is shows that things we thought were true maybe aren't; that's really bad. This is a great setting in which to talk about these things, and I want to talk about it in part from my experience in this because I started to come into contact with this in a way that I'll describe right now.
Let me just give you a quotation from Barack Obama, President of the United States; he says, "I'm trying to pare down decisions. I don't want to make decisions about what I'm eating or wearing." He was discussing the fact that all of his suits are the same, so he doesn't have to actually pick suits each day. He says, "You need to focus your decision making energy." He's relying here on an idea from psychology which is that there's this stuff called willpower, and this connects to Fiery's previous remarks, and you just sort of use it up and if you use up your reservoir of willpower you've got less of it to make a decision. This is some work that was developed by Roy Baumeister and colleagues more than a decade ago. The paper on which it's based has been cited over 1,500 times, which, for the uninitiated in psychology, that's a lot. The modal number of citations on papers tends to be zero, so 1,500 is bigger than that.
I remember coming across this and thinking this is very contrary to the kind of model that I would have predicted. It feels like this kind of hydraulic model, which felt to be very 19th Century, as opposed to a computational model which would've had the properties that Fiery was talking about, which is, okay, what's the demon in there that says: "Okay, how do I figure out if I should be doing my paper or eating ice cream?" and then some kind of process where these two different sorts of systems fight against each other and then one wins and then some kind of behavior pops out. I thought: if this is right, then I've got to step back about what I thought I knew about the way the mind works. It was pretty important to me in that respect. I also think it's an important issue: how do we make these decisions about exerting willpower? Lots of decisions that are really bad in the long run can be understood in this context.
I looked at the literature and one of the first things I did, (which was what many of us around this table would do) I planned an empirical agenda, the first part of which was to replicate the finding. I had a first year graduate student who was just starting and so, of course, I did again what all of us what do, I made him do all the work. He put together a replication of one of the big kinds of bedrocks of this literature. In this literature there are two different tasks. I'm looking this way and I'm not supposed to look at some words on the screen over here, and this is supposed to drain my willpower but it's hard not to look. Then I do a task that takes willpower—solving some unsolvable anagrams or doing a Stroop Test, which requires saying the color that some word is in as opposed to what the word says, which is just hard.
We did that and we ran about 100 subjects and we got nothing; we couldn't get the effect. We contacted the people who developed these stimuli. We also read the literature more carefully, because, of course we didn't read it that carefully to begin with, and we specifically went through and we tried to find where the big effect sizes were. Maybe we just picked a weak one. We ran that with their guidance, including their original stimuli—couldn't get it. We ran a third round, and by this time it was the end of the student's first year because, of course, these things take time, he had to write a paper up from his first year project to punch his ticket to get to the second year; we couldn't get that one either.
Then, as all of us do, again, I started just talking informally with colleagues about this. I would go to give talks in places and, lo and behold, it turns out there's this kind of background radiation—there's the dark matter of psychology, which is a few people who fail to replicate and don't publish their work and also don't talk about it because the fact that you've failed to replicate has a reputational effect, right? The person who's in charge of this literature says, "Oh, these guys were going after me," and so maybe you don't talk about it in polite company. Right? It's sort of like sex, it's the thing that we're all doing, we're all replicating, we just don't want to talk about it too much, right?
Once I did that I started getting the sense that I was fishing into literature where there's no there there. There are other things that I won't discuss that lead me to that conclusion, too. I will say one piece of it, which is that more and more work is coming out that's very difficult to interpret under the willpower model. Carol Dweck has this stuff that says it only works if you have this belief that willpower is a limited resource. And that seems like it's pretty bad for the model.
Just a couple of days ago, so before I came out here, there was a piece that came out in Psych Science, obviously a prominent place where we'd all like to see our stuff. I'm just going to read the title to you: "Heightened Sensitivity to Temperature Cues in Individuals of High Anxious Attachment. " That's not important, what's important is what comes after the colon: "Real or Elusive Phenomenon?" It was a failure to replicate, so basically there was this effect, which does not matter for this conversation, they tried to replicate it, they quadrupled the sample size, they can't get it, and then they published this. What they published is that it's either real or elusive, but you can't even say out loud real or it wasn't there to begin with, right? These are not things that we say in polite company.
We have this responsibility to be better scholars, particularly as scholars whose work is consumed by the world.
What I want to talk about today or what I've already talked about and what I want your ideas on is how do we get rid of bad ideas? Because my experience is a little bit like the pessimism that you see in Max Planck, which is that it happens a funeral at a time. I'm confident we can do better than that; and people discuss this. As an addendum on the willpower stuff, which is really in terms of content where my interest is, one of the things that was striking to me is, I went back and what these authors are saying is that the substrate of willpower is brain glucose. If you run out of sugar in your head, then you can't exert your self-control. And I did two things. The first thing I did after I read that is I just did a back of the envelope computation about how much sugar we're talking about. The theory's off by two orders of magnitude. They're giving people 100 calories of lemonade in these manipulations and, in fact, your whole brain is using like a quarter of a calorie per minute so you just use this massive sledgehammer to talk about this little bitty effect.
The other thing I did is I reanalyzed some of the data in the paper on which many of these claims are based. First of all, I should also say the key dataset I asked for, I was told that the data were corrupted (this is a paper that was two-years old). Already there's something funny going on in our discipline when we can't get the raw data from scholars. But then I looked at the data they did give me and rather than supporting their claims, it undermined it. If you just run the math, which wasn't too hard, it was five minus three (that joke is lost without Fiery), it turns out that their own data undermined this idea. I want to emphasize again, this is work that's continuing to be exciting. I just did a Google search on it before I got here. I just searched "organize my time." Within the last month another dozen papers have come out on this. And it's going to turn out that this is not true, it's not true in an important way, in the sense that, okay, Obama's not making policy today, he's not deciding on whether or not he's going to bomb Libya based on how much glucose he's got in his head or something like that. But he's aware of it. This is penetrating into policy sectors.
My point is not specifically narrowed on the willpower stuff, although it's important we clean this up, this also has public health implications. There's this great advertisement we found in 1971, this is work in collaboration with my good friend and colleague, Angela Duckworth and Joe Cable, a behavioral economist, a neuro economist. We found an advertisement which basically says: If you want to lose weight, what you do is eat an ice cream cone because that's got the sugar, it'll give you the willpower not to keep eating. This is something that people in my casual conversations, if you are a consumer of psychology as opposed to a producer of psychology, this is the kind of stuff which is penetrated into the popular consciousness. If I just eat a couple of simple carbohydrates, then I'll lose weight. Not going to turn out to be right.
We have this responsibility to be better scholars, particularly as scholars whose work is consumed by the world. Uri Simonsohn, my colleague at the University of Pennsylvania in the Wharton School, has written extremely well on this. He's discussed all the things, again, like sex that we all do but don't talk about. He calls it P-hacking, where we run enough subjects and the P value sneaks below .05, and then we've decided we're done. That's just one of a large number of things that people do, including selective reporting of dependent measures. I'm not saying people have not addressed this. I understand that people have. I'm saying that in many ways the replication crisis in psychology is a little bit like the weather, right? We all talk about it but no one really does anything about it. We do a little about it here and there. I'm going to kind of wrap up. I'm in the speaker slot between the audience and their food, so I'll keep this brief. But I really do feel like this is a good opportunity for us to talk about it. It's really important.
I want to talk about, again, getting back to me, a couple of things that I've done. I have a journal, Evolution and Human Behavior. As of January of this year, mostly influenced by things like this and Uri Simonsohn, we implemented a policy that says you have to post your raw data, all your raw data. Now, there are a couple of exceptions. If there's identifying stuff in there that violates some kind of HIPAA requirement, privacy requirements, also if your data are drawn from a publicly available source, it's already available. And we've had good compliance. At my journal I don't want people to run into the problem where they publish this result and then people who are trying to recreate the statistics can't do it. That's the first thing we've done.
The second thing we've done is we've got a professional statistician on board. All of our action editors know that if there's anything about the stats that feels dicey to them, they can ask this person to take a look at just the stats. This is a person who doesn't have to worry about the content, the theory, what have you, and is more or less indifferent; he's not even in my discipline; he doesn't really care. He's actually a kind of critic of the discipline, which is one of the reasons I feel lucky to have him. And so just the stats can be looked at by a professional. This is a nice innovation which I would love to see elsewhere.
Another thing we're talking about is, (this seems like a separate issue but I don't think it is) I bet all of you have looked at people that cite your work and you look at the sentence, like, "I didn't say that. How did the person take me to be saying that?" Even the reverse in some cases. And, having published in law reviews, one of the things I thought was so interesting about their model, and I see why it's difficult to implement but, on the other hand, I don't think it's impossible, they're law students, they go through every single footnote and make sure the source says what you say it says. Now, they err in the other direction. I got this funny comment from a law student who says, "What's your citation for the claim that correlation coefficients go between positive one and negative one?" Fine, they've gone a little bit too far, right? But, nonetheless, if we think that science is accretive to the extent that people are mis-citing these sources and they don't go back and the thing doesn't say what it says, first of all, that means the author was probably confused or lazy and we should clean that up, but the second thing is we're not really building an edifice anymore, we're doing kind of this weird smorgasbord kind of thing.
I should just say, Evolution and Human Behavior is published by Elsevier. The downside of this is they're big and evil, right? Everyone hates Elsevier because they're making a ton of money off of our labor. I take that point. One of the good points is we get the royalties, right? We get a fraction of that. We have the resources to have graduate students go in and check these citations. As far as I know, in psychology and economics, This is true in philosophy, not only doesn't this happen, it wouldn't really occur to people, right? The onus is on the author. I don't see any reason why these institutions need to stay in place.
My point here is not yet another handwringing, how horrible is this? It's that we need concrete proposals to move the field forward in a way that's going to be productive. I'm optimistic about psychology. I think that in ten years it could be that in addition to a council of economic advisors there's a council of psychological advisors. You're already seeing this a little bit. As people who talk about nudges are getting into policy positions. But what that means is our powder better be dry, like, we better be giving ideas to policymakers that are right. And the way to do that is to do better science.
To take it back to where I started, I'll be honest, the idea that there is a reservoir of willpower, that's just obviously wrong. If you look at it in the context of what we already know about the way the mind works, it wouldn't have passed the 10-second test, but it's a very appealing idea; it has a certain resonance. And I'm willing to say that as an empirical matter, when people start going into this to try to replicate it and they bring together all the replications with the existing data, they're going to find there was no there there. Now, maybe I'm wrong. Maybe that turns out not to be true. We can talk about that if you want. But I do think that it's going to turn out that there's lots of ideas that people like us—not us, obviously we're too careful for that—and, in fact, Uri Simonsohn uses Danny Kahneman as an example when he talks about: what should your curve look like in terms of P values? All of them shouldn't be between .045 and 05. It should actually look like a lot of them are really small and then a few of them are a little bit bigger and Danny's curve looks just like that, which means that it's being drawn from the right distribution, that is, he's looking at actual effects and doing appropriate statistics on them. But that many of our colleagues are not, and I that this is the kind of community that has a responsibility to be more assertive about making positive changes so that we do a better job of doing our jobs.
GRUBER: This is a wonderful initiative and I'm happy to see someone not, like you're saying, wallowing in the despair that we see ourselves in, but practically thinking of steps we can take. A couple of questions I had related to this is: how do you think we can expand this kind of approach towards datasets that are much more complicated to grapple with? For example, there's issues that often arise in behavioral coding where people worry there are biases within the lab to be coding a behavior in a way that sort of supports their hypothesis. Do you have people submit the videotapes? This is just an issue to think about how we expand it. Or datasets that are far more complex and require a lot of preprocessing—I'm thinking neuroimaging, even psychophysiology to this extent. What do we do to really insure the integrity of the data we get in those raw data files? SPSS, Excel—how can we really trace it back to some of the biases that really could be problematic earlier on upstream?
KURZBAN: Let me say two things about that. The first thing I would say is that the solutions for cleaning things up are going to turn out to be specific to the area. And that's something that's a challenge. I take that point. And so the short answer to your question is, I don't know. But let me say something that is really important, which is journal incentives. If I did the kind of thing for my authors where I said to them, "If you're going to give me a behavioral coding experiment I need you to send the data to another lab and have them verify it before we're going to look at it," I would stop getting submissions, right? I'm not saying this is an easy problem. As an editor, to the extent I make my authors' lives harder, which is going to improve the science, I'm going to look worse because down the road I'm going to get fewer papers and they're not going to be cited as often. There's no doubt that my incentives have something to do with keeping things easy for my authors, which is going to push down the quality of the work. That's part of what we've got to start talking about. I mean, already there are some suggestions along these lines.
BROCKMAN: What about issues of code? You talked about the data you get from your experiments, but a lot of researchers have proprietary code—some of it corporately financed where there are patent issues. Unless you have the code or the software, you don't have anything. And yet often that never comes up. Victoria Stodden has done a lot of research on this. And how do you handle that?
KURZBAN: Again, my answer to that is ... I'm going to go ahead and punt and say that's a great question for the smart people at the table to figure out. I will say this. In my word, code is usually agent-based simulations. And Rob Boyd has introduced what I think ... it's actually crazy that we didn't do this before, which is that he has two different people simultaneously do the agent based simulation code to make sure that when they look at the output they're not looking at something idiosyncratic because someone made a mistake. They want to make sure they have two replicants of the building of the code. This, is a no-brainer. Yes, it adds work and, again, there's an incentive problem here, which is now he's going to get half as much stuff done because half the time he's doing his buddies' coding. That's the kind of innovation that has been very good. And in terms of what you're talking about, I have to throw up my hands. I don't think these things are insoluble. There might be legal mechanisms that allow people to verify these things. People don't talk about this, they talk about how horrible it is that Elsevier takes our labor, but it makes pools of money available. One of the ways that we can deploy resources might be to ... there's various ways you can incentivize people to behave in a better way or at least to check and verify.
KAHNEMAN: A couple of remarks. One, I don't want to defend the reservoir model of willpower but there is a difference between the way that people operate when they're tired and when they are not tired. My understanding of Baumeister's research is that he's studying fatigue and that looks more plausible than the glucose work.
On the other point of practices, I have a suggestion—an observation and a suggestion. There is a line of research which includes Baumeister's and the priming stuff, where it takes an hour to collect one data point.
Those are between subject experiments and they're expensive. They're very costly, and the samples that people run are too small, they're too small —I think a reasonable idea is by about a factor of four—and they haven't increased in the last 50 years. There were complaints about that. Jacob Cohen in the 1960s, and it's been replicated by Gigerenzer 15 to 20 years ago, and nothing has changed. I think that's a matter for editors.
Bobby Spellman and I have been thinking of writing a piece where editors of psychology journals would treat a problem as car makers do with respect to fleet gas mileage, that is, that within ten years we have the goal now of achieving that level of power. And that is agreed across all journals so that you don't get competition with the wrong incentives. And it's slow. But if you have that as an objective, that you can measure the average power of your studies against small effects in a reliable and consistent way across time and show improvement. That would reduce the problem very significantly.
KURZBAN: Let me address the first one very quickly. I'm not denying the empirical phenomenon. There's definitely something going on. I'm arguing with the explanation.
Bobby's done great things with perspective and exactly what you're talking about is the kind of thing that I want to point to, which is we've got to get groups of editors in fields together either physically or in terms of correspondence, or in terms of policy. That kind of initiative is going to wind up being successful. And this speaks to the question of incentives. Because if I'm no longer competing with editors who are willing to go low in terms of power, now I'm able to insist on well-powered studies without putting myself at a disadvantage. This is like hockey helmets. I don't mind if there's a rule that says everyone's got to wear a hockey helmet, but if there's no rule, I don't want to wear one because I'm going to lose that advantage, right?
That, is where we've got to go and in order to do that you've got to have leaders who step forward and say, I'm going to gather these editors and I'm going to say, "You guys, you can do better as long as you're not mutually affecting others in a negative way." That's exactly the kind of thing I'm pointing to.
MULLAINATHAN: It's not helpful in the sense that we're not proposing any magical solution or even a non-magical solution. I'm talking a little bit what the problem is—that is, it's easy to sit and say the problem is replication. That's part of it, but I actually think there are two problems that are actually unrelated to replication or at least weakly related to replications that contribute to the problem, and this is coming from looking—economics has this issue as well—So let me start with the first one. You mentioned that the sub title of it was "Real or Elusive." There's something to that and here's what I mean. It seems to me a very crude way of describing psychology is that you have quite a bit of control in the lab over the nature of the context you're used to to show the effect.
What I mean by that is even taking the original willpower studies. You had said you tried to replicate, tried to follow their exact methods, and supposedly it turned out that it wasn't the hand press, that you had just used some other thing and you didn't find it. You went and did hand press and you found it. Oh, good, they were right. But were they right? And, after all, if I replace hand press with this other thing, but in some narrow sense, and that's what I mean by replication. There's a question of if you literally replicated the original studies and found they worked perfectly—you replaced hand press with some other proxy for self-control and found it no longer worked—that's a different kind of failure to replicate. It's almost like a context sensitivity. And the reason I'm emphasizing that is, I feel like there's quite a bit of that that goes on, which has nothing to do with p-hacking generally, it has to do with "Oh, I would like to show this effect. Let me try it with these types of things. Let me try it with this." And to the extent that that's in the DNA of the field to say, "I'm looking for context when certain things happen," then I feel like this type of problem is going to be immense.
Let's contrast it with areas where we're trying to find psychological effects but where everything is already pre-specified—line up versus show up, or how do we show data—so then it's really clear. Are you picking the right guy? Are you doing this? We're looking for effects. There's still a lot less freedom and replication is much more robust than sensible. Does that make sense? And seems to me this is as much a problem of what the goal is as it is about the method. I don't know if I'm being clear. But it's that you can really see that.
Take biology, where we're all trying to study this cell and this mechanism, then it's very clear when there's a failure. The fields where I've seen this happen, the biggest is the subfields within psychology is a terrific example. It's a very abstract big concept, and if I happen to show the effect using hand press, I've shown it. I'm only pointing this out because you talked about the council of psychological advisors. I would be exactly as concerned if you told me all of these willpower studies perfectly replicated but they only hold for hand press. Well, what on earth does that have to say about policy? There's a different kind of replication that fields like psychology need to take far more seriously, which has to do with robustness across settings. And there are areas where it's taken very seriously, which has to do with robustness across the thing. And there are areas where it's taken very seriously and there are areas where it feels like it's just left to something else.
KAHNEMAN: Sendhil, that's one of the major problems. I feel conceptual replication is what social psychologists do because they go across contexts. And that is one of the major forms of p-hacking. That is, that there are no constraints on how many failures can replicate. Hal Pashler asks, "when has anybody reported a failure of a conceptual replication?" So, p-hacking is mediated in large part by conceptual replication.
KURZBAN: Yes, I completely agree with that. It's a different form of p-hacking. But another way to think about this, again, having one foot in economics and one foot in psychology, it's just a different. The entities in our explanations in psychology tend to be way wigglier than one finds in at least some areas of economics, and it leads to these kinds of problems. I's the nature of the explanations and then some of these methodological practices, which are permanent. So I don't think I disagree.
MULLAINATHAN: The reason I'm pushing is, I understand you use this p-hacking, but it's actually statistically pretty significant.
KAHNEMAN: You're right. It's being misused in a big way. This is right at the core of the debate between social psychology and its critics. Social psychologists say, "We replicate conceptually all the time and, therefore, a failure of literal replication doesn't concern us." That's a big part of the debate.
KURZBAN: You said you had two points.
MULLAINATHAN: The other point is this ... look, to be honest, economics is as prone to this as psychology, and it's terrible. I looked at a lot of medicine and how the ideas in medicine have evolved and here's a striking fact. There are a lot of weird theories that just look crazy. The germ theory is crazy. You're telling me that there are these invisible things called germs that are deadly to us ... oooh. And yet at the same time, here is the anomaly: there are 100 crazy sounding theories and one of them turns out to be right.
The reason I'm emphasizing that is that in contrast, when you work in these fields that have these problems, we're so driven by our intuition about what's right. Like, "That sounds wrong. That just feels like it can't be true." I don't know if you've noticed this, but there's a fundamental dominance given to the theory and the intuition as opposed to: it's a mess, it's too bad that the data suggests this, and who knows what it's going to be saying? Even the structure of a psych paper or an economics paper is of the variety, it's like you're forcing your hand into saying, "Write down the theory. Here's a test." Yet, you look at biology—a vibrant field, it's a mess. And it's that mess from which stuff arises. And I feel like there's something else, forgetting replication, forgetting everything, why is there no space for empirical exercises where you say, "I don't know what this is. I hope in 50 years we figure it out." I mean, that would be the worst. You would say, "What an unintellectual researcher. They just aren't thinking." That's the other fundamental problem, is that we're not allowing empirics to have its own vibrancy and strength and structure.
PIZARRO: We were just talking about this, and that you're confusing the way that we write up our ... psychology is a mess and it is driven a lot by this sort of empiricism that you're describing. We do lots of studies, and we discover an effect that we never would've predicted, and then we write up the paper by saying, "We sought three tests of the hypothesis..." Maybe the error here is in the way that we communicate what we do, but that's never how we do it.
MULLAINATHAN: That is what I mean. Because that that is not just narrow communication. All that stuff that you did to arrive at the thing, that's what should be on the other side of the thing.
PIZARRO: Absolutely. I just wanted to point out that it is a mess.
KNOBE: A lot of our discussion has been about ways that we can avoid making errors in the first place. But it seems like a really central question is if we have made an error, so we said something and it's wrong, how can we make it the case that we will recover from that error? And it seems like if we work backwards in the way that Sendhil was suggesting earlier, what we would want to have is the person who initially made the error would change his or her mind—would say, "Oh, now that there's new data on the table, I guess I was wrong." But then if you work backwards from that and we think: what can we do to make people actually change their minds? It seems like a really central thing is something cultural. I feel like right now there's a kind of spirit in our discipline where if someone says something and then other researchers investigate it further and it turns out that they're wrong, that they've been somehow de-bested, they've failed, they're being crushed by this other person who was smarter.
If we as a culture could eliminate that kind of feeling about what happens when you turn out to be wrong, then people would be much more willing to say that they're wrong. Maybe if we all had a different feeling about what happens when people just say, "Oh, I guess it wasn't true," people would be much more likely to do that because we recover from errors. I was thinking maybe—like we have the Nobel Prize—we could also have the "Noble Prize." It could go to that person each year who most publicly admitted that someone else had done an experiment which just … …
KURZBAN: That's a great idea.
KNOBE: Then the field can just move on. That idea—it seemed like a good idea at the time—it's over.
BROCKMAN: What's the name of that prize?
KNOBE: The "Noble Prize."
KURZBAN: Yes, that's right. Scientists are supposed to be the sorts of people who just admit they're wrong. I always remember that nice Arrow quote where someone asked him why he seems to change his mind in print so often. I don't know if this is true or not but he says, "Yes, that's what I do when I realize I'm wrong. What do you do?" And that's good. The rhetorical question is there. I don't want to name names, but in this literature I actually asked someone in this literature, I said, "Well, there's this stuff out there. What pattern of data would cause you to change your mind?" And the reply I got was, "You know what? I should really think about that sometime."