Are social science RCTs headed in the wrong direction? A roundup of the discussion

Last week I posted my worries about the direction of social science experiments, how raising the quality bar was going to drop the quantity of studies (and what that will sacrifice). I did not expect tens of thousands of visits, tweets, replies, and suggested readings. Even Vox picked it up, causing me to immediately think: “Oh man, why do I say these things out loud?”

Clearly there is an untapped demand for mournful posts about feeling overworked, masquerading as critiques of science. And here all I thought you cared about was Star Wars and Trump.

My basic point was this: each study is like a lamp post. We might want to have a few smaller lamp posts illuminating our path, rather than the world’s largest and most awesome lamp post illuminating just one spot. I worried that our striving for perfect, overachieving studies could make our world darker on average.

There were lots of good comments, and I thought I’d summarize some of them here.

David McKenzie blogged that he thinks my concerns apply more to the big government evaluations than the new, smaller, flexible, lab-like experiments that more and more social scientists are running in the field.

I suppose that’s possible, but I found that the more control over experiments I’ve had, the more ability I have to add bells and whistles. The expectations of what is possible are higher. And I push myself harder, since I’m my own worst enemy. So, in my view, the incentives to over-invest are greatest with the experiments we control the most.

Meanwhile, Rachel Glennerster countered David with tips on how to work with government partners on randomized evaluations.

But most surprisingly to me, a huge number of commenters said something along the lines of: “I’d much rather have a couple of really good studies than a whole bunch of small and underpowered ones.” On some level: of course. But on another level, this kind of statement is exactly the problem I’m describing.

Think in extremes for a moment. On one extreme, we could have just one experiment in one place to answer an important question, and it would be big and amazing. On the other extreme, we could have thousands of tiny trials in as many different places, each with only a trivial sample.

Obviously neither of these extremes makes any sense. There’s a sweet spot in the middle. I read those commenters as saying “we know where we want to be, and we’re already there.” I’m not so confident in the status quo, and we should always be suspicious when “right here” feels like the best place in the world.

I think economics might be just a little too close to the first extreme, with just one big awesome study. Why do I say this? Because we obsess about internal validity all the time and almost ignore external validity. The profession is not optimizing. Meanwhile, my sense is that political science paper-writers are a little too far towards the other extreme, with more emphasis on quantity over quality (this does not apply to the books so much).

Scientific and statistical analysis should be able to guide us to the optimal point. For this we probably need a better science of extrapolation. The fact that I am not aware of any such science doesn’t mean it doesn’t exist. But it’s not something in the mainstream that we discuss. This should worry you.

Actually there are a few things out there.

  • Jorg Peters sent me this systematic review of all experimental papers published in top economics journals, with a good discussion of the many external validity problems.
  • My colleague Kiki Pop-Eleches and coauthors have a draft paper that looks at a natural experiment that’s happened in every country—the gender combination of your children—and use it to build models for understanding when we can and cannot extrapolate well, and how to build experiments to maximize generalizability.
  • Lant Pritchett send me a new paper arguing that, when social programs are complex, with lots of dimensions, we benefi t from testing more elements of its design, even if it leads to small sample sizes and low statistical power.

I can’t say whether these papers are correct, or if the list is complete, but I enjoyed skimming through them, and plan to read them carefully sometime soon. All this strikes me as a pretty important area for more research and reflection.

I would love to get pointers to other work in the comments.