Are social science RCTs headed in the wrong direction? A roundup of the discussion

Last week I posted my worries about the direction of social science experiments, how raising the quality bar was going to drop the quantity of studies (and what that will sacrifice). I did not expect tens of thousands of visits, tweets, replies, and suggested readings. Even Vox picked it up, causing me to immediately think: “Oh man, why do I say these things out loud?”

Clearly there is an untapped demand for mournful posts about feeling overworked, masquerading as critiques of science. And here all I thought you cared about was Star Wars and Trump.

My basic point was this: each study is like a lamp post. We might want to have a few smaller lamp posts illuminating our path, rather than the world’s largest and most awesome lamp post illuminating just one spot. I worried that our striving for perfect, overachieving studies could make our world darker on average.

There were lots of good comments, and I thought I’d summarize some of them here.

David McKenzie blogged that he thinks my concerns apply more to the big government evaluations than the new, smaller, flexible, lab-like experiments that more and more social scientists are running in the field.

I suppose that’s possible, but I found that the more control over experiments I’ve had, the more ability I have to add bells and whistles. The expectations of what is possible are higher. And I push myself harder, since I’m my own worst enemy. So, in my view, the incentives to over-invest are greatest with the experiments we control the most.

Meanwhile, Rachel Glennerster countered David with tips on how to work with government partners on randomized evaluations.

But most surprisingly to me, a huge number of commenters said something along the lines of: “I’d much rather have a couple of really good studies than a whole bunch of small and underpowered ones.” On some level: of course. But on another level, this kind of statement is exactly the problem I’m describing.

Think in extremes for a moment. On one extreme, we could have just one experiment in one place to answer an important question, and it would be big and amazing. On the other extreme, we could have thousands of tiny trials in as many different places, each with only a trivial sample.

Obviously neither of these extremes makes any sense. There’s a sweet spot in the middle. I read those commenters as saying “we know where we want to be, and we’re already there.” I’m not so confident in the status quo, and we should always be suspicious when “right here” feels like the best place in the world.

I think economics might be just a little too close to the first extreme, with just one big awesome study. Why do I say this? Because we obsess about internal validity all the time and almost ignore external validity. The profession is not optimizing. Meanwhile, my sense is that political science paper-writers are a little too far towards the other extreme, with more emphasis on quantity over quality (this does not apply to the books so much).

Scientific and statistical analysis should be able to guide us to the optimal point. For this we probably need a better science of extrapolation. The fact that I am not aware of any such science doesn’t mean it doesn’t exist. But it’s not something in the mainstream that we discuss. This should worry you.

Actually there are a few things out there.

  • Jorg Peters sent me this systematic review of all experimental papers published in top economics journals, with a good discussion of the many external validity problems.
  • My colleague Kiki Pop-Eleches and coauthors have a draft paper that looks at a natural experiment that’s happened in every country—the gender combination of your children—and use it to build models for understanding when we can and cannot extrapolate well, and how to build experiments to maximize generalizability.
  • Lant Pritchett send me a new paper arguing that, when social programs are complex, with lots of dimensions, we benefi t from testing more elements of its design, even if it leads to small sample sizes and low statistical power.

I can’t say whether these papers are correct, or if the list is complete, but I enjoyed skimming through them, and plan to read them carefully sometime soon. All this strikes me as a pretty important area for more research and reflection.

I would love to get pointers to other work in the comments.

81 Responses

  1. Thanks! This is an interesting perspective. Personally, I am someone who has not found RCTs to be useful in the work I do and support primarily because I find them too rigid and overly simplified to demonstrate anything close to the reality of complex behaviour in systems. Lately, I have comes across this tool: http://cognitive-edge.com/sensemaker/. I know some practitioners are using it and are finding it enlightening. One interesting piece is that it removes the bias of the researcher (or the statistical framework, and computer algorithm) from the interpretation of the results. What do you think?

  2. I’m skeptical that either proliferation of small-scale experiments or meta-analysis provides a solution to the problem of external validity. In both cases, we only learn about the distribution of treatment effects in places where experiments are being done. And we increasingly have evidence that places with experiments differ from other contexts we may be interested in, and differ in ways that are correlated with the effect of the treatment being evaluated. Hunt Allcott calls this “site selection bias” and provides some of the empirical evidence for it here: https://www.dropbox.com/s/g9bxwzyhsxz7uri/Allcott%202015%20QJE%20-%20Site%20Selection%20Bias%20in%20Program%20Evaluation.pdf?dl=0 .

    So just as we developed tools to account for selection bias in program evaluation (including randomization), we need to develop new tools to account for site selection bias. These tools could be developed either in a reduced-form framework or using a structural model of behavior. My own applied econometrics work, on bounding the average effect of an experimentally-evaluated treatment in a new context (http://www.personal.psu.edu/mdg5396/Gechter_Generalizing_Social_Experiments.pdf), has focused on the former. I believe the latter is a particularly exciting avenue for future research.

  3. There is a science of extrapolation and we do talk about it: it’s meta-analysis. Good meta analysis should always assess the validity of extrapolating from one site/study to another – that is, it should assess the genuine heterogeneity of the treatment effects (netting out sampling variation). For more detail: I discuss this, with links to papers that are trying to do this in economics, in my current working paper: http://economics.mit.edu/files/10595