Randomized evaluation is not without its critics, who say that there is little benefit in learning rigorously about one context because the lessons will not translate into other contexts. In effect, the critics ask, can we generalize about human behavior across countries?
That is an empirical question, and the growing body of evidence is helping us answer it scientifically. Hundreds of randomized evaluations of anti-poverty programs are now being conducted all over the world. While each evaluation is carefully crafted to describe one part of the development puzzle, many pieces are starting to come together.
That is Rachel Glennerster and Michael Kremer in the Boston Review. As usual, BR features big thinkers pushing one point of view, and a bunch of other thinkers and practitioners responding. The forum is here. I thought the G&K piece and responses were all excellent.
Pranab Bardhan reminds us that a lot of the questions we want to answer in development won’t be answered by the randomized control trial. Glennerster and Kremer don’t need to be told this (most of their best work is neither randomized nor even micro-empirical), but most of the development world needs to hear this message.
Dan Posner is worried that the current rash of experiments are not designed to see how context matters. I couldn’t agree more, and I would like to see the leaders of the RCT movement, especially 3ie and the World Bank, push this agenda more aggressively. It’s discussed in salons but too seldom implemented in the field.
I think Dan could have gone farther: context and complexity may be everything in the realm of politics. I suspect a half dozen deworming experiments or a half dozen vocational training program evaluations will yield somewhat consistent and generalizable results, at least on the same continent (assuming anyone ever gets around to serious and consistent replication). But experiments with community driven development programs? Corruption control? Electoral reform? Even if done well (most experiments are not) I expect inconsistent and erratic results.
The most serious criticism, in my mind, comes from Eran Bendavid, who warns social science that it’s making the same mistakes as medicine (and then some). We don’t register our experiments, leading to all sorts of selective reporting and post-hoc analysis. Our interventions and populations vary with every trial, often in obscure and undocumented ways. Not to mention rampant publication bias (not to mention citation bias).
Here I was a bit disappointed with G&K’s response. They are right: field experimenters need a little flexibility to examine unregistered relationships and findings, since so much learning happens along the way; also, field experimenters are getting better on all these fronts; and finally, most observational research is worse.
Fair enough. But I still think we are headed towards the same dangerous path as medicine, where most published research findings are probably false. I think my response would have differed on a few points:
- The profession’s efforts to register trials, publish null results, and replicate trials is pretty weak so far, with all incentives stacked against it, and this will need to change to make serious progress
- Every trial ought to be registered, and economics journals ought to enforce the practice
- We should welcome unregistered sub-group or post-hoc analysis, so long as it’s clearly labelled as such, and all the sub-group and post-hoc hypotheses tested are disclosed
- We need to stop finding a robust empirical result, writing a model that is consistent, then putting the model at the beginning of the paper and calling the empirical work a test (I throw up in my mouth a little every time I see this)
- Yes, observational research is often much worse, but experiments should take the high road against the worst practices, rather than simply pointing to a road lower than its own
- The greatest advantage economics holds is theory, and it ought to be wielded more productively in experimental design (especially the hundreds of atheoretical searches for significance that characterize most program evaluations in the big development agencies)
I expand on these points in a now somewhat outdated (but hopefully still useful) talk transcript. One day I would like to write a book, but perhaps I ought to actually publish one of my experiments first…
P.S. I should mention that most of my own experiments suffer from the worst of these problems, mainly because I didn’t know any better when I started, and the damn things take so long. But it’s getting better each time (I hope).