Randomized evaluation is not without its critics, who say that there is little benefit in learning rigorously about one context because the lessons will not translate into other contexts. In effect, the critics ask, can we generalize about human behavior across countries?
That is an empirical question, and the growing body of evidence is helping us answer it scientifically. Hundreds of randomized evaluations of anti-poverty programs are now being conducted all over the world. While each evaluation is carefully crafted to describe one part of the development puzzle, many pieces are starting to come together.
That is Rachel Glennerster and Michael Kremer in the Boston Review. As usual, BR features big thinkers pushing one point of view, and a bunch of other thinkers and practitioners responding. The forum is here. I thought the G&K piece and responses were all excellent.
Pranab Bardhan reminds us that a lot of the questions we want to answer in development won’t be answered by the randomized control trial. Glennerster and Kremer don’t need to be told this (most of their best work is neither randomized nor even micro-empirical), but most of the development world needs to hear this message.
Dan Posner is worried that the current rash of experiments are not designed to see how context matters. I couldn’t agree more, and I would like to see the leaders of the RCT movement, especially 3ie and the World Bank, push this agenda more aggressively. It’s discussed in salons but too seldom implemented in the field.
I think Dan could have gone farther: context and complexity may be everything in the realm of politics. I suspect a half dozen deworming experiments or a half dozen vocational training program evaluations will yield somewhat consistent and generalizable results, at least on the same continent (assuming anyone ever gets around to serious and consistent replication). But experiments with community driven development programs? Corruption control? Electoral reform? Even if done well (most experiments are not) I expect inconsistent and erratic results.
The most serious criticism, in my mind, comes from Eran Bendavid, who warns social science that it’s making the same mistakes as medicine (and then some). We don’t register our experiments, leading to all sorts of selective reporting and post-hoc analysis. Our interventions and populations vary with every trial, often in obscure and undocumented ways. Not to mention rampant publication bias (not to mention citation bias).
Here I was a bit disappointed with G&K’s response. They are right: field experimenters need a little flexibility to examine unregistered relationships and findings, since so much learning happens along the way; also, field experimenters are getting better on all these fronts; and finally, most observational research is worse.
Fair enough. But I still think we are headed towards the same dangerous path as medicine, where most published research findings are probably false. I think my response would have differed on a few points:
- The profession’s efforts to register trials, publish null results, and replicate trials is pretty weak so far, with all incentives stacked against it, and this will need to change to make serious progress
- Every trial ought to be registered, and economics journals ought to enforce the practice
- We should welcome unregistered sub-group or post-hoc analysis, so long as it’s clearly labelled as such, and all the sub-group and post-hoc hypotheses tested are disclosed
- We need to stop finding a robust empirical result, writing a model that is consistent, then putting the model at the beginning of the paper and calling the empirical work a test (I throw up in my mouth a little every time I see this)
- Yes, observational research is often much worse, but experiments should take the high road against the worst practices, rather than simply pointing to a road lower than its own
- The greatest advantage economics holds is theory, and it ought to be wielded more productively in experimental design (especially the hundreds of atheoretical searches for significance that characterize most program evaluations in the big development agencies)
I expand on these points in a now somewhat outdated (but hopefully still useful) talk transcript. One day I would like to write a book, but perhaps I ought to actually publish one of my experiments first…
P.S. I should mention that most of my own experiments suffer from the worst of these problems, mainly because I didn’t know any better when I started, and the damn things take so long. But it’s getting better each time (I hope).
9 Responses
The claim that “most published research findings are false” is somewhat misleading. Thinking evolves; not all research is treated equally.
By far the majority of genetic mutations are harmful; and yet look how far we have come.
Improving the rate of accurate findings is desirable, but being able to evaluate findings ex post is probably even more important. And my sense is we are at doing a decent job of the latter.
There’s a very interesting discussion on precisely the whole RCT/structural models issue from the recent CSAE conference in Oxford with Glenn Harrison, David McKenzie, and Leonard Wantchekon. The bottom line is that we need both, but there is a fair amount of the kind of scepticism you cite above (particularly from Harrison!).
Yes, the discussion at CSAE was provocative, and the video is on the web (just Google “CSAE Oxford” and you’ll find it.) Leonard gave a very interesting presentation on experimentation in poli sci, David talked about the challenge of doing experiments with firms when the universe of firms is small, and Glenn gave a charged critique of RCTs. One of Glenn’s main lines of criticism was the same as Pranab Bardhan’s and one that I’ve heard many times before: experiments can’t answer all the interesting questions. I always feel like this is a straw man argument. Is there anyone out there who has ever made that claim?
great post!
I think you hit the nail on the head with your final comment about theory. It doesn’t help to know that a program has a causal effect in one place at one time if we don’t understand what is in the black box that makes the program work. The best experiments are those that test implications of a preexisting theory, or test competing theories against each other. A theory about human behavior is falsifiable in any location at any time.
In reference to another point you made, we certainly want our theories to be informed by evidence. As long as it is clear that the model isn’t being tested, I don’t see a problem with including an idea about a mechanism in a paper describing an empirical regularity. Of course, this shouldn’t be construed as a test of the theory, just an idea about what might be going on. Is that enough to help you keep your stomach down?
I sent my paper to a journal in exactly the same way. I presented some empirical facts. Then I presented a model that may explain the mechanism behind those facts. One of the referees told me to write the model first and present the empirical facts as a “test” of my model.
Great post. I agree that registration is very important, which leads me to wonder whether you’ve registered your current trials. If so, do you find that most of your colleagues do so as well? If not, what are the disincentives on a personal level?
Also, there are huge differences between clinical trials in (relatively) easy to work in hospital environments — which is much of the medical literature — and some of the community trials done in global health. Large-scale randomized community trials in developing settings (see NNIPS, at http://www.jhsph.edu/dept/ih/news/summer2010/nepal.html, for some good examples). In those sorts of trials, there’s more danger of the intervention varying in different trials, either by design or fault, and also more danger of populations varying, which is harder avoid but important to measure. For example, there’s currently a raging debate on possible gender differentials in the effects of Vitamin A supplementation: some trials have shown it, others haven’t, and this may be driven in part by unmeasured differences in the vitamin A deficiency of the populations in which the studies were. If the original studies had been designed better (or coordinated) to get that baseline data there might not be a need for the current crop of expensive trials, but even those may yield inconclusive results because sometimes populations are just different.
All that to say that there seem to be a lot more similarities between the new randomized studies in economics and those sorts of randomized trials in developing communities than with clinical trials in the US and Europe, which seems to be the comparison I see made most often.
I (shamefully) have not registered, but that’s partly because I think the registries ask the wrong questions. Also, they’ve only been around about a year.
What I do at a minimum is draft a document where I pre-specify the hypothesized impacts, prioritize them into primary, secondary, and tertiary impacts (in terms of expected importance) and also specify key areas of heterogeneity. I try (and mostly succeed) at writing my cleaning and analysis code before the results come in, a practice aided by electronic data collection. Where relevant, I also try to guide the intended analysis by theory, which tends to prespecify the analysis to perform.
If there were more incentives to register and prespecify, it would have forced me to think my experiments through better (something I did not know to do as a grad student) and some of the stuff emerging now would probably be better quality. It would also force me to be more disciplined even now.
I readily admit to blurring the edges between prespecification and data mining, though I do my best to police the border and label the post hoc analysis clearly.
Thanks. In a sense what you’re doing with the blog — posting about all the experiments you’re running — is a sort of registration because we all know about your experiments and someone would probably notice if you didn’t publish any of them. So it’s more the non-blogging experimenters that we have to watch! Relatedly, I wonder if there’s an expert at some medical journals or ISRCTN who could lend some insight into how registration became standard in clinical trials — I’m not familiar enough with the history, but I imagine it could inform efforts to make registration similarly standard in behavioral econ trials.
Relatedly, I assume you must seek IRB approval before starting human subjects research. But do you have a Data Safety and Monitoring Board (DSMB – http://www.irb.emory.edu/researchers/formstools/docs/other/DSMB%20Notes.pdf)? DSMBs seem like an ethical necessity if you’re doing experiments with human subjects on a question with equipoise (so that harm could result, or the intervention could be extremely beneficial and thus it might be unethical to deny it to the control group after a certain amount of data is gathered). Its standard in public health, but I’m not sure how the standards are applied across fields.