Impact evaluations, good and bad

In an interesting new working paper, Michael Clemens and Gabriel Demombynes discuss different levels of rigor in the context of evaluating the impact of a specific intervention. Roughly speaking, they compare the current approach being used in that project (before and after) to a better approach (using ex post matched controls) to the ‘best’ approach (randomizing ex ante). For instance: one impact currently claimed is increased cell phone penetration, but of course that has been happening everywhere as time passes, so it mostly goes away (as a direct impact) when controls are used.

Part of what makes the paper interesting is that their chosen example is a large (potentially huge) media-friendly intervention: the Millenium Villages Project. Doing evaluation right is important, especially so for exciting but untested ideas. But part of what I found interesting is that there is no randomization, although they show how it could easily be incorporated. It’s worth keeping in mind that one can fall somewhat short of the gold standard, or very short, and this makes a big difference; the world is not binary.

So when, if ever, should we not doing a rigorous evaluation? Probably the best answer I’ve heard to that is: when ambiguity is useful as political cover. For instance, one could argue that conditional-cash-transfer programs (e.g. Oportunidades) are mostly redistributive in nature, and the conditional dimension exists to keep conservatives happy. Doing a rigorous evaluation might suggest that these are not cost-effective ways to, say, increase school attendance rates. Not doing the evaluation allows the primary goal to continue without having to argue for it solely on the cost-effectiveness merits. Or so one could argue.

11 thoughts on “Impact evaluations, good and bad

  1. Er, this isn’t very close to the actual facts. Check out the evaluation protocol here: http://www.thelancet.com/protocol-reviews/09PRT-8648. The cell phone issue is a red herring.

    Look at the actual protocol, and look at all the outcomes, and keep in mind the project is at a mid-point, not the end. If you don’t come to a different conclusion then snarkiness has taken over your soul.

  2. @Marc Levy: Thanks for your concern for our souls. The protocol itself lists most of the limitations we mention, as we state it does, so it’s hard to see why you think those concerns arise from “snarkiness” unless the protocol is “snarky” against itself. And calling things “impacts” that are clearly not “impacts” of the project, such as the expansion of cell phone coverage in Africa, is incorrect for any project. Name-calling with terms like “red herring” is easy; explaining why it’s okay for projects to claim impacts that aren’t impacts is harder.

  3. Michael and Julian – Perhaps a small point, but would you consider any evaluation that is not an RCT to be “rigorous”? Since I don’t believe this is the case, I have to say it grates that RCT/impact evaluation is used interchangeably with “rigorous.”

  4. @Marc: The paper has two main sections. The first examines the before-and-after impact estimates in the MVP mid-term evaluation. Cell phone ownership is one of the limited set of indicators reported by the MVP in the report and they highlight the increase in cell phone ownership at the Kenya site as one of the project’s “biggest impacts.” The problems with the before-and-after approach are glaringly obvious with the cell phone case but apply just as much to the other indicators. In a second part of the paper, we give a very detailed critique of the evaluation protocol’s description of future MVP evaluation plans.

    @AndyB: No, I wouldn’t equate RCT with “rigorous” impact evaluation. Here’s what we say in the paper: “As we use the terms here, ‘rigorous’ or ‘careful’ impact evaluation is the measurement of a policy’s effects with great attention to scientifically distinguishing true causal relationships from correlations that may or may not reflect causal relationships, using ‘well-controlled comparisons and/or natural quasi-experiments’ (Angrist and Pischke 2009: xii).” This definition leaves some room for discussion, but it’s clear that it does not encompass before-and-after impact evaluation.

  5. @AndyB: Thanks for your comment. I agree with you 100%. “Rigorous” is not synonymous with RCTs by any stretch. In my own research I’ve used diffs-in-diffs, regression discontinuities, instrumental variables, identification through heteroskedasticity, Conley et al. sensitivity analysis, *and* an RCT, among other methods. To me what distinguishes rigorous impact evaluation from other impact evaluation is what distinguishes the best science from other science: replicability. RCTs are attractive because it’s crystal-clear what’s being compared to what, so replication is (relatively) easy. In other cases one has to make a greater effort to document the counterfactual in a clearly replicable way, and this absolutely can be done well in nonrandomized settings — indeed, must be done in the vast range of situations where randomization is strictly impossible. The Millennium Village Project is not one of those settings where randomization would be impossible, or even difficult.

  6. As someone who has worked extensively in building the monitoring & evaluation capacity of grassroots organizations in Africa, the latest trend towards using “gold standard” of randomized control trials is especially troubling when one is talking about community initiatives. Imposing such incredibly risk-averse behavior, evaluating every single intervention on people who are in the process of organizing at the local level is most certainly a drain on their time and scarce resources. And what so many people on the ground have told me again and again is that abstract metrics don’t help them understand their relationship to improving the well-being of the people they serve.

    Yes, let’s pursue and obtain useful data from the ground, but at a scale at which information can be easily generated, utilized, and acted upon by those we are trying to serve. M&E implemented solely for the purpose of accountability fails to result in improved programming and, in many cases, undermines the effectiveness of the very interventions it is trying to measure. Let’s always consider what is the appropriate cost and complexity needed for evaluation (especially given the size and scope of the program) and aim for proportional expectations so we ensure M&E is a tool for learning, not policing.

  7. @Gabriel and @Michael. My comment was aimed at the blog post, not your report. Your report has many useful insights and deserves a more complete response than a simple blog comment. As Kyu notes, one such response has been composed by the MV leadership.

    When I read your report my first thought was that, while it had useful insights, overall it gave a misleading impression of the MV Project and how it is undertaking evaluation. When I read Julian’s blog post this impression was reinforced.

    Underlying my original comment is a simple proposition. If one reads only your report, one doesn’t understand the MV Project and its approach to evaluation. The proposition can be tested. After Julian reads Harvests of Development and the evaluation protocol, let’s see what he has to say. If I am right any post after that “treatment” should be notably different.

  8. @Kyu Thanks for the link. I think the point made regarding RCT on straightforward interventions versus RCT for complex interventions, including complex adaptive systems, is an important one that I’d like to see more discussion on.

  9. @Julian re CCT evaluations: Two small points are worth making. First, a rigorous evaluation is not needed when (a) we know the answer from other ‘rigorous’ studies, or (b) the question is not interesting (such as “if I put in millions of dollars into a village, will this improve development outcomes?”). The more interesting questions (for the MV) are the synergies between the investments, the exact pathways to impact, spillovers, general equilibrium effects, etc. So, it is not doing a rigorous evaluation per se, but designing it properly to answer an important and previously unanswered policy (and/or research) question.

    Second, on the CCTs, you are absolutely right about what technocratic evaluations might show. We have such an evaluation in Malawi (http://ideas.repec.org/p/wbk/wbrwps/5259.html) that shows that the condition to attend school does not improve school enrollment over and above unconditional transfers. In fact, UCTs perform strictly better than UCTs for some other outcomes, such as marriage and pregnancy among adolescent girls. However, you underestimate the political economy argument. If a policy-maker reads our results and still says “the non-beneficiaries (or the median voter) will never allow us to implement a progressive transfer program if we don’t impose some conditions”, she may well be absolutely right. If UCTs are infeasible and CCTs are still better than the counterfactual of staying pat, then a CCT may still well be OK. It is true that the rigorous evidence might make a politician weigh the political economy costs against the relative benefits of a UCT and may force her to explain the reasons behind the program design more explicitly. But such debates about rights and responsibilities of citizens HAVE taken place in some countries that implement CCTs and see the conditions as co-responsibilities of their citizens.

  10. @Marc: i’m confused by your statements. My post was not about the MVP per se, but rather about impact evaluations in general (thoughts inspired by Michael and Gabriel’s paper). My view is that RCTs are indeed the gold standard, but that rigor is on a continuum and that people should be careful to distinguish quality levels in non-RCTs. In that sense I am trying to [softly] rebuke both hard-core defenders of RCTs and those on the other end of the spectrum who believe that a non-RCT can ever fully capture the impacts.

    I haven’t read Harvests of Development and don’t plan to; it is irrelevant to the points I’m trying to make (and I’m not especially interested in it as an intervention, since my personal focus is on measuring individual preferences, although I respect the ambition). I did read the MV response to this working paper (thanks Kyu) and found it fairly unconvincing, but that again is irrelevant to my post.

    @Berk: Good points, thanks, Often these sorts of things come down to the [usually implicit] choice of counterfactuals. Is a CCT a good idea? Well, that depends on whether the alternative use of the money is a UCT, some other development project (possibly with a more proven impact), or reducing taxes on the wealthy. I should also have pointed out in my original post that some projects simply can’t be randomized (e.g. distributing natural resources across countries and comparing growth rates), which is an obvious but important reason not to always do RCTs.