Am I actually sticking up for the Millennium Villages?

These are tough questions that the Millennium Villages Project will leave unanswered. For a huge pilot project with so much money and support behind it, and one that specifically aims to be exemplary (to “show what success looks like”), this is a disappointment, and a wasted opportunity.

Laura Freschi takes aim at the Millennium Villages and their absence of rigorous evaluation in the Aid Watch blog. She also talks to my advisor and co-author:

Ted Miguel, head of the Center of Evaluation for Global Action at Berkeley, also said he would “hope to see a randomized impact evaluation, as the obvious, most scientifically rigorous approach, and one that is by now a standard part of the toolkit of most development economists. At a minimum I would have liked to see some sort of comparison group of nearby villages not directly affected by MVP but still subject to any relevant local economic/political ‘shocks,’ or use in a difference-in-differences analysis.”

Here’s the thing: I don’t know if rigorous evaluation is feasible with the MVs.

Usually the MVs are a cluster of perhaps 10 villages. This is, in some sense, a sample size of one (or 10, with high levels of cross-village correlation, which is not much of an improvement). Adding a few comparison clusters is informative, but they wouldn’t provide the rigor or precision we would like. (Josh Angrist and Alan Krueger demonstrated this famous flaw with difference-in-differences comparison of cities in the US).

If we did many more MV clusters, we might be able to test their impact more confidently. I think the MV guys would love to do this, but they have a hard enough time getting funding for a single cluster, let alone plenty.

Also, even if we looked at control villages, and saw an impact, what we would learn from it? “A gazillion dollars in aid and lots of government attention produces good outcomes.” Should this be shocking?

We wouldn’t be testing the fundamental premises: the theory of the big push; that high levels of aid simultaneously attacking many sectors and bottlenecks are needed to spur development; that there are positive interactions and externalities from multiple interventions.

The alternative hypothesis is that development is a gradual process, that marginal returns to aid may be high at low levels, and that we can also have a big impact with smaller, sector-specific interventions.

To test the big push and all these externalities, we’d need to measure marginal returns to many single interventions as well as these interventions in combination (to get at the externalities). I’m not sure the sample size exists that could do it.

I once joked with a friend at DFID that we should raise the money to try. I wanted to call them the ‘Villennium Millages’. Now that would be a career maker.

Aid Watch is right to ask for more learning and evaluation. But we shouldn’t ask for rigor that isn’t feasible unless we’re prepared to fund thirty clusters of MVs for them to give it an honest try.

In the meantime, there are other paths to learning what works and why. I’m willing to bet there is a lot of trial-and-error learning in the MVs that could be shared. If they’re writing up these findings, I haven’t seen them. I suspect they could do a much better job, and I suspect they agree. But we shouldn’t hold them to evaluation goals that, from the outset, are bound to fail.

Here is what I suggest: we use the MVs to develop hypotheses about what interventions work and in what combination–things we didn’t expect or didn’t know before. Then, if necessary, we go and test these more manageable claims in a nearby enviroment, to see if they’re worth scaling up. This seems like a more productive debate to have with the MVs.

10 Responses

  1. I am managing a team trying to do this the right way in neighboring Kenya for Nuru International. We have been on the ground for just over a year and have a similarly modeled five-year exit plan. Last year, we collected baseline data, then 8 months later, brought in 3rd part evaluators to measure our progress toward our goals for all of our metrics. This year we are beginning scaling efforts and we hope to use those efforts as a means to begin conducting randomized comparisons of our interventions with neighboring communities. It IS hard, but we are trying to start-out the right way, despite the stumbling that we’ve done.

  2. Millennium Villages reports: “By the end of 2008, the MVP… served over 400,000 people in 14 sites in ten countries.” http://www.millenniumvillages.org/docs/MVP_Annual_Report_2008.pdf

    There would have been no problem finding credible comparison groups (even if not randomized) for a sample of 14 clusters with multiple villages and 400,000 people. MVP has foregone a useful opportunity by not collecting baseline and follow-up data on nearby comparison communities. That opportunity would have added credibility to their claims of improvement and would have modelled how to bulid learning into more development projects.

  3. Also big fail on my part for being anonymous since google helpfully shows my picture!

  4. I think I’m with you on this one. This is a big enough push that maybe one might define failure as anything short of overwhelming change that dwarfs *every* plausible counterfactual. In which case, there’s no need to approximate one.

    In 15 years, if the Millennium Villages look like La Jolla, California, and the neighboring villages are all empty because people have voted with their feet, then you really would have to be a pedant to claim the MVP wasn’t responsible for that in every relevant-to-the-real-world sense.

    On the other hand, if 15 years from now the argument is whether we can causally attribute to the MVP whatever small differences are observed from whatever crude approximation of the counterfactual a reasonable person might be inclined to point to, well, then the very fact that the question is out there means the project failed.

    1. Certainly, with big effects, randomization is not that important.

      But then again, with careful planning, randomization is not that hard or expensive.

      I believe a project claiming to be a leader in the field should have led by example.

      1. Just a point on the design of MVP. I believe the aim was for these projects to “demonstrate” what can be done with multiple interventions – but I think this means demonstrate in terms of visibility rather than statistically. In this sense they have probably achieved their aim – and for this their design was adequate.

        I agree with Chris that one very useful thing would be for them to document and share their lessons learned from implementation along the way to help inform other programmes about the practical challenges they faced and how they might be overcome – or what surprising things they learned along the way.

  5. You are right that it would be extremely expensive to test the big push theory, but that doesn’t mean that they shouldn’t be testing each piece of the push independently and, where sample size allows it, their interactions. Is Busia, Kenya, not an example of an MV-like area that is essentially getting a big push, but testing each component along the way? Why not have Millennium locations or sub-locations and randomize across villages, schools, households, and individuals within these areas? Idle speculation that the sample size would be too small is unproductive.

  6. I guess what you are saying is that the MV project is so badly designed from the start that it cannot be randomly evaluated. I think that is an indictment of the project not a defense. It is very hard to argue that proper evaluation would not have been possible if conceived from the start.

    And I disagree completely about sample size to test big push theories. Presumably these effects will be huge (big push) and cumulative over time (take off). We actually don’t need large samples to detect these huge effects.

    Frankly, there is no excuse for a project as visible, important and expensive as this not to have thought more carefully about evaluation from the outset. Least of all when its sponsors will use it to make strongly worded policy recommendations.