The Millennium Villages, evaluated? A skeptical view

Michael Clemens highlights a new paper by Wanjala and Muradian, who do a clandestine evaluation of Kenya’s Millennium Villages.

The result does not look good for the MVs:

While Wanjala and Muradian find that the project caused a 70% increase in agricultural productivity among the treated households, tending to increase household income, it also caused less diversification of household economic activity into profitable non-farm employment, tending to decrease household income. These countervailing effects are precisely what one might expect from a large and intensive subsidy to agricultural activity. On balance, households that received this large and intensive intervention have no more income today than households that did not receive the intervention.

The only problem: I’m not sure I believe it.

This blog has had long conversations on evaluating the MVs. Before I get into my reading of the paper, do I have a secret side? Not really. I think it’s fair to say I’m an MV agnostic. All the folks on both sides of the divide are close friends and colleagues. This could make me neutral arbiter (but mostly it makes me disappointing to both).

In spite of this, I decided I’d take up the paper as if I were refereeing for a journal, dispassionately as possible (if somewhat hasty in my reading). So put on your propeller hats, folks, and join me under the fold.

My basic concern: there are a number of things that don’t add up in the paper. They could be just fine once clarified, but I worry not.

Let’s begin at the beginning: summary statistics. Before the authors make their adjustments, incomes are actually higher and poverty is lower among villagers in the Millennium Villages (MVs).

  • To see this, take a look at the simple differences between MV and non-MV people (Table 3). The MV people are 14 percentage points more likely to have a higher quality home (one of the best quick indicators of poverty, however imperfect) and 12 percentage points more like to have land. So it looks like durable assets may be greater by a third in the MVs.
  • Also, Most sources of income are the same or greater in the MVs than in the non-MVs. At least before the adjustments.
  • There doesn’t appear to be much of a difference in total income, however. But is this an error? The sub-components of income add up to the total among the MV people, but not in among the non-MV. If an error, then income is about 10% higher in the MVs on average.
  • Finally, the MV people are 14 percentage points more likely to be engaged in agriculture, but agriculture is not an either/or proposition. Most households engage in many activities, and are underemployed to begin with, and so increased time and productivity in agriculture does not necessarily crowd out small business or wage income. So I’m not even sure what to make of these indicators.

Now, you might say, “Hey, we’re not comparing apples to apples. Maybe the MVP folks were richer to begin with.” You’d be right. This is what motivates the authors’ matching method: Let’s match MV people to similar-looking non-MV folks.

This is a great idea if you can match on the right things. But is that the case?

  • Usually you want to match on pre-program characteristics like initial income or prior agricultural work. Better yet, you want to match on pre-program trends, not levels. This way you avoid matching someone on a downswing to someone on an upswing who happen (at that particular moment in time) to have similar levels. Almost no one does this, but they should.
  • The authors can’t do either without pre-program data. So (as far as I can tell) they match on post-program data, like employment status and housing quality and agricultural employment. What this means is…. wait a second… didn’t we just find out that the MV people are different along most of these characteristics? I think we have a problem.
  • What does this mean? I suspect the authors are (unknowingly) taking non-MV people who are unemployed or in poor quality housing–possibly because they didn’t receive the MV project (who knows?)–and matching them to unemployed and poor quality housing MV people. If so, it’s no surprise that there is no difference in income, since they’ve controlled (in their matching) for the impact of the MVs.
  • Meanwhile, you’re comparing employed and richer people in the MVs to the same kind of people in the non-MVs. But the MVs purportedly bump people on the margins of employment and poverty into slightly more employment and riches (i.e. the matching variables flip). Quite possibly these newly-non-poor have lower incomes than the average non-MV employed person in a nice house. But is that because they have just gotten out of poverty, and have yet to rise?
  • What we would like to know is if previously poor and unemployed people have gotten more employed or less poor. I think the authors omit this possibility altogether, and so lose a lot of the potential power of the projects.

So, it’s not clear to me that the matched estimates really mean anything, since they match on things that we think are affected by the program. If anything, they seem to me to indicate the MV are better off even with the matching deck stacked against them:

  • First, the direction of the matched estimates suggest that self-employment and wage income is higher among MV people than non-MV people (even if not statistically significant). This seems at odds with a main claim of the paper.
  • Also, remittances from outside are way, way down in MV households. This is probably the best measured portion of income and (in my mind) a pretty good sign that MV households are better off.

I should stress that this is not a vindication of the Millennium Villages. The MV folks could be better off now because they were better off to begin with. Without a credible matching strategy or other research design, it’s very difficult to say.

Also, the real question is not whether the MVs reduce poverty. If you put in more inputs, you’ll get more outputs. That’s something we mostly know in aid at the micro level. If the MVs actually raise incomes by 10% and assets by a third, then they almost certainly pass a cost-benefit test. That is important.

The real assumption behind the MVs, however, is that the different interventions are complementary: the whole of poverty alleviation is greater than the sum of its parts. That is not at all clear from an evaluation like this.

My own theory of poverty is actually the opposite: there are diminishing marginal returns to aid in a single village. I believe in the possibility of increasing returns and complementarities, but mainly through broad, national institutional and technological change. I’m personally not convinced real poverty traps exist, or can be overcome, at the household or village level.

Before ending, a few other red flags in the paper, which may or may not be an issue:

  1. This is a small sample size (just over 400 households), but the real “smallness” comes from the fact that there are just 16 communities. Since the assignment to MVP was done at the community level, in some sense the sample size here is 16 and not 411. That’s not really true, but one does have to account for the fact that people within communities have similar outcomes and reactions to the MV or lack thereof (“clustering of standard errors”, in the lexicon). I can’t tell if this was done, but it looks like not. If not, the statistical significance of any differences is probably overstated, and none of the results are as significant as the paper says. I suspect sample sizes are too small to say whether there is an impact one way or the other.
  2. I’d like to know more about how household income was measured. This is famously difficult to capture when households have multiple, irregular income streams. Especially agricultural income, which should include consumption of own produce. A poor measure of income could be little better than noise, especially in small samples. More worrisome, since the MV people are more likely to be in agriculture, we might be systematically underestimating agricultural income and hence the effect of the MV project.
  3. Consumption and nutrition data would be one way to get around this issue. In fact, there’s a bunch of data I would love to see measurs: subjective well-being, distress and anxiety, social cohesion. These are all things I’ve seen impacted by aid in Uganda.
  4. I’d also like to see more detailed employment data, like hours instead of employment indicators. In rural Africa, there’s almost no such thing as “fully employed” or “unemployed”. It’s typically a matter of degree of underemployment.
  5. Matching estimates are famously sensitive to the matching method. These are not. In fact, some estimates change not at all. Probably this is because of the small sample size, but it’s a red flag for coding issues.

I could have been wrong in my reading of the paper, especially as it was hasty. Clarifications and corrections welcome.

11 thoughts on “The Millennium Villages, evaluated? A skeptical view

  1. Thanks Chris. This is a smart analysis with many good points. One point I’d like to make is that the relevant null hypothesis, if we are assessing whether or not the project meets its goals, is not that the difference between the villages is zero. The relevant null is that the difference between the villages corresponds to a 50% difference in poverty rates, for that is the goal of the project. Clearly it would take massive contortions of these data to imply the very large and broad-based differences in income that would correspond to a 50% difference in poverty rates between the two groups. Whether or not incomes are 10% different between the groups is not relevant in assessing a project that claims to be “a solution to extreme poverty” capable of meeting the Millennium Development Goals.

    I would take issue with some of your points, such as the idea that clustered standard errors are required. All of the households in question, treated and untreated are within Gem district. That plausibly means they belong to a single cluster, and that the standard errors used in the paper are appropriate. I see no reason to believe that there would be the kind of heterogeneous pockets of homogeneity that would require clustered standard errors.

    That said, your analysis is thorough and appropriately skeptical. Have you applied this level of thorough, skeptical scrutiny to the papers produced by the MVP’s internal evaluation? I haven’t seen a post by you comparable to this one about any of their papers. I suggest you do one. You’ll find that they make strong, quantitative claims of ‘impact’ made based on much shakier evidence than is used by Wanjala and Muradian — including before-and-after analysis with no comparison group at all, much less a comparison group matched by propensity scores with a common support.

    Gabriel Demombynes and I have undertaken such an analysis of one of the instances of “peer reviewed science” on which the project rests its claims of enormous impacts, in this post on the World Bank’s African Can End Poverty blog.

  2. @mclem: Quick thoughts:

    On the change in poverty rates, a 10% increase in current cash income could conceal a larger change in consumption because (1) it probably doesn’t account for home production, and (2) it may not account for consumption of the durables (which appear to have increased by a third.

    Is this a 50% change in poverty? Depends how one measures poverty. The cumulative effects of this change could be quite large, possibly enough to push half the village above a threshold poverty line (the common headcount measure). To know for sure, better measures and added analysis would be needed than this paper allows. But I would not be totally surprised if true.

    On clustering, if there is no inter-cluster correlation like you say, then it won’t matter. Being conservative in this matter is thus a no-lose situation. But the reason these villages are indeed different clusters is because the clustering is determined and set by the unit of randomization. Standard experimental and quasi-experimental practice, and different than other motivations for clustering. That is a poor explanation, but it is late and I had Amtrak cheese and crackers and beer for dinner.

  3. Thanks very much Chris. These are hypothetically true. But the idea that poverty could be massively lower among hundreds of treated households than untreated households (the project’s goal), when otherwise similar households have similar incomes, is the idea of which we should definitely be skeptical. The burden of proof should lie on anyone making such a claim. Wanjala and Muradian, unlike the MVP’s internal evaluation, have no incentive to find one way or the other.

    And that, you’ll notice, was the main point of my post: independent evaluation is critically important. If you published an evaluation of the Yale Political Science Department, I would have a hard time believing it. If I published an evaluation of CGD, you wouldn’t believe it. Wanjala and Muradian’s results are more credible because they are independent. Apart from that, they are more credible because, despite the imperfections you correctly note, Wanjala and Muradian at least take seriously the need to compare to a credible counterfactual. The MVP, as we have discussed in several settings, does not.

    Finally, is it right to be completely agnostic about the impact? Do you really believe that if poverty rates in the treated villages were 50% lower than in untreated comparison villages, the project either would not know this, or would know it but would have concealed this great triumph? All of the project’s evaluation reports and papers have been silent about income, though clearly they are collecting income data. In other words, there are good reasons not to have a flat prior about the income effect.

    And there are definitely good reasons to subject the MVP’s internal evaluation to at least the same level of skepticism and scrutiny as Wanjala and Muradian’s results. I hope you do that in a future post.

  4. @Michael – I’d like to gently challenge a premise you and Chris seem to agree on, which is this insistence on a benchmark of 50% poverty reduction. Certainly an impact evaluation should measure the extent to which an initiative meets its stated goals, but as people interested in developing world issues, we can take a broader view of the benefits the Millennium Villages might provide.

    A 50% poverty reduction is not the only satisfactory result. Neither should we be happy about any small improvement over the control. Rather than sitting prepared to denounce the MVP if it fails to reduce poverty by 50%, we should be keeping an eye out to see if the Millennium Villages are a cost-effective way to alleviate poverty. If they are, then this model is useful, whether it helps achieve the MDGs or not.

    I’m sure it can be frustrating to deal with the MVP’s overconfident claims, but don’t let that blind you to the fact that the MVP, even if it fails at its goals, might still prove to be an effective anti-poverty program.

  5. Are independent results more likely to be truly unbiased? Let’s think a little bit about standard research bias – usually we ‘re always looking to show an impact, positive or negative. But now the null hypothesis is: “Millennium Villages are the holy grail of poverty alleviation, prove us wrong.”

    To me, it seems like `independent’ researchers have a massive incentive to knock down the MVP’s claims. True independence involves a lot more transparency at the start than True independence would have involved a little more transparency at the start than the Wanjala and Muradian study managed.

  6. I have not read the paper as closely as I should, but to clarify one point from your post, Chris:

    You mentioned that matching should be done on individual characteristics that are logically prior to receipt of treatment, and that they can’t do that without pre-program data. That is only partially true. They claim (page 16) to match on these covariates:
    sex, age, educational attainment of household head, number of household members, level of dependence within household, size of land holding, type of house (permanent or temporary), employment status of household head, marital status of household head and source of livelihood of household head (agricultural versus non-agricultural)

    Some of those ARE logically prior (I would say the first four, though the fourth is debatable). The others aren’t–and you’re right to point out that some of them are quite clearly consequences of the treatment.

    Does the MVP have panel or repeated cross-section data for its own villages, including pre-treatment? If so, I wonder if one could create a synthetic control group using nationally representative household surveys (DHS, for instance) at the district or province level for some of these countries. If I recall correctly, part of the problem was always that they didn’t do any real data collection beforehand, in which case this is a moot point.

  7. An equally important aspect of the debate has been ignored altogether: how sustainable are these gains? Are increases in income / consumption of durables partially or largely dependent on concurrent support from MVs? Or do they sustain longer term improvements once support has ceased (this may be true of consumption of durables, depending on the durables, but unlikely in general consumption patterns). This is something good evaluation should address.

  8. “So (as far as I can tell) they match on post-program data, like employment status and housing quality and agricultural employment.”

    I didn’t bother reading any further. If you are right, the paper is useless. If your wrong, then one would have to reanalyze the paper.

    Have you gotten a clear answer about this?

    @B Peterson – I’d say only the first 2 covariates are independent. I’d expect a rise in education to be a proposed side effect of the MVs, and that the family size could go either way, but certainly would be effected by economic conditions.

  9. KevinH: The education variable is the education of the household head, which I don’t think is likely to be affected by being an MV–presumably they are adults and are not typically going back to school. But really we’re just quibbling over details here, because the majority of the matching occurs on post-treatment variables, and I think Chris underestimates how big a problem that is in an effort to be nice.

    My point was mainly that you can match on pre-program characteristics using data gathered in a post-sample (which was not clear from the original post)–you just have to ask the right questions, i.e. about pre-program characteristics that would not be affected by treatment. But, again, they didn’t do this except in the first two cases (and arguably #3/4).