Randomized evaluations 2.0

November 3, 2007

In recent years, an enthusiasm for applying the same high standards of investigation to foreign development assistance that we usually reserve for medical trials has been sweeping donors and development institutions with the intensity and passion of religious conversion.

As a researcher and sometime proponent of such evaluations, however, I worry that (as currently conceived) randomized evaluations are of limited practical use to the practitioners and decision-makers who actually do development.

The proponents of randomization (Angus Deaton wonderfully labels the most ferocious of the breed ‘randomistas’) tend to focus on two main gaps. One is the gap in knowledge about what development programs work. What donors and big institutions seems to want to know is the relative return on investment for some educational/health/economic interventions over others. This we admittedly do not know.

The second gap often lamented by us randomistas is the poor grasp of evaluation techniques among many policy-makers. The standards of conventional assessment often appear all too low.

However, I am increasingly convinced that the gap between researchers and practitioners is not merely the practitioners’ ignorance of evaluation techniques, but also the researchers’ ignorance of what it means to actually do development.

Suppose we evaluate a vocational training program and it fails to yield a ‘return on investment’. Do we know why? Suppose a microfinance program yields a tremendous return, with magnificent impacts. Do we know who or what part of the process led to the success? Ususally the answer to both questions is ‘no’.

Return on investment numbers make excellent academic papers, and are of great interest and importance to donors and big institutions like USAID and the World Bank. Do they really help government agencies or NGOs improve their programs in a helpful and timely way? All of the emphasis seems to be on upward accountability of results to the donor or institutions. Real accountability, and real process evolution and improvement, means knowing what worked, who made it happen, and why.

Evaluations tend to look at the difference between having a program and no program, when what is most interesting and helpful to an organization would be evaluating different flavors of a program, or varying targeting and implementation strategies.

Furthermore, whether evaluated properly and helpfuly or not, feedback also often comes too late for most implementers to feed back results into their programs. Moreover, acadmic publications tend to be too long, technical, and inaccessible for most decision-makers. Can we really say that NGOs and governments are learning from our evaluations?

Very simply, simple randomized evaluations are not a performance management tool for development NGO and government agencies.

I’m currently working on a short essay on this topic for CGD. Some preliminary thoughts and recommendations (feedback welcome!) include the following:

We need methods to produce more timely results. Interim follow-ups can do this, as can electronic data entry that feed directly into (1) academic analysis and (2) NGO and government performance management systems.

Pay attention to impact heterogeneity–the differential response of different people (poorer, more educated, females, children) to the same program. This is important feedback for implementating organizations.

Vary the process as well as the program. Within treatment groups, implement the program in different ways in order to learn how to run the program better, rather than the simple average effect of the program. This is not only of practical value, but academically can help reveal causal channels (and hence improve accountability).

Move away from simply experimenting with programs and program design. Experiment with alternative program targeting strategies. Development organizations care as much about reaching the poor and vulnerable as they do about being effective once they reach them.

As I said, feedback is encouraged.

2 Responses

Bear says:

December 14, 2007 at 12:59 pm

This is interesting and could be somewhat useful to practicioners. Understanding how to target or implement a program more effectively is useful–potentially more useful than just doing with/without analysis. However, it is still just tinkering at the margins and I am still skeptical it will work.

I have a Master’s and work as an evaluator in the field. USAID almost never, ever wants to invest the time, money or energy into doing randomized controlled trials for anything. Why would they? They have already decided that they are going to do the program and there are a host delicate political considerations that went into that program. A whole host of political considerations may also go into how they target and how they implement the program as well. So, even if you do a RCT on implementation or targeting to show them it could be done better, it may or may not amount to anything. Just like it may or may not amount to anything if you do with/without analysis. They may be locked into how they do it or how they target for political reasons.

I hate to say it, but the work we do at my firm is pretty damn crappy. Some of it is because it is entrenched and the older evaluators don’t know new techniques at all. (I say “randomized controlled trial” and they look upon me with fear like I hold the secrets of fire or something.) But honestly, its because we get money from USAID, which wants us to do things in a crappy way. They want us to do things in a crappy way so whatever the true outcome, they look good. They need to look good so Congress gives them money. It’s just an incentive problem and I am not sure tinkering at the margins of RTC will work.

All of this used to keep me up at night, but now it makes me want to get a PhD and do real research.
Michael says:

November 4, 2007 at 10:19 am

Chris, what is an example of a best-practice 2.0 evaluation? Pointing to a real example of one such evaluation, and pointing out the exceptional elements, would be helpful.

Beyond the strictly normative analysis, how about some positive analysis? That is, what is it about the incentives academics face that leads to evaluation 1.0? What other kinds of incentives would lead them to move toward 2.0? Is it just the fact that practitioners won’t want to work with academics unless academics embrace 2.0, so if academics want to run any experiments at all they’ll be forced to adopt 2.0? Or is there some other way that academics’ incentives might be shaped — perhaps even by some mechanism that you propose?

Michael Woolcock has done some related work that you might want to look at. I don’t know where his paper is, but here is a PowerPoint that summarizes it. Note his recommendation to “reduce distance between researcher and program”.

If you do scholarly work on this subject, note that your critique of evaluation 1.0 is precisely analogous to the critique of all empirical research known as the Duhem-Quine Thesis. If an empirical observation falsifies a theory, exactly which hypothesis has been falsified — the one the researcher happens to be focusing on, or any one of countless others that have unavoidably been assumed to hold? The classic example is the fact that Copernicus’s heliocentric theory of the solar system combined with Kepler’s elliptical orbits resulted in a clear prediction for the orbit of Uranus, which turned out to be wrong. Were these theories thereby falsified? No: It turned out that there was an “auxiliary” hypothesis that the observation falsified, the hypothesis that there were only seven planets. The discovery of Neptune explained the eccentricity in the orbit of Uranus.

More generally, Quine wrote that it is impossible to test any single hypothesis in isolation, because there are always auxiliary hypotheses that must simply be assumed in order to make the test possible. There’s no way out of the problem, but there are things we can do to take more explicit account of it: either be more humble about the interpretation of any given falsification, or relax as many of the obvious auxiliary assumptions as possible (e.g. try an implementation in various ways, as you suggest). By Quine’s term, move somewhat further out on the “web of knowledge”.

Chris Blattman

Chris Blattman

Randomized evaluations 2.0

Related

2 Responses

Subscribe to Blog

Recent Posts

Presentation to the Joint Chiefs Operations Directorate

From street fights to world wars: What gang violence can teach us about conflict

When is War Justified?

Conversation with Teny Gross on Gang Violence

The 5 reasons wars happen

Advanced Master’s & PhDs