Why “what works?” is the wrong question: Evaluating ideas not programs

The following are some remarks I made at the UK’s Department for International Development in June.

I want to start with a story about Liberia. It was 2008, and the UN Peacebuilding Fund had $15 million to support programs to reduce the risk of a return to war.

They asked NGOs and UN agencies to propose ideas, and set out to fund most the promising. Some were longstanding, locally-developed programs, and some were templates ready to be imported and tested in new ground.

I was in Liberia at the time, as a young post-doctoral student. When asked for my input, I suggested the Fund do one small thing: in their call for proposals, simply say that they would favor any proposal that created knowledge as a public good, including rigorous program evaluations.

The donor whistled a few opening notes, and applicants continued the tune. A huge amount of evidence was generated, with almost every project funded, including at least four randomized trials: a program that taught conflict resolution skills, a legal aid program, an agricultural skills training program for ex-combatants, and a program of cognitive behavior therapy for criminals and drug dealers and street toughs.

These studies laid the groundwork for more evaluation work. Innovations for Poverty Action (IPA) established a permanent office. More international researchers came to the country. Local organizations that could do research began to form. Liberian staff gradually took on more and more senior roles, especially as they get hired and promoted in non-research organizations. The government and NGOs and UN recognized the value of these skills.

Today Liberia, one of the smallest countries in the world, has more hard evidence on conflict prevention than almost any other country on the planet. In per capita terms, it is off the charts.

For all that, Liberia is not a complete success story. We did one thing badly: we thought too much in terms of “what works?” In a couple of cases, we did one important thing well: we put a fundamental idea or assumption to the test. (Though mostly by accident.) And we did not do one of the most important things at all: set up the research to see if this very local insight told us something about the world more broadly.

I want to go through these three failures and successes one by one.

A mistake: focusing too much on “what works?”

Lately I find myself cringing at the question “what works in development?” I think it’s a mistake to think that way. That is why I now try hard not to talk in terms of “program evaluation”.

“Does it work?” is how I approached at least two of the studies. One example: Would a few months of agricultural skills training coax a bunch of ex-combatants out of illegal gold mining, settle them in villages, and make it less likely they join the next mercenary movement that forms?

But instead of asking, “does the program work?”, I should have asked, “How does the world work?” What we want is a reasonably accurate model of the world: why people or communities or institutions behave the way they do, and how they will respond to an incentive, or a constraint relieved. Randomized trials, designed right, can help move us to better models.

Take the ex-combatant training program as an example. This program stood on three legs.

The first was the leg that every training program stood on: that poor people have high returns to skills, but for some reason lack access to those skills, and if you give them those skills for free then they will be able to work and earn more.

The second leg was that farmers also needed capital inputs, like seeds and tools. The program provided some capital too, though not much in relation to the training.

The third leg was that something about the outcome—a new identity, a new group of peers, disconnection from old commanders, better incomes, respect in the community—would keep the men from returning to their illegal mining, or joining an armed movement in future.

Kick out any one of these legs, and the program doesn’t work. No program like it would with a wobbly leg.

In our study, we followed a control group of ex-combatants that did not receive the program. That mean people either had the whole package or none of it. That meant we couldn’t answer take aim at any one leg in particular. Just prod all three at once. That’s not very useful beyond this one program. It’s a long and expensive audit.

Accidentally tackling the right question: “How does the world work?”

Now, if we’d started with the question, “How does the world work?” we would have proceeded differently. We might have played with the balance of training to capital, to find out where the returns were greater. We might have asked, “Are skills or capital even the real constraint that are holding these men back?” We might have tested other ways to reduce their interest in committing crimes or fighting someone else’s war.

As it happens, we were able to prod each leg by accident.

Some of the men only got the training, not the capital. The men who expressed interest in raising vegetables received their seeds and tools. But the men who said they were interested in raising animals waited more than 18 months for their chicks and piglets to arrive. It turned out you couldn’t source chicks and piglets in Liberia, and so they had to be flown in on unheated UN cargo planes from Guinea, with predictable results for the poor baby animals.

Not surprisingly, the vegetable men had higher farm profits than the men still waiting for their animals to arrive.

Then something even more tragic and more important happened: a war broke out in neighboring Cote d’Ivoire.

Ex-commanders around the country began to hold meetings and recruit. Some of the ex-combatants in our study, about one in ten, said they made some kind of plan to move or agreement to get involved. It’s unclear how serious they were, and we never found out, as a French military intervention ended the war and that recruitment.

An interesting thing about who did and didn’t say they got involved in recruitment (and an interesting thing about who did and didn’t return to the illegal mines): the vegetable farmers were a little less likely than the control group to express interest in recruitment. They had greater incomes after all. But the men waiting for their piglets and chicks were the least likely to express interest or plans for mercenary work. Apparently, a promise of $100 worth of animals was worth more than the $500 for getting onto the truck to Cote d’Ivoire. Or so they told us.

That’s not something I expected. But it does make a kind of sense–future incentives might be more important than past ones. It sounds obvious but it’s one that has eluded most rehabilitation programs, whether it’s US criminals or African excombatants.

My first attempt to do better: Theory-driven programs and research design

The pitfalls of simple program evaluation didn’t catch me by complete surprise. Development economists had been talking about these pitfalls for some time. For the most part I knew “whether the package works” was not the best investment of time or money. But I underestimated how much work program evaluation could be. For that kind of money and effort and time, I wanted to do better.

With more credibility and patience, I set out to take an important idea first, and test an assumption about the world.

I met a group of former combatants turned social workers that called themselves NEPI. NEPI had developed a therapy program, one that did not teach job skills. Rather, NEPI took criminals, drug dealers, and street toughs, and tried to teach them self-control, emotional regulation, and a new self image that eschewed crime and violence over just eight weeks. What they were doing closely resembled cognitive behavior therapy, an approach used widely in the U.S. to tackle a variety of problems, including rehabilitating youth delinquents. A number of small U.S. studies showed real signs of success.

Here was a cheap program rooted in international practice with a slightly unbelievable premise: that eight weeks of group therapy could do as much or more than an employment program to reduce crime, violence, and other anti-social behaviors. So we put that to the test. And because we wanted to test an alternative approach, we also evaluated a cash transfer to the same men. So men could receive one, both, or none of the interventions.

We found that the therapy was incredibly powerful, drastically reducing crime and violence. And we found it was even more powerful (and lasting) when an economic intervention provided petty criminals with even a short period of time in which to practice their new skills of self control and new self image. Crime and aggression fell by half.

These two examples illustrate something that randomized control trials can do well: they can put a crucial idea or an assumption to the test.

Sometimes randomized trials do this by “proof of concept”—something outlandish that we think could be true, but don’t quite believe (or, more often, no one believes us!). Such as whether aggression and criminality can be tempered by short courses of cognitive behavioral therapy.

Sometimes randomized trials do it by testing “fundamental assumptions”—ones that prop up the intervention (especially the most common ones). One of my favorite examples is the growing number of studies that the returns to business and vocational skills are not that high, at least without other inputs such as capital.

Personally, I think these theory-driven cases are where randomized trials have been most useful. The number of accidental insights into the world still outnumbers the number of purposeful, theory-driven ones. But the balance is shifting.

That is the first big idea I want you to take away: that we should not set out to evaluate a program and understand “does it work?” but rather “how can we use this program to test a fundamental assumption about the world—one that might change the way we do all programs in future.”

But theory-driven design is not enough; we need to design for generalizability

The second and final big idea I want you to take away: even this theory-driven, “test the big assumptions” approach to randomized trials is not enough.

One cognitive behavioral trial with one population in Liberia is not enough to give us a new model of behavior change among risky men. Nor is one ex-combatant agricultural program. We can be intrigued by these insights, but we cannot responsibly pretend we know something about the wider world.

We need many other ingredients for that, including better theories and understanding of context and cases. Another powerful ingredient, however, is bigger randomized trials: multiple trials, in multiple places, testing some of the same key ideas and assumptions. This strikes me as the only possible path from randomized trials to general knowledge.

Take the therapy example. The next logical step is cognitive behavioral therapy trials in more places. There are enough examples from the US and Liberia that cognitive behavioral is ripe for piloting and testing at scale in several countries, ideally testing implementation approaches in each place: longer or shorter, with or without economic assistance, and so forth.

Another example: The right combination of skills and capital to raise incomes is also ripe for testing in many countries. In fact, we are currently trying to work with DFID and others to launch out a six-country study where, in each country, we test different combinations of skills, capital, and other inputs in the same way. We aim to do so this with multiple populations of importance, including high-risk men—meaning, we anticipate, some conflicted countries n the mix.

Without this knowledge, these countries are spending money and time they cannot afford to waste on anti-poverty programs that probably don’t work at all or as well as they should.

There are a couple of examples of programs that have already been tested in a semi-coordinated fashion.

  • Can dropping money on a village, and requiring democratic decision-making on it’s use, change how decisions get made in future? This is the premise of many “community driven development” and reconstruction programs. The answer seems to be “no”.
  • Does microfinance help even some of the poor invest and raise their incomes? The answer seems to be “very seldom or not at all.”

Too much money was spent on these programs because of assumptions we waited too long to test them.

There are other mature ideas that can and should be tested in a coordinated fashion, where we don’t yet know the answer:

  • When people do become wealthier and more employed, are they less willing to fight? Maybe. The evidence is still scant.
  • Are their votes harder to buy? Are they less tolerant of a corrupt regime? Maybe anti-poverty programs have a hidden impact on governance that we are not taking into account. Maybe, if these programs strengthen autocratic governments, they lessen accountability.

These are questions that are answerable in a short space of time, that don’t require much more money or time than the average large country program.

There are only a few governments and development agencies in the world big enough to carry these out, however. Right now almost none are doing so. Some efforts by JPAL and EGAP are promising. But fully coordinated trials, implementing for the sake of learning about the big questions and assumptions—I haven’t seen these yet.

One reason, I think, is that they are poorly organized to identify, fund, and carry out initiatives that answer questions important to every country.

Most spending, as far as I can tell, is driven by a country’s parochial priorities. Decisions are made in this very siloed way. And that is where the funds lie too.

In structuring themselves to be country driven, development agencies have created a collective action problem. The public good of knowledge is still getting created, but it’s slow, piecemeal, and sometimes accidental.

The term “program evaluation” and the question “what works?” is a symptom of this problem.

If I were interested in creating the public good, I would devote a small percentage of a government or agency’s money to answering questions important to all of us. I would centrally plan and fund projects to be implemented in a particular way, maybe the same ways, across these country boundaries.

I would make sure they more seldom evaluate programs, and more often investigate assumptions and ideas. And never just any ideas, but the legs that prop up enormous things. If strengthened, these legs can have enormous impact. And if a weak leg is kicked out, then the money used for the kicking probably could not have been better spent

243 thoughts on “Why “what works?” is the wrong question: Evaluating ideas not programs

  1. But, for service delivery in developing countries, my sense is that the vast majority of research is impact evaluations. So, the studies keep piling up, but we get little additional illumination.

    golu dolls
    golu dolls

  2. All that they can say, it didn’t work in slums of Hydrabad. But that doesn’t mean it didn’t work in rest of India. This distinction between marginal and infra-marginal is important and serious researchers need to keep that in mind

  3. Albert Hirshman’s “Development Projects Observed” was indeed 50 years ahead of its time and excellent anti-dote to the “What Works” pablum which in effect is just an embellished version of Cost/Benefit analysis paradigmatic approach to project evaluation. Case in point his principle of the “Hiding Hand”. This referred to the fact that (infrastructual) project costs tend to be underappreciated, benefits exagerated to please political patrons, difficulties ignored when a
    project is embarked upon. This is seen as a blessing in disguise, the project might never not have been greelighted if the full extent of the difficulties in project implementation had been fully anticipated.
    Yet the struggle with the unexpected obstacles and difficulties is precisely what makes the project succeed, and often in ways which could not have been anticipated ex ante. Hence the ‘hiding hand’ providentially turns the private troubles of the project manager into the public blessings of development – and perhaps the private blessing of the project manager himself as well in the end.

  4. Chris – interesting session in DFID and great post – and we were happy to help you amend your title from ‘why programme evaluation is a bad idea'(!). At the end of the DFID session you said ‘good evidence and good programming are mutually complementary’. This is important – and suggests a need to keep a range of interest groups happy enough (…the politics – including the ‘parochial priorities’). Working in a research funding role, I see it like this:
    i) Programme evaluation can be methodologically over-complex and sometimes wobbly, but is important for accountability to those funding a programme (and the politicians/parliamentary committees/commissions that represent their interests).
    ii) ‘What works?’ is sometimes used as lazy shorthand, but signals an intent to do operationally relevant research/evaluation (i.e. impatient to change the world, not just describe it in greater detail).
    We endeavour to also iii) evaluate ideas/test assumptions (identify and kick the legs of your stool) for a greater public good, but servicing i) and ii) helps us to convince our programme implementation partners and funders to let iii) happen, and particularly to ‘hold steady’ – not bin the RCT when the programming going gets tough.

    One development in DFID social research commissioning is that RCTs are creeping into wider research programmes – i.e. ‘mainstream’ researchers spotting opportunities where an RCT (or cluster of RCTs) makes sense, within a framework of wider Research Questions and mixed methods (contrasting with, and complementing organisations that only do RCTs). …but here’s hoping that this is not just bargain basement botox…

  5. Thanks for a thoughtful post, Chris. I’d like to push generalizability, what we should strive for, what it really means for lessons to be transferable, & how to distribute responsibility for this. There is certainly work to be done in terms of more purposive partner and site selection, for example, in a way that might allow us to test ideas and theories in a more systematic way (taking a cue from case selection in political science, e.g., including Gerring, Lieberman). But there is still a fundamental issue of reporting so that readers elsewhere can make some assessments about generalizability (working on assumption that there are no silver bullets and that every program would need to be tailored, in line with arguments about a mirage of external validity). Let’s call this external assessability, as distinct from external validity. I argue this is critical for generalizability and too often poo-pooed, dismissed as time-consuming or not sufficiently ‘real’ to be reported as research, relegated to blogs or beers at best. Do reseachers currently provide sufficient details about site and partner selection, as well as contextual details and implementing partner capacity, for readers elsewhere to say, hey, that sounds like me and do-able here? Or is it up to readers to try to figure out whether X setting is sufficiently like them to be worth trying an idea out? Similarly, do researchers currently provide enough details about implementation processes (not just intended design, as per ISSIT by Evans et al recently, but how implementation actually went, what struggles were faced, and how a context did/not support a program) for readers elsewhere to think: OK, i have a sense of how to begin doing this here? Or, again, is it up to readers to reach out and say, ‘so, what really happened?’ My sense is that researchers should do some of this work and push to make it a standard part of research-as-public-good — but I often hear that there is limited incentive to systematically collect or report this information (but always room for a good piece of anec-data), that’s it’s impossible to include this sort of information in publication, etc. No doubt there are constraints and competing pressures. But as long as information about research and programmatic challenges remains tacit, only to be shared over beers or in elite seminar rooms, it is hard to understand how we are collectively going to move towards research from ‘here’ that can stimulate not only more research but also action ‘there.’ Generalizability, I think, depends critically on not just the perfect ‘here’ or more ‘here’s — key design issues — but on well-describing ‘here,’ why ‘here’ and what happened ‘here’ in an accessible way.

  6. Great recognition of the need for learning about how things work. Agree with April – the reference to program evaluation seems to be based on a narrow definition of evaluation and the current emphasis on impact rather than other OECD DAC evaluation criteria or learning etc. The issue doesn’t seem to be program evaluation but rather the wrong question was asked. Then perhaps also something about an inappropriate method being chosen.

  7. Agree with all your points Chris. I believe the problems you point out are significantly worse with regard to development assistance aiming to strengthen service delivery, in particular, assistance purporting to aim for improvements in the operation and performance of health and education organizations. In these cases, framing research as “program” “impact evaluations” has led to research that very rarely sheds light on how the world works in the domain of interest. Where is the research illuminating which factors are driving the oh-so-common extreme dysfunctionality of public clinics? Or which policies (not donor-supported programmatic actions) generate changes that bring about organization level and individual level behavior changes that improve operation? sustained beyond the collection of the usual ex post “impact evaluation” horizon?
    I suppose it wouldn’t matter that impact evaluations shed so little light on the phenomena of most relevance for service delivery dysfunction/ strengthening, if the evaluations constituted a small portion of the total research being undertaken (e.g. if other researchers were doing basic research on service delivery/ policy toward service delivery). But, for service delivery in developing countries, my sense is that the vast majority of research is impact evaluations. So, the studies keep piling up, but we get little additional illumination.

  8. I want to nitpick about the assertion about microfinance. Even the authors of the mutli-country study suggested their results doesn’t apply for infra-marginal cases. All that they can say, it didn’t work in slums of Hydrabad. But that doesn’t mean it didn’t work in rest of India. This distinction between marginal and infra-marginal is important and serious researchers need to keep that in mind in popularizing the results.

  9. Very interesting. Sounds like you’ve come to a “sad realisation”
    Seems odd that the first stages of the evaluation didn’t catch this. The Evaluation Assessment’s first question should have what is your theory of social intervention, and the ensuing logic models should have made sense, or at least been clear and measurable. Was a guiding strategy formulated? Was there Formative Evaluation during the programmes’ designs?
    Undoubtedly this was a huge effort, with multiple clients and issues, but Programme Evaluation is not usually the problem. It’s higher up the chain where the confusion, incompetence, and conflicting agendas doom the entire effort.
    Wouldn’t some experts in international development and management helped early on? Or are they the problem?

  10. Great post. I’d argue that this is an issue for economics generally, in fact! “What can we learn from this?” is a question that many researchers fail to really appreciate.

  11. This is great, thanks for sharing – the testing of how the world works implies much greater humility on those offering assistance, and the theory-based modeling implies much stronger logic around what we currently believe/assume to be true in most sub-disciplines of development.

    Duncan Green has a blog post today on how change happens and the role of shifting norms, e.g. in gender. That work doesn’t happen so quickly, but it seems that outside actors can catalyze and frame changes in norms (certainly true here in the US, for something like gay marriage). This implies that aid actors should have some programming that works to change slow-moving norms, with results to be perceived over a longer term. How would you suggest addressing the question of theory-based programming intended to matter over the long term (resilience to climate shocks, gender norms, market development, etc.)?