Why “what works?” is the wrong question: Evaluating ideas not programs

The following are some remarks I made at the UK’s Department for International Development in June.

I want to start with a story about Liberia. It was 2008, and the UN Peacebuilding Fund had $15 million to support programs to reduce the risk of a return to war.

They asked NGOs and UN agencies to propose ideas, and set out to fund most the promising. Some were longstanding, locally-developed programs, and some were templates ready to be imported and tested in new ground.

I was in Liberia at the time, as a young post-doctoral student. When asked for my input, I suggested the Fund do one small thing: in their call for proposals, simply say that they would favor any proposal that created knowledge as a public good, including rigorous program evaluations.

The donor whistled a few opening notes, and applicants continued the tune. A huge amount of evidence was generated, with almost every project funded, including at least four randomized trials: a program that taught conflict resolution skills, a legal aid program, an agricultural skills training program for ex-combatants, and a program of cognitive behavior therapy for criminals and drug dealers and street toughs.

These studies laid the groundwork for more evaluation work. Innovations for Poverty Action (IPA) established a permanent office. More international researchers came to the country. Local organizations that could do research began to form. Liberian staff gradually took on more and more senior roles, especially as they get hired and promoted in non-research organizations. The government and NGOs and UN recognized the value of these skills.

Today Liberia, one of the smallest countries in the world, has more hard evidence on conflict prevention than almost any other country on the planet. In per capita terms, it is off the charts.

For all that, Liberia is not a complete success story. We did one thing badly: we thought too much in terms of “what works?” In a couple of cases, we did one important thing well: we put a fundamental idea or assumption to the test. (Though mostly by accident.) And we did not do one of the most important things at all: set up the research to see if this very local insight told us something about the world more broadly.

I want to go through these three failures and successes one by one.

A mistake: focusing too much on “what works?”

Lately I find myself cringing at the question “what works in development?” I think it’s a mistake to think that way. That is why I now try hard not to talk in terms of “program evaluation”.

“Does it work?” is how I approached at least two of the studies. One example: Would a few months of agricultural skills training coax a bunch of ex-combatants out of illegal gold mining, settle them in villages, and make it less likely they join the next mercenary movement that forms?

But instead of asking, “does the program work?”, I should have asked, “How does the world work?” What we want is a reasonably accurate model of the world: why people or communities or institutions behave the way they do, and how they will respond to an incentive, or a constraint relieved. Randomized trials, designed right, can help move us to better models.

Take the ex-combatant training program as an example. This program stood on three legs.

The first was the leg that every training program stood on: that poor people have high returns to skills, but for some reason lack access to those skills, and if you give them those skills for free then they will be able to work and earn more.

The second leg was that farmers also needed capital inputs, like seeds and tools. The program provided some capital too, though not much in relation to the training.

The third leg was that something about the outcome—a new identity, a new group of peers, disconnection from old commanders, better incomes, respect in the community—would keep the men from returning to their illegal mining, or joining an armed movement in future.

Kick out any one of these legs, and the program doesn’t work. No program like it would with a wobbly leg.

In our study, we followed a control group of ex-combatants that did not receive the program. That mean people either had the whole package or none of it. That meant we couldn’t answer take aim at any one leg in particular. Just prod all three at once. That’s not very useful beyond this one program. It’s a long and expensive audit.

Accidentally tackling the right question: “How does the world work?”

Now, if we’d started with the question, “How does the world work?” we would have proceeded differently. We might have played with the balance of training to capital, to find out where the returns were greater. We might have asked, “Are skills or capital even the real constraint that are holding these men back?” We might have tested other ways to reduce their interest in committing crimes or fighting someone else’s war.

As it happens, we were able to prod each leg by accident.

Some of the men only got the training, not the capital. The men who expressed interest in raising vegetables received their seeds and tools. But the men who said they were interested in raising animals waited more than 18 months for their chicks and piglets to arrive. It turned out you couldn’t source chicks and piglets in Liberia, and so they had to be flown in on unheated UN cargo planes from Guinea, with predictable results for the poor baby animals.

Not surprisingly, the vegetable men had higher farm profits than the men still waiting for their animals to arrive.

Then something even more tragic and more important happened: a war broke out in neighboring Cote d’Ivoire.

Ex-commanders around the country began to hold meetings and recruit. Some of the ex-combatants in our study, about one in ten, said they made some kind of plan to move or agreement to get involved. It’s unclear how serious they were, and we never found out, as a French military intervention ended the war and that recruitment.

An interesting thing about who did and didn’t say they got involved in recruitment (and an interesting thing about who did and didn’t return to the illegal mines): the vegetable farmers were a little less likely than the control group to express interest in recruitment. They had greater incomes after all. But the men waiting for their piglets and chicks were the least likely to express interest or plans for mercenary work. Apparently, a promise of $100 worth of animals was worth more than the $500 for getting onto the truck to Cote d’Ivoire. Or so they told us.

That’s not something I expected. But it does make a kind of sense–future incentives might be more important than past ones. It sounds obvious but it’s one that has eluded most rehabilitation programs, whether it’s US criminals or African excombatants.

My first attempt to do better: Theory-driven programs and research design

The pitfalls of simple program evaluation didn’t catch me by complete surprise. Development economists had been talking about these pitfalls for some time. For the most part I knew “whether the package works” was not the best investment of time or money. But I underestimated how much work program evaluation could be. For that kind of money and effort and time, I wanted to do better.

With more credibility and patience, I set out to take an important idea first, and test an assumption about the world.

I met a group of former combatants turned social workers that called themselves NEPI. NEPI had developed a therapy program, one that did not teach job skills. Rather, NEPI took criminals, drug dealers, and street toughs, and tried to teach them self-control, emotional regulation, and a new self image that eschewed crime and violence over just eight weeks. What they were doing closely resembled cognitive behavior therapy, an approach used widely in the U.S. to tackle a variety of problems, including rehabilitating youth delinquents. A number of small U.S. studies showed real signs of success.

Here was a cheap program rooted in international practice with a slightly unbelievable premise: that eight weeks of group therapy could do as much or more than an employment program to reduce crime, violence, and other anti-social behaviors. So we put that to the test. And because we wanted to test an alternative approach, we also evaluated a cash transfer to the same men. So men could receive one, both, or none of the interventions.

We found that the therapy was incredibly powerful, drastically reducing crime and violence. And we found it was even more powerful (and lasting) when an economic intervention provided petty criminals with even a short period of time in which to practice their new skills of self control and new self image. Crime and aggression fell by half.

These two examples illustrate something that randomized control trials can do well: they can put a crucial idea or an assumption to the test.

Sometimes randomized trials do this by “proof of concept”—something outlandish that we think could be true, but don’t quite believe (or, more often, no one believes us!). Such as whether aggression and criminality can be tempered by short courses of cognitive behavioral therapy.

Sometimes randomized trials do it by testing “fundamental assumptions”—ones that prop up the intervention (especially the most common ones). One of my favorite examples is the growing number of studies that the returns to business and vocational skills are not that high, at least without other inputs such as capital.

Personally, I think these theory-driven cases are where randomized trials have been most useful. The number of accidental insights into the world still outnumbers the number of purposeful, theory-driven ones. But the balance is shifting.

That is the first big idea I want you to take away: that we should not set out to evaluate a program and understand “does it work?” but rather “how can we use this program to test a fundamental assumption about the world—one that might change the way we do all programs in future.”

But theory-driven design is not enough; we need to design for generalizability

The second and final big idea I want you to take away: even this theory-driven, “test the big assumptions” approach to randomized trials is not enough.

One cognitive behavioral trial with one population in Liberia is not enough to give us a new model of behavior change among risky men. Nor is one ex-combatant agricultural program. We can be intrigued by these insights, but we cannot responsibly pretend we know something about the wider world.

We need many other ingredients for that, including better theories and understanding of context and cases. Another powerful ingredient, however, is bigger randomized trials: multiple trials, in multiple places, testing some of the same key ideas and assumptions. This strikes me as the only possible path from randomized trials to general knowledge.

Take the therapy example. The next logical step is cognitive behavioral therapy trials in more places. There are enough examples from the US and Liberia that cognitive behavioral is ripe for piloting and testing at scale in several countries, ideally testing implementation approaches in each place: longer or shorter, with or without economic assistance, and so forth.

Another example: The right combination of skills and capital to raise incomes is also ripe for testing in many countries. In fact, we are currently trying to work with DFID and others to launch out a six-country study where, in each country, we test different combinations of skills, capital, and other inputs in the same way. We aim to do so this with multiple populations of importance, including high-risk men—meaning, we anticipate, some conflicted countries n the mix.

Without this knowledge, these countries are spending money and time they cannot afford to waste on anti-poverty programs that probably don’t work at all or as well as they should.

There are a couple of examples of programs that have already been tested in a semi-coordinated fashion.

  • Can dropping money on a village, and requiring democratic decision-making on it’s use, change how decisions get made in future? This is the premise of many “community driven development” and reconstruction programs. The answer seems to be “no”.
  • Does microfinance help even some of the poor invest and raise their incomes? The answer seems to be “very seldom or not at all.”

Too much money was spent on these programs because of assumptions we waited too long to test them.

There are other mature ideas that can and should be tested in a coordinated fashion, where we don’t yet know the answer:

  • When people do become wealthier and more employed, are they less willing to fight? Maybe. The evidence is still scant.
  • Are their votes harder to buy? Are they less tolerant of a corrupt regime? Maybe anti-poverty programs have a hidden impact on governance that we are not taking into account. Maybe, if these programs strengthen autocratic governments, they lessen accountability.

These are questions that are answerable in a short space of time, that don’t require much more money or time than the average large country program.

There are only a few governments and development agencies in the world big enough to carry these out, however. Right now almost none are doing so. Some efforts by JPAL and EGAP are promising. But fully coordinated trials, implementing for the sake of learning about the big questions and assumptions—I haven’t seen these yet.

One reason, I think, is that they are poorly organized to identify, fund, and carry out initiatives that answer questions important to every country.

Most spending, as far as I can tell, is driven by a country’s parochial priorities. Decisions are made in this very siloed way. And that is where the funds lie too.

In structuring themselves to be country driven, development agencies have created a collective action problem. The public good of knowledge is still getting created, but it’s slow, piecemeal, and sometimes accidental.

The term “program evaluation” and the question “what works?” is a symptom of this problem.

If I were interested in creating the public good, I would devote a small percentage of a government or agency’s money to answering questions important to all of us. I would centrally plan and fund projects to be implemented in a particular way, maybe the same ways, across these country boundaries.

I would make sure they more seldom evaluate programs, and more often investigate assumptions and ideas. And never just any ideas, but the legs that prop up enormous things. If strengthened, these legs can have enormous impact. And if a weak leg is kicked out, then the money used for the kicking probably could not have been better spent