The cardinal sin of matching

We interrupt our regular programming for a brief statistics rant.

I keep seeing papers that make the following argument: “We want to test whether more T leads to more Y. Unfortunately there are lots of unobserved variables that could be driving both T and Y, so a least squares regression of Y on T (plus some correlates X that we do observe) is probably going to give a biased estimate. Therefore I am going to use a matching estimator to reduce the bias.”

This mistake is made by papers in some of the best journals, especially in political science. It has even been made by  a couple of causal identification methodologists who shall remain nameless. I call it the cardinal sin of matching.

Here is the short story: matching is not an identification strategy a solution to your endogeneity problem; it is a weighting scheme. Saying matching will reduce endogeneity bias is like saying that the best way to get thin is to weigh yourself in kilos. The statement makes no sense. It confuses technique with substance.

Your causal inference problem is pretty simple: there are things you can’t measure that could lead to more of both T and Y. Let’s call these pesky variables Z. Unless you can find a way to observe Z (or find that holy grail: an instrument for T), you could run the fanciest estimator in the world and it would make little difference.

When you run a regression, you control for the X you can observe. When you match, you are simply matching based on those same X. If X are a pretty good proxy for Z, then you’ve probably reduced your endogeneity bias. But whether you proceed via matching or regression is of little consequence.

For causal inference, the most important difference between regression and matching is what observations count the most. A regression tries to minimize the squared errors, so observations on the margins get a lot of weight. Matching puts the emphasis on observations that have similar X’s, and so those observations on the margin might get no weight at all. For the math see this excellent little book.

Matching might make sense if there are observations in your data that have no business being compared to one another, and in that way produce a better estimate, but the 800lb identification problem is still staring at you from the corner.

Even if you say he only weighs 363 kilos.

16 thoughts on “The cardinal sin of matching

  1. This has annoyed me so many times: people absolutely bashing least squares for perceived endogeneity and then moving to matching without so much as a word on how this solves the identification issues… A weight has been lifted from my heart!

    From my very limited knowledge, I’ve always seen matching more as a way to safeguard against extrapolation (by putting more weight on those observations with similar X’s), which is probably useful in itself. I.e. you’re more certain that you’re comparing like with like in terms of the values of the X’s, whereas in least squares this is not necessarily the case so you might end up relying a lot on the (typically) linear functional form to construct a counterfactual for your “treated” observations. (Not sure this makes sense, but it’s my best attempt at explaining what I thought.)

  2. Chris, I agree. But I would also emphasize the kinship between matching and IV. IV can be viewed as a weighting exercise too: it gives more weight to variation in regressors that is correlated with instruments. So they descend into the mud together.

  3. Is this not what the authors of the paper on Foreign Aid Shocks as a Cause of Violent Armed Conflict that was posted a month or two ago did? However, they matched their sample on the possible pre-civil war predictors that politicians might use to anticipate conflict. I’m not sure why that would make a difference however. I remember not really following the logic, although I could be wrong. I always think of matching as preventing one from creating a model relying on extra assumptions about observations dissimilar from your “treated” cases i.e. One uses it simply to avoid comparisons based on non-overlapping and unbalanced data.

  4. Good point. Corrected. This is where it shows that I am no statistician.

    But I think this is a semantic point, correct? Neither confronts the main causal inference concern: unobservables.

  5. Well, all approaches are muddier than most papers would have us believe. But at least an instrumental variable has the potential to address the unobservables problem. It’s suited to the task at hand.

  6. oh come on – put your Canadian away and pull out the evil economist in you – we want names. If academics publish nonsense it’s fair game to point that out publically (unless they’re on your tenure committee ;-))

  7. Yeah, this my own pet peeve and a semantic point. I like to say that the best way to deal with unobserved variables is to observe and correct for them–and that is the essence of matching and OLS. Naturally this doesn’t deal with unobservables which is your point. As you know, however, it’s not always the case that IV is better than other identification strategies which is why I try to express things in a way that puts different methods on a more equal footing.

  8. Wow. I guess I’m just a naive first-year grad student, but I’m really appalled that anyone pretending to do serious empirical research has such minimal understanding of what these estimators do. Where did they get the idea that matching addresses omitted variables bias? I don’t understand how that misconception would even arise, other than a vague sense that matching is more sophisticated and thus must have more desirable properties.

    I’m not sure I quite agree with David Roodman’s take on IV vs. matching. In a world with heterogenous treatment effects, it’s true that IV estimation produces a “weighted” estimate (LATE) that is different from the average effect we would estimate if we could write a perfect OLS specification (ATE). But that isn’t all that IV does. To see why, imagine some treatment that has zero effect for all possible values of the covariates, but is correlated with unobserved variables that greatly affect the outcome. In this instance, both OLS and matching estimates will be complete garbage: they’re so obscenely biased that they’re not estimating anything meaningful.

    If we have a good instrument for the treatment, however, then IV will be consistent for the correct treatment effect: 0. (This is because I assumed that 0 was the treatment effect everywhere.) It isn’t just estimating a reweighted average treatment effect; it’s giving us a correct answer when neither OLS nor matching works at all. Describing this as a “weighting exercise” isn’t really accurate under the normal meaning of “weighting”. IV does a lot more than that.*

    * If, of course, you have a good instrument, which is extremely rare.

  9. Good rant, I like – it kills me too.

    I guess one of the good bits of matching is that it gets you to think about common support, which is pretty much what Nicholas is saying above. Before doing matching, I hadn’t thought about this issue at all.

  10. What do you think of structural equations (simultaneous equations)? I keep thinking on the paper by Deanton…

  11. Hi, I’m one of the authors of “Foreign Aid Shocks as a Cause of Violent Armed Conflict” that was mentioned in the previous comments (by “G” up above). Since our paper came up, I wanted to clarify a few things about our approach. The bottom line is that our claims to causal identification stemmed from our attempts to measure and condition on potential confounders, not from matching per se. I hope we aren’t falling into any of the traps Chris mentions because I agree with most of his rant!

    In our paper, we estimate the causal effect of a large drop in foreign aid on the subsequent probability of civil conflict. We were worried that donors might be able to predict civil conflict several years in advance and then withdraw aid, making it appear as if aid shocks had an effect when they didn’t. Unfortunately, this potential confounder is unobservable, unless we can somehow observe aid donors’ perceptions of the risk of violent conflict in developing countries (on a year-to-year basis!), so we turned to the causal inference literature.

    There are three ways of making causal inference about the effect of X on Y. (1) characterize a complete path from X to Y, (2) use experiments, natural or otherwise (IV, RD, etc), and (3) condition on observables. In our paper, we found that options 1 and 2 simply weren’t viable for our question — we couldn’t find a credible instrument or specify a complete mechanism — so we moved on to conditioning on observables via both regression and matching + regression. Obviously, this weakens the claims we can make about identification.

    I agree with Chris (I think), so we weren’t naive enough to somehow assume that matching would magically solve our endogeneity problem! I’m not sure which papers Chris is referring to, but I share his extreme skepticism of the “matching pixie dust” as a ready-made solution to endogeneity. Instead, we attempted to observe the unobserved confounders and condition on them. We couldn’t measure aid donors’ perceptions of the likelihood of conflict, so instead we attempted to measure all of the systematic inputs to these perceptions. In particular, we conditioned on the same variables that the CIA uses in its quantitative predictions of civil conflict, as well as some other predictors of conflict onset. Then, by conditioning on these inputs, either by regression or matching, we argue that we’ve mitigated the most obvious selection effect we face. Like most observational studies, we can’t guarantee that there aren’t other unobserved confounders, but we’ve tried hard to measure the obvious confounders and condition on them.

    I think Chris’ points are dead-on and that too many people think that matching can do magic. We’re trying hard to get credible answers to a tough question, so I’d welcome comments and criticisms, here or offline.

    Cheers,
    –Rich Nielsen ([email protected])

  12. Chris, when you cited Mostly Harmless Econometrics, were you referring to the result on pp. 75-76?

    Angrist & Pischke show there that when Y is regressed on T and a full set of dummies for the possible values of X, OLS estimates a weighted average of covariate-specific treatment effects, putting the most weight on “cells where there are equal numbers of treated and control observations.”

    That’s quite different from saying that “observations on the margins get a lot of weight”, although there’s probably some other sense and other scenario (with a continuous covariate) in which the latter is right.

    This is kind of orthogonal to your main points, but it might be worth clarifying since the weighting result (Angrist 1998, Econometrica) is underappreciated.