Since yesterday’s pointy-headed statistics post proved unexpectedly viral, I assume you want more econometric rants. So here’s something that has been bothering me all week.

When I was in graduate school, economists discovered clustered standard errors. Or so I assume because it almost became a joke that the first question in any seminar was “did you cluster your standard errors?”

Lately I’ve been getting the same question from referees on my field experiments, and to the best of my knowledge, this is wrong, wrong, wrong.

So, someone please tell me if I’m mistaken. And if I’m not, a plea to my colleagues: this is not something to write in your referee reports. Please stop.

*[Read the follow up post here]*

I guess I should explain what clustering means (though if you don’t know already there’s a good chance you don’t care and it’s not relevant to your life). Imagine people in a village who experience a change in rainfall or the national price of the crop they grow. If you want to know how employment or violence or something responds to that shock, you have to account for the fact that people in the same village are subject to the same unobserved forces of all varieties. If you don’t, your regression will tend to overstate the precision of any link between the rainfall change and employment. In Stata, this is mindlessly and easily accomplished by putting “, cluster” at the end of your regression, and we all do it.

This makes sense if you have observational data (at least sometimes). But if you have randomized a program at the individual level, you do not need to cluster at the village level, or some other higher unit of analysis. Because you randomized.

Or so I believe. But I don’t have a proof or citation to one. I have asked some of the very best experimentalists in the land this week, and all agree with me, but none have a citation or a proof. I could run a simulation to prove it, but surely someone smarter and less lazy than me has attacked this problem?

While I’m on the subject, my related but nearly opposite pet peeves:

- Reviewing papers that randomize at the village or higher level and do not account for this through clustering or some other method. This too is wrong, wrong, wrong, and I see it happen all the time, especially political science and public health.
- Maybe worse are the political scientists who persist in interviewing people on either side of a border and treating some historical change as a treatment, ignoring that they basically have a sample of size of two. This is not a valid method of causal inference.

**Update:** The follow up post is here.

## 43 Responses

@jjmatta https://t.co/lahYBiJiFQ

https://t.co/nc0PqBQzPA https://t.co/9nSMnp3J5x

Peter, if I understand your question correctly, that sounds like a SUTVA violation, which is a serious issue but not solved by clustered standard errors (see also pp. 4-5 on peer effects in the Weiss et al. paper that Stuart linked to).

Seems that you might still have to cluster even with individual-level random assignment. For instance, imagine the case where there is individual random assignment at the village level, and then significant spillovers across treatment and control individuals within a village but not across villages. Seems to me this would require clustering across villages. Would love to hear if this is wrong however and why.

Clusterjerk https://t.co/jWfgFUBXwO

Stuart, I think Mike, J.R., and Dan’s paper is very good, but I agree with Chris that “if you have randomized a program at the individual level, you do not need to cluster at the village level, or some other higher unit of analysis”. (“You do not need” doesn’t mean “you absolutely shouldn’t”.)

As some other commenters have mentioned, if individuals were randomly assigned, then Neyman’s mode of randomization inference can justify the use of robust standard errors (not clustered at a higher level) in large samples. This result is discussed in, among other places, Imbens & Rubin’s book (as Doug mentioned), and extended to regression-adjusted estimation of average treatment effects in my paper “Agnostic Notes on Regression Adjustments to Experimental Data: Reexamining Freedman’s Critique”. Neyman’s mode of randomization inference tries to construct a confidence interval for the average treatment effect on the experimental sample. It doesn’t try to generalize to a superpopulation or to what would happen if the same treatment were given by different service providers, if the villages experienced different economic shocks, etc. Thus, at best, it answers a very narrow question and doesn’t capture all the uncertainty we should have about broader policy-relevant questions. Nevertheless, I think this framework can be useful for lower bounds on our uncertainty, because it’s easier to agree on what the unit of randomization was than to agree on what’s a reasonable model of the factors that affect outcomes.

When people discuss Chris’s question in a regression model framework instead of a randomization inference framework, they’re implicitly asking a different question. Model-based frequentist inference considers hypothetical replications of the study in which each new replication brings not a new random assignment, but a new random draw of each person’s error term “epsilon”. Since epsilon has to represent everything that determines the outcome besides treatment and the covariates in the regression model, we inevitably have some dependence between individuals’ epsilons, e.g. because a group of patients share a service provider, because people in the same local labor market are subject to the same “random” economic shocks, or because students in the same classroom experience the same “random” events such as a dog barking just outside the classroom on exam day (unless we want to hold service providers, economic shocks, and dog barks fixed across our hypothetical replications of the study). One of the important contributions of Mike et al.’s paper is to show that addressing such dependence can be harder than people think and even intractable.

The desired inference depends on what we want to generalize to, and there’s no one right answer. I think it could often be a good idea to show more than one analysis. But sometimes the best we can do on the broader questions is an informal discussion. E.g., suppose we randomly assign 40 schools in two school districts. Since we’re probably interested in the broader question of would happen in other school districts, should we conclude that we have to cluster at the district level, where we have a sample size of 2? Or should we construct confidence intervals from SEs clustered at the school level (perhaps using the method in Imbens & Kolesar’s paper “Robust Standard Errors in Small Samples: Some Practical Advice”), but acknowledge that the findings don’t necessarily generalize to all districts?

This kind of issue also comes up in nonexperimental studies. Jeff Wooldridge (“Cluster-Sample Methods in Applied Econometrics”, American Econ Review, 2003) mentions that Donald & Lang criticized Card & Krueger’s New Jersey – Pennsylvania minimum wage study for ignoring state-level clustering. Wooldridge points out that accounting for such clustering is impossible with only two states. He writes, “The criticism in the G = 2 case is indistinguishable from a common criticism of difference-in-differences (DID) analyses: How can one be sure that any observed difference in means is due entirely to the policy change?” Both here and in randomized experiments, my view is that formal statistical inference never captures all the uncertainty we care about. Confidence intervals and tests can be useful for lower bounds on our uncertainty, but sources of uncertainty that they don’t capture should be acknowledged. Whether that should be done formally or informally may depend on the situation.

See also Mosteller and Tukey’s 1977 book “Data Analysis and Regression” (sections on “Choosing an error term”, pp. 123-125, and “Supplementary uncertainty and its combination with internal uncertainty”, pp. 129-131).

On clustering standard errors in experiments. Clusterjerk https://t.co/seZUoBtVLi

What do folks think of this MDRC paper?

Estimating the Standard Error of the Impact Estimator in Individually Randomized Trials with Clustering

04/2014

| Michael J. Weiss, J. R. Lockwood, Daniel F. McCaffrey

In many experimental evaluations in the social and medical sciences, individuals are randomly assigned to a treatment arm or a control arm of the experiment. After treatment assignment is determined, individuals within one or both experimental arms are frequently grouped together (e.g., within classrooms or schools, through shared case managers, in group therapy sessions, or through shared doctors) to receive services. Consequently, there may be within-group correlations in outcomes resulting from (1) the process that sorts individuals into groups, (2) service provider effects, and/or (3) peer effects. When estimating the standard error of the impact estimate, it may be necessary to account for within-group correlations in outcomes. This article demonstrates that correlations in outcomes arising from nonrandom sorting of individuals into groups leads to bias in the estimated standard error of the impact estimator reported by common estimation approaches.

http://www.mdrc.org/publication/estimating-standard-error-impact-estimator-individually-randomized-trials-clustering

I agree with Coady Wing and the paper by Cameron and Miller (now published in JHR) is helpful.

I have a little suspicion that Chris may have some confusions between conditions for coefficient consistency and standard error consistency. Randomization remove the correlation between error and treatment variable, so it leads to consistency of OLS. But for standard errors, the standard formula for standard errors assume homoscedastic and uncorrelated errors. But if there is correlation between errors, the assumed formula will assume different information content than we actually have, so the standard error formula is inconsistent to the true standard error. The more subtle conditions for consistency are specified in the Camaron-Miller paper.

Some pegagogical simulations trying to make a similar point available here:

http://www.depauw.edu/learn/stata/Teaching_Complex_Survey_Design_October_22_2013.pdf

It’s been a very long time since I took an econometrics class and I seem to disagree with everyone else here so I’d appreciate somebody pointing out where I’m wrong.

My memory (plus some Googling) was that the default calculations of the standard errors of OLS coefficients assume spherical errors (as does the Gauss-Markov Theorem proving that OLS is BLUE).

Even if randomization occurred at the individual level, we’d still expect individuals within the same cluster to experience the same shocks. Wouldn’t this be likely to violate the spherical errors assumption? In other words, I’d expect the variance of the error to vary by cluster (leading to homoscedasticity) and errors within clusters to be correlated.

I think this means that you need to correct your standard errors. Robust SEs are an option but SEs that specifically account for the clusters are more efficient.

Another issue (I think) occurs if you randomized at the individual level but selected people for participation in the study at the level of the cluster. It seems like this should further reduce your effective sample size.

Tell me what I’m missing!

Let’s consider the extreme case in which all of the error is from the village-specific component. If you don’t cluster, you might claim to have 10,000 observations, but in practice, if there are only 100 villages, there’s no way that you can have more than 100 independent observations in your sample. By failing to cluster, you are inflating your t-statistics by acting like you have 10,000 independent random draws.

Hi Dr. Blattman,

Others have already pointed out that looking at the moulton inflation factor shows that when assignment is at the individual level clustering isn’t necessary. If you’re like me, this probably doesn’t help much with the intuition though. Here’s an alternate explanation: if you assume each unit has a defined outcome under treatment and control (i.e. SUTVA), then each treatment and control unit in a randomized experiment is a random draw from one of two distributions — the distribution of potential treatment outcomes and the distribution of potential control outcomes. (This is a slight simplification. For full details see page 87 of Imbens and Rubin, 2015.) Thus, the treatment and control means are averages of independent, identically distributed variables and the usual estimate of the variance (without clustering) is justified. Note that this explanation does not make any assumptions whatsoever about the distribution of potential outcomes in the overall population (other than the basic stuff necessary for the CLT to hold). Also note that this would not be the case if you randomized at a higher level.

On a related note, these same issues are present when testing for baseline balance in a randomized experiment. I have seen quite a number of papers where the authors randomize at a group level and then to balance tests at the unit level and erroneously come to the conclusion that their randomization failed.

“Maybe worse are the political scientists who persist in interviewing people on either side of a border and treating some historical change as a treatment, ignoring that they basically have a sample of size of two. This is not a valid method of causal inference.”

You’re stating this a little too strongly. You have to make some strong assumptions about what it means to be a treated unit, but you can do it. Melissa Dell mining mita paper does a decent job outlining this.

(Not) Clustering in experiments, observational research, and something on natural experiments (borders) https://t.co/PWSJOy1GzM by @cblatts

All I gotta say is, ex post / ex ante.

Clusterjerk: Since yesterday’s pointy-headed statistics post proved unexpectedly viral, I assume you want more… https://t.co/kYASxjvlJd

“Maybe worse are the political scientists who persist in interviewing people on either side of a border and treating some historical change as a treatment, ignoring that they basically have a sample of size of two. This is not a valid method of causal inference.”

Aren’t basically half of all the examples given by Robinson & Acemoglu in “Why Nations Fail” this right here?

RT @cblatts: Are you a clusterjerk or am I? (A bleg on whether I need to cluster std errors in a field experiment.) https://t.co/R21D5pkKpg

@cdsamii @FlorianFoos @BrendanNyhan @ClaytonNall @cblatts @cdsamii Yes! This paper may help: https://t.co/r6jnBQTdSl

I think you are asking the wrong question. The onus is on the other parties to show why some deviation from the simple model is needed, not on you to show why it is not needed. I can see why you would prefer a handy proof or paper to show your case; just pointing out that there is something backward about having to prove some enhancement is not needed. Fwiw, it seems the underlying logic for why s.errors are clustered would still apply. Obs within a group would still be within a group, and thus possibly correlated, regardless of how the sample was chosen. (Obviously, this depends on the model.) But will follow the comments for arguments, proofs, to the contrary.

Forgot the link:

http://cameron.econ.ucdavis.edu/research/Cameron_Miller_Cluster_Robust_October152013.pdf

This review article by Cameron and Miller is helpful. On page 21, they write:

“First, given V[b] defined in (7) and (9), whenever there is reason to believe that both the regressors and the errors might be correlated within cluster, we should think about clustering defined in a broad enough way to account for that clustering. Going the other way, if we think that either the regressors or the errors are likely to be uncorrelated within a potential

group, then there is no need to cluster within that group.”

I think the second sentence is what you care about.

You can see the idea more easily in the parameterized Moulton formula. It’s equation (6) in the paper. The equation shows that the “inflation” is a product of within cluster correlation in the regressor (treatment) and within cluster correllation in the outcome. If either of those terms is equal to 0 then there is no variance inflation to worry about. In an individual level randomized experiment then the within cluster correlation in the treatment will be zero and so there is no need to cluster.

The one regressor example that they give in section IIA applies well to experiments with person level random assignment and it does not use the parametric Moulton approach. It sets things up with the sort of cluster standard errors that come out of the stata cluster option.

Clusterjerk – Chris Blattman https://t.co/6DpXD32Qd3

I had the same puzzle when using fixed effects at the level where you’d usually cluster. Seems overkill to do both.

RT @cblatts: Are you a clusterjerk or am I? (A bleg on whether I need to cluster std errors in a field experiment.) https://t.co/R21D5pkKpg

RT @cblatts: Are you a clusterjerk or am I? (A bleg on whether I need to cluster std errors in a field experiment.) https://t.co/R21D5pkKpg

@eduardo_leoni @BrendanNyhan @ClaytonNall @cblatts what I meant: this point doesn’t matter for consistency. It’s an efficiency issue.

@FlorianFoos @BrendanNyhan @ClaytonNall @cblatts they use the original Moulton factor, hides the key result (correlation in treatment).

@cdsamii @BrendanNyhan @ClaytonNall @cblatts Arceneaux and Nickerson 2009 PolAnalysis would be a good cite imo

@eduardo_leoni @BrendanNyhan @ClaytonNall @cblatts ? W/ design based methods, effects are always presumed to vary arbitrarily.

@BrendanNyhan @ClaytonNall @cblatts these are all implications of the generalized Moulton factor (cf Mostly Harmless).

@cdsamii @BrendanNyhan @ClaytonNall @cblatts if the effect of treatment varies across clusters you would still have to account for it.

@BrendanNyhan @ClaytonNall @cblatts this is because of the negative residual correlation of *treatment* vars within the group.

@BrendanNyhan @ClaytonNall @cblatts if you assign *within* groups clustering can also be consistent and yield *smaller* s.e.

@BrendanNyhan @ClaytonNall @cblatts Chris is correct: clustering at the level of assignment is, typically, correct.

@ClaytonNall @cblatts @cdsamii per your other post, thinking about evaluating experiment w/RI instead suggests this doesn’t make sense

It’s not a proof, but just look at the Moulton formula for the design effect of clustering on standard errors. You only need to cluster if your outcome has a nonzero covariance across observations within clusters, and the same is true of your main explanatory variable of interest. Randomizing at the individual level destroys the second of these within cluster covariances.

@ClaytonNall @cblatts i think we need @cdsamii on the case

@BrendanNyhan @cblatts Who’s arguing this? (Note that HHs could be clusters, though, if you are not Kish sampling, etc.)

“Clusterjerk” https://t.co/EKFtVghF7B

@cblatts report back on the results!