Chris Blattman

Remember the study showing most psychology studies don’t replicate? Turns out it doesn’t replicate.


Isn’t it ironic?

A recent article by the Open Science Collaboration (a group of 270 coauthors) gained considerable academic and public attention due to its sensational conclusion that the replicability of psychological science is surprisingly low. Science magazine lauded this article as one of the top 10 scientific breakthroughs of the year across all fields of science, reports of which appeared on the front pages of newspapers worldwide.

We show that OSC’s article contains three major statistical errors and, when corrected, provides no evidence of a replication crisis.

Indeed, the evidence is consistent with the opposite conclusion — that the reproducibility of psychological science is quite high and, in fact, statistically indistinguishable from 100%. The moral of the story is that meta-science must follow the rules of science.

A new article by Gilbert, King, Pettigrew, and Wilson.

This excerpt is rather amazing:

For example, many of OSC’s replication studies drew their samples from different populations than the original studies did. An original study that measured American’s attitudes toward African Americans was replicated with Italians, who do not share the same stereotypes;

an original study that asked college students to imagine being called on by a professor was replicated with participants who had never been to college;

and an original study that asked students who commute to school to choose between apartments that were short and long drives from campus was replicated with students who do not commute to school.

What’s more, many of OSC’s replication studies used procedures that differed from the original study’s procedures in substantial ways: An original study that asked Israelis to imagine the consequences of military service was replicated by asking Americans to imagine the consequences of a honeymoon;

an original study that gave younger children the difficult task of locating targets on a large screen was replicated by giving older children the easier task of locating targets on a small screen;

an original study that showed how a change in the wording of a charitable appeal sent by mail to Koreans could boost response rates was replicated by sending 771,408 e-mail messages to people all over the world (which produced a response rate of essentially zero in all conditions).

While the fact that this was detected and reported on quickly is (in some sense) evidence of science working, it seems that something is broken about the review and referee process that the original paper got through the publication process.

Update: The original group of authors don’t like the replication of their replication. See Andrew Gelman’s excellent and skeptical discussion. I have read none of these closely enough to have an opinion.

150 Responses

  1. my wife needed TSP-1 this month and learned about a great service that has lots of form templates . If people are looking for TSP-1 also , here’s a

  2. C’mon, the title of the post is misleading! It’s not a new article, it’s a comment on the original replication paper. And the comment does not try to replicate the study, it simply re-analyzes the results. BTW, using a single, inadequate, metric for replication (with erroneous interpretations of a CI and its properties) and comparing different studies as if they belong to the same reference class.

    They do bring some important issues to attention, like the validity of some replications that seems to drift away from the originals. But their conclusion – high reproducibility – is overreaching.

  3. This week’s Science Mag also has this new article on replicability in econ lab experiments:
    “We find a significant effect in the same direction as the original study for 11 replications (61%); on average the replicated effect size is 66% of the original. The reproducibility rate varies between 67% and 78% for four additional reproducibility indicators, including a prediction market measure of peer beliefs.”

  4. I would say design diff within a certain degree verifies our inference of concepts from operationalized experiments. psychologists deal with concepts not procedures.

  5. On the other hand, if the effect is so narrow as to only work in very specific populations under narrow conditions, is it worth reporting on to begin with?

Comments are closed.