One of my favorite science writers, Ben Goldacre, enters the so-called Worm Wars. He’s not alone, with a flurry of new articles today. The question is simple: is a deworming pill that costs just a few cents one of the most potent anti-poverty interventions of our time?
Below is the picture from Goldacre’s post. I assume Buzzfeed editors chose it. It’s a nice illustration that nothing you will read in this debate is dispassionate. Everyone wants one thing: your clicks (and retweets, and likes, and citations). Most writers sincerely want the truth too. Sadly the two are not always compatible.
In brief: Ted Miguel and Michael Kremer are Berkeley and Harvard economists who ran the original deworming study that showed big effects of the medicine on school attendance in Kenya—one of the few to attempt to measure such impacts. That study ignited the impact evaluation movement in international development, especially through their students (like me). It also ignited a movement to deworm the world. This is a big claim, worth investigating. Calum Davey led the team who did a replication.
I know this study. In fact, as a first year graduate student I spent a summer working for Miguel and Kremer designing their long term follow up survey. Relationships are incestuous on all sides of the deworming debate, so you can hardly call me an impartial judge. Nonetheless, bear with me as I try.
I haven’t paid much attention to the deworming world for more than a decade. So I spent last night and this morning reading as much as I could. There’s an overwhelming amount to process, but I’ve drawn a few early conclusions.
The bottom line is this: both sides exaggerate, but the errors and issues with the replication seem so great that it looks to me more like attention-seeking than dispassionate science. I was never convinced that we should deworm the world. There are clearly serious problems with the Miguel-Kremer study. But, to be quite frank, you have throw so much crazy sh*t at Miguel-Kremer to make the result go away that I believe the result even more than when I started.
Backing up, we should remember that most scientific studies don’t stand up to scrutiny very well. Most are utterly wrong.
This was Ben Goldacre’s overarching point, and I couldn’t agree more. But reading the details, I think he may have accidentally chosen an example of the opposite problem: a love of the witch hunt. A bias among authors to have a flashy result, and a bias among journals to publish it.
If you throw enough at a study, the results will eventually get imprecise enough that you can’t draw a strong conclusion. This is why, ultimately, the body of evidence matters. In a single study, the amount of assault a result can take is a good indication of its quality.
By this metric, the Miguel-Kremer deworming result is actually impressive. Davey and company find some real errors, but most don’t change the results. They have to throw a whole lot at the results to put them just beyond the realm of statistical significance.
From what I can tell, you have to do three or four things at once:
1. You have to divide it into two smaller experiments. The medicine was phased in over time. Some people received medicine in year 1 and some in year two year 2. If you split years 1 and 2 into two separate experiments, the precision goes down. Naturally. But the rationale for doing that is completely weird. I’ve never seen any study do this before.
2. You also have to ignore the fact that the disease could pass to other people. If you give medicine to person A, it can affect the health of person B nearby (since A doesn’t pass on infections). That means if you compare A and B, you’re biased towards underestimating the effect of the medicine. Many health studies (amazingly) make this mistake. So does the replication.
3. You have to care about school rather than pupil effects. Some schools are small (50 pupils) and some are large (1300 pupils). As it happens, the medicine has a bigger effect in bigger schools, probably because the medicine stops the disease from spreading. If you ignore this, and take school averages rather than pupil averages, you will get a lower estimate of how well the medicine works. Again I’m not sure why you’d do that.
4. You have to recode the groups to put people who didn’t get the medicine into the group that was supposed to get the mechanism. This looks to me like a mistaken understanding of the experiment by Davey and team.
Reading Miguel-Kremer’s response in the same journal (none of the journalists I’ve read seem to have read or cite this) here’s the amazing thing: Just doing two or three of these things is not enough to make the result go away. It looks to me like you have to do three or all four. In particular, if you do everything except split the experiment in two, everything holds. Most of the “debunking” rides on splitting the experiment in two.
[Read World Bank economist Berk Ozler’s similar points in greater depth.]
There was a lot to absorb, so I invite other views. But my quick read is this: Davey and team’s 1 through 4 are useful checks on the data, but rather weird choices. A reasonable scientist might choose one of them. Maybe, and probably erroneously. But all four? Something is amiss.
This is all rather technical, so it’s not surprising that the journalists writing on the debate don’t understand. But if you’re not a statistician, here’s what should make you suspicious: Every single choice ticks in the direction of making the effects of the medicine less impressive. This is either correct, coincidental, convenient, or conniving.
Since I see no argument for correct, someone should interrogate the other three. Instead most of the journalism has accepted the article at face value.
To me there’s a simple and sad explanation why: Whether it’s a sensational photo, a sensational result, or a sensational take down of a seminal paper, everyone has incentives to exaggerate. This whole episode strikes me as a sorry day for science.
[New: My subsequent post on the ten things I leaned from the trenches of the Worm Wars]
28 Responses
Dividing a sample into 2 sub samples has one enormous benefit. Sure the statistical precsision goes down by root 2, but it allows for systematic checks on the “look elsewhere effect”.
If a measured variable looks interesting in one subset and not in the second it is muck more likely to be statistical fluke. If a 3-sigma improvement in sample A becomes a 0 sigma effect in sample B, that is probably not 1.5 sigma effect on the whole sample. It is probably no effect with statistical fluctuation.
Seeing if the effect in year 1 and year 2 is the same is something we do all the time in my branch of physics (observational astrophysics)
RT @cblatts: Dear journalists and policymakers: What you need to know about the Worm Wars http://t.co/dlSeQgC17N
Dr. Blattman,
Excellent post. Definitely agree that this is starting to feel like a witch hunt. Leaving aside the debate over this particular study, the one thing that everyone seems to agree on is that the nitty gritty of how data are cleaned and analyzed is often a mess and that there should be greater outside scrutiny of this process. Would be great to have a post with your thoughts on what steps researchers can take to make this process less of a mess. Here’s my list:
1. Come up with a design plan for how your code will go from raw data to finished results
Don’t just start coding. This will prevent headaches down the road and make it easier to divide up the work.
2. Make your code modular
Unless you can go from raw data to final results in 100 lines of code or less, break up the logic into separate do files. This will help with the next step.
3. Test your code
Don’t just look at whether the outputs from regressions look fishy. Test that each of the modules is doing what it is supposed to do. Throw weird inputs at your functions and make sure it works.
4. Use version control
There probably isn’t enough space here to go over this in detail but you should use something like Git so that you can always recover previous versions of your code.
5. Make your code open source
One of the more amazing things to me is that Miguel and Kremer were even able to find the exact do files that they used for their paper however long ago. Finding old files is a lot easier if you push them to Github and tag the version you use for a paper. As long as you don’t hard code any details of the data into your code, this doesn’t cause any data privacy concerns. Making your code open source and tagging versions will also push you to make it cleaner and better documented.
RT @cblatts: Dear journalists and policymakers: What you need to know about the Worm Wars http://t.co/dlSeQgC17N
RT @cblatts: Dear journalists and policymakers: What you need to know about the Worm Wars http://t.co/dlSeQgC17N
RT @cblatts: Dear journalists and policymakers: What you need to know about the Worm Wars http://t.co/dlSeQgC17N
I updated my post on Development Impact, which can be found here: https://blogs.worldbank.org/impactevaluations/worm-wars-review-reanalysis-miguel-and-kremer-s-deworming-study
Hopefully it clarifies some of the confusion around the key issues…
Berk.
Thanks Berk, I appreciate it and I look forward to reading the updated post. Cheers, s
RT @cblatts: Dear journalists and policymakers: What you need to know about the Worm Wars http://t.co/dlSeQgC17N
Important for those who care about the overall question on whether – given all the evidence – deworming is a recommended policy for improving the lives of children in areas with such worms:
The independent charity evaluator Givewell, known for its independence and their high scrutiny of data and evidence for poverty alleviation policies, just came out with a new statement evaluating the overall evidence on this topic.
Conclusion: Deworming is still highly effective recommended intervention.
A must read if you care about the details: http://blog.givewell.org/2015/07/24/new-deworming-reanalyses-and-cochrane-review/
What bugs me about the Miguel & Kremer study is external validity: on page 174, they mention that anemia is “the most frequently hypothesized link between worm infections and cognitive performance”, but that treated and untreated pupils didn’t differ very much in terms of hemoglobin concentrations. This is attributed to the low overall rates of anemia in Busia, which in turn may be caused by the fact that 73 percent of children were engaged in ‘geophagy’ i.e. eating dirt (as mentioned in footnote 21).
This would actually suggest to me that the paper underestimates the effect of deworming pills in non-geophagic areas, since the anemia mechanism would be relevant there. Of course, they also mention briefly that geophagy could increase reinfection rates by increasing exposure to larvae, which could mean that they’re overestimating what the positive externalities would be in a non-geophagic area. However, I’m not an expert on this stuff, so my analysis here should be taken with a grain of salt/dirt.
Hi Berk (sorry i forgot the greetings, my first comment on a blog!).
I completely agree that in the absence of a pre-analysis plan, I would likely have asked the original authors re: start date and taken their word for it. I think the sensitivity/as-treated analyses they conducted instead are second best.
But that’s a different point from the one you made in the WB blog post (and that a lot others repeated), i.e., that nobody ever classifies as “treatment” any observation taken before said treatment was actually delivered and calls it ITT. That is not accurate. It is commonly done in public health and for good reason. I think economists should in fact adopt the public health definition of ITT in stepped wedge trials, because that is an important safeguard. And it is a pity this got lost in the discussion.
Hi Stefane,
Thanks – that’s fair enough. I think it is true that this alternative definition — the reasoning for which is, like you said, more of a safeguard as opposed what we think in economics (selective take-up of the treatment offer) when we think of ITT — is something that almost never comes up for us (my first time thinking about it). When it does, it’s in the sense you’re thinking of: I have a transfer program, I already collected baseline data, now the treatment is delayed for 6 months and the impacts for the 12-month endline are supressed. Of course, in such cases, those six months are counted as treatment, as it is real-life program-delivery circumstances that caused those delays. It’s just that it was hard to see that kind of reasoning in this particular case.
I’ll link your comment in my updated post on Development Impact as it is so nice. Cheers,
Berk.
I’m rather puzzled by the suggestion from Berk (and a lot of other economists) that no one defines ITT in that way. In a stepped wedge design in public health/epi/medecine, ITT means analyzing clusters according to their randomized cross-over time. If there’s a deviation from this schedule, then it becomes an as-treated analysis. This is not “unusual, almost bizarre”. It is a common interpretation of iTT in stepped wedge trials in public health, which may not be familiar to economists, sure, but doesn’t mean it doesn’t have merit.
In particular, it is there to guard against manipulation of treatment: delays in implementing treatment may be due to temporary conditions that would not be favorable to the treatment and that the investigator purposefully would want to avoid. For example, imagine some teachers who work in treatment schools and typically have higher attendance rates among their students are away for training for a month or two, then maybe an investigator intent on showing effect would push back implementation until the teachers are back. I’m not saying this is what happened in the Kenya trial, but this is something against which a safeguard is warranted, however rare such instances may be. So in such trials in public health, “treatment” does not start when it is implemented, it starts when it supposed to be implemented–>ITT. This paper in the BMJ summarizes practice: http://www.bmj.com/content/350/bmj.h391
The Ebola example in Berk is an extreme case, and there are ways to deal with this: contact the IRB, interrupt recruitment, revise protocol with new start date etc…
So the question for this deworming debate & ITT is now simply whether there was ever a planned start date, and when it was. In the absence of protocol/pre-registration, we won’t know for sure. But that doesn’t disqualify the replication as bizarre or worse. It is a reflection of current practice in public health/epidemiology. You may disagree with a lot of things (as I do), but it’s worth respecting the practice, learning about it and understanding where (smart) people are coming from. And who knows, there may be things of interest/importance in there for economists too? This whole deworming saga strikes me as huge missed opportunity to learn across disciplines and improve RCT practice.
Hi Stephane,
At a school-based trial, to be able to collect baseline and then follow-up data, you need to be there during the school-year. Treatment for any group has to start after the relevant round of data collection. Given that a bunch of attendance data from registers were collected about four weeks before pupil questionnaires, an then the pupil questionnaires themselves, there is no way the plan could be to treat people immediately at the start of the school year. So, why would the replication authors not pick the most likely scenario for their primary analysis rather than one that is a priori likely to understate treatment effects? In other words, in the absence of a pre-analysis plan, why is treatment at the start of the year an equally valid choice, when it would have meant coming before data collection?
I similarly am a bit skeptical of Table 3 in the Davey et al statistical replication. While they don’t specifically mention which method they use, my interpretation is that they re-weight by school for all of the estimates in Table 3, given that this seems to be what they’re doing throughout the analysis. If this issue is indeed the case, this “sensitivity analysis” would be pretty limited and flawed. (It’s certainly possible I’m wrong; I would love to see the code and data, so I’m hopeful that is published sooner rather than later). Berk and Chris have mentioned this already, but this re-weighting decision seems especially indefensible.
Miguel and Kremer mention their concern with the re-weights in their response. The re-weights aren’t identified in the Davey et al pre-analysis plan, and their justification for doing so based on some graphs showing correlation (without accompanying tests) is rather unconventional. Given that this is a major criticism, I was rather disappointed to see that Davey et al did not respond to Miguel and Kremer’s criticism of their weighting decision.
The single table that I found most convincing was Table S7 of the statistical appendix of Miguel and Kremer’s response, available here:
http://ije.oxfordjournals.org/content/early/2015/07/21/ije.dyv129/suppl/DC1 (page 61).
In it they show that the results are robust to separating by treatment year and defining treatment how Davey et al do, but that the results fall apart when you also re-weight by school rather than attendance observations or pupil population. It’s for this reason that I suspect Table 3 in Davey’s statistical replication might still employ reweights.
Hi Berk.
Some of that, especially at the beginning, seems a bit garbled to me. Was there an error in your posting?
I read all your post. I think really do think the issue of whether the treatment time was correctly coded was the key point of your post. It was the first, the longest, and it was 859 words out of 1903 words for your four points.
I’m sorry to see this possibly degenerate into anger or – worse – chaos and confusion.
I’m really just keen to hear your response to my concern about your first, main point, on treatment time. Your argument was that the reanalysis found no benefit because treatment time wasn’t correctly coded. My concern is that the re-analysis team also did the analysis coding treatment time the way you prefer, and it made little difference.
Thanks,
Ben
Ugh, it’s hard to stay quiet when Dr. Goldacre wrongly accuses us of wrong-headed criticism or worse factionalism. Let’s look at it using his own words directly from his above comment:
BG: “I don’t look at most of the issues you discuss above, and focus largely on the errors in the original paper that are acknowledged as errors by the original researchers.”
But, the issue Chris discussed above are exactly the ones I discussed, with just the order switched around. You don’t believe me, here it is:
CB: “You have to divide it into two smaller experiments.” (Point #2 in my blog linked above)
CB: “You also have to ignore the fact that the disease could pass to other people.” (Point #4 in my blog)
CB: “You have to care about school rather than pupil effects.” (point #3 in my blog)
CB: “You have to recode the groups to put people who didn’t get the medicine into the group that was supposed to get the mechanism.” (Point #1 in my blog).
Would you like more? Here is another:
CB: “…I believe the result even more than when I started.”
BO: “In fact, if anything, I find the findings of the original study more robust than I did before.”
So, apparently, Ben Goldacre picks and chooses what he wants to read or what he’d like to emphasize as it suits him.
As for the his point that the ITT objection is my central point and that it is wrong, well. let’s look at text again. This is from my post:
“In their reanalysis of the data from the original study, DAHH make some choices that are significantly different than the ones made by the original study authors. There are many departures but four of them are key: (i) definition of treatment; (ii) ignoring the longitudinal data in favor of cross-sectional analysis of treatment effects by year; (iii) weighting observations differently; and (iv) ignoring spillovers from treatment to control. I address them in order below:”
As you can see, I do not have any central thesis. (Goldacre probably thinks so because he stopped reading after point 1.) How do we know that I don’t have a central thesis? Because I never conclude that this is what kills one’s faith in the replication findings. Instead, here is what said in conclusion:
“Tables 1 & 3 in HKM’s response demonstrate that a number of unconventional ways of handling the data and conducting the analysis are JOINTLY required to obtain results that are qualitatively different than the original study.” (emphasis added)
This is a point made above by Chris as well as the authors in their various responses over and over again:
CB: “here’s the amazing thing: Just doing two or three of these things is not enough to make the result go away. It looks to me like you have to do three or all four. In particular, if you do everything except split the experiment in two, everything holds. Most of the “debunking” rides on splitting the experiment in two.”
Goldacre keeps referring to Table 3, which did not exist when I wrote my post 6 months ago. It then used to be Appendix Table 7, and still suffered from a number of the other issues raised, just as it does now. I can hardly be blamed for not emphasizing an appendix table, when I did not even think the point mattered by itself.
Finally, the ITT definition criticism stands on its own – as a principle — whether or not it mattered by itself in this case. No one defines ITT that way (the authors say as much), it would make no sense to treat your control group before you measured outcomes in this design! If Goldacre really believes that the replication authors were justified to define treatment that way for their primary analysis, he should answer my question about the hypothetical Liberia experiment: if his answer is ‘yes,’ we don’t need to be discussing statistics any further.
I’ll respond to Goldacre’s other rants from Twitter yesterday, when I was seriously getting worried that he was becoming untethered, and hence decided to let him cool off, later — likely on Development Impact.
Berk Ozler.
@bucksci: Yes but no.
First, yes. The statistical details can be complicated, as you say, and I would not expect journalists to uncover these or weigh in on them. The Daveys et al paper looks balanced and reasonable at first reading. This was my first impression.
But in the end, no. Journalists could have done more. I read the reply by Miguel-Kremer, and didn’t understand all their details. It was overwhelming. So yesterday morning I picked up the phone and called them. The most amazing discovery: I was among the very first people to do so. I don’t know if any of the major articles in any of the major newspapers looked at their reply, let alone spoke to them. That surprised me. I don’t expect journalists to know the details. But I thought speaking to the experts on either side of a debate was more standard.
On the picture, I thought excerpting pictures from an article, like text, was within fair use (in this case I’m linking to their photo URL). But the fact that you ask, and the fact that I know nothing about fair use, suggests I am wrong. If so let me know and I’ll get rid of the photo. Maybe me taking a screenshot of the Buzzfeed page would be fair? I should know how these things work but don’t.
Just a placeholder here: I don’t think Berk’s argument is wrong. I’m actually not sure, and need to look into it more closely. My reading of Figures 1 and 2 in the Miguel-Kremer response is that they do a much more complete sensitivity analysis that Davey et al. This makes the Davey et al Table 3 you mention look suspiciously curated to fit their original argument. But then we should be suspicious of Miguel-Kremer’s response too, and figuring out what is correct takes time. When I have a moment I will look at it. Don’t worry, I admit I am wrong all the time on this blog. It happens a lot.
But speaking as someone who worked on the project and spent a lot of time in these schools: the Davey et al choice of years looks completely bizarre. If Miguel-Kremer are to be believed, this was initially an error that Davey et al made by accident, they were surprised when the error was pointed out, they withheld publication by 3ie to investigate, and then decided they liked their interpretation better. Maybe out of sincere science, maybe from a desire to preserve their results. The Miguel-Kremer response suggests that the results are not even that sensitive to the choice anyways. That is, it’s all about splitting the experiment in two. I don;t know and I haven’t been able to get to the truth yet.
All of this is something that should have been addressed by the editor of the journal, who has done a very poor job. Since the editor appears to be in the same department as the replication writers, it’s all a little suspicious. Again something I am investigating to see if there is any truth.
Not sure why there is a personalized accusatory tone in some of this conversation. I trust we are all in this conversation to learn about what is the best way forward both for creating good research and good policy. So it’s important to be careful with the details – for example I didn’t say “if we don’t use this intervention children will die”.
There are two separate issues here. One is which findings of one paper are robust. (And we agree: six kilometer radius is a larger area than 3). Another is what the overall literature says. Both are important to understand what the right policy implication is, with our shared goal in mind that we want to do what’s most effective in helping children lead a healthy, prosperous live.
Thanks Dina.
1. Some of the errors were previously spotted. Some were not. Re-analysis is a good and helpful thing, that’s why I wrote a piece, to celebrate re-analysis and open data, and to praise Miguel and Kramer for exposing themselves to it.
2. The overall policy question “does worming work” is not one I would aim to answer with one trial, but rather, with a systematic review of all trials, as I explain in the piece. On the issue of whether worming is effective, sadly the Cochrane review has been clear for at least the past two updates:
http://onlinelibrary.wiley.com/doi/10.1002/14651858.CD000371.pub6/abstract
3. Regarding spillovers, there are a lot of concerning issues (it’s not clear how 3 and 6km were chosen, breaking down subgroups risks multiple bites at cherry etc, but no personal interest in go into that in further detail on this trial); also worth noting that the area covered by a radius of 3km is a lot smaller than 6km.
4. Berk’s central criticism of the trial – on whether whole years were used – is not valid. I am glad you agree on that. However I am concerned to see from here and from twitter that it seems both you and Chris Blattman knew it wasn’t valid at the time you were promoting it, tweeting it, and including it in blogs. I think this discussion should be conducted with clean hands and in good faith. When I see people promoting critiques which they apparently already know to be invalid, I can’t say this any other way: I feel sad.
5. I think your line of argument – if we don’t use this intervention children will die – is misplaced. We need to establish whether a intervention works, and whether it is cost effective. It’s okay to discuss the quality of evidence from one trial, and all trials, and evidence synthesis, and cost effectiveness calculations. I fully agree that lives are at stake, but not from failing to use one specific intervention. Lives are at stake if we use things without good evidence, and if we create a culture where bad or invalid arguments are tolerated.
Point 4 above I think is absolutely key. This needs to be a serious discussion about the evidence. There is no point in re-analyses if people don’t read them and use them. I’m genuinely, seriously disturbed to see you and Chris promoting a critique of a re-analysis when you apparently already knew that critique to be invalid. That’s seems incredibly strange and unhelpful to me, and it really does not sit well with your emotive invocation that children may die, in point 5.
Great discussion! I realize there is a lot of bad info out there, so it’s easy to get confused. A couple of things to clarify on the questions Ben raised:
1) The errors in the original of the Kremer & Miguel paper that Ben mentions were not detected by the re-analysis paper, they were largely detected by the authors themselves and already publicly acknowledged and addressed in 2007.
2) The overall policy question – does deworming work – was not affected by this. The key conclusion stands: Deworming reduces worm infections and increases school attendance for treated children as well as in untreated children that live nearby.
3) The only difference after accounting for this coding error is about how far the spillovers on untreated children reach. The original study estimated benefits up to 6 kilometers around the treated school, while the corrected analysis finds effects only within 3 kilometers.
4) On the technical discussion that Ben raises of why the World Bank post by Berk Ozler raises some issues that then turn out not to matter in practice in the analysis by Aiken et al. The explanation is simple: when Berk wrote that post, the analysis of whether that point will matter in practice or not had not been published by Aiken et al. yet.
5) Important to keep in mind beyond the technical details of how far exactly spillovers go, and who said what when: The overall big question is whether it is a wise policy decision to conduct school-based deworming. Real human lives are at stakes. If we withhold deworming, when in fact it works, it could have detrimental effects on millions of lives, as several new papers show that being dewormed as a child can significantly boost long-term levels of education, professional achievements and income.
Hi Chris,
thanks, I certainly do think econs are better than medics about sharing data (the issue of re-identifiable pseudonymised health records isn’t so common in econ work, which makes it easier) and that’s a good thing.
I don’t want you to think I’m being unkind, but I’m concerned to see you link to Berk’s piece. You did so knowing that the main point of it was wrong, or at least, well after I had explained to you, on twitter, how it was wrong:
https://twitter.com/cblatts/status/624299018676490240
https://twitter.com/search?q=cblatts%20bengoldacre%20berk&src=typd
I do think for the conversation to move forwards on things like this it’s important that people listen, think, read, and possibly change, as well as interact. If you don’t think my concern about Berk’s piece is valid, can you say why? If you haven’t taken the time to read and think about it then, with respect, isn’t that a problem too?
Surely there’s no point in any of this replication work being done if influential people like you – who write about it – don’t also read through the mechanics?
I should perhaps also note that I’ve posted my critique of Berk’s central point on his WorldBank blog three times but it hasn’t appeared.
Ben
Also, do you have permission to reproduce that image?
I think you are unfairly tarnishing journalists in the above. We are taught to respect reviews, rather than individual papers, and that the Cochrane Reviews are the “gold standard” of reviews on medical topics. I am afraid few of us are qualified to make the kinds of judgments you ask us to make in your piece, and of course Ben Goldacre is much better qualified than most of the rest of us.
In this case the journalists have clearly done the right thing. If there is an internal debate to be had within medicine/economics/statistics, then fine but you can’t blame us for not knowing about the details or not asking you when you were not even working in that field.
Ben, I think you’re right that what’s more important is the broader evidence, since the policy at stake is very important.
At the same time, a literature is just a collection of papers. And papers can be done poorly or well, and critiqued poorly or well. The replication just does not measure up in my view. It’s important to get these things right, and to recognize that not all analytical choices are equal. Thinking like that, and a lack of understanding of some of the subtler statistical issues, is one of the chief failings of so much medical research.
You pick up on another failing, which is a lack of data transparency in medicine. Maybe economics just has a higher bar. We routinely make our data available. Every one of my studies has the full raw data and do files, from variable construction to analysis, online before publication. This goes for much of the experimental crowd in economics. These kinds of replications are routine, even to the point of being common classroom exercises. So I set a higher bar. Especially for so foundational a paper.
Thanks for this Chris.
These issues have already been discussed ad infinitum elsewhere as you know. I don’t have any view on deworming, and I wouldn’t rely on one trial if I did, I would look at a systematic review. I am interested in replications, open data, and withheld trials, which is the focus of the piece. I don’t look at most of the issues you discuss above, and focus largely on the errors in the original paper that are acknowledged as errors by the original researchers.
It is worth being clear that these frank errors (again, accepted, quite rightly and honourably, as errors by the original researchers) were not trivial. In fact they were exactly the kind of thing that gets studies retracted from academic journals. For example, this paper had a much more trivial error, of a similar type, with no impact on the main outcome, and it was fully retracted by the Lancet:
http://retractionwatch.com/2015/01/30/lancet-retracts-republishes-cardiology-paper-admirable-notice/
I’m not saying I think the M&K paper should be retracted. I don’t want to send a signal that exposing your data to independent re-analysis will be such a bad experience. I wrote that long BuzzFeed piece specifically because I was worried that people would be mocking the original authors of the trial as fools today, when actually – as I explain – I think they are heroes, errors are common, and more people should share data as they have done.
But I do think that the reasonable scale of their errors – and the fact that they are of the kind that leads to full formal retraction of academic papers – is very important context, if supporters of this specific treatment are claiming that the criticisms of this one trial are unfair, overstated, or unreasonable.
I would also be very cautious about the WorldBank blog criticism of the re-analysis, which you link to with approval.
His first and main criticism, point one, over a whole six paragraphs, is that the re-analysis team split the treatment time up the wrong way. But this criticism is entirely unreasonable. The re-analysis team did indeed split the treatment time up one way, which the original authors disagree with; but they also ran the analysis in two other ways, that the original authors (and that blogger) think is better. That’s fully covered in the paper, and the results are presented in Table 3, which is copied in the link below. Re-analysing it the original researchers’ preferred way made little difference. So this WorldBank blogger’s central criticism simply, surely, doesn’t stand.
https://pbs.twimg.com/media/CKnRS1UWwAAVubm.png:large
To be clear, however, informed technical discussion on these papers – and all papers – is exactly what we need to see. There may well be flaws in the re-analysis too, I would be amazed if there weren’t. I think that a serious public technical discussion on these issues is excellent, and I remain hugely impressed by the original authors giving over their data and their code. For this alone, they are already head and shoulders above almost everyone in my own profession of medical academia. So for that, and for bringing trials into policy and development work, very seriously, it cannot be said enough: bravo!