How much of the data you download is made up?

On the sister blog I report on a new paper, “Don’t Get Duped: Fraud through Duplication in Public Opinion Surveys,” by Noble Kuriakose, a researcher at SurveyMonkey, and Michael Robbins, a researcher at Princeton and the University of Michigan, who gathered data from “1,008 national surveys with more than 1.2 million observations, collected over a period of 35 years covering 154 countries, territories or subregions.”

They did some forensics, looking for duplicate or near-duplicate records as a sign of faked data, and they estimate that something like 20% of the surveys they studied had “substantial data falsification via duplication.”

These were serious surveys such as Afrobarometer, Arab Barometer, Americas Barometer, International Social Survey, Pew Global Attitudes, Pew Religion Project, Sadat Chair, and World Values Survey. To the extent these surveys are faked in many countries, we should really be questioning what we think we know about public opinion in these many countries.

That is Andrew Gelman.

At quick glance, the paper’s approach to calling data “duplicated” is a bit crude, but I’ve worked with several of the survey firms that have produced these surveys in Africa. I have no trouble imagining that the data have very serious problems.

Of course, if 20% of the surveys have a 5% duplication problem, I don’t know that this makes them much worse than the other data scholars use. Researchers use national statistics or cross-national databases all the time as if the data are valid, while most are terrible. When I referee a paper, it’s obvious who knows whats under the data and who never bother to look, and simply download blindly.

But back to the surveys. Duplication is terrible, but the least of my worries. For instance:

  • The questions from most of these surveys sound perfectly sensible until you sit down and ask them to someone in a village. Then the absurdities become immediately apparent. Researchers: If you have a chance, print out any one of these surveys and test them out sometime. You will never operate the same again.
  • Then there’s the poor quality of much of the legitimately-collected data from rushed, tired, poorly incentivized enumerators.
  • Finally, these survey firms are for-profit enterprises with very different incentives and constraints than the researchers. They often have limited cash flow, middling middle management, and their average customer is a private firm or development agency that pays little attention to data quality.

If I want reliable data, mostly I do not use private survey firms. I hire and train teams myself (when I can through a local non-profit research organization or an international one like Innovations for Poverty Action). And if I must us a firm, I hire a researcher I trust to keep an eye on things full time. I recommend nothing else.

54 thoughts on “How much of the data you download is made up?

  1. That’s funny. Data duplication is a source of error. Reaction? Comments and identical re tweets “How much of the data you download is made up?”

  2. “The questions from most of these surveys sound perfectly sensible until you sit down and ask them to someone in a village. Then the absurdities become immediately apparent. Researchers: If you have a chance…”

    There’s always a chance. Even if you can’t go to a remote village, try the questions out on someone — I used to start with my late mother (Mom was a good critic of ambiguous language).

    Pretesting questions is one of the cheapest and most efficient ways to improve the survey results.