How much of the data you download is made up?

February 29, 2016

On the sister blog I report on a new paper, “Don’t Get Duped: Fraud through Duplication in Public Opinion Surveys,” by Noble Kuriakose, a researcher at SurveyMonkey, and Michael Robbins, a researcher at Princeton and the University of Michigan, who gathered data from “1,008 national surveys with more than 1.2 million observations, collected over a period of 35 years covering 154 countries, territories or subregions.”

They did some forensics, looking for duplicate or near-duplicate records as a sign of faked data, and they estimate that something like 20% of the surveys they studied had “substantial data falsification via duplication.”

These were serious surveys such as Afrobarometer, Arab Barometer, Americas Barometer, International Social Survey, Pew Global Attitudes, Pew Religion Project, Sadat Chair, and World Values Survey. To the extent these surveys are faked in many countries, we should really be questioning what we think we know about public opinion in these many countries.

That is Andrew Gelman.

At quick glance, the paper’s approach to calling data “duplicated” is a bit crude, but I’ve worked with several of the survey firms that have produced these surveys in Africa. I have no trouble imagining that the data have very serious problems.

Of course, if 20% of the surveys have a 5% duplication problem, I don’t know that this makes them much worse than the other data scholars use. Researchers use national statistics or cross-national databases all the time as if the data are valid, while most are terrible. When I referee a paper, it’s obvious who knows whats under the data and who never bother to look, and simply download blindly.

But back to the surveys. Duplication is terrible, but the least of my worries. For instance:

The questions from most of these surveys sound perfectly sensible until you sit down and ask them to someone in a village. Then the absurdities become immediately apparent. Researchers: If you have a chance, print out any one of these surveys and test them out sometime. You will never operate the same again.
Then there’s the poor quality of much of the legitimately-collected data from rushed, tired, poorly incentivized enumerators.
Finally, these survey firms are for-profit enterprises with very different incentives and constraints than the researchers. They often have limited cash flow, middling middle management, and their average customer is a private firm or development agency that pays little attention to data quality.

If I want reliable data, mostly I do not use private survey firms. I hire and train teams myself (when I can through a local non-profit research organization or an international one like Innovations for Poverty Action). And if I must us a firm, I hire a researcher I trust to keep an eye on things full time. I recommend nothing else.

54 Responses

sebenikalan says:

March 4, 2016 at 6:43 am

Your analysis is only as good as your data. Poor analysis of bad data is the worst (and way too common) https://t.co/2ug4ud2mWn
DorisBiesenbach says:

March 3, 2016 at 4:54 am

RT @felixhaass: How much of the data you download is made up? On survey quality and the use of survey firms https://t.co/FEwxgwxj6o h/t @cb…
DelightfulApps says:

March 3, 2016 at 2:12 am

RT @SIMLab: “If I want reliable data, I hire & train teams myself (when I can through a local or intl non-profit research org.” https://t.c…
SIMLab says:

March 2, 2016 at 5:24 pm

“If I want reliable data, I hire & train teams myself (when I can through a local or intl non-profit research org.” https://t.co/mg09ubR3Dm
paulgibbs says:

March 2, 2016 at 4:40 pm

How much of the data you download is made up?: https://t.co/2AhkOn8q8s
joelamport says:

March 2, 2016 at 3:09 pm

“these surveys sound perfectly sensible until you sit down & ask them to someone in a village” the band marches on.. https://t.co/fM2wQJ6VtO
Grace_Kite says:

March 2, 2016 at 1:34 pm

RT @cblatts: How much of the data you download is made up? https://t.co/PuvQWLq3j1
sergiotpinto says:

March 1, 2016 at 10:46 pm

Not for the first time, @cblatts has a post that is simultaneously very good and pretty depressing https://t.co/r5Nr8HKBIj
felixhaass says:

March 1, 2016 at 4:01 pm

How much of the data you download is made up? On survey quality and the use of survey firms https://t.co/FEwxgwxj6o h/t @cblatts
RCAFDM says:

March 1, 2016 at 2:56 pm

RT @RAVerBruggen: How much of the data you download is made up? https://t.co/3CqDt1BoNl
IstvanHajnal says:

March 1, 2016 at 2:26 pm

RT @freakonometrics: “Can you trust international surveys?” https://t.co/fofI0YwvQ2 and https://t.co/oTJthhmcfY see also @cblatts’s https:/…
mzjanani says:

March 1, 2016 at 1:25 pm

RT @cblatts: How much of the data you download is made up? https://t.co/PuvQWLq3j1
IamLord_Zico says:

March 1, 2016 at 12:28 pm

RT @freakonometrics: “Can you trust international surveys?” https://t.co/fofI0YwvQ2 and https://t.co/oTJthhmcfY see also @cblatts’s https:/…
freakonometrics says:

March 1, 2016 at 12:22 pm

“Can you trust international surveys?” https://t.co/fofI0YwvQ2 and https://t.co/oTJthhmcfY see also @cblatts’s https://t.co/UHG3qhh2Hz
jwalkersmith says:

March 1, 2016 at 9:40 am

#Marketing #MRX alert! 20% surveys w/ made-up data! Other issues cited here overblown – Fabrication is THE problem https://t.co/qxpQyzKf3H
zbicyclist says:

March 1, 2016 at 9:18 am

“The questions from most of these surveys sound perfectly sensible until you sit down and ask them to someone in a village. Then the absurdities become immediately apparent. Researchers: If you have a chance…”

There’s always a chance. Even if you can’t go to a remote village, try the questions out on someone — I used to start with my late mother (Mom was a good critic of ambiguous language).

Pretesting questions is one of the cheapest and most efficient ways to improve the survey results.
SeanyBeau says:

March 1, 2016 at 6:26 am

RT @FlorianKrampe: How much of the data you download is made up? – @cblatts https://t.co/VmcCnak2gh
diane1859 says:

March 1, 2016 at 6:21 am

RT @cblatts: How much of the data you download is made up? https://t.co/PuvQWLq3j1
revimfizi says:

March 1, 2016 at 6:08 am

RT @cblatts: How much of the data you download is made up? https://t.co/PuvQWLq3j1
FlorianKrampe says:

March 1, 2016 at 5:56 am

How much of the data you download is made up? – @cblatts https://t.co/VmcCnak2gh
epopppp says:

March 1, 2016 at 5:53 am

RT @cblatts: How much of the data you download is made up? https://t.co/PuvQWLq3j1
JamiePett says:

March 1, 2016 at 5:52 am

RT @cblatts: How much of the data you download is made up? https://t.co/PuvQWLq3j1
MJerven says:

March 1, 2016 at 5:48 am

RT @cblatts: How much of the data you download is made up? https://t.co/PuvQWLq3j1
PoliticsDrB says:

February 29, 2016 at 8:57 pm

ICYMI study estimates 20% of #surveys from 154 countries have 5% or more made up #data https://t.co/SXuMALaogs
https://t.co/3v6SiK1PER
demos_graphy says:

February 29, 2016 at 8:06 pm

How much of the data you download is made up? https://t.co/vQfVRBeitl
HardlySimilar says:

February 29, 2016 at 7:21 pm

How much of the data you download is made up? https://t.co/SCjLzJT8dE
Adrian Turcu says:

February 29, 2016 at 7:12 pm

That’s funny. Data duplication is a source of error. Reaction? Comments and identical re tweets “How much of the data you download is made up?”
benjamingeer says:

February 29, 2016 at 4:08 pm

‘most of these surveys sound perfectly sensible until you sit down and ask them to someone in a village’ https://t.co/qjTnKCu7tF
BobData says:

February 29, 2016 at 3:21 pm

@cblatts @GavinChait 12.8%
nastyoldmrpike says:

February 29, 2016 at 3:08 pm

@RAVerBruggen @cblatts 95%
RAVerBruggen says:

February 29, 2016 at 3:07 pm

How much of the data you download is made up? https://t.co/3CqDt1BoNl
david_henderson says:

February 29, 2016 at 2:54 pm

RT @brettkeller: no quality control w/o intensive time/skill investment >> How much of the data you download is made up? | @cblatts https:/…
brettkeller says:

February 29, 2016 at 2:48 pm

no quality control w/o intensive time/skill investment >> How much of the data you download is made up? | @cblatts https://t.co/uHYTjmn7TT
adaugelli says:

February 29, 2016 at 2:43 pm

How much of the data you download is made up? https://t.co/pFa3KmJLWY
sewenz says:

February 29, 2016 at 2:23 pm

RT @ValuesStudies: “Duplication is terrible, but the least of my worries.” @cblatts on data falsification in international surveys https://…
ValuesStudies says:

February 29, 2016 at 2:08 pm

“Duplication is terrible, but the least of my worries.” @cblatts on data falsification in international surveys https://t.co/mPy7OS6cQT
GavinChait says:

February 29, 2016 at 1:55 pm

RT @cblatts: How much of the data you download is made up? https://t.co/PuvQWLq3j1
rustytodd says:

February 29, 2016 at 1:36 pm

How much of the data you download is made up? https://t.co/HZzlBLP3FU
idegoede says:

February 29, 2016 at 1:33 pm

How much of the data you download is made up? https://t.co/eqA12k0era
sbmitche says:

February 29, 2016 at 1:23 pm

RT @cblatts: How much of the data you download is made up? https://t.co/PuvQWLq3j1
bdkwood says:

February 29, 2016 at 1:05 pm

RT @cblatts: How much of the data you download is made up? https://t.co/PuvQWLq3j1
DanielCanueto says:

February 29, 2016 at 1:02 pm

How much of the data you download is made up? https://t.co/SjwlEWeCXe
aid_leap says:

February 29, 2016 at 12:49 pm

.@cblatts explores an important question – how much data from surveys is made up? https://t.co/6YUrrd2oKq
kylemcnabb says:

February 29, 2016 at 11:57 am

RT @cblatts: How much of the data you download is made up? https://t.co/PuvQWLq3j1
tangjinxu says:

February 29, 2016 at 11:23 am

RT @cblatts: How much of the data you download is made up? https://t.co/PuvQWLq3j1
TEMarclint says:

February 29, 2016 at 11:10 am

RT @cblatts: How much of the data you download is made up? https://t.co/PuvQWLq3j1
MichelArmel says:

February 29, 2016 at 11:08 am

RT @cblatts: How much of the data you download is made up? https://t.co/PuvQWLq3j1
NicVdSijpe says:

February 29, 2016 at 10:55 am

RT @cblatts: How much of the data you download is made up? https://t.co/PuvQWLq3j1
lorenznoe says:

February 29, 2016 at 10:43 am

Especially as a consumer of secondary #data myself, a necessary reminder to always dig deeper. From @cblatts https://t.co/hp8IaW01tr
jselanikio says:

February 29, 2016 at 10:19 am

RT @cblatts: How much of the data you download is made up? https://t.co/PuvQWLq3j1
JoelWWood says:

February 29, 2016 at 10:15 am

RT @cblatts: How much of the data you download is made up? https://t.co/PuvQWLq3j1
ttrespasser says:

February 29, 2016 at 10:14 am

RT @cblatts: How much of the data you download is made up? https://t.co/PuvQWLq3j1
adiofasi says:

February 29, 2016 at 10:11 am

RT @cblatts: How much of the data you download is made up? https://t.co/PuvQWLq3j1
abmakulec says:

February 29, 2016 at 10:08 am

RT @cblatts: How much of the data you download is made up? https://t.co/PuvQWLq3j1

Chris Blattman

Chris Blattman

How much of the data you download is made up?

Related

54 Responses

Subscribe to Blog

Recent Posts

Presentation to the Joint Chiefs Operations Directorate

From street fights to world wars: What gang violence can teach us about conflict

When is War Justified?

Conversation with Teny Gross on Gang Violence

The 5 reasons wars happen

Advanced Master’s & PhDs