Scientific Datasets Are Broken. We’re Fixing Them.

1 minute read

Published:

Last quarter, I started extracting datasets from published scientific papers to automate systematic reviews at Revv Research. I was surprised by the number of errors in published reviews — and this is at the very end of the pipeline, just copying from already published papers.

The errors upstream are far worse. Science Detective built software to scan Dryad for duplicate data blocks that shouldn’t exist, and found serious copy-paste errors in roughly 3% of the first 600 datasets scanned. Extrapolated across Dryad’s ~24,000 datasets, that’s an estimated 700+ compromised records. 15 cases have been posted to PubPeer. One paper has already been retracted.

Take this example: the landmark 2016 Cell paper “Gut Microbiota Regulate Motor Deficits and Neuroinflammation in a Model of Parkinson’s Disease” argued that gut bacteria drive Parkinson’s-like symptoms in mice. Science Detective found identical sequences of data copied between different experimental groups — duplicated rows made up 50% of SPF samples and 42% of ExGF samples in the adhesive removal test.

With TablePage, I’m hoping to prevent this in the future — we extract datasets from papers and make them live, structured, and citable. Here’s that previous dataset, extracted and structured on TablePage:

When data lives in a structured, queryable format like this, these errors become findable. When it’s trapped in a PDF, they’re invisible.