1
|
Khurshid S, Reeder C, Harrington LX, Singh P, Sarma G, Friedman SF, Di Achille P, Diamant N, Cunningham JW, Turner AC, Lau ES, Haimovich JS, Al-Alusi MA, Wang X, Klarqvist MDR, Ashburner JM, Diedrich C, Ghadessi M, Mielke J, Eilken HM, McElhinney A, Derix A, Atlas SJ, Ellinor PT, Philippakis AA, Anderson CD, Ho JE, Batra P, Lubitz SA. Cohort design and natural language processing to reduce bias in electronic health records research. NPJ Digit Med 2022; 5:47. [PMID: 35396454 PMCID: PMC8993873 DOI: 10.1038/s41746-022-00590-0] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2021] [Accepted: 03/09/2022] [Indexed: 01/04/2023] Open
Abstract
Electronic health record (EHR) datasets are statistically powerful but are subject to ascertainment bias and missingness. Using the Mass General Brigham multi-institutional EHR, we approximated a community-based cohort by sampling patients receiving longitudinal primary care between 2001-2018 (Community Care Cohort Project [C3PO], n = 520,868). We utilized natural language processing (NLP) to recover vital signs from unstructured notes. We assessed the validity of C3PO by deploying established risk models for myocardial infarction/stroke and atrial fibrillation. We then compared C3PO to Convenience Samples including all individuals from the same EHR with complete data, but without a longitudinal primary care requirement. NLP reduced the missingness of vital signs by 31%. NLP-recovered vital signs were highly correlated with values derived from structured fields (Pearson r range 0.95-0.99). Atrial fibrillation and myocardial infarction/stroke incidence were lower and risk models were better calibrated in C3PO as opposed to the Convenience Samples (calibration error range for myocardial infarction/stroke: 0.012-0.030 in C3PO vs. 0.028-0.046 in Convenience Samples; calibration error for atrial fibrillation 0.028 in C3PO vs. 0.036 in Convenience Samples). Sampling patients receiving regular primary care and using NLP to recover missing data may reduce bias and maximize generalizability of EHR research.
Collapse
Affiliation(s)
- Shaan Khurshid
- Division of Cardiology, Massachusetts General Hospital, Boston, MA, USA
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
- Cardiovascular Disease Initiative, Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Christopher Reeder
- Data Sciences Platform, Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Lia X Harrington
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
- Cardiovascular Disease Initiative, Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Pulkit Singh
- Data Sciences Platform, Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Gopal Sarma
- Data Sciences Platform, Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Samuel F Friedman
- Data Sciences Platform, Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Paolo Di Achille
- Data Sciences Platform, Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Nathaniel Diamant
- Data Sciences Platform, Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Jonathan W Cunningham
- Cardiovascular Disease Initiative, Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
- Division of Cardiology, Brigham and Women's Hospital, Boston, MA, USA
| | - Ashby C Turner
- Department of Neurology, Massachusetts General Hospital, Boston, MA, USA
- Henry and Allison McCance Center for Brain Health, Massachusetts General Hospital, Boston, MA, USA
| | - Emily S Lau
- Division of Cardiology, Massachusetts General Hospital, Boston, MA, USA
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
- Cardiovascular Disease Initiative, Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Julian S Haimovich
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
- Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Mostafa A Al-Alusi
- Division of Cardiology, Massachusetts General Hospital, Boston, MA, USA
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
| | - Xin Wang
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
- Cardiovascular Disease Initiative, Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Marcus D R Klarqvist
- Data Sciences Platform, Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Jeffrey M Ashburner
- Harvard Medical School, Boston, MA, USA
- Division of General Internal Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Christian Diedrich
- Bayer AG, Research and Development, Pharmaceuticals, Leverkusen, Germany
| | - Mercedeh Ghadessi
- Bayer AG, Research and Development, Pharmaceuticals, Leverkusen, Germany
| | - Johanna Mielke
- Bayer AG, Research and Development, Pharmaceuticals, Leverkusen, Germany
| | - Hanna M Eilken
- Bayer AG, Research and Development, Pharmaceuticals, Leverkusen, Germany
| | - Alice McElhinney
- Cardiovascular Disease Initiative, Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Andrea Derix
- Bayer AG, Research and Development, Pharmaceuticals, Leverkusen, Germany
| | - Steven J Atlas
- Harvard Medical School, Boston, MA, USA
- Division of General Internal Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Patrick T Ellinor
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
- Cardiovascular Disease Initiative, Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
- Demoulas Center for Cardiac Arrhythmias, Massachusetts General Hospital, Boston, MA, USA
| | - Anthony A Philippakis
- Data Sciences Platform, Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
- Eric and Wendy Schmidt Center, Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Christopher D Anderson
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
- Department of Neurology, Massachusetts General Hospital, Boston, MA, USA
- Henry and Allison McCance Center for Brain Health, Massachusetts General Hospital, Boston, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Department of Neurology, Brigham and Women's Hospital, Boston, MA, USA
| | - Jennifer E Ho
- Division of Cardiology, Massachusetts General Hospital, Boston, MA, USA
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
- Cardiovascular Disease Initiative, Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Puneet Batra
- Data Sciences Platform, Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Steven A Lubitz
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA.
- Cardiovascular Disease Initiative, Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA.
- Demoulas Center for Cardiac Arrhythmias, Massachusetts General Hospital, Boston, MA, USA.
| |
Collapse
|