1
|
Carrell DS, Floyd JS, Gruber S, Hazlehurst BL, Heagerty PJ, Nelson JC, Williamson BD, Ball R. A general framework for developing computable clinical phenotype algorithms. J Am Med Inform Assoc 2024; 31:1785-1796. [PMID: 38748991 PMCID: PMC11258420 DOI: 10.1093/jamia/ocae121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Revised: 05/07/2024] [Accepted: 05/14/2024] [Indexed: 07/20/2024] Open
Abstract
OBJECTIVE To present a general framework providing high-level guidance to developers of computable algorithms for identifying patients with specific clinical conditions (phenotypes) through a variety of approaches, including but not limited to machine learning and natural language processing methods to incorporate rich electronic health record data. MATERIALS AND METHODS Drawing on extensive prior phenotyping experiences and insights derived from 3 algorithm development projects conducted specifically for this purpose, our team with expertise in clinical medicine, statistics, informatics, pharmacoepidemiology, and healthcare data science methods conceptualized stages of development and corresponding sets of principles, strategies, and practical guidelines for improving the algorithm development process. RESULTS We propose 5 stages of algorithm development and corresponding principles, strategies, and guidelines: (1) assessing fitness-for-purpose, (2) creating gold standard data, (3) feature engineering, (4) model development, and (5) model evaluation. DISCUSSION AND CONCLUSION This framework is intended to provide practical guidance and serve as a basis for future elaboration and extension.
Collapse
Affiliation(s)
- David S Carrell
- Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States
| | - James S Floyd
- Department of Medicine, School of Medicine, University of Washington, Seattle, WA 98195, United States
- Department of Epidemiology, School of Public Health, University of Washington, Seattle, WA 98195, United States
| | - Susan Gruber
- Putnam Data Sciences, LLC, Cambridge, MA 02139, United States
| | - Brian L Hazlehurst
- Center for Health Research, Kaiser Permanente Northwest, Portland, OR 97227, United States
| | - Patrick J Heagerty
- Department of Biostatistics, School of Public Health, University of Washington, Seattle, WA 98195, United States
| | - Jennifer C Nelson
- Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States
| | - Brian D Williamson
- Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States
| | - Robert Ball
- Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, MD 20993, United States
| |
Collapse
|
2
|
Knevel R, Liao KP. From real-world electronic health record data to real-world results using artificial intelligence. Ann Rheum Dis 2023; 82:306-311. [PMID: 36150748 PMCID: PMC9933153 DOI: 10.1136/ard-2022-222626] [Citation(s) in RCA: 46] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Accepted: 09/10/2022] [Indexed: 11/04/2022]
Abstract
With the worldwide digitalisation of medical records, electronic health records (EHRs) have become an increasingly important source of real-world data (RWD). RWD can complement traditional study designs because it captures almost the complete variety of patients, leading to more generalisable results. For rheumatology, these data are particularly interesting as our diseases are uncommon and often take years to develop. In this review, we discuss the following concepts related to the use of EHR for research and considerations for translation into clinical care: EHR data contain a broad collection of healthcare data covering the multitude of real-life patients and the healthcare processes related to their care. Machine learning (ML) is a powerful method that allows us to leverage a large amount of heterogeneous clinical data for clinical algorithms, but requires extensive training, testing, and validation. Patterns discovered in EHR data using ML are applicable to real life settings, however, are also prone to capturing the local EHR structure and limiting generalisability outside the EHR(s) from which they were developed. Population studies on EHR necessitates knowledge on the factors influencing the data available in the EHR to circumvent biases, for example, access to medical care, insurance status. In summary, EHR data represent a rapidly growing and key resource for real-world studies. However, transforming RWD EHR data for research and for real-world evidence using ML requires knowledge of the EHR system and their differences from existing observational data to ensure that studies incorporate rigorous methods that acknowledge or address factors such as access to care, noise in the data, missingness and indication bias.
Collapse
Affiliation(s)
- Rachel Knevel
- Department of Rheumatology, Leiden University Medical Center, Leiden, The Netherlands
- Newcastle University School of Clinical Medical Sciences, Newcastle upon Tyne, UK
| | - Katherine P Liao
- Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, Massachusetts, USA
- Harvard Medical School Center for Biomedical Informatics, Boston, Massachusetts, USA
| |
Collapse
|
3
|
Yang S, Varghese P, Stephenson E, Tu K, Gronsbell J. Machine learning approaches for electronic health records phenotyping: a methodical review. J Am Med Inform Assoc 2023; 30:367-381. [PMID: 36413056 PMCID: PMC9846699 DOI: 10.1093/jamia/ocac216] [Citation(s) in RCA: 46] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 09/27/2022] [Accepted: 10/27/2022] [Indexed: 11/23/2022] Open
Abstract
OBJECTIVE Accurate and rapid phenotyping is a prerequisite to leveraging electronic health records for biomedical research. While early phenotyping relied on rule-based algorithms curated by experts, machine learning (ML) approaches have emerged as an alternative to improve scalability across phenotypes and healthcare settings. This study evaluates ML-based phenotyping with respect to (1) the data sources used, (2) the phenotypes considered, (3) the methods applied, and (4) the reporting and evaluation methods used. MATERIALS AND METHODS We searched PubMed and Web of Science for articles published between 2018 and 2022. After screening 850 articles, we recorded 37 variables on 100 studies. RESULTS Most studies utilized data from a single institution and included information in clinical notes. Although chronic conditions were most commonly considered, ML also enabled the characterization of nuanced phenotypes such as social determinants of health. Supervised deep learning was the most popular ML paradigm, while semi-supervised and weakly supervised learning were applied to expedite algorithm development and unsupervised learning to facilitate phenotype discovery. ML approaches did not uniformly outperform rule-based algorithms, but deep learning offered a marginal improvement over traditional ML for many conditions. DISCUSSION Despite the progress in ML-based phenotyping, most articles focused on binary phenotypes and few articles evaluated external validity or used multi-institution data. Study settings were infrequently reported and analytic code was rarely released. CONCLUSION Continued research in ML-based phenotyping is warranted, with emphasis on characterizing nuanced phenotypes, establishing reporting and evaluation standards, and developing methods to accommodate misclassified phenotypes due to algorithm errors in downstream applications.
Collapse
Affiliation(s)
- Siyue Yang
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
| | | | - Ellen Stephenson
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Karen Tu
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Jessica Gronsbell
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
4
|
Sulieman L, Cronin RM, Carroll RJ, Natarajan K, Marginean K, Mapes B, Roden D, Harris P, Ramirez A. Comparing medical history data derived from electronic health records and survey answers in the All of Us Research Program. J Am Med Inform Assoc 2022; 29:1131-1141. [PMID: 35396991 PMCID: PMC9196700 DOI: 10.1093/jamia/ocac046] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Revised: 02/18/2022] [Accepted: 03/23/2022] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE A participant's medical history is important in clinical research and can be captured from electronic health records (EHRs) and self-reported surveys. Both can be incomplete, EHR due to documentation gaps or lack of interoperability and surveys due to recall bias or limited health literacy. This analysis compares medical history collected in the All of Us Research Program through both surveys and EHRs. MATERIALS AND METHODS The All of Us medical history survey includes self-report questionnaire that asks about diagnoses to over 150 medical conditions organized into 12 disease categories. In each category, we identified the 3 most and least frequent self-reported diagnoses and retrieved their analogues from EHRs. We calculated agreement scores and extracted participant demographic characteristics for each comparison set. RESULTS The 4th All of Us dataset release includes data from 314 994 participants; 28.3% of whom completed medical history surveys, and 65.5% of whom had EHR data. Hearing and vision category within the survey had the highest number of responses, but the second lowest positive agreement with the EHR (0.21). The Infectious disease category had the lowest positive agreement (0.12). Cancer conditions had the highest positive agreement (0.45) between the 2 data sources. DISCUSSION AND CONCLUSION Our study quantified the agreement of medical history between 2 sources-EHRs and self-reported surveys. Conditions that are usually undocumented in EHRs had low agreement scores, demonstrating that survey data can supplement EHR data. Disagreement between EHR and survey can help identify possible missing records and guide researchers to adjust for biases.
Collapse
Affiliation(s)
- Lina Sulieman
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Robert M Cronin
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
- Department of Medicine, The Ohio State University, Columbus, Ohio, USA
| | - Robert J Carroll
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Karthik Natarajan
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Kayla Marginean
- Vanderbilt Institute of Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Brandy Mapes
- Vanderbilt Institute of Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Dan Roden
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
- Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Paul Harris
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
- Vanderbilt Institute of Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Andrea Ramirez
- Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA
- Office of data and analytics, All of Us Research Program, National Institutes of Health, Bethesda, Maryland, USA
| |
Collapse
|
5
|
Zeng C, Bastarache LA, Tao R, Venner E, Hebbring S, Andujar JD, Bland ST, Crosslin DR, Pratap S, Cooley A, Pacheco JA, Christensen KD, Perez E, Zawatsky CLB, Witkowski L, Zouk H, Weng C, Leppig KA, Sleiman PMA, Hakonarson H, Williams MS, Luo Y, Jarvik GP, Green RC, Chung WK, Gharavi AG, Lennon NJ, Rehm HL, Gibbs RA, Peterson JF, Roden DM, Wiesner GL, Denny JC. Association of Pathogenic Variants in Hereditary Cancer Genes With Multiple Diseases. JAMA Oncol 2022; 8:835-844. [PMID: 35446370 PMCID: PMC9026237 DOI: 10.1001/jamaoncol.2022.0373] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Importance Knowledge about the spectrum of diseases associated with hereditary cancer syndromes may improve disease diagnosis and management for patients and help to identify high-risk individuals. Objective To identify phenotypes associated with hereditary cancer genes through a phenome-wide association study. Design, Setting, and Participants This phenome-wide association study used health data from participants in 3 cohorts. The Electronic Medical Records and Genomics Sequencing (eMERGEseq) data set recruited predominantly healthy individuals from 10 US medical centers from July 16, 2016, through February 18, 2018, with a mean follow-up through electronic health records (EHRs) of 12.7 (7.4) years. The UK Biobank (UKB) cohort recruited participants from March 15, 2006, through August 1, 2010, with a mean (SD) follow-up of 12.4 (1.0) years. The Hereditary Cancer Registry (HCR) recruited patients undergoing clinical genetic testing at Vanderbilt University Medical Center from May 1, 2012, through December 31, 2019, with a mean (SD) follow-up through EHRs of 8.8 (6.5) years. Exposures Germline variants in 23 hereditary cancer genes. Pathogenic and likely pathogenic variants for each gene were aggregated for association analyses. Main Outcomes and Measures Phenotypes in the eMERGEseq and HCR cohorts were derived from the linked EHRs. Phenotypes in UKB were from multiple sources of health-related data. Results A total of 214 020 participants were identified, including 23 544 in eMERGEseq cohort (mean [SD] age, 47.8 [23.7] years; 12 611 women [53.6%]), 187 234 in the UKB cohort (mean [SD] age, 56.7 [8.1] years; 104 055 [55.6%] women), and 3242 in the HCR cohort (mean [SD] age, 52.5 [15.5] years; 2851 [87.9%] women). All 38 established gene-cancer associations were replicated, and 19 new associations were identified. These included the following 7 associations with neoplasms: CHEK2 with leukemia (odds ratio [OR], 3.81 [95% CI, 2.64-5.48]) and plasma cell neoplasms (OR, 3.12 [95% CI, 1.84-5.28]), ATM with gastric cancer (OR, 4.27 [95% CI, 2.35-7.44]) and pancreatic cancer (OR, 4.44 [95% CI, 2.66-7.40]), MUTYH (biallelic) with kidney cancer (OR, 32.28 [95% CI, 6.40-162.73]), MSH6 with bladder cancer (OR, 5.63 [95% CI, 2.75-11.49]), and APC with benign liver/intrahepatic bile duct tumors (OR, 52.01 [95% CI, 14.29-189.29]). The remaining 12 associations with nonneoplastic diseases included BRCA1/2 with ovarian cysts (OR, 3.15 [95% CI, 2.22-4.46] and 3.12 [95% CI, 2.36-4.12], respectively), MEN1 with acute pancreatitis (OR, 33.45 [95% CI, 9.25-121.02]), APC with gastritis and duodenitis (OR, 4.66 [95% CI, 2.61-8.33]), and PTEN with chronic gastritis (OR, 15.68 [95% CI, 6.01-40.92]). Conclusions and Relevance The findings of this genetic association study analyzing the EHRs of 3 large cohorts suggest that these new phenotypes associated with hereditary cancer genes may facilitate early detection and better management of cancers. This study highlights the potential benefits of using EHR data in genomic medicine.
Collapse
Affiliation(s)
- Chenjie Zeng
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland
| | - Lisa A Bastarache
- Center for Precision Medicine, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee
| | - Ran Tao
- Department of Biostatistics, Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, Tennessee
| | - Eric Venner
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas
| | - Scott Hebbring
- Center for Human Genetics, Marshfield Clinic Research Institute, Marshfield, Wisconsin
| | - Justin D Andujar
- Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee.,Clinical and Translational Hereditary Cancer Program, Division of Genetic Medicine, Vanderbilt-Ingram Cancer Center, Vanderbilt University, Nashville, Tennessee
| | - Sarah T Bland
- Center for Precision Medicine, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee
| | - David R Crosslin
- Department of Biomedical Informatics and Medical Education, University of Washington School of Medicine, Seattle
| | - Siddharth Pratap
- School of Graduate Studies and Research, Meharry Medical College, Nashville, Tennessee
| | - Ayorinde Cooley
- Department of Microbiology, Immunology and Physiology, Meharry Medical College, Nashville, Tennessee
| | - Jennifer A Pacheco
- Center for Genetic Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois
| | - Kurt D Christensen
- PRecisiOn Medicine Translational Research (PROMoTeR) Center, Department of Population Medicine, Harvard Pilgrim Health Care Institute, Boston, Massachusetts.,Department of Population Medicine, Harvard Medical School, Boston, Massachusetts
| | - Emma Perez
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital, Boston, Massachusetts
| | - Carrie L Blout Zawatsky
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital, Boston, Massachusetts
| | - Leora Witkowski
- Centre Universitaire de Santé McGill, McGill University Health Centre, Montreal, Quebec, Canada
| | - Hana Zouk
- Laboratory for Molecular Medicine, Partners Healthcare Personalized Medicine, Cambridge, Massachusetts.,Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York
| | - Kathleen A Leppig
- Genetic Services and Kaiser Permanente Washington Health Research Institute, Kaiser Permanente of Washington, Seattle
| | - Patrick M A Sleiman
- Center for Applied Genomics, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania.,Division of Human Genetics, Department of Pediatrics, The University of Pennsylvania Perelman School of Medicine, Philadelphia
| | - Hakon Hakonarson
- Center for Applied Genomics, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania.,Division of Human Genetics, Department of Pediatrics, The University of Pennsylvania Perelman School of Medicine, Philadelphia
| | - Marc S Williams
- Genomic Medicine Institute, Geisinger, Danville, Pennsylvania
| | - Yuan Luo
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois
| | - Gail P Jarvik
- Department of Medicine (Medical Genetics), University of Washington, Seattle.,Department of Genome Sciences, University of Washington, Seattle
| | - Robert C Green
- Brigham and Women's Hospital, Broad Institute, Ariadne Labs and Harvard Medical School, Boston, Massachusetts
| | - Wendy K Chung
- Department of Pediatrics, Columbia University, New York, New York.,Department of Medicine, Columbia University, New York, New York
| | - Ali G Gharavi
- Division of Nephrology, Department of Medicine, Columbia University Irving Medical Center, New York, New York.,Center for Precision Medicine and Genomics, Department of Medicine, Columbia University Irving Medical Center, New York, New York
| | - Niall J Lennon
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts
| | - Heidi L Rehm
- Medical & Population Genetics Program and Genomics Platform, Broad Institute of MIT and Harvard Cambridge, Cambridge, Massachusetts.,Center for Genomic Medicine, Massachusetts General Hospital, Boston.,Department of Pathology, Harvard Medical School, Boston, Massachusetts
| | - Richard A Gibbs
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas
| | - Josh F Peterson
- Center for Precision Medicine, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee
| | - Dan M Roden
- Center for Precision Medicine, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee.,Divisions of Cardiovascular Medicine and Clinical Pharmacology, Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee.,Department of Pharmacology, Vanderbilt University, Nashville, Tennessee
| | - Georgia L Wiesner
- Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee.,Clinical and Translational Hereditary Cancer Program, Division of Genetic Medicine, Vanderbilt-Ingram Cancer Center, Vanderbilt University, Nashville, Tennessee
| | - Joshua C Denny
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland
| |
Collapse
|
6
|
Abstract
Electronic health records (EHRs) are a rich source of data for researchers, but extracting meaningful information out of this highly complex data source is challenging. Phecodes represent one strategy for defining phenotypes for research using EHR data. They are a high-throughput phenotyping tool based on ICD (International Classification of Diseases) codes that can be used to rapidly define the case/control status of thousands of clinically meaningful diseases and conditions. Phecodes were originally developed to conduct phenome-wide association studies to scan for phenotypic associations with common genetic variants. Since then, phecodes have been used to support a wide range of EHR-based phenotyping methods, including the phenotype risk score. This review aims to comprehensively describe the development, validation, and applications of phecodes and suggest some future directions for phecodes and high-throughput phenotyping.
Collapse
Affiliation(s)
- Lisa Bastarache
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee 37232, USA;
| |
Collapse
|
7
|
Wang L, Zhang X, Meng X, Koskeridis F, Georgiou A, Yu L, Campbell H, Theodoratou E, Li X. Methodology in phenome-wide association studies: a systematic review. J Med Genet 2021; 58:720-728. [PMID: 34272311 DOI: 10.1136/jmedgenet-2021-107696] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2021] [Accepted: 05/27/2021] [Indexed: 11/04/2022]
Abstract
Phenome-wide association study (PheWAS) has been increasingly used to identify novel genetic associations across a wide spectrum of phenotypes. This systematic review aims to summarise the PheWAS methodology, discuss the advantages and challenges of PheWAS, and provide potential implications for future PheWAS studies. Medical Literature Analysis and Retrieval System Online (MEDLINE) and Excerpta Medica Database (EMBASE) databases were searched to identify all published PheWAS studies up until 24 April 2021. The PheWAS methodology incorporating how to perform PheWAS analysis and which software/tool could be used, were summarised based on the extracted information. A total of 1035 studies were identified and 195 eligible articles were finally included. Among them, 137 (77.0%) contained 10 000 or more study participants, 164 (92.1%) defined the phenome based on electronic medical records data, 140 (78.7%) used genetic variants as predictors, and 73 (41.0%) conducted replication analysis to validate PheWAS findings and almost all of them (94.5%) received consistent results. The methodology applied in these PheWAS studies was dissected into several critical steps, including quality control of the phenome, selecting predictors, phenotyping, statistical analysis, interpretation and visualisation of PheWAS results, and the workflow for performing a PheWAS was established with detailed instructions on each step. This study provides a comprehensive overview of PheWAS methodology to help practitioners achieve a better understanding of the PheWAS design, to detect understudied or overstudied outcomes, and to direct their research by applying the most appropriate software and online tools for their study data structure.
Collapse
Affiliation(s)
- Lijuan Wang
- School of Public Health and the Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
| | - Xiaomeng Zhang
- Centre for Global Health, The University of Edinburgh Usher Institute of Population Health Sciences and Informatics, Edinburgh, UK
| | - Xiangrui Meng
- Vanke School of Public Health, Tsinghua University, Beijing, China
| | - Fotios Koskeridis
- Department of Hygiene and Epidemiology, University of Ioannina, Ioannina, Epirus, Greece
| | - Andrea Georgiou
- Department of Hygiene and Epidemiology, University of Ioannina, Ioannina, Epirus, Greece
| | - Lili Yu
- School of Public Health and the Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
| | - Harry Campbell
- Centre for Global Health, The University of Edinburgh Usher Institute of Population Health Sciences and Informatics, Edinburgh, UK
| | - Evropi Theodoratou
- Centre for Global Health, The University of Edinburgh Usher Institute of Population Health Sciences and Informatics, Edinburgh, UK.,Cancer Research UK Edinburgh Centre, The University of Edinburgh MRC Institute of Genetics and Molecular Medicine, Edinburgh, UK
| | - Xue Li
- School of Public Health and the Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
| |
Collapse
|
8
|
Moldwin A, Demner-Fushman D, Goodwin TR. Empirical Findings on the Role of Structured Data, Unstructured Data, and their Combination for Automatic Clinical Phenotyping. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2021; 2021:445-454. [PMID: 34457160 PMCID: PMC8378600] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The objective of this study is to explore the role of structured and unstructured data for clinical phenotyping by determining which types of clinical phenotypes are best identified using unstructured data (e.g., clinical notes), structured data (e.g., laboratory values, vital signs), or their combination across 172 clinical phenotypes. Specifically, we used laboratory and chart measurements as well as clinical notes from the MIMIC-III critical care database and trained an LSTM using features extracted from each type of data to determine which categories of phenotypes were best identified by structured data, unstructured data, or both. We observed that textual features on their own outperformed structured features for 145 (84%) of phenotypes, and that Doc2Vec was the most effective representation of unstructured data for all phenotypes. When evaluating the impact of adding textual features to systems previously relying only on structured features, we found a statistically significant (p < 0.05) increase in phenotyping performance for 51 phenotypes (primarily involving the circulatory system, injury, and poisoning), one phenotype for which textual features degraded performance (diabetes without complications), and no statistically significant change in performance with the remaining 120 phenotypes. We provide analysis on which phenotypes are best identified by each type of data and guidance on which data sources to consider for future research on phenotype identification.
Collapse
Affiliation(s)
- Asher Moldwin
- U.S. National Library of Medicine, Bethesda, MD, USA
| | | | | |
Collapse
|
9
|
Thangaraj PM, Kummer BR, Lorberbaum T, Elkind MSV, Tatonetti NP. Comparative analysis, applications, and interpretation of electronic health record-based stroke phenotyping methods. BioData Min 2020; 13:21. [PMID: 33372632 PMCID: PMC7720570 DOI: 10.1186/s13040-020-00230-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2020] [Accepted: 11/15/2020] [Indexed: 01/14/2023] Open
Abstract
BACKGROUND Accurate identification of acute ischemic stroke (AIS) patient cohorts is essential for a wide range of clinical investigations. Automated phenotyping methods that leverage electronic health records (EHRs) represent a fundamentally new approach cohort identification without current laborious and ungeneralizable generation of phenotyping algorithms. We systematically compared and evaluated the ability of machine learning algorithms and case-control combinations to phenotype acute ischemic stroke patients using data from an EHR. MATERIALS AND METHODS Using structured patient data from the EHR at a tertiary-care hospital system, we built and evaluated machine learning models to identify patients with AIS based on 75 different case-control and classifier combinations. We then estimated the prevalence of AIS patients across the EHR. Finally, we externally validated the ability of the models to detect AIS patients without AIS diagnosis codes using the UK Biobank. RESULTS Across all models, we found that the mean AUROC for detecting AIS was 0.963 ± 0.0520 and average precision score 0.790 ± 0.196 with minimal feature processing. Classifiers trained with cases with AIS diagnosis codes and controls with no cerebrovascular disease codes had the best average F1 score (0.832 ± 0.0383). In the external validation, we found that the top probabilities from a model-predicted AIS cohort were significantly enriched for AIS patients without AIS diagnosis codes (60-150 fold over expected). CONCLUSIONS Our findings support machine learning algorithms as a generalizable way to accurately identify AIS patients without using process-intensive manual feature curation. When a set of AIS patients is unavailable, diagnosis codes may be used to train classifier models.
Collapse
Affiliation(s)
- Phyllis M Thangaraj
- Department of Biomedical Informatics, Columbia University, 622 W 168th St., PH-20, New York, NY, 10032, USA
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - Benjamin R Kummer
- Department of Neurology, Icahn School of Medicine at Mt. Sinai, New York, NY, USA
| | - Tal Lorberbaum
- Department of Biomedical Informatics, Columbia University, 622 W 168th St., PH-20, New York, NY, 10032, USA
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - Mitchell S V Elkind
- Department of Neurology, Vagelos College of Physicians and Surgeons, Columbia University, New York, NY, USA
- Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY, USA
| | - Nicholas P Tatonetti
- Department of Biomedical Informatics, Columbia University, 622 W 168th St., PH-20, New York, NY, 10032, USA.
- Department of Systems Biology, Columbia University, New York, NY, USA.
| |
Collapse
|
10
|
Reducing Bias Due to Outcome Misclassification for Epidemiologic Studies Using EHR-derived Probabilistic Phenotypes. Epidemiology 2020; 31:542-550. [DOI: 10.1097/ede.0000000000001193] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
11
|
Allergic Immune Diseases and the Risk of Mortality Among Patients Hospitalized for Acute Infection. Crit Care Med 2020; 47:1735-1742. [PMID: 31599813 DOI: 10.1097/ccm.0000000000004020] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
OBJECTIVES The immune response during sepsis remains poorly understood and is likely influenced by the host's preexisting immunologic comorbidities. Although more than 20% of the U.S. population has an allergic-atopic disease, the type 2 immune response that is overactive in these diseases can also mediate beneficial pro-resolving, tissue-repair functions. Thus, the presence of allergic immunologic comorbidities may be advantageous for patients suffering from sepsis. The objective of this study was to test the hypothesis that comorbid type 2 immune diseases confer protection against morbidity and mortality due to acute infection. DESIGN Retrospective cohort study of patients hospitalized with an acute infection between November 2008 and January 2016 using electronic health record data. SETTING Single tertiary-care academic medical center. PATIENTS Admissions to the hospital through the emergency department with likely infection at the time of admission who may or may not have had a type 2 immune-mediated disease, defined as asthma, allergic rhinitis, atopic dermatitis, or food allergy, as determined by International Classification of Diseases, 9th Revision, Clinical Modification codes. INTERVENTIONS None. MEASUREMENTS AND MAIN RESULTS Of 10,789 admissions for infection, 2,578 (24%) had a type 2 disease; these patients were more likely to be female, black, and younger than patients without type 2 diseases. In unadjusted analyses, type 2 patients had decreased odds of dying during the hospitalization (0.47; 95% CI, 0.38-0.59, p < 0.001), while having more than one type 2 disease conferred a dose-dependent reduction in the risk of mortality (p < 0.001). When adjusting for demographics, medications, types of infection, and illness severity, the presence of a type 2 disease remained protective (odds ratio, 0.55; 95% CI, 0.43-0.70; p < 0.001). Similar results were found using a propensity score analysis (odds ratio, 0.57; 95% CI, 0.45-0.71; p < 0.001). CONCLUSIONS Patients with type 2 diseases admitted with acute infections have reduced mortality, implying that the type 2 immune response is protective in sepsis.
Collapse
|
12
|
Sonabend W A, Cai W, Ahuja Y, Ananthakrishnan A, Xia Z, Yu S, Hong C. Automated ICD coding via unsupervised knowledge integration (UNITE). Int J Med Inform 2020; 139:104135. [PMID: 32361145 DOI: 10.1016/j.ijmedinf.2020.104135] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2019] [Revised: 02/14/2020] [Accepted: 03/26/2020] [Indexed: 12/30/2022]
Abstract
OBJECTIVE Accurate coding is critical for medical billing and electronic medical record (EMR)-based research. Recent research has been focused on developing supervised methods to automatically assign International Classification of Diseases (ICD) codes from clinical notes. However, supervised approaches rely on ICD code data stored in the hospital EMR system and is subject to bias rising from the practice and coding behavior. Consequently, portability of trained supervised algorithms to external EMR systems may suffer. METHOD We developed an unsupervised knowledge integration (UNITE) algorithm to automatically assign ICD codes for a specific disease by analyzing clinical narrative notes via semantic relevance assessment. The algorithm was validated using coded ICD data for 6 diseases from Partners HealthCare (PHS) Biobank and Medical Information Mart for Intensive Care (MIMIC-III). We compared the performance of UNITE against penalized logistic regression (LR), topic modeling, and neural network models within each EMR system. We additionally evaluated the portability of UNITE by training at PHS Biobank and validating at MIMIC-III, and vice versa. RESULTS UNITE achieved an averaged AUC of 0.91 at PHS and 0.92 at MIMIC over 6 diseases, comparable to LR and MLP. It had substantially better performance than topic models. In regards to portability, the performance of UNITE was consistent across different EMR systems, superior to LR, topic models and neural network models. CONCLUSION UNITE accurately assigns ICD code in EMR without requiring human labor, and has major advantages over commonly used machine learning approaches. In addition, the UNITE attained stable performance and high portability across EMRs in different institutions.
Collapse
Affiliation(s)
- Aaron Sonabend W
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | | | - Yuri Ahuja
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Ashwin Ananthakrishnan
- Division of Gastroenterology, Massachusetts General Hospital and Harvard Medical School, USA
| | - Zongqi Xia
- Department of Neurology and Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Sheng Yu
- Center for Statistical Science, Tsinghua University, Beijing, China; Department of Industrial Engineering, Tsinghua University, Beijing, China; Institute for Data Science, Tsinghua University, Beijing, China
| | - Chuan Hong
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
13
|
Weng C, Shah N, Hripcsak G. Call for papers: Deep phenotyping for Precision Medicine. J Biomed Inform 2018; 87:66-67. [DOI: 10.1016/j.jbi.2018.09.017] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2018] [Accepted: 09/28/2018] [Indexed: 11/26/2022]
|
14
|
Cai T, Zhang Y, Ho YL, Link N, Sun J, Huang J, Cai TA, Damrauer S, Ahuja Y, Honerlaw J, Huang J, Costa L, Schubert P, Hong C, Gagnon D, Sun YV, Gaziano JM, Wilson P, Cho K, Tsao P, O’Donnell CJ, Liao KP. Association of Interleukin 6 Receptor Variant With Cardiovascular Disease Effects of Interleukin 6 Receptor Blocking Therapy: A Phenome-Wide Association Study. JAMA Cardiol 2018; 3:849-857. [PMID: 30090940 PMCID: PMC6233652 DOI: 10.1001/jamacardio.2018.2287] [Citation(s) in RCA: 73] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/30/2018] [Accepted: 06/13/2018] [Indexed: 12/30/2022]
Abstract
Importance Electronic health record (EHR) biobanks containing clinical and genomic data on large numbers of individuals have great potential to inform drug discovery. Individuals with interleukin 6 receptor (IL6R) single-nucleotide polymorphisms (SNPs) who are not receiving IL6R blocking therapy have biomarker profiles similar to those treated with IL6R blockers. This gene-drug pair provides an example to test whether associations of IL6R SNPs with a broad range of phenotypes can inform which diseases may benefit from treatment with IL6R blockade. Objective To determine whether screening for clinical associations with the IL6R SNP in a phenome-wide association study (PheWAS) using EHR biobank data can identify drug effects from IL6R clinical trials. Design, Setting, and Participants Diagnosis codes and routine laboratory measurements were extracted from the VA Million Veteran Program (MVP); diagnosis codes were mapped to phenotype groups using published PheWAS methods. A PheWAS was performed by fitting logistic regression models for testing associations of the IL6R SNPs with 1342 phenotype groups and by fitting linear regression models for testing associations of the IL6R SNP with 26 routine laboratory measurements. Significance was reported using a false discovery rate of 0.05 or less. Findings were replicated in 2 independent cohorts using UK Biobank and Vanderbilt University Biobank data. The Million Veteran Program included 332 799 US veterans; the UK Biobank, 408 455 individuals from the general population of the United Kingdom; and the Vanderbilt University Biobank, 13 835 patients from a tertiary care center. Exposures IL6R SNPs (rs2228145; rs4129267). Main Outcomes and Measures Phenotypes defined by International Classification of Diseases, Ninth Revision codes. Results Of the 332 799 veterans included in the main cohort, 305 228 (91.7%) were men, and the mean (SD) age was 66.1 (13.6) years. The IL6R SNP was most strongly associated with a reduced risk of aortic aneurysm phenotypes (odds ratio, 0.87-0.90; 95% CI, 0.84-0.93) in the MVP. We observed known off-target effects of IL6R blockade from clinical trials (eg, higher hemoglobin level). The reduced risk for aortic aneurysms among those with the IL6R SNP in the MVP was replicated in the Vanderbilt University Biobank, and the reduced risk for coronary heart disease was replicated in the UK Biobank. Conclusions and Relevance In this proof-of-concept study, we demonstrated application of the PheWAS using large EHR biobanks to inform drug effects. The findings of an association of the IL6R SNP with reduced risk for aortic aneurysms correspond with the newest indication for IL6R blockade, giant cell arteritis, of which a major complication is aortic aneurysm.
Collapse
Affiliation(s)
- Tianxi Cai
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
- Harvard T. H. Chan School of Public Health, Boston, Massachusetts
- Harvard Medical School, Boston, Massachusetts
| | - Yichi Zhang
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
- Harvard T. H. Chan School of Public Health, Boston, Massachusetts
| | - Yuk-Lam Ho
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
| | - Nicholas Link
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
| | - Jiehuan Sun
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
- Harvard T. H. Chan School of Public Health, Boston, Massachusetts
| | - Jie Huang
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
- Brigham and Women’s Hospital, Boston, Massachusetts
| | - Tianrun A. Cai
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
- Harvard Medical School, Boston, Massachusetts
- Brigham and Women’s Hospital, Boston, Massachusetts
| | - Scott Damrauer
- Corporal Michael Crescenz Veterans Affairs Medical Center, Perlman School of Medicine, University of Pennsylvania, Philadelphia
| | - Yuri Ahuja
- Harvard Medical School, Boston, Massachusetts
| | | | - Jie Huang
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
| | - Lauren Costa
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
| | - Petra Schubert
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
| | - Chuan Hong
- Harvard T. H. Chan School of Public Health, Boston, Massachusetts
| | - David Gagnon
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
- Boston University School of Public Health, Boston, Massachusetts
| | - Yan V. Sun
- Emory University Schools of Medicine and Public Health, Atlanta, Georgia
| | - J. Michael Gaziano
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
- Harvard Medical School, Boston, Massachusetts
- Brigham and Women’s Hospital, Boston, Massachusetts
| | - Peter Wilson
- Emory University Schools of Medicine and Public Health, Atlanta, Georgia
- Atlanta Veterans Affairs Medical Center, Atlanta, Georgia
| | - Kelly Cho
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
- Harvard Medical School, Boston, Massachusetts
- Brigham and Women’s Hospital, Boston, Massachusetts
| | - Philip Tsao
- Veterans Affairs Palo Alto Health Care System, Palo Alto, California
- Department of Medicine, Stanford University of Medicine, Stanford, California
| | - Christopher J. O’Donnell
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
- Harvard Medical School, Boston, Massachusetts
- Associate Editor, JAMA Cardiology
| | - Katherine P. Liao
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
- Harvard Medical School, Boston, Massachusetts
- Brigham and Women’s Hospital, Boston, Massachusetts
| |
Collapse
|