1
|
Lee S, Kelly RS, Chen Y, Waqas M, Mendez KM, Hecker J, Hahn G, Lutz SM, Celedón JC, Clish CB, Litonjua AA, Chen Q, McGeachie M, Choi Y, Weiss ST, Tanzi RE, Lange C, Prokopenko D, Lasky-Su JA. Associations of APOE variants with sphingomyelin and cholesterol metabolites across the life-course in diverse populations. Metabolomics 2025; 21:64. [PMID: 40335834 DOI: 10.1007/s11306-025-02256-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/06/2024] [Accepted: 04/02/2025] [Indexed: 05/09/2025]
Abstract
INTRODUCTION Two alleles (ε2 and ε4) in the APOE gene are known to be strongly associated with lipid metabolism disorders, such as dyslipidemia, which are important risk factors for the development of cardiovascular diseases. While prior research has largely centered on adult populations, establishing APOE-lipid associations in infants, children, and adolescents-especially those from historically understudied groups-could inform earlier interventions and treatments. OBJECTIVES This study aimed to evaluate the dependence of the metabolome on the APOE variants using five diverse cohorts that span infancy through adulthood, comprising a total of over 190,000 individuals. METHODS We extracted the APOE variants (rs7412 and rs429358) from all cohorts-testing both the ε2 allele (rs7412-T and rs429358-T) and the ε4 allele (rs7412-C and rs429358-C)-and evaluated their associations with the global plasma metabolome which was measured by mass spectrometry-based (Metabolon or Broad Institute) or NMR-based (Nightingale) assays depending on the cohort, using a Bonferroni-corrected significance threshold. RESULTS Among 589 metabolites tested in our discovery population, only six including sphingomyelins and cholesterol were significantly associated with the rs7412/ε2 allele. Sphingomyelin (d18:1/22:0) and cholesterol were negatively associated with ε2 allele (β-value = -0.54 [-0.76, -0.32] p-value = 1.39 × 10-6 and - 0.55 [-0.77, -0.33]; p-value = 1.49 × 10-6, respectively). These relationships were replicated in the four additional cohorts without heterogeneity. CONCLUSION Our findings support the need for early intervention in lipid levels regardless of age, sex, and ethnicity and further investigations of the APOE variants on risk of various diseases in later life.
Collapse
Grants
- R01HL169300 the National Heart, Lung, and Blood Institute
- R01HL169300 the National Heart, Lung, and Blood Institute
- R01HL169300 the National Heart, Lung, and Blood Institute
- R01MH129337 the National Heart, Lung, and Blood Institute
- P01 HL132825 NHLBI NIH HHS
- R01HL169300 the National Heart, Lung, and Blood Institute
- R01HL169300 the National Heart, Lung, and Blood Institute
Collapse
Affiliation(s)
- Sanghun Lee
- Department of Medical Consilience, Division of Medicine, Graduate School, Dankook University, Yongin-si, South Korea
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Rachel S Kelly
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Yulu Chen
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Mohammad Waqas
- Genetics and Aging Research Unit, McCance Center for Brain Health, Department of Neurology, Massachusetts General Hospital, Boston, MA, USA
| | - Kevin M Mendez
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Julian Hecker
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Georg Hahn
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Sharon M Lutz
- Department of Population Medicine, Harvard Medical School, Harvard Pilgrim Healthcare Institute, Boston, MA, USA
| | - Juan C Celedón
- Division of Pediatric Pulmonary, Department of Pediatrics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Clary B Clish
- Metabolomics Platform, Broad Institute, Cambridge, MA, USA
| | - Augusto A Litonjua
- Division of Pediatric Pulmonary Medicine, Golisano Children's Hospital at Strong, University of Rochester Medical Center, Rochester, NY, USA
| | - Qingwen Chen
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Michael McGeachie
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Younjung Choi
- Genetics and Aging Research Unit, McCance Center for Brain Health, Department of Neurology, Massachusetts General Hospital, Boston, MA, USA
| | - Scott T Weiss
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Rudolph E Tanzi
- Genetics and Aging Research Unit, McCance Center for Brain Health, Department of Neurology, Massachusetts General Hospital, Boston, MA, USA
| | - Christoph Lange
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Dmitry Prokopenko
- Genetics and Aging Research Unit, McCance Center for Brain Health, Department of Neurology, Massachusetts General Hospital, Boston, MA, USA.
| | - Jessica A Lasky-Su
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA.
| |
Collapse
|
2
|
Shen Y, Wang J, Wang Z, Shi Z, Chen H, Wang Z, Jiang Y, Wang X, Cheng C, Wang X, Zhu H, Ye J. CATI: A medical context-enhanced framework for diagnosis code assignment in the UK Biobank study. Artif Intell Med 2025; 166:103136. [PMID: 40344999 DOI: 10.1016/j.artmed.2025.103136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2024] [Revised: 03/10/2025] [Accepted: 04/15/2025] [Indexed: 05/11/2025]
Abstract
Diagnosis codes are standard code format of diseases or medical conditions. This study is aimed at assigning diagnosis codes to patients in large-scale biobanks, particularly addressing the issue of missing codes for some patients. This is crucial for downstream disease-related tasks. While recent methods primarily rely on structured biobank data for code assignment, they often overlook the valuable medical context provided by textual information in the biobanks and hierarchical structure of the disease coding system. To address this gap, we have developed CATI, a medical context-enhanced framework for diagnosis Code Assignment by integrating Textual details derived from key features and disease hIerarchy. The study is based on the UK Biobank data and considers Phecodes and ICD-10 codes as standard disease formats. We start by representing ten informative codified features using their formal names and then integrate them into CATI as text embeddings, achieved through prompt tuning on the pre-trained language model BioBERT. Recognizing the hierarchical structure of diagnosis codes, we have developed a novel convolution layer in our method that effectively propagates logits between adjacent diagnosis codes. Evaluation results demonstrate that CATI outperforms existing state-of-the-art methods in terms of both Phecodes and ICD-10 codes, boasting at least a 5.16% improvement in average AUROC for unseen disease codes and an 8.68% rise in average AUPRC for disease codes with training instances ranging in (1000,10000]. This framework contributes to the formation of well-defined cohorts for downstream studies and offers a unique perspective for addressing complex healthcare tasks by incorporating vital medical context.
Collapse
Affiliation(s)
- Yue Shen
- MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, Hefei, Anhui, 230027, China
| | - Jie Wang
- MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, Hefei, Anhui, 230027, China.
| | - Zhe Wang
- MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, Hefei, Anhui, 230027, China
| | - Zhihao Shi
- MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, Hefei, Anhui, 230027, China
| | - Hanzhu Chen
- MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, Hefei, Anhui, 230027, China
| | - Zheng Wang
- Alibaba Cloud Computing, Hangzhou, Zhejiang, 310030, China
| | - Yukang Jiang
- Department of Radiology, University of North Carolina at Chapel Hill, NC 27599, USA
| | - Xiaopu Wang
- School of Management, University of Science and Technology of China, Hefei, Anhui, 230027, China
| | - Chuandong Cheng
- Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, 230027, China
| | - Xueqin Wang
- School of Management, University of Science and Technology of China, Hefei, Anhui, 230027, China
| | - Hongtu Zhu
- Biomedical Research Imaging Center, Department of Biostatistics, University of North Carolina at Chapel Hill, NC 27599, USA.
| | - Jieping Ye
- Alibaba Cloud Computing, Hangzhou, Zhejiang, 310030, China.
| |
Collapse
|
3
|
Huerta-Chagoya A, Schroeder P, Mandla R, Li J, Morris L, Vora M, Alkanaq A, Nagy D, Szczerbinski L, Madsen JGS, Bonàs-Guarch S, Mollandin F, Cole JB, Porneala B, Westerman K, Li JH, Pollin TI, Florez JC, Gloyn AL, Carey DJ, Cebola I, Mirshahi UL, Manning AK, Leong A, Udler M, Mercader JM. Rare variant analyses in 51,256 type 2 diabetes cases and 370,487 controls reveal the pathogenicity spectrum of monogenic diabetes genes. Nat Genet 2024; 56:2370-2379. [PMID: 39379762 PMCID: PMC11549050 DOI: 10.1038/s41588-024-01947-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 09/10/2024] [Indexed: 10/10/2024]
Abstract
Type 2 diabetes (T2D) genome-wide association studies (GWASs) often overlook rare variants as a result of previous imputation panels' limitations and scarce whole-genome sequencing (WGS) data. We used TOPMed imputation and WGS to conduct the largest T2D GWAS meta-analysis involving 51,256 cases of T2D and 370,487 controls, targeting variants with a minor allele frequency as low as 5 × 10-5. We identified 12 new variants, including a rare African/African American-enriched enhancer variant near the LEP gene (rs147287548), associated with fourfold increased T2D risk. We also identified a rare missense variant in HNF4A (p.Arg114Trp), associated with eightfold increased T2D risk, previously reported in maturity-onset diabetes of the young with reduced penetrance, but observed here in a T2D GWAS. We further leveraged these data to analyze 1,634 ClinVar variants in 22 genes related to monogenic diabetes, identifying two additional rare variants in HNF1A and GCK associated with fivefold and eightfold increased T2D risk, respectively, the effects of which were modified by the individual's polygenic risk score. For 21% of the variants with conflicting interpretations or uncertain significance in ClinVar, we provided support of being benign based on their lack of association with T2D. Our work provides a framework for using rare variant GWASs to identify large-effect variants and assess variant pathogenicity in monogenic diabetes genes.
Collapse
Affiliation(s)
- Alicia Huerta-Chagoya
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Diabetes Unit, Massachusetts General Hospital, Boston, MA, USA
| | - Philip Schroeder
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Diabetes Unit, Massachusetts General Hospital, Boston, MA, USA
| | - Ravi Mandla
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Diabetes Unit, Massachusetts General Hospital, Boston, MA, USA
- Graduate Program in Genomics and Computational Biology, University of Pennsylvania, Philadelphia, PA, USA
| | - Jiang Li
- Department of Genomic Health, Geisinger, Danville, PA, USA
| | - Lowri Morris
- Section of Genetics and Genomics, Department of Metabolism, Digestion and Reproduction, Imperial College London, London, UK
| | - Maheak Vora
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Ahmed Alkanaq
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Dorka Nagy
- Section of Genetics and Genomics, Department of Metabolism, Digestion and Reproduction, Imperial College London, London, UK
- National Heart and Lung Institute, Faculty of Medicine, London, UK
| | - Lukasz Szczerbinski
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Diabetes Unit, Massachusetts General Hospital, Boston, MA, USA
- Department of Endocrinology, Diabetology and Internal Medicine, Medical University of Bialystok, Bialystok, Poland
- Clinical Research Centre, Medical University of Bialystok, Bialystok, Poland
| | - Jesper G S Madsen
- Institute of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
- The Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, Odense, Denmark
| | - Silvia Bonàs-Guarch
- Centre for Genomic Regulation, The Barcelona Institute of Science and Technology, Barcelona, Spain
- Centro de Investigación Biomédica en Red de Diabetes y Enfermedades Metabólicas Asociadas, Madrid, Spain
- Department of Metabolism, Digestion and Reproduction, Imperial College London, London, UK
| | - Fanny Mollandin
- Centre for Genomic Regulation, The Barcelona Institute of Science and Technology, Barcelona, Spain
- Centro de Investigación Biomédica en Red de Diabetes y Enfermedades Metabólicas Asociadas, Madrid, Spain
| | - Joanne B Cole
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
- Division of Endocrinology, Boston Children's Hospital, Boston, MA, USA
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA
| | - Bianca Porneala
- Division of General Internal Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Kenneth Westerman
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Josephine H Li
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Diabetes Unit, Massachusetts General Hospital, Boston, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
- Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Toni I Pollin
- University of Maryland, School of Medicine, Baltimore, MD, USA
| | - Jose C Florez
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Diabetes Unit, Massachusetts General Hospital, Boston, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
- Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
- Endocrine Division, Massachusetts General Hospital, Boston, MA, USA
| | - Anna L Gloyn
- Department of Pediatrics, Division of Endocrinology, Stanford School of Medicine, Stanford, CA, USA
| | - David J Carey
- Department of Genomic Health, Geisinger, Danville, PA, USA
| | - Inês Cebola
- Section of Genetics and Genomics, Department of Metabolism, Digestion and Reproduction, Imperial College London, London, UK
| | | | - Alisa K Manning
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
- Clinical and Translational Epidemiology Unit, Massachusetts General Hospital, Boston, MA, USA
| | - Aaron Leong
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Diabetes Unit, Massachusetts General Hospital, Boston, MA, USA
- Division of General Internal Medicine, Massachusetts General Hospital, Boston, MA, USA
- Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
- Endocrine Division, Massachusetts General Hospital, Boston, MA, USA
| | - Miriam Udler
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Diabetes Unit, Massachusetts General Hospital, Boston, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
- Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Josep M Mercader
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA.
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.
- Diabetes Unit, Massachusetts General Hospital, Boston, MA, USA.
- Department of Medicine, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
4
|
Liew JW, Treu T, Park Y, Ferguson JM, Rosser MA, Ho YL, Gagnon DR, Stovall R, Monach P, Heckbert SR, Gensler LS, Liao KP, Dubreuil M. The association of TNF inhibitor use with incident cardiovascular events in radiographic axial spondyloarthritis. Semin Arthritis Rheum 2024; 68:152482. [PMID: 38865875 PMCID: PMC11381167 DOI: 10.1016/j.semarthrit.2024.152482] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Revised: 04/02/2024] [Accepted: 05/20/2024] [Indexed: 06/14/2024]
Abstract
BACKGROUND Whether tumor necrosis factor inhibitor (TNFi) use is cardioprotective among individuals with radiographic axial spondyloarthritis (r-axSpA), who have heightened cardiovascular (CV) risk, is unclear. We tested the association of TNFi use with incident CV outcomes in r-axSpA. METHODS We identified a r-axSpA cohort within a Veterans Affairs database between 2002 and 2019 using novel phenotyping methods and secondarily using ICD codes. TNFi use was assessed as a time-varying exposure using pharmacy dispense records. The primary outcome was incident CV disease identified using ICD codes for coronary artery disease, myocardial infarction or stroke. We fit Cox models with inverse probability weights to estimate the risk of each outcome with TNFi use versus non-use. Analyses were performed in the overall cohort, and separately in two periods (2002-2010, 2011-2019) to account for secular trends. RESULTS Using phenotyping we identified 26,928 individuals with an r-axSpA diagnosis (mean age 63.4 years, 94 % male); at baseline 3633 were TNFi users and 23,295 were non-users. During follow-up of a mean 3.3 ± 4.2 years, 674 (18.6 %) TNFi users had incident CVD versus 11,838 (50.8 %) non-users. In adjusted analyses, TNFi use versus non-use was associated with lower risk of incident CVD (HR 0.34, 95 % CI 0.29-0.40) in the cohort overall, and in the two time periods separately. CONCLUSION In this r-axSpA cohort identified using phenotyping methods, TNFi use versus non-use had a lower risk of incident CVD. These findings provide reassurance regarding the CV safety of TNFi agents for r-axSpA treatment. Replication of these results in other cohorts is needed.
Collapse
Affiliation(s)
- Jean W Liew
- Section of Rheumatology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
| | - Timothy Treu
- Massachusetts Veterans Epidemiology Research and Information Center (MAVERIC), VA Boston Healthcare System, Boston, MA, USA
| | - Yojin Park
- Massachusetts Veterans Epidemiology Research and Information Center (MAVERIC), VA Boston Healthcare System, Boston, MA, USA
| | - Jacqueline M Ferguson
- Center for Innovation to Implementation, Veterans Affairs Palo Alto Health Care System, Menlo Park, CA, USA
| | - Morgan A Rosser
- Duke University, Department of Anesthesiology, Durham, NC, USA
| | - Yuk-Lam Ho
- Massachusetts Veterans Epidemiology Research and Information Center (MAVERIC), VA Boston Healthcare System, Boston, MA, USA
| | - David R Gagnon
- Massachusetts Veterans Epidemiology Research and Information Center (MAVERIC), VA Boston Healthcare System, Boston, MA, USA; Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA
| | - Rachael Stovall
- Division of Rheumatology, University of Washington, Seattle, WA, USA
| | - Paul Monach
- Rheumatology Section, VA Boston Healthcare System, Boston, MA, USA
| | - Susan R Heckbert
- Department of Epidemiology, University of Washington, Seattle, WA, USA
| | - Lianne S Gensler
- Division of Rheumatology, Department of Medicine, University of California San Francisco, San Francisco, CA, USA
| | - Katherine P Liao
- Brigham and Women's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA, VA Boston Healthcare System, Boston, MA, USA; Section of Rheumatology, VA Boston Healthcare System; Division of Rheumatology, Inflammation, and Immunity, Brigham and Women's Hospital, Boston, MA, USA
| | - Maureen Dubreuil
- Section of Rheumatology, Boston University Chobanian & Avedisian School of Medicine, VA Boston Healthcare System, Boston, MA, USA
| |
Collapse
|
5
|
Stavers-Sosa I, Cronkite DJ, Gerstley LD, Kelley A, Kiel L, Kline-Simon AH, Marafino BJ, Ramaprasan A, Carrell DS, Hirschtritt ME. Protocol for Designing a Model to Predict the Likelihood of Psychosis From Electronic Health Records Using Natural Language Processing and Machine Learning. Perm J 2024; 28:23-36. [PMID: 39219312 PMCID: PMC11404646 DOI: 10.7812/tpp/23.139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/04/2024]
Abstract
INTRODUCTION Rapid identification of individuals developing a psychotic spectrum disorder (PSD) is crucial because untreated psychosis is associated with poor outcomes and decreased treatment response. Lack of recognition of early psychotic symptoms often delays diagnosis, further worsening these outcomes. METHODS The proposed study is a cross-sectional, retrospective analysis of electronic health record data including clinician documentation and patient-clinician secure messages for patients aged 15-29 years with ≥ 1 primary care encounter between 2017 and 2019 within 2 Kaiser Permanente regions. Patients with new-onset PSD will be distinguished from those without a diagnosis if they have ≥ 1 PSD diagnosis within 12 months following the primary care encounter. The prediction model will be trained using a trisourced natural language processing feature extraction design and validated both within each region separately and in a modified combined sample. DISCUSSION This proposed model leverages the strengths of the large volume of patient-specific data from an integrated electronic health record with natural language processing to identify patients at elevated chance of developing a PSD. This project carries the potential to reduce the duration of untreated psychosis and thereby improve long-term patient outcomes.
Collapse
Affiliation(s)
- Icelini Stavers-Sosa
- Department of Psychiatry, Kaiser Permanente Oakland Medical Center, Oakland, CA, USA
- Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, San Francisco, CA, USA
| | - David J Cronkite
- Kaiser Permanente Washington Health Research Institute, Seattle, WA, USA
| | - Lawrence D Gerstley
- Division of Research, Kaiser Permanente Northern California, Oakland, CA, USA
| | - Ann Kelley
- Kaiser Permanente Washington Health Research Institute, Seattle, WA, USA
| | - Linda Kiel
- Kaiser Permanente Washington Health Research Institute, Seattle, WA, USA
| | | | - Ben J Marafino
- Division of Research, Kaiser Permanente Northern California, Oakland, CA, USA
| | - Arvind Ramaprasan
- Kaiser Permanente Washington Health Research Institute, Seattle, WA, USA
| | - David S Carrell
- Kaiser Permanente Washington Health Research Institute, Seattle, WA, USA
| | - Matthew E Hirschtritt
- Department of Psychiatry, Kaiser Permanente Oakland Medical Center, Oakland, CA, USA
- Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, San Francisco, CA, USA
- Division of Research, Kaiser Permanente Northern California, Oakland, CA, USA
| |
Collapse
|
6
|
Nogues IE, Wen J, Zhao Y, Bonzel CL, Castro VM, Lin Y, Xu S, Hou J, Cai T. Semi-supervised Double Deep Learning Temporal Risk Prediction (SeDDLeR) with Electronic Health Records. J Biomed Inform 2024; 157:104685. [PMID: 39004109 DOI: 10.1016/j.jbi.2024.104685] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Revised: 05/11/2024] [Accepted: 06/25/2024] [Indexed: 07/16/2024]
Abstract
BACKGROUND Risk prediction plays a crucial role in planning for prevention, monitoring, and treatment. Electronic Health Records (EHRs) offer an expansive repository of temporal medical data encompassing both risk factors and outcome indicators essential for effective risk prediction. However, challenges emerge due to the lack of readily available gold-standard outcomes and the complex effects of various risk factors. Compounding these challenges are the false positives in diagnosis codes, and formidable task of pinpointing the onset timing in annotations. OBJECTIVE We develop a Semi-supervised Double Deep Learning Temporal Risk Prediction (SeDDLeR) algorithm based on extensive unlabeled longitudinal Electronic Health Records (EHR) data augmented by a limited set of gold standard labels on the binary status information indicating whether the clinical event of interest occurred during the follow-up period. METHODS The SeDDLeR algorithm calculates an individualized risk of developing future clinical events over time using each patient's baseline EHR features via the following steps: (1) construction of an initial EHR-derived surrogate as a proxy for the onset status; (2) deep learning calibration of the surrogate along gold-standard onset status; and (3) semi-supervised deep learning for risk prediction combining calibrated surrogates and gold-standard onset status. To account for missing onset time and heterogeneous follow-up, we introduce temporal kernel weighting. We devise a Gated Recurrent Units (GRUs) module to capture temporal characteristics. We subsequently assess our proposed SeDDLeR method in simulation studies and apply the method to the Massachusetts General Brigham (MGB) Biobank to predict type 2 diabetes (T2D) risk. RESULTS SeDDLeR outperforms benchmark risk prediction methods, including Semi-parametric Transformation Model (STM) and DeepHit, with consistently best accuracy across experiments. SeDDLeR achieved the best C-statistics ( 0.815, SE 0.023; vs STM +.084, SE 0.030, P-value .004; vs DeepHit +.055, SE 0.027, P-value .024) and best average time-specific AUC (0.778, SE 0.022; vs STM + 0.059, SE 0.039, P-value .067; vs DeepHit + 0.168, SE 0.032, P-value <0.001) in the MGB T2D study. CONCLUSION SeDDLeR can train robust risk prediction models in both real-world EHR and synthetic datasets with minimal requirements of labeling event times. It holds the potential to be incorporated for future clinical trial recruitment or clinical decision-making.
Collapse
Affiliation(s)
| | - Jun Wen
- Department of Biomedical Informatics, Harvard Medical School, United States of America
| | - Yihan Zhao
- Harvard College, Harvard University, United States of America
| | - Clara-Lea Bonzel
- Department of Biomedical Informatics, Harvard Medical School, United States of America
| | - Victor M Castro
- Research Information Science and Computing, Mass General Brigham Healthcare, United States of America
| | - Yucong Lin
- Institute of Engineering Medicine, Beijing Institute of Technology, China
| | - Shike Xu
- Department of Statistics, University of Connecticut, United States of America
| | - Jue Hou
- Division of Biostatistics, School of Public Health, University of Minnesota, United States of America.
| | - Tianxi Cai
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, United States of America; Department of Biomedical Informatics, Harvard Medical School, United States of America
| |
Collapse
|
7
|
Lisik D, Milani GP, Salisu M, Özuygur Ermis SS, Goksör E, Basna R, Wennergren G, Kankaanranta H, Nwaru BI. Machine learning-derived phenotypic trajectories of asthma and allergy in children and adolescents: protocol for a systematic review. BMJ Open 2024; 14:e080263. [PMID: 39214659 PMCID: PMC11367367 DOI: 10.1136/bmjopen-2023-080263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Accepted: 08/07/2024] [Indexed: 09/04/2024] Open
Abstract
INTRODUCTION Development of asthma and allergies in childhood/adolescence commonly follows a sequential progression termed the 'atopic march'. Recent reports indicate, however, that these diseases are composed of multiple distinct phenotypes, with possibly differential trajectories. We aim to synthesise the current literature in the field of machine learning-based trajectory studies of asthma/allergies in children and adolescents, summarising the frequency, characteristics and associated risk factors and outcomes of identified trajectories and indicating potential directions for subsequent research in replicability, pathophysiology, risk stratification and personalised management. Furthermore, methodological approaches and quality will be critically appraised, highlighting trends, limitations and future perspectives. METHODS AND ANALYSES 10 databases (CAB Direct, CINAHL, Embase, Google Scholar, PsycInfo, PubMed, Scopus, Web of Science, WHO Global Index Medicus and WorldCat Dissertations and Theses) will be searched for observational studies (including conference abstracts and grey literature) from the last 10 years (2013-2023) without restriction by language. Screening, data extraction and assessment of quality and risk of bias (using a custom-developed tool) will be performed independently in pairs. The characteristics of the derived trajectories will be narratively synthesised, tabulated and visualised in figures. Risk factors and outcomes associated with the trajectories will be summarised and pooled estimates from comparable numerical data produced through random-effects meta-analysis. Methodological approaches will be narratively synthesised and presented in tabulated form and figure to visualise trends. ETHICS AND DISSEMINATION Ethical approval is not warranted as no patient-level data will be used. The findings will be published in an international peer-reviewed journal. PROSPERO REGISTRATION NUMBER CRD42023441691.
Collapse
Affiliation(s)
- Daniil Lisik
- Krefting Research Centre, Institute of Medicine, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
| | - Gregorio Paolo Milani
- Department of Clinical Science and Community Health, University of Milan, Milan, Italy
- Pediatric Unit, Ospedale Maggiore Policlinico, Milano, Italy
| | - Michael Salisu
- Krefting Research Centre, Institute of Medicine, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
| | - Saliha Selin Özuygur Ermis
- Krefting Research Centre, Institute of Medicine, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
| | - Emma Goksör
- Department of Pediatrics, University of Gothenburg Sahlgrenska Academy, Gothenburg, Sweden
| | - Rani Basna
- Krefting Research Centre, Institute of Medicine, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
- Department of Clinical Sciences, Lund University, Lund, Sweden
| | - Göran Wennergren
- Krefting Research Centre, Institute of Medicine, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
- Department of Pediatrics, University of Gothenburg Sahlgrenska Academy, Gothenburg, Sweden
| | - Hannu Kankaanranta
- Krefting Research Centre, Institute of Medicine, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
- Tampere University Respiratory Research Group, Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
| | - Bright I Nwaru
- Krefting Research Centre, Institute of Medicine, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
- Wallenberg Centre for Molecular and Translational Medicine, University of Gothenburg, Gothenburg, Sweden
| |
Collapse
|
8
|
Carrell DS, Floyd JS, Gruber S, Hazlehurst BL, Heagerty PJ, Nelson JC, Williamson BD, Ball R. A general framework for developing computable clinical phenotype algorithms. J Am Med Inform Assoc 2024; 31:1785-1796. [PMID: 38748991 PMCID: PMC11258420 DOI: 10.1093/jamia/ocae121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Revised: 05/07/2024] [Accepted: 05/14/2024] [Indexed: 07/20/2024] Open
Abstract
OBJECTIVE To present a general framework providing high-level guidance to developers of computable algorithms for identifying patients with specific clinical conditions (phenotypes) through a variety of approaches, including but not limited to machine learning and natural language processing methods to incorporate rich electronic health record data. MATERIALS AND METHODS Drawing on extensive prior phenotyping experiences and insights derived from 3 algorithm development projects conducted specifically for this purpose, our team with expertise in clinical medicine, statistics, informatics, pharmacoepidemiology, and healthcare data science methods conceptualized stages of development and corresponding sets of principles, strategies, and practical guidelines for improving the algorithm development process. RESULTS We propose 5 stages of algorithm development and corresponding principles, strategies, and guidelines: (1) assessing fitness-for-purpose, (2) creating gold standard data, (3) feature engineering, (4) model development, and (5) model evaluation. DISCUSSION AND CONCLUSION This framework is intended to provide practical guidance and serve as a basis for future elaboration and extension.
Collapse
Affiliation(s)
- David S Carrell
- Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States
| | - James S Floyd
- Department of Medicine, School of Medicine, University of Washington, Seattle, WA 98195, United States
- Department of Epidemiology, School of Public Health, University of Washington, Seattle, WA 98195, United States
| | - Susan Gruber
- Putnam Data Sciences, LLC, Cambridge, MA 02139, United States
| | - Brian L Hazlehurst
- Center for Health Research, Kaiser Permanente Northwest, Portland, OR 97227, United States
| | - Patrick J Heagerty
- Department of Biostatistics, School of Public Health, University of Washington, Seattle, WA 98195, United States
| | - Jennifer C Nelson
- Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States
| | - Brian D Williamson
- Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States
| | - Robert Ball
- Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, MD 20993, United States
| |
Collapse
|
9
|
Gao J, Bonzel CL, Hong C, Varghese P, Zakir K, Gronsbell J. Semi-supervised ROC analysis for reliable and streamlined evaluation of phenotyping algorithms. J Am Med Inform Assoc 2024; 31:640-650. [PMID: 38128118 PMCID: PMC10873838 DOI: 10.1093/jamia/ocad226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2023] [Revised: 09/22/2023] [Accepted: 11/20/2023] [Indexed: 12/23/2023] Open
Abstract
OBJECTIVE High-throughput phenotyping will accelerate the use of electronic health records (EHRs) for translational research. A critical roadblock is the extensive medical supervision required for phenotyping algorithm (PA) estimation and evaluation. To address this challenge, numerous weakly-supervised learning methods have been proposed. However, there is a paucity of methods for reliably evaluating the predictive performance of PAs when a very small proportion of the data is labeled. To fill this gap, we introduce a semi-supervised approach (ssROC) for estimation of the receiver operating characteristic (ROC) parameters of PAs (eg, sensitivity, specificity). MATERIALS AND METHODS ssROC uses a small labeled dataset to nonparametrically impute missing labels. The imputations are then used for ROC parameter estimation to yield more precise estimates of PA performance relative to classical supervised ROC analysis (supROC) using only labeled data. We evaluated ssROC with synthetic, semi-synthetic, and EHR data from Mass General Brigham (MGB). RESULTS ssROC produced ROC parameter estimates with minimal bias and significantly lower variance than supROC in the simulated and semi-synthetic data. For the 5 PAs from MGB, the estimates from ssROC are 30% to 60% less variable than supROC on average. DISCUSSION ssROC enables precise evaluation of PA performance without demanding large volumes of labeled data. ssROC is also easily implementable in open-source R software. CONCLUSION When used in conjunction with weakly-supervised PAs, ssROC facilitates the reliable and streamlined phenotyping necessary for EHR-based research.
Collapse
Affiliation(s)
- Jianhui Gao
- Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada
| | - Clara-Lea Bonzel
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| | - Chuan Hong
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, United States
| | - Paul Varghese
- Health Informatics, Verily Life Sciences, Cambridge, MA, United States
| | - Karim Zakir
- Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada
| | - Jessica Gronsbell
- Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada
- Department of Family and Community Medicine, University of Toronto, Toronto, ON, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
10
|
Smith JC, Williamson BD, Cronkite DJ, Park D, Whitaker JM, McLemore MF, Osmanski JT, Winter R, Ramaprasan A, Kelley A, Shea M, Wittayanukorn S, Stojanovic D, Zhao Y, Toh S, Johnson KB, Aronoff DM, Carrell DS. Data-driven automated classification algorithms for acute health conditions: applying PheNorm to COVID-19 disease. J Am Med Inform Assoc 2024; 31:574-582. [PMID: 38109888 PMCID: PMC10873852 DOI: 10.1093/jamia/ocad241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Revised: 10/19/2023] [Accepted: 11/27/2023] [Indexed: 12/20/2023] Open
Abstract
OBJECTIVES Automated phenotyping algorithms can reduce development time and operator dependence compared to manually developed algorithms. One such approach, PheNorm, has performed well for identifying chronic health conditions, but its performance for acute conditions is largely unknown. Herein, we implement and evaluate PheNorm applied to symptomatic COVID-19 disease to investigate its potential feasibility for rapid phenotyping of acute health conditions. MATERIALS AND METHODS PheNorm is a general-purpose automated approach to creating computable phenotype algorithms based on natural language processing, machine learning, and (low cost) silver-standard training labels. We applied PheNorm to cohorts of potential COVID-19 patients from 2 institutions and used gold-standard manual chart review data to investigate the impact on performance of alternative feature engineering options and implementing externally trained models without local retraining. RESULTS Models at each institution achieved AUC, sensitivity, and positive predictive value of 0.853, 0.879, 0.851 and 0.804, 0.976, and 0.885, respectively, at quantiles of model-predicted risk that maximize F1. We report performance metrics for all combinations of silver labels, feature engineering options, and models trained internally versus externally. DISCUSSION Phenotyping algorithms developed using PheNorm performed well at both institutions. Performance varied with different silver-standard labels and feature engineering options. Models developed locally at one site also worked well when implemented externally at the other site. CONCLUSION PheNorm models successfully identified an acute health condition, symptomatic COVID-19. The simplicity of the PheNorm approach allows it to be applied at multiple study sites with substantially reduced overhead compared to traditional approaches.
Collapse
Affiliation(s)
- Joshua C Smith
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Brian D Williamson
- Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States
| | - David J Cronkite
- Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States
| | - Daniel Park
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Jill M Whitaker
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Michael F McLemore
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Joshua T Osmanski
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Robert Winter
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Arvind Ramaprasan
- Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States
| | - Ann Kelley
- Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States
| | - Mary Shea
- Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States
| | - Saranrat Wittayanukorn
- Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, MD 20903, United States
| | - Danijela Stojanovic
- Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, MD 20903, United States
| | - Yueqin Zhao
- Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, MD 20903, United States
| | - Sengwee Toh
- Harvard Pilgrim Health Care Institute, Boston, MA 02215, United States
| | - Kevin B Johnson
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - David M Aronoff
- Department of Medicine, Indiana University School of Medicine, Indianapolis, IN 46202, United States
| | - David S Carrell
- Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States
| |
Collapse
|
11
|
Marelli AJ, Li C, Liu A, Nguyen H, Moroz H, Brophy JM, Guo L, Buckeridge DL, Tang J, Yang AY, Li Y. Machine Learning Informed Diagnosis for Congenital Heart Disease in Large Claims Data Source. JACC. ADVANCES 2024; 3:100801. [PMID: 38939385 PMCID: PMC11198709 DOI: 10.1016/j.jacadv.2023.100801] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/06/2022] [Revised: 08/10/2023] [Accepted: 10/20/2023] [Indexed: 06/29/2024]
Abstract
Background With an increasing interest in using large claims databases in medical practice and research, it is a meaningful and essential step to efficiently identify patients with the disease of interest. Objectives This study aims to establish a machine learning (ML) approach to identify patients with congenital heart disease (CHD) in large claims databases. Methods We harnessed data from the Quebec claims and hospitalization databases from 1983 to 2000. The study included 19,187 patients. Of them, 3,784 were labeled as true CHD patients using a clinician developed algorithm with manual audits considered as the gold standards. To establish an accurate ML-empowered automated CHD classification system, we evaluated ML methods including Gradient Boosting Decision Tree, Support Vector Machine, Decision tree, and compared them to regularized logistic regression. The Area Under the Precision Recall Curve was used as the evaluation metric. External validation was conducted with an updated data set to 2010 with different subjects. Results Among the ML methods we evaluated, Gradient Boosting Decision Tree led the performance in identifying true CHD patients with 99.3% Area Under the Precision Recall Curve, 98.0% for sensitivity, and 99.7% for specificity. External validation returned similar statistics on model performance. Conclusions This study shows that a tedious and time-consuming clinical inspection for CHD patient identification can be replaced by an extremely efficient ML algorithm in large claims database. Our findings demonstrate that ML methods can be used to automate complicated algorithms to identify patients with complex diseases.
Collapse
Affiliation(s)
- Ariane J. Marelli
- McGill University Health Centre, McGill Adult Unit for Congenital Heart Disease Excellence, Montreal, Québec, Canada
| | - Chao Li
- McGill University Health Centre, McGill Adult Unit for Congenital Heart Disease Excellence, Montreal, Québec, Canada
| | - Aihua Liu
- McGill University Health Centre, McGill Adult Unit for Congenital Heart Disease Excellence, Montreal, Québec, Canada
| | - Hanh Nguyen
- McGill University Health Centre, McGill Adult Unit for Congenital Heart Disease Excellence, Montreal, Québec, Canada
| | - Harry Moroz
- McGill University Health Centre, McGill Adult Unit for Congenital Heart Disease Excellence, Montreal, Québec, Canada
| | - James M. Brophy
- Department of Epidemiology, Biostatistics, and Occupational Health, McGill University, Montreal, Québec, Canada
| | - Liming Guo
- McGill University Health Centre, McGill Adult Unit for Congenital Heart Disease Excellence, Montreal, Québec, Canada
| | - David L. Buckeridge
- Department of Epidemiology, Biostatistics, and Occupational Health, McGill University, Montreal, Québec, Canada
| | - Jian Tang
- Department of Decision Sciences HEC, Université de Montréal, Montreal, Québec, Canada
| | - Archer Y. Yang
- Department of Mathematics and Statistics, McGill University, Montreal, Québec, Canada
| | - Yue Li
- School of Computer Science, McGill University, Montreal, Québec, Canada
| |
Collapse
|
12
|
Abbas A, Lee M, Shanavas N, Kovatchev V. Clinical concept annotation with contextual word embedding in active transfer learning environment. Digit Health 2024; 10:20552076241308987. [PMID: 39711738 PMCID: PMC11660282 DOI: 10.1177/20552076241308987] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Accepted: 12/04/2024] [Indexed: 12/24/2024] Open
Abstract
Objective The study aims to present an active learning approach that automatically extracts clinical concepts from unstructured data and classifies them into explicit categories such as Problem, Treatment, and Test while preserving high precision and recall and demonstrating the approach through experiments using i2b2 public datasets. Methods Initially labeled data are acquired from a lexical-based approach in sufficient amounts to perform an active learning process. A contextual word embedding similarity approach is adopted using BERT base variant models such as ClinicalBERT, DistilBERT, and SCIBERT to automatically classify the unlabeled clinical concept into explicit categories. Additionally, deep learning and large language model (LLM) are trained on acquiring label data through active learning. Results Using i2b2 datasets (426 clinical notes), the lexical-based method achieved precision, recall, and F1-scores of 76%, 70%, and 73%. SCIBERT excelled in active transfer learning, yielding precision of 70.84%, recall of 77.40%, F1-score of 73.97%, and accuracy of 69.30%, surpassing counterpart models. Among deep learning models, convolutional neural networks (CNNs) trained with embeddings (BERTBase, DistilBERT, SCIBERT, ClinicalBERT) achieved training accuracies of 92-95% and testing accuracies of 89-93%. These results were higher compared to other deep learning models. Additionally, we individually evaluated these LLMs; among them, ClinicalBERT achieved the highest performance, with a training accuracy of 98.4% and a testing accuracy of 96%, outperforming the others. Conclusions The proposed methodology enhances clinical concept extraction by integrating active learning and models like SCIBERT and CNN. It improves annotation efficiency while maintaining high accuracy, showcasing potential for clinical applications.
Collapse
Affiliation(s)
- Asim Abbas
- School of Computer Science, University of Birmingham, Birmingham, UK
| | - Mark Lee
- School of Computer Science, University of Birmingham, Birmingham, UK
| | - Niloofer Shanavas
- School of Computer Science, University of Birmingham, Abu Dhabi, United Arab Emirates
| | - Venelin Kovatchev
- School of Computer Science, University of Birmingham, Birmingham, UK
| |
Collapse
|
13
|
Alsentzer E, Rasmussen MJ, Fontoura R, Cull AL, Beaulieu-Jones B, Gray KJ, Bates DW, Kovacheva VP. Zero-shot interpretable phenotyping of postpartum hemorrhage using large language models. NPJ Digit Med 2023; 6:212. [PMID: 38036723 PMCID: PMC10689487 DOI: 10.1038/s41746-023-00957-x] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Accepted: 11/01/2023] [Indexed: 12/02/2023] Open
Abstract
Many areas of medicine would benefit from deeper, more accurate phenotyping, but there are limited approaches for phenotyping using clinical notes without substantial annotated data. Large language models (LLMs) have demonstrated immense potential to adapt to novel tasks with no additional training by specifying task-specific instructions. Here we report the performance of a publicly available LLM, Flan-T5, in phenotyping patients with postpartum hemorrhage (PPH) using discharge notes from electronic health records (n = 271,081). The language model achieves strong performance in extracting 24 granular concepts associated with PPH. Identifying these granular concepts accurately allows the development of interpretable, complex phenotypes and subtypes. The Flan-T5 model achieves high fidelity in phenotyping PPH (positive predictive value of 0.95), identifying 47% more patients with this complication compared to the current standard of using claims codes. This LLM pipeline can be used reliably for subtyping PPH and outperforms a claims-based approach on the three most common PPH subtypes associated with uterine atony, abnormal placentation, and obstetric trauma. The advantage of this approach to subtyping is its interpretability, as each concept contributing to the subtype determination can be evaluated. Moreover, as definitions may change over time due to new guidelines, using granular concepts to create complex phenotypes enables prompt and efficient updating of the algorithm. Using this language modelling approach enables rapid phenotyping without the need for any manually annotated training data across multiple clinical use cases.
Collapse
Affiliation(s)
- Emily Alsentzer
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, USA
| | - Matthew J Rasmussen
- Department of Anesthesiology, Perioperative and Pain Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Romy Fontoura
- Department of Anesthesiology, Perioperative and Pain Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Alexis L Cull
- Department of Anesthesiology, Perioperative and Pain Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Brett Beaulieu-Jones
- Section of Biomedical Data Science, Department of Medicine, University of Chicago, Chicago, IL, USA
| | - Kathryn J Gray
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Division of Maternal-Fetal Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - David W Bates
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA, USA
- Department of Health Care Policy and Management, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Vesela P Kovacheva
- Department of Anesthesiology, Perioperative and Pain Medicine, Brigham and Women's Hospital, Boston, MA, USA.
| |
Collapse
|
14
|
Srinivasan S, Wu P, Mercader JM, Udler MS, Porneala BC, Bartz TM, Floyd JS, Sitlani C, Guo X, Haessler J, Kooperberg C, Liu J, Ahmad S, van Duijn C, Liu CT, Goodarzi MO, Florez JC, Meigs JB, Rotter JI, Rich SS, Dupuis J, Leong A. A Type 1 Diabetes Polygenic Score Is Not Associated With Prevalent Type 2 Diabetes in Large Population Studies. J Endocr Soc 2023; 7:bvad123. [PMID: 37841955 PMCID: PMC10576255 DOI: 10.1210/jendso/bvad123] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Indexed: 10/17/2023] Open
Abstract
Context Both type 1 diabetes (T1D) and type 2 diabetes (T2D) have significant genetic contributions to risk and understanding their overlap can offer clinical insight. Objective We examined whether a T1D polygenic score (PS) was associated with a diagnosis of T2D in the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) consortium. Methods We constructed a T1D PS using 79 known single nucleotide polymorphisms associated with T1D risk. We analyzed 13 792 T2D cases and 14 169 controls from CHARGE cohorts to determine the association between the T1D PS and T2D prevalence. We validated findings in an independent sample of 2256 T2D cases and 27 052 controls from the Mass General Brigham Biobank (MGB Biobank). As secondary analyses in 5228 T2D cases from CHARGE, we used multivariable regression models to assess the association of the T1D PS with clinical outcomes associated with T1D. Results The T1D PS was not associated with T2D both in CHARGE (P = .15) and in the MGB Biobank (P = .87). The partitioned human leukocyte antigens only PS was associated with T2D in CHARGE (OR 1.02 per 1 SD increase in PS, 95% CI 1.01-1.03, P = .006) but not in the MGB Biobank. The T1D PS was weakly associated with insulin use (OR 1.007, 95% CI 1.001-1.012, P = .03) in CHARGE T2D cases but not with other outcomes. Conclusion In large biobank samples, a common variant PS for T1D was not consistently associated with prevalent T2D. However, possible heterogeneity in T2D cannot be ruled out and future studies are needed do subphenotyping.
Collapse
Affiliation(s)
- Shylaja Srinivasan
- Division of Pediatric Endocrinology, University of California at San Francisco, San Francisco, CA 94158, USA
| | - Peitao Wu
- Department of Biostatistics, Boston University School of Public Health, Boston, MA 02215, USA
| | - Josep M Mercader
- Department of Medicine, Harvard Medical School, Boston, MA 02115, USA
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard & Massachusetts Institute of Technology, Cambridge, MA 02142, USA
- Center for Genomic Medicine and Diabetes Unit, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Miriam S Udler
- Department of Medicine, Harvard Medical School, Boston, MA 02115, USA
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard & Massachusetts Institute of Technology, Cambridge, MA 02142, USA
- Center for Genomic Medicine and Diabetes Unit, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Bianca C Porneala
- Division of General Internal Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Traci M Bartz
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA
- Cardiovascular Health Research Unit, University of Washington, Seattle, WA 98195, USA
| | - James S Floyd
- Cardiovascular Health Research Unit, University of Washington, Seattle, WA 98195, USA
- Department of Medicine, University of Washington, Seattle, WA 98195, USA
- Department of Epidemiology, University of Washington, Seattle, WA 98195, USA
| | - Colleen Sitlani
- Cardiovascular Health Research Unit, University of Washington, Seattle, WA 98195, USA
- Department of Medicine, University of Washington, Seattle, WA 98195, USA
| | - Xiquing Guo
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA 90502, USA
| | - Jeffrey Haessler
- Division of Public Health Sciences, Fred Hutchinson Cancer Center, Seattle, WA 98109, USA
| | - Charles Kooperberg
- Division of Public Health Sciences, Fred Hutchinson Cancer Center, Seattle, WA 98109, USA
| | - Jun Liu
- Department of Epidemiology, Erasmus Medical Center, 3015 GD Rotterdam, The Netherlands
- Nuffield Department of Population Health, University of Oxford, Oxford OX1 2JD, UK
| | - Shahzad Ahmad
- Department of Epidemiology, Erasmus Medical Center, 3015 GD Rotterdam, The Netherlands
| | - Cornelia van Duijn
- Department of Epidemiology, Erasmus Medical Center, 3015 GD Rotterdam, The Netherlands
- Nuffield Department of Population Health, University of Oxford, Oxford OX1 2JD, UK
| | - Ching-Ti Liu
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard & Massachusetts Institute of Technology, Cambridge, MA 02142, USA
| | - Mark O Goodarzi
- Division of Endocrinology, Diabetes and Metabolism, Department of Medicine, Cedars-Sinai Medical Center, Los Angeles, CA 90048, USA
| | - Jose C Florez
- Department of Medicine, Harvard Medical School, Boston, MA 02115, USA
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard & Massachusetts Institute of Technology, Cambridge, MA 02142, USA
- Center for Genomic Medicine and Diabetes Unit, Massachusetts General Hospital, Boston, MA 02114, USA
| | - James B Meigs
- Department of Medicine, Harvard Medical School, Boston, MA 02115, USA
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard & Massachusetts Institute of Technology, Cambridge, MA 02142, USA
- Division of General Internal Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Jerome I Rotter
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA 90502, USA
| | - Stephen S Rich
- Center for Public Health Genomics, Department of Public Health Sciences, University of Virginia, Charlottesville, VA 22903, USA
| | - Josée Dupuis
- Department of Biostatistics, Boston University School of Public Health, Boston, MA 02215, USA
| | - Aaron Leong
- Department of Medicine, Harvard Medical School, Boston, MA 02115, USA
- Center for Genomic Medicine and Diabetes Unit, Massachusetts General Hospital, Boston, MA 02114, USA
- Division of General Internal Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
| |
Collapse
|
15
|
Schroeder P, Mandla R, Huerta-Chagoya A, Alkanak A, Nagy D, Szczerbinski L, Madsen JGS, Cole JB, Porneala B, Westerman K, Li JH, Pollin TI, Florez JC, Gloyn AL, Cebola I, Manning A, Leong A, Udler M, Mercader JM. Rare variant association analysis in 51,256 type 2 diabetes cases and 370,487 controls informs the spectrum of pathogenicity of monogenic diabetes genes. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.09.28.23296244. [PMID: 37808701 PMCID: PMC10557807 DOI: 10.1101/2023.09.28.23296244] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/10/2023]
Abstract
We meta-analyzed array data imputed with the TOPMed reference panel and whole-genome sequence (WGS) datasets and performed the largest, rare variant (minor allele frequency as low as 5×10-5) GWAS meta-analysis of type 2 diabetes (T2D) comprising 51,256 cases and 370,487 controls. We identified 52 novel variants at genome-wide significance (p<5 × 10-8), including 8 novel variants that were either rare or ancestry-specific. Among them, we identified a rare missense variant in HNF4A p.Arg114Trp (OR=8.2, 95% confidence interval [CI]=4.6-14.0, p = 1.08×10-13), previously reported as a variant implicated in Maturity Onset Diabetes of the Young (MODY) with incomplete penetrance. We demonstrated that the diabetes risk in carriers of this variant was modulated by a T2D common variant polygenic risk score (cvPRS) (carriers in the top PRS tertile [OR=18.3, 95%CI=7.2-46.9, p=1.2×10-9] vs carriers in the bottom PRS tertile [OR=2.6, 95% CI=0.97-7.09, p = 0.06]. Association results identified eight variants of intermediate penetrance (OR>5) in monogenic diabetes (MD), which in aggregate as a rare variant PRS were associated with T2D in an independent WGS dataset (OR=4.7, 95% CI=1.86-11.77], p = 0.001). Our data also provided support evidence for 21% of the variants reported in ClinVar in these MD genes as benign based on lack of association with T2D. Our work provides a framework for using rare variant imputation and WGS analyses in large-scale population-based association studies to identify large-effect rare variants and provide evidence for informing variant pathogenicity.
Collapse
Affiliation(s)
- Philip Schroeder
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Diabetes Unit, Massachusetts General Hospital, Boston, MA, USA
| | - Ravi Mandla
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Diabetes Unit, Massachusetts General Hospital, Boston, MA, USA
- Department of Medicine and Cardiovascular Research Institute, Cardiology Division, University of California, San Francisco, CA, USA
| | - Alicia Huerta-Chagoya
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Ahmed Alkanak
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Dorka Nagy
- Section of Genetics and Genomics, Department of Metabolism, Digestion and Reproduction, Imperial College London, London, UK
- National Heart and Lung Institute, Faculty of Medicine, London, UK
| | - Lukasz Szczerbinski
- Department of Endocrinology, Diabetology and Internal Medicine, Medical University of Bialystok, Bialystok, 15-276, Poland
- Clinical Research Centre, Medical University of Bialystok, Bialystok, 15-276, Poland
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Diabetes Unit, Massachusetts General Hospital, Boston, MA, USA
| | - Jesper G S Madsen
- Institute of Mathematics and Computer Science, University of Southern Denmark, Odense M, 5230, Denmark
- The Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Joanne B Cole
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
- Division of Endocrinology, Boston Children's Hospital, Boston, MA, USA
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Bianca Porneala
- Division of General Internal Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Kenneth Westerman
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Josephine H Li
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Diabetes Unit, Massachusetts General Hospital, Boston, MA, USA
- Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Toni I Pollin
- Emory University, Atlanta, Georgia, USA., Atlanta, GA, USA
| | - Jose C Florez
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Diabetes Unit, Massachusetts General Hospital, Boston, MA, USA
- Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
- Endocrine Division, Massachusetts General Hospital, Boston, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
| | - Anna L Gloyn
- Department of Pediatrics, Division of Endocrinology, Stanford School of Medicine, Stanford, CA, USA
| | - Inês Cebola
- Section of Genetics and Genomics, Department of Metabolism, Digestion and Reproduction, Imperial College London, London, UK
| | - Alisa Manning
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Clinical and Translational Epidemiology Unit, Massachusetts General Hospital, Boston, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
| | - Aaron Leong
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Diabetes Unit, Massachusetts General Hospital, Boston, MA, USA
- Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
- Endocrine Division, Massachusetts General Hospital, Boston, MA, USA
- Division of General Internal Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Miriam Udler
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Diabetes Unit, Massachusetts General Hospital, Boston, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
- Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Josep M Mercader
- Programs in Metabolism and Medical & Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Diabetes Unit, Massachusetts General Hospital, Boston, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
16
|
Laber S, Strobel S, Mercader JM, Dashti H, dos Santos FR, Kubitz P, Jackson M, Ainbinder A, Honecker J, Agrawal S, Garborcauskas G, Stirling DR, Leong A, Figueroa K, Sinnott-Armstrong N, Kost-Alimova M, Deodato G, Harney A, Way GP, Saadat A, Harken S, Reibe-Pal S, Ebert H, Zhang Y, Calabuig-Navarro V, McGonagle E, Stefek A, Dupuis J, Cimini BA, Hauner H, Udler MS, Carpenter AE, Florez JC, Lindgren C, Jacobs SB, Claussnitzer M. Discovering cellular programs of intrinsic and extrinsic drivers of metabolic traits using LipocyteProfiler. CELL GENOMICS 2023; 3:100346. [PMID: 37492099 PMCID: PMC10363917 DOI: 10.1016/j.xgen.2023.100346] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Revised: 08/22/2022] [Accepted: 05/26/2023] [Indexed: 07/27/2023]
Abstract
A primary obstacle in translating genetic associations with disease into therapeutic strategies is elucidating the cellular programs affected by genetic risk variants and effector genes. Here, we introduce LipocyteProfiler, a cardiometabolic-disease-oriented high-content image-based profiling tool that enables evaluation of thousands of morphological and cellular profiles that can be systematically linked to genes and genetic variants relevant to cardiometabolic disease. We show that LipocyteProfiler allows surveillance of diverse cellular programs by generating rich context- and process-specific cellular profiles across hepatocyte and adipocyte cell-state transitions. We use LipocyteProfiler to identify known and novel cellular mechanisms altered by polygenic risk of metabolic disease, including insulin resistance, fat distribution, and the polygenic contribution to lipodystrophy. LipocyteProfiler paves the way for large-scale forward and reverse deep phenotypic profiling in lipocytes and provides a framework for the unbiased identification of causal relationships between genetic variants and cellular programs relevant to human disease.
Collapse
Affiliation(s)
- Samantha Laber
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7FZ, UK
- Wellcome Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
| | - Sophie Strobel
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Institute of Nutritional Medicine, School of Medicine, Technical University of Munich, 85354 Freising-Weihenstephan, Germany
| | - Josep M. Mercader
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Diabetes Unit and Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
- Department of Medicine, Harvard Medical School, Boston, MA 02114, USA
| | - Hesam Dashti
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Department of Medicine, Harvard Medical School, Boston, MA 02114, USA
- The Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Felipe R.C. dos Santos
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- The Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Phil Kubitz
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Else Kröner-Fresenius-Centre for Nutritional Medicine, School of Life Sciences, Technical University of Munich, 85354 Freising-Weihenstephan, Germany
- The Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Maya Jackson
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- The Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Alina Ainbinder
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Julius Honecker
- Else Kröner-Fresenius-Centre for Nutritional Medicine, School of Life Sciences, Technical University of Munich, 85354 Freising-Weihenstephan, Germany
| | - Saaket Agrawal
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Garrett Garborcauskas
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - David R. Stirling
- Imaging Platform, Center for the Development of Therapeutics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Aaron Leong
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Diabetes Unit and Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
- Department of Medicine, Harvard Medical School, Boston, MA 02114, USA
| | - Katherine Figueroa
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Diabetes Unit and Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Nasa Sinnott-Armstrong
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Department of Genetics, Stanford University, San Francisco, CA, USA
| | - Maria Kost-Alimova
- Imaging Platform, Center for the Development of Therapeutics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Giacomo Deodato
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Alycen Harney
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Gregory P. Way
- Imaging Platform, Center for the Development of Therapeutics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Alham Saadat
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Sierra Harken
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Saskia Reibe-Pal
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7FZ, UK
| | - Hannah Ebert
- Institute of Nutritional Science, University Hohenheim, 70599 Stuttgart, Germany
| | - Yixin Zhang
- Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA
| | - Virtu Calabuig-Navarro
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Institute of Nutritional Science, University Hohenheim, 70599 Stuttgart, Germany
| | - Elizabeth McGonagle
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Adam Stefek
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Josée Dupuis
- Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, QC H3A 1G1, Canada
| | - Beth A. Cimini
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Hans Hauner
- Institute of Nutritional Medicine, School of Medicine, Technical University of Munich, 85354 Freising-Weihenstephan, Germany
- Else Kröner-Fresenius-Centre for Nutritional Medicine, School of Life Sciences, Technical University of Munich, 85354 Freising-Weihenstephan, Germany
- German Center for Diabetes Research (DZD), 85764 Neuherberg, Germany
| | - Miriam S. Udler
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Diabetes Unit and Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
- Department of Medicine, Harvard Medical School, Boston, MA 02114, USA
| | - Anne E. Carpenter
- Imaging Platform, Center for the Development of Therapeutics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Jose C. Florez
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Diabetes Unit and Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
- Department of Medicine, Harvard Medical School, Boston, MA 02114, USA
| | - Cecilia Lindgren
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7FZ, UK
- Wellcome Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
| | - Suzanne B.R. Jacobs
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Diabetes Unit and Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Melina Claussnitzer
- Programs in Metabolism and Medical and Population Genetics, Type 2 Diabetes Systems Genomics Initiative, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA
- Diabetes Unit and Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
- Department of Medicine, Harvard Medical School, Boston, MA 02114, USA
- The Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| |
Collapse
|
17
|
Alsentzer E, Rasmussen MJ, Fontoura R, Cull AL, Beaulieu-Jones B, Gray KJ, Bates DW, Kovacheva VP. Zero-shot Interpretable Phenotyping of Postpartum Hemorrhage Using Large Language Models. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.05.31.23290753. [PMID: 37398230 PMCID: PMC10312824 DOI: 10.1101/2023.05.31.23290753] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
Many areas of medicine would benefit from deeper, more accurate phenotyping, but there are limited approaches for phenotyping using clinical notes without substantial annotated data. Large language models (LLMs) have demonstrated immense potential to adapt to novel tasks with no additional training by specifying task-specific i nstructions. We investigated the per-formance of a publicly available LLM, Flan-T5, in phenotyping patients with postpartum hemorrhage (PPH) using discharge notes from electronic health records ( n =271,081). The language model achieved strong performance in extracting 24 granular concepts associated with PPH. Identifying these granular concepts accurately allowed the development of inter-pretable, complex phenotypes and subtypes. The Flan-T5 model achieved high fidelity in phenotyping PPH (positive predictive value of 0.95), identifying 47% more patients with this complication compared to the current standard of using claims codes. This LLM pipeline can be used reliably for subtyping PPH and outperformed a claims-based approach on the three most common PPH subtypes associated with uterine atony, abnormal placentation, and obstetric trauma. The advantage of this approach to subtyping is its interpretability, as each concept contributing to the subtype determination can be evaluated. Moreover, as definitions may change over time due to new guidelines, using granular concepts to create complex phenotypes enables prompt and efficient updating of the algorithm. Using this lan-guage modelling approach enables rapid phenotyping without the need for any manually annotated training data across multiple clinical use cases.
Collapse
|
18
|
Estiri H, Azhir A, Blacker DL, Ritchie CS, Patel CJ, Murphy SN. Temporal characterization of Alzheimer's Disease with sequences of clinical records. EBioMedicine 2023; 92:104629. [PMID: 37247495 DOI: 10.1016/j.ebiom.2023.104629] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Revised: 05/05/2023] [Accepted: 05/10/2023] [Indexed: 05/31/2023] Open
Abstract
BACKGROUND Alzheimer's Disease (AD) is a complex clinical phenotype with unprecedented social and economic tolls on an ageing global population. Real-world data (RWD) from electronic health records (EHRs) offer opportunities to accelerate precision drug development and scale epidemiological research on AD. A precise characterization of AD cohorts is needed to address the noise abundant in RWD. METHODS We conducted a retrospective cohort study to develop and test computational models for AD cohort identification using clinical data from 8 Massachusetts healthcare systems. We mined temporal representations from EHR data using the transitive sequential pattern mining algorithm (tSPM) to train and validate our models. We then tested our models against a held-out test set from a review of medical records to adjudicate the presence of AD. We trained two classes of Machine Learning models, using Gradient Boosting Machine (GBM), to compare the utility of AD diagnosis records versus the tSPM temporal representations (comprising sequences of diagnosis and medication observations) from electronic medical records for characterizing AD cohorts. FINDINGS In a group of 4985 patients, we identified 219 tSPM temporal representations (i.e., transitive sequences) of medical records for constructing the best classification models. The models with sequential features improved AD classification by a magnitude of 3-16 percent over the use of AD diagnosis codes alone. The computed cohort included 663 patients, 35 of whom had no record of AD. Six groups of tSPM sequences were identified for characterizing the AD cohorts. INTERPRETATION We present sequential patterns of diagnosis and medication codes from electronic medical records, as digital markers of Alzheimer's Disease. Classification algorithms developed on sequential patterns can replace standard features from EHRs to enrich phenotype modelling. FUNDING National Institutes of Health: the National Institute on Aging (RF1AG074372) and the National Institute of Allergy and Infectious Diseases (R01AI165535).
Collapse
Affiliation(s)
- Hossein Estiri
- Department of Medicine, Massachusetts General Hospital, Boston, MA, USA.
| | - Alaleh Azhir
- Department of Medicine, Massachusetts General Hospital, Boston, MA, USA; Harvard Medical School, Harvard-MIT Program in Health Sciences and Technology, USA
| | - Deborah L Blacker
- Department of Psychiatry, Massachusetts General Hospital, Boston, MA, USA
| | | | - Chirag J Patel
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Shawn N Murphy
- Department of Neurology, Massachusetts General Hospital, Boston, MA, USA
| |
Collapse
|
19
|
Hou J, Zhao R, Gronsbell J, Lin Y, Bonzel CL, Zeng Q, Zhang S, Beaulieu-Jones BK, Weber GM, Jemielita T, Wan SS, Hong C, Cai T, Wen J, Ayakulangara Panickan V, Liaw KL, Liao K, Cai T. Generate Analysis-Ready Data for Real-world Evidence: Tutorial for Harnessing Electronic Health Records With Advanced Informatic Technologies. J Med Internet Res 2023; 25:e45662. [PMID: 37227772 PMCID: PMC10251230 DOI: 10.2196/45662] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2023] [Revised: 03/31/2023] [Accepted: 04/05/2023] [Indexed: 05/26/2023] Open
Abstract
Although randomized controlled trials (RCTs) are the gold standard for establishing the efficacy and safety of a medical treatment, real-world evidence (RWE) generated from real-world data has been vital in postapproval monitoring and is being promoted for the regulatory process of experimental therapies. An emerging source of real-world data is electronic health records (EHRs), which contain detailed information on patient care in both structured (eg, diagnosis codes) and unstructured (eg, clinical notes and images) forms. Despite the granularity of the data available in EHRs, the critical variables required to reliably assess the relationship between a treatment and clinical outcome are challenging to extract. To address this fundamental challenge and accelerate the reliable use of EHRs for RWE, we introduce an integrated data curation and modeling pipeline consisting of 4 modules that leverage recent advances in natural language processing, computational phenotyping, and causal modeling techniques with noisy data. Module 1 consists of techniques for data harmonization. We use natural language processing to recognize clinical variables from RCT design documents and map the extracted variables to EHR features with description matching and knowledge networks. Module 2 then develops techniques for cohort construction using advanced phenotyping algorithms to both identify patients with diseases of interest and define the treatment arms. Module 3 introduces methods for variable curation, including a list of existing tools to extract baseline variables from different sources (eg, codified, free text, and medical imaging) and end points of various types (eg, death, binary, temporal, and numerical). Finally, module 4 presents validation and robust modeling methods, and we propose a strategy to create gold-standard labels for EHR variables of interest to validate data curation quality and perform subsequent causal modeling for RWE. In addition to the workflow proposed in our pipeline, we also develop a reporting guideline for RWE that covers the necessary information to facilitate transparent reporting and reproducibility of results. Moreover, our pipeline is highly data driven, enhancing study data with a rich variety of publicly available information and knowledge sources. We also showcase our pipeline and provide guidance on the deployment of relevant tools by revisiting the emulation of the Clinical Outcomes of Surgical Therapy Study Group Trial on laparoscopy-assisted colectomy versus open colectomy in patients with early-stage colon cancer. We also draw on existing literature on EHR emulation of RCTs together with our own studies with the Mass General Brigham EHR.
Collapse
Affiliation(s)
- Jue Hou
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, United States
| | - Rachel Zhao
- Department of Medicine, University of British Columbia, Vancouver, BC, Canada
| | - Jessica Gronsbell
- Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada
| | - Yucong Lin
- Institute of Engineering Medicine, Beijing Institute of Technology, Beijing, China
| | - Clara-Lea Bonzel
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| | - Qingyi Zeng
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, United States
| | - Sinian Zhang
- School of Statistics, Renmin University of China, Bejing, China
| | | | - Griffin M Weber
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| | | | | | - Chuan Hong
- Department of Biostatistics & Bioinformatics, Duke University, Durham, NC, United States
| | - Tianrun Cai
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| | - Jun Wen
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| | | | | | - Katherine Liao
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
- Division of Rheumatology, Inflammation, and Immunity, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, United States
| | - Tianxi Cai
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
- Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA, United States
| |
Collapse
|
20
|
Wang AL, Lahousse L, Dahlin A, Edris A, McGeachie M, Lutz SM, Sordillo JE, Brusselle G, Lasky-Su J, Weiss ST, Iribarren C, Lu MX, Tantisira KG, Wu AC. Novel genetic variants associated with inhaled corticosteroid treatment response in older adults with asthma. Thorax 2023; 78:432-441. [PMID: 35501119 PMCID: PMC9810110 DOI: 10.1136/thoraxjnl-2021-217674] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Accepted: 04/01/2022] [Indexed: 01/07/2023]
Abstract
INTRODUCTION Older adults have the greatest burden of asthma and poorest outcomes. The pharmacogenetics of inhaled corticosteroid (ICS) treatment response is not well studied in older adults. METHODS A genome-wide association study of ICS response was performed in asthmatics of European ancestry in Genetic Epidemiology Research on Adult Health and Aging (GERA) by fitting Cox proportional hazards regression models, followed by validation in the Mass General Brigham (MGB) Biobank and Rotterdam Study. ICS response was measured using two definitions in asthmatics on ICS treatment: (1) absence of oral corticosteroid (OCS) bursts using prescription records and (2) absence of asthma-related exacerbations using diagnosis codes. A fixed-effect meta-analysis was performed for each outcome. The validated single-nucleotide polymorphisms (SNPs) were functionally annotated to standard databases. RESULTS In 5710 subjects in GERA, 676 subjects in MGB Biobank, and 465 subjects in the Rotterdam Study, four novel SNPs on chromosome six near PTCHD4 validated across all cohorts and met genome-wide significance on meta-analysis for the OCS burst outcome. In 4541 subjects in GERA and 505 subjects in MGB Biobank, 152 SNPs with p<5 × 10-5 were validated across these two cohorts for the asthma-related exacerbation outcome. The validated SNPs included methylation and expression quantitative trait loci for CPED1, CRADD and DST for the OCS burst outcome and GM2A, SNW1, CACNA1C, DPH1, and RPS10 for the asthma-related exacerbation outcome. CONCLUSIONS Multiple novel SNPs associated with ICS response were identified in older adult asthmatics. Several SNPs annotated to genes previously associated with asthma and other airway or allergic diseases, including PTCHD4.
Collapse
Affiliation(s)
- Alberta L Wang
- Division of Allergy and Clinical Immunology, Brigham and Women's Hospital, Boston, Massachusetts, USA
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, Massachusetts, USA
| | - Lies Lahousse
- Department of Epidemiology, Erasmus Medical Center, Rotterdam, The Netherlands
- Department of Bioanalysis, Faculty of Pharmaceutical Sciences, Ghent University, Ghent, Belgium
| | - Amber Dahlin
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, Massachusetts, USA
| | - Ahmed Edris
- Department of Bioanalysis, Faculty of Pharmaceutical Sciences, Ghent University, Ghent, Belgium
| | - Michael McGeachie
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, Massachusetts, USA
| | - Sharon M Lutz
- PRecisiOn Medicine Translational Research (PROMoTeR) Center, Department of Population Medicine, Harvard Pilgrim Health Care Institute and Harvard Medical School, Boston, Massachusetts, USA
| | - Joanne E Sordillo
- PRecisiOn Medicine Translational Research (PROMoTeR) Center, Department of Population Medicine, Harvard Pilgrim Health Care Institute and Harvard Medical School, Boston, Massachusetts, USA
| | - Guy Brusselle
- Department of Epidemiology, Erasmus Medical Center, Rotterdam, The Netherlands
- Department of Respiratory Medicine, Ghent University Hospital, Ghent, Belgium
- Department of Respiratory Medicine, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Jessica Lasky-Su
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, Massachusetts, USA
| | - Scott T Weiss
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, Massachusetts, USA
| | - Carlos Iribarren
- Kaiser Permanente Division of Research, Kaiser Permanente, Oakland, California, USA
| | - Meng X Lu
- Kaiser Permanente Division of Research, Kaiser Permanente, Oakland, California, USA
| | - Kelan G Tantisira
- Division of Pediatric Respiratory Medicine, Rady's Children's Hospital-San Diego, University of California San Diego School of Medicine, San Diego, California, USA
| | - Ann C Wu
- PRecisiOn Medicine Translational Research (PROMoTeR) Center, Department of Population Medicine, Harvard Pilgrim Health Care Institute and Harvard Medical School, Boston, Massachusetts, USA
| |
Collapse
|
21
|
He T, Belouali A, Patricoski J, Lehmann H, Ball R, Anagnostou V, Kreimeyer K, Botsis T. Trends and opportunities in computable clinical phenotyping: A scoping review. J Biomed Inform 2023; 140:104335. [PMID: 36933631 DOI: 10.1016/j.jbi.2023.104335] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2022] [Revised: 03/07/2023] [Accepted: 03/09/2023] [Indexed: 03/18/2023]
Abstract
Identifying patient cohorts meeting the criteria of specific phenotypes is essential in biomedicine and particularly timely in precision medicine. Many research groups deliver pipelines that automatically retrieve and analyze data elements from one or more sources to automate this task and deliver high-performing computable phenotypes. We applied a systematic approach based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines to conduct a thorough scoping review on computable clinical phenotyping. Five databases were searched using a query that combined the concepts of automation, clinical context, and phenotyping. Subsequently, four reviewers screened 7960 records (after removing over 4000 duplicates) and selected 139 that satisfied the inclusion criteria. This dataset was analyzed to extract information on target use cases, data-related topics, phenotyping methodologies, evaluation strategies, and portability of developed solutions. Most studies supported patient cohort selection without discussing the application to specific use cases, such as precision medicine. Electronic Health Records were the primary source in 87.1 % (N = 121) of all studies, and International Classification of Diseases codes were heavily used in 55.4 % (N = 77) of all studies, however, only 25.9 % (N = 36) of the records described compliance with a common data model. In terms of the presented methods, traditional Machine Learning (ML) was the dominant method, often combined with natural language processing and other approaches, while external validation and portability of computable phenotypes were pursued in many cases. These findings revealed that defining target use cases precisely, moving away from sole ML strategies, and evaluating the proposed solutions in the real setting are essential opportunities for future work. There is also momentum and an emerging need for computable phenotyping to support clinical and epidemiological research and precision medicine.
Collapse
Affiliation(s)
- Ting He
- Department of Oncology, The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD, USA; Biomedical Informatics and Data Science Section, Johns Hopkins University School of Medicine, Baltimore, MD, USA.
| | - Anas Belouali
- Biomedical Informatics and Data Science Section, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Jessica Patricoski
- Biomedical Informatics and Data Science Section, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Harold Lehmann
- Biomedical Informatics and Data Science Section, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Robert Ball
- Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, US FDA, Silver Spring, MD, USA
| | - Valsamo Anagnostou
- Department of Oncology, The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Kory Kreimeyer
- Department of Oncology, The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD, USA; Biomedical Informatics and Data Science Section, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Taxiarchis Botsis
- Department of Oncology, The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD, USA; Biomedical Informatics and Data Science Section, Johns Hopkins University School of Medicine, Baltimore, MD, USA.
| |
Collapse
|
22
|
Murphy RM, Dongelmans DA, Kom IYD, Calixto I, Abu-Hanna A, Jager KJ, de Keizer NF, Klopotowska JE. Drug-related causes attributed to acute kidney injury and their documentation in intensive care patients. J Crit Care 2023; 75:154292. [PMID: 36959015 DOI: 10.1016/j.jcrc.2023.154292] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2023] [Revised: 03/14/2023] [Accepted: 03/14/2023] [Indexed: 03/25/2023]
Abstract
PURPOSE To investigate drug-related causes attributed to acute kidney injury (DAKI) and their documentation in patients admitted to the Intensive Care Unit (ICU). METHODS This study was conducted in an academic hospital in the Netherlands by reusing electronic health record (EHR) data of adult ICU admissions between November 2015 to January 2020. First, ICU admissions with acute kidney injury (AKI) stage 2 or 3 were identified. Subsequently, three modes of DAKI documentation in EHR were examined: diagnosis codes (structured data), allergy module (semi-structured data), and clinical notes (unstructured data). RESULTS n total 8124 ICU admissions were included, with 542 (6.7%) ICU admissions experiencing AKI stage 2 or 3. The ICU physicians deemed 102 of these AKI cases (18.8%) to be drug-related. These DAKI cases were all documented in the clinical notes (100%), one in allergy module (1%) and none via diagnosis codes. The clinical notes required the highest time investment to analyze. CONCLUSIONS Drug-related causes comprise a substantial part of AKI in the ICU patients. However, current unstructured DAKI documentation practice via clinical notes hampers our ability to gain better insights about DAKI occurrence. Therefore, both automating DAKI identification from the clinical notes and increasing structured DAKI documentation should be encouraged.
Collapse
Affiliation(s)
- Rachel M Murphy
- Amsterdam UMC location University of Amsterdam, Department of Medical Informatics, Meibergdreef 9, Amsterdam, the Netherlands; Amsterdam Public Health, Digital Health, Amsterdam, the Netherlands; Amsterdam Public Health, Quality of Care, Amsterdam, the Netherlands.
| | - Dave A Dongelmans
- Amsterdam Public Health, Quality of Care, Amsterdam, the Netherlands; Amsterdam UMC location University of Amsterdam, Department of Intensive Care Medicine, Meibergdreef 9, Amsterdam, the Netherlands
| | - Izak Yasrebi-de Kom
- Amsterdam UMC location University of Amsterdam, Department of Medical Informatics, Meibergdreef 9, Amsterdam, the Netherlands; Amsterdam Public Health, Methodology, Amsterdam, the Netherlands
| | - Iacer Calixto
- Amsterdam UMC location University of Amsterdam, Department of Medical Informatics, Meibergdreef 9, Amsterdam, the Netherlands; Amsterdam Public Health, Methodology, Amsterdam, the Netherlands; Amsterdam Public Health, Mental Health, Amsterdam, the Netherlands
| | - Ameen Abu-Hanna
- Amsterdam UMC location University of Amsterdam, Department of Medical Informatics, Meibergdreef 9, Amsterdam, the Netherlands; Amsterdam Public Health, Methodology, Amsterdam, the Netherlands; Amsterdam Public Health, Aging & Later Life, Amsterdam, the Netherlands
| | - Kitty J Jager
- Amsterdam UMC location University of Amsterdam, Department of Medical Informatics, Meibergdreef 9, Amsterdam, the Netherlands; Amsterdam Public Health, Quality of Care, Amsterdam, the Netherlands; Amsterdam Public Health, Aging & Later Life, Amsterdam, the Netherlands; Amsterdam Cardiovascular Sciences, Pulmonary hypertension & thrombosis, Amsterdam, the Netherlands
| | - Nicolette F de Keizer
- Amsterdam UMC location University of Amsterdam, Department of Medical Informatics, Meibergdreef 9, Amsterdam, the Netherlands; Amsterdam Public Health, Digital Health, Amsterdam, the Netherlands; Amsterdam Public Health, Quality of Care, Amsterdam, the Netherlands
| | - Joanna E Klopotowska
- Amsterdam UMC location University of Amsterdam, Department of Medical Informatics, Meibergdreef 9, Amsterdam, the Netherlands; Amsterdam Public Health, Digital Health, Amsterdam, the Netherlands; Amsterdam Public Health, Quality of Care, Amsterdam, the Netherlands
| |
Collapse
|
23
|
Pacheco JA, Rasmussen LV, Wiley K, Person TN, Cronkite DJ, Sohn S, Murphy S, Gundelach JH, Gainer V, Castro VM, Liu C, Mentch F, Lingren T, Sundaresan AS, Eickelberg G, Willis V, Furmanchuk A, Patel R, Carrell DS, Deng Y, Walton N, Satterfield BA, Kullo IJ, Dikilitas O, Smith JC, Peterson JF, Shang N, Kiryluk K, Ni Y, Li Y, Nadkarni GN, Rosenthal EA, Walunas TL, Williams MS, Karlson EW, Linder JE, Luo Y, Weng C, Wei W. Evaluation of the portability of computable phenotypes with natural language processing in the eMERGE network. Sci Rep 2023; 13:1971. [PMID: 36737471 PMCID: PMC9898520 DOI: 10.1038/s41598-023-27481-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Accepted: 01/03/2023] [Indexed: 02/05/2023] Open
Abstract
The electronic Medical Records and Genomics (eMERGE) Network assessed the feasibility of deploying portable phenotype rule-based algorithms with natural language processing (NLP) components added to improve performance of existing algorithms using electronic health records (EHRs). Based on scientific merit and predicted difficulty, eMERGE selected six existing phenotypes to enhance with NLP. We assessed performance, portability, and ease of use. We summarized lessons learned by: (1) challenges; (2) best practices to address challenges based on existing evidence and/or eMERGE experience; and (3) opportunities for future research. Adding NLP resulted in improved, or the same, precision and/or recall for all but one algorithm. Portability, phenotyping workflow/process, and technology were major themes. With NLP, development and validation took longer. Besides portability of NLP technology and algorithm replicability, factors to ensure success include privacy protection, technical infrastructure setup, intellectual property agreement, and efficient communication. Workflow improvements can improve communication and reduce implementation time. NLP performance varied mainly due to clinical document heterogeneity; therefore, we suggest using semi-structured notes, comprehensive documentation, and customization options. NLP portability is possible with improved phenotype algorithm performance, but careful planning and architecture of the algorithms is essential to support local customizations.
Collapse
Affiliation(s)
| | | | - Ken Wiley
- National Human Genome Research Institute, Bethesda, USA
| | | | - David J Cronkite
- Kaiser Permanente Washington Health Research Institute, Seattle, USA
| | | | | | | | | | | | - Cong Liu
- Columbia University, New York, USA
| | - Frank Mentch
- Children's Hospital of Philadelphia, Philadelphia, USA
| | - Todd Lingren
- Cincinnati Children's Hospital Medical Center, Cincinnati, USA
| | | | | | | | | | | | - David S Carrell
- Kaiser Permanente Washington Health Research Institute, Seattle, USA
| | - Yu Deng
- Northwestern University, Evanston, USA
| | | | | | | | | | | | | | | | | | - Yizhao Ni
- Cincinnati Children's Hospital Medical Center, Cincinnati, USA
| | - Yikuan Li
- Northwestern University, Evanston, USA
| | | | | | | | | | | | | | - Yuan Luo
- Northwestern University, Evanston, USA
| | | | - WeiQi Wei
- Vanderbilt University Medical Center, Nashville, USA
| |
Collapse
|
24
|
Carrell DS, Gruber S, Floyd JS, Bann MA, Cushing-Haugen KL, Johnson RL, Graham V, Cronkite DJ, Hazlehurst BL, Felcher AH, Bejan CA, Kennedy A, Shinde MU, Karami S, Ma Y, Stojanovic D, Zhao Y, Ball R, Nelson JC. Improving Methods of Identifying Anaphylaxis for Medical Product Safety Surveillance Using Natural Language Processing and Machine Learning. Am J Epidemiol 2023; 192:283-295. [PMID: 36331289 PMCID: PMC9896464 DOI: 10.1093/aje/kwac182] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2021] [Revised: 07/06/2022] [Accepted: 10/11/2022] [Indexed: 11/06/2022] Open
Abstract
We sought to determine whether machine learning and natural language processing (NLP) applied to electronic medical records could improve performance of automated health-care claims-based algorithms to identify anaphylaxis events using data on 516 patients with outpatient, emergency department, or inpatient anaphylaxis diagnosis codes during 2015-2019 in 2 integrated health-care institutions in the Northwest United States. We used one site's manually reviewed gold-standard outcomes data for model development and the other's for external validation based on cross-validated area under the receiver operating characteristic curve (AUC), positive predictive value (PPV), and sensitivity. In the development site 154 (64%) of 239 potential events met adjudication criteria for anaphylaxis compared with 180 (65%) of 277 in the validation site. Logistic regression models using only structured claims data achieved a cross-validated AUC of 0.58 (95% CI: 0.54, 0.63). Machine learning improved cross-validated AUC to 0.62 (0.58, 0.66); incorporating NLP-derived covariates further increased cross-validated AUCs to 0.70 (0.66, 0.75) in development and 0.67 (0.63, 0.71) in external validation data. A classification threshold with cross-validated PPV of 79% and cross-validated sensitivity of 66% in development data had cross-validated PPV of 78% and cross-validated sensitivity of 56% in external data. Machine learning and NLP-derived data improved identification of validated anaphylaxis events.
Collapse
Affiliation(s)
- David S Carrell
- Correspondence to Dr. David Carrell, Kaiser Permanente Washington Health Research Institute, 1730 Minor Avenue, Suite 1600, Seattle, WA 98101 (e-mail: )
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
25
|
Yang S, Varghese P, Stephenson E, Tu K, Gronsbell J. Machine learning approaches for electronic health records phenotyping: a methodical review. J Am Med Inform Assoc 2023; 30:367-381. [PMID: 36413056 PMCID: PMC9846699 DOI: 10.1093/jamia/ocac216] [Citation(s) in RCA: 46] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 09/27/2022] [Accepted: 10/27/2022] [Indexed: 11/23/2022] Open
Abstract
OBJECTIVE Accurate and rapid phenotyping is a prerequisite to leveraging electronic health records for biomedical research. While early phenotyping relied on rule-based algorithms curated by experts, machine learning (ML) approaches have emerged as an alternative to improve scalability across phenotypes and healthcare settings. This study evaluates ML-based phenotyping with respect to (1) the data sources used, (2) the phenotypes considered, (3) the methods applied, and (4) the reporting and evaluation methods used. MATERIALS AND METHODS We searched PubMed and Web of Science for articles published between 2018 and 2022. After screening 850 articles, we recorded 37 variables on 100 studies. RESULTS Most studies utilized data from a single institution and included information in clinical notes. Although chronic conditions were most commonly considered, ML also enabled the characterization of nuanced phenotypes such as social determinants of health. Supervised deep learning was the most popular ML paradigm, while semi-supervised and weakly supervised learning were applied to expedite algorithm development and unsupervised learning to facilitate phenotype discovery. ML approaches did not uniformly outperform rule-based algorithms, but deep learning offered a marginal improvement over traditional ML for many conditions. DISCUSSION Despite the progress in ML-based phenotyping, most articles focused on binary phenotypes and few articles evaluated external validity or used multi-institution data. Study settings were infrequently reported and analytic code was rarely released. CONCLUSION Continued research in ML-based phenotyping is warranted, with emphasis on characterizing nuanced phenotypes, establishing reporting and evaluation standards, and developing methods to accommodate misclassified phenotypes due to algorithm errors in downstream applications.
Collapse
Affiliation(s)
- Siyue Yang
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
| | | | - Ellen Stephenson
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Karen Tu
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Jessica Gronsbell
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
26
|
Zheng HW, Ranganath VK, Perry LC, Chetrit DA, Criner KM, Pham AQ, Seto R, Vangala S, Elashoff DA, Bui AA. Evaluation of an automated phenotyping algorithm for rheumatoid arthritis. J Biomed Inform 2022; 135:104214. [DOI: 10.1016/j.jbi.2022.104214] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2022] [Revised: 09/24/2022] [Accepted: 09/26/2022] [Indexed: 11/16/2022]
|
27
|
Gronsbell J, Liu M, Tian L, Cai T. Efficient Evaluation of Prediction Rules in Semi-Supervised Settings under Stratified Sampling. J R Stat Soc Series B Stat Methodol 2022; 84:1353-1391. [PMID: 36275859 PMCID: PMC9586151 DOI: 10.1111/rssb.12502] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
In many contemporary applications, large amounts of unlabeled data are readily available while labeled examples are limited. There has been substantial interest in semi-supervised learning (SSL) which aims to leverage unlabeled data to improve estimation or prediction. However, current SSL literature focuses primarily on settings where labeled data is selected uniformly at random from the population of interest. Stratified sampling, while posing additional analytical challenges, is highly applicable to many real world problems. Moreover, no SSL methods currently exist for estimating the prediction performance of a fitted model when the labeled data is not selected uniformly at random. In this paper, we propose a two-step SSL procedure for evaluating a prediction rule derived from a working binary regression model based on the Brier score and overall misclassification rate under stratified sampling. In step I, we impute the missing labels via weighted regression with nonlinear basis functions to account for stratified sampling and to improve efficiency. In step II, we augment the initial imputations to ensure the consistency of the resulting estimators regardless of the specification of the prediction model or the imputation model. The final estimator is then obtained with the augmented imputations. We provide asymptotic theory and numerical studies illustrating that our proposals outperform their supervised counterparts in terms of efficiency gain. Our methods are motivated by electronic health record (EHR) research and validated with a real data analysis of an EHR-based study of diabetic neuropathy.
Collapse
Affiliation(s)
- Jessica Gronsbell
- Jessica Gronsbell is an Assistant Professor in the Department of Statistical Sciences, University of Toronto, Toronto, ON M5S 3G3, CA Molei Liu is a Ph.D. student in the Department of Biostatistics, Harvard University, Boston, MA 02115, USA Lu Tian is an Associate Professor, Department of Biomedical Data Science, Stanford University, Palo Alto, California 94305, U.S.A Tianxi Cai is a Professor, Department of Biostatistics, Harvard University, Boston, MA 02115, USA
- The first two authors are equal contributors to this work
| | - Molei Liu
- Jessica Gronsbell is an Assistant Professor in the Department of Statistical Sciences, University of Toronto, Toronto, ON M5S 3G3, CA Molei Liu is a Ph.D. student in the Department of Biostatistics, Harvard University, Boston, MA 02115, USA Lu Tian is an Associate Professor, Department of Biomedical Data Science, Stanford University, Palo Alto, California 94305, U.S.A Tianxi Cai is a Professor, Department of Biostatistics, Harvard University, Boston, MA 02115, USA
- The first two authors are equal contributors to this work
| | - Lu Tian
- Jessica Gronsbell is an Assistant Professor in the Department of Statistical Sciences, University of Toronto, Toronto, ON M5S 3G3, CA Molei Liu is a Ph.D. student in the Department of Biostatistics, Harvard University, Boston, MA 02115, USA Lu Tian is an Associate Professor, Department of Biomedical Data Science, Stanford University, Palo Alto, California 94305, U.S.A Tianxi Cai is a Professor, Department of Biostatistics, Harvard University, Boston, MA 02115, USA
| | - Tianxi Cai
- Jessica Gronsbell is an Assistant Professor in the Department of Statistical Sciences, University of Toronto, Toronto, ON M5S 3G3, CA Molei Liu is a Ph.D. student in the Department of Biostatistics, Harvard University, Boston, MA 02115, USA Lu Tian is an Associate Professor, Department of Biomedical Data Science, Stanford University, Palo Alto, California 94305, U.S.A Tianxi Cai is a Professor, Department of Biostatistics, Harvard University, Boston, MA 02115, USA
| |
Collapse
|
28
|
Ahuja Y, Zou Y, Verma A, Buckeridge D, Li Y. MixEHR-Guided: A guided multi-modal topic modeling approach for large-scale automatic phenotyping using the electronic health record. J Biomed Inform 2022; 134:104190. [PMID: 36058522 DOI: 10.1016/j.jbi.2022.104190] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2022] [Revised: 08/27/2022] [Accepted: 08/28/2022] [Indexed: 01/18/2023]
Abstract
Electronic Health Records (EHRs) contain rich clinical data collected at the point of the care, and their increasing adoption offers exciting opportunities for clinical informatics, disease risk prediction, and personalized treatment recommendation. However, effective use of EHR data for research and clinical decision support is often hampered by a lack of reliable disease labels. To compile gold-standard labels, researchers often rely on clinical experts to develop rule-based phenotyping algorithms from billing codes and other surrogate features. This process is tedious and error-prone due to recall and observer biases in how codes and measures are selected, and some phenotypes are incompletely captured by a handful of surrogate features. To address this challenge, we present a novel automatic phenotyping model called MixEHR-Guided (MixEHR-G), a multimodal hierarchical Bayesian topic model that efficiently models the EHR generative process by identifying latent phenotype structure in the data. Unlike existing topic modeling algorithms wherein the inferred topics are not identifiable, MixEHR-G uses prior information from informative surrogate features to align topics with known phenotypes. We applied MixEHR-G to an openly-available EHR dataset of 38,597 intensive care patients (MIMIC-III) in Boston, USA and to administrative claims data for a population-based cohort (PopHR) of 1.3 million people in Quebec, Canada. Qualitatively, we demonstrate that MixEHR-G learns interpretable phenotypes and yields meaningful insights about phenotype similarities, comorbidities, and epidemiological associations. Quantitatively, MixEHR-G outperforms existing unsupervised phenotyping methods on a phenotype label annotation task, and it can accurately estimate relative phenotype prevalence functions without gold-standard phenotype information. Altogether, MixEHR-G is an important step towards building an interpretable and automated phenotyping system using EHR data.
Collapse
Affiliation(s)
- Yuri Ahuja
- Department of Biostatistics, Harvard TH Chan School of Public Health, 677 Huntington Ave, Boston, MA 02115, USA; Harvard Medical School, 25 Shattuck St, Boston, MA 02115, USA.
| | - Yuesong Zou
- School of Computer Science, McGill University, 3480 Rue University, Montreal, QC H3A 2A7, Canada
| | - Aman Verma
- School of Population and Global Health, McGill University, 2001 McGill College Avenue, Montreal, Québec H3A 1G1, Canada
| | - David Buckeridge
- School of Population and Global Health, McGill University, 2001 McGill College Avenue, Montreal, Québec H3A 1G1, Canada.
| | - Yue Li
- School of Computer Science, McGill University, 3480 Rue University, Montreal, QC H3A 2A7, Canada.
| |
Collapse
|
29
|
Ashburner JM, Chang Y, Wang X, Khurshid S, Anderson CD, Dahal K, Weisenfeld D, Cai T, Liao KP, Wagholikar KB, Murphy SN, Atlas SJ, Lubitz SA, Singer DE. Natural Language Processing to Improve Prediction of Incident Atrial Fibrillation Using Electronic Health Records. J Am Heart Assoc 2022; 11:e026014. [PMID: 35904194 PMCID: PMC9375475 DOI: 10.1161/jaha.122.026014] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/06/2022] [Accepted: 06/29/2022] [Indexed: 11/16/2022]
Abstract
Background Models predicting atrial fibrillation (AF) risk, such as Cohorts for Heart and Aging Research in Genomic Epidemiology AF (CHARGE-AF), have not performed as well in electronic health records. Natural language processing (NLP) may improve models by using narrative electronic health record text. Methods and Results From a primary care network, we included patients aged ≥65 years with visits between 2003 and 2013 in development (n=32 960) and internal validation cohorts (n=13 992). An external validation cohort from a separate network from 2015 to 2020 included 39 051 patients. Model features were defined using electronic health record codified data and narrative data with NLP. We developed 2 models to predict 5-year AF incidence using (1) codified+NLP data and (2) codified data only and evaluated model performance. The analysis included 2839 incident AF cases in the development cohort and 1057 and 2226 cases in internal and external validation cohorts, respectively. The C-statistic was greater (P<0.001) in codified+NLP model (0.744 [95% CI, 0.735-0.753]) compared with codified-only (0.730 [95% CI, 0.720-0.739]) in the development cohort. In internal validation, the C-statistic of codified+NLP was modestly higher (0.735 [95% CI, 0.720-0.749]) compared with codified-only (0.729 [95% CI, 0.715-0.744]; P=0.06) and CHARGE-AF (0.717 [95% CI, 0.703-0.731]; P=0.002). Codified+NLP and codified-only were well calibrated, whereas CHARGE-AF underestimated AF risk. In external validation, the C-statistic of codified+NLP (0.750 [95% CI, 0.740-0.760]) remained higher (P<0.001) than codified-only (0.738 [95% CI, 0.727-0.748]) and CHARGE-AF (0.735 [95% CI, 0.725-0.746]). Conclusions Estimation of 5-year risk of AF can be modestly improved using NLP to incorporate narrative electronic health record data.
Collapse
Affiliation(s)
- Jeffrey M. Ashburner
- Division of General Internal MedicineMassachusetts General HospitalBostonMA
- Harvard Medical SchoolBostonMA
| | - Yuchiao Chang
- Division of General Internal MedicineMassachusetts General HospitalBostonMA
- Harvard Medical SchoolBostonMA
| | - Xin Wang
- Cardiovascular Research CenterMassachusetts General HospitalBostonMA
| | - Shaan Khurshid
- Cardiovascular Research CenterMassachusetts General HospitalBostonMA
- Division of CardiologyMassachusetts General HospitalBostonMA
| | | | - Kumar Dahal
- Department of Rheumatology, Inflammation, and ImmunityBrigham and Women’s HospitalBostonMA
| | - Dana Weisenfeld
- Department of Rheumatology, Inflammation, and ImmunityBrigham and Women’s HospitalBostonMA
| | - Tianrun Cai
- Harvard Medical SchoolBostonMA
- Department of Rheumatology, Inflammation, and ImmunityBrigham and Women’s HospitalBostonMA
| | - Katherine P. Liao
- Harvard Medical SchoolBostonMA
- Department of Rheumatology, Inflammation, and ImmunityBrigham and Women’s HospitalBostonMA
| | - Kavishwar B. Wagholikar
- Harvard Medical SchoolBostonMA
- Laboratory of Computer ScienceMassachusetts General HospitalBostonMA
| | - Shawn N. Murphy
- Harvard Medical SchoolBostonMA
- Research Information Science and ComputingMass General BrighamSomervilleMA
| | - Steven J. Atlas
- Division of General Internal MedicineMassachusetts General HospitalBostonMA
- Harvard Medical SchoolBostonMA
| | - Steven A. Lubitz
- Cardiovascular Research CenterMassachusetts General HospitalBostonMA
- Cardiac Arrhythmia ServiceMassachusetts General HospitalBostonMA
| | - Daniel E. Singer
- Division of General Internal MedicineMassachusetts General HospitalBostonMA
- Harvard Medical SchoolBostonMA
| |
Collapse
|
30
|
Cai T, He Z, Hong C, Zhang Y, Ho YL, Honerlaw J, Geva A, Ayakulangara Panickan V, King A, Gagnon DR, Gaziano M, Cho K, Liao K, Cai T. Scalable relevance ranking algorithm via semantic similarity assessment improves efficiency of medical chart review. J Biomed Inform 2022; 132:104109. [PMID: 35660521 DOI: 10.1016/j.jbi.2022.104109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2022] [Revised: 04/30/2022] [Accepted: 05/29/2022] [Indexed: 01/19/2023]
Abstract
OBJECTIVE Accurately assigning phenotype information to individual patients via computational phenotyping using Electronic Health Records (EHRs) has been seen as the first step towards enabling EHRs for precision medicine research. Chart review labels annotated by clinical experts, also known as "gold standard" labels, are essential for the development and validation of computational phenotyping algorithms. However, given the complexity of EHR systems, the process of chart review is both labor intensive and time consuming. We propose a fully automated algorithm, referred to as pGUESS, to rank EHR notes according to their relevance to a given phenotype. By identifying the most relevant notes, pGUESS can greatly improve the efficiency and accuracy of chart reviews. METHOD pGUESS uses prior guided semantic similarity to measure the informativeness of a clinical note to a given phenotype. We first select candidate clinical concepts from a pool of comprehensive medical concepts using public knowledge sources and then derive the semantic embedding vector (SEV) for a reference article (SEVref) and each note (SEVnote). The algorithm scores the relevance of a note as the cosine similarity between SEVnote and SEVref. RESULTS The algorithm was validated against four sets of 200 notes that were manually annotated by clinical experts to assess their informativeness to one of three disease phenotypes. pGUESS algorithm substantially outperforms existing unsupervised approaches for classifying the relevance status with respect to both accuracy and scalability across phenotypes. Averaging over the three phenotypes, the rank correlation between the algorithm ranking and gold standard label was 0.64 for pGUESS, but only 0.47 and 0.35 for the next two best performing algorithms. pGUESS is also much more computationally scalable compared to existing algorithms. CONCLUSION pGUESS algorithm can substantially reduce the burden of chart review and holds potential in improving the efficiency and accuracy of human annotation.
Collapse
Affiliation(s)
- Tianrun Cai
- Division of Rheumatology, Inflammation, and Immunity, Brigham and Women's Hospital, 60 Fenwood Road, Boston, USA; Department of Biomedical Informatics, Harvard Medical School, 10 Shattuck Street, Suite 514, Boston, USA; VA Boston Healthcare System, 150 S Huntington Ave, Boston, USA.
| | - Zeling He
- Division of Rheumatology, Inflammation, and Immunity, Brigham and Women's Hospital, 60 Fenwood Road, Boston, USA; Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Avenue, Boston, MA, USA
| | - Chuan Hong
- Department of Biostatistics & Bioinformatics, Duke University, Duke University Medical Center 2424 Erwin Road, Suite 1102 Hock Plaza Box 2721, Durham, NC, USA
| | - Yichi Zhang
- Department of Computer Science and Statistics, University of Rhode Island, Tyler Hall, 9 Greenhouse Road, Suite 2, Kingston, RI, USA
| | - Yuk-Lam Ho
- VA Boston Healthcare System, 150 S Huntington Ave, Boston, USA
| | | | - Alon Geva
- Department of Biomedical Informatics, Harvard Medical School, 10 Shattuck Street, Suite 514, Boston, USA; Department of Anesthesiology, Boston Children's Hospital, 300 Longwood Avenue, Bader, 6th Floor, Boston, MA, USA
| | | | - Amanda King
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Avenue, Boston, MA, USA
| | - David R Gagnon
- VA Boston Healthcare System, 150 S Huntington Ave, Boston, USA; Department of Biostatistics, Boston University, School of Public Health, 801 Massachusetts Ave Crosstown Center, Boston, MA, USA
| | - Michael Gaziano
- Division of Rheumatology, Inflammation, and Immunity, Brigham and Women's Hospital, 60 Fenwood Road, Boston, USA; VA Boston Healthcare System, 150 S Huntington Ave, Boston, USA
| | - Kelly Cho
- Division of Rheumatology, Inflammation, and Immunity, Brigham and Women's Hospital, 60 Fenwood Road, Boston, USA; Department of Biomedical Informatics, Harvard Medical School, 10 Shattuck Street, Suite 514, Boston, USA; VA Boston Healthcare System, 150 S Huntington Ave, Boston, USA
| | - Katherine Liao
- Division of Rheumatology, Inflammation, and Immunity, Brigham and Women's Hospital, 60 Fenwood Road, Boston, USA; Department of Biomedical Informatics, Harvard Medical School, 10 Shattuck Street, Suite 514, Boston, USA; VA Boston Healthcare System, 150 S Huntington Ave, Boston, USA
| | - Tianxi Cai
- Department of Biomedical Informatics, Harvard Medical School, 10 Shattuck Street, Suite 514, Boston, USA; VA Boston Healthcare System, 150 S Huntington Ave, Boston, USA; Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Avenue, Boston, MA, USA
| |
Collapse
|
31
|
Liang L, Hou J, Uno H, Cho K, Ma Y, Cai T. Semi-supervised approach to event time annotation using longitudinal electronic health records. LIFETIME DATA ANALYSIS 2022; 28:428-491. [PMID: 35753014 PMCID: PMC10044535 DOI: 10.1007/s10985-022-09557-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Accepted: 05/13/2022] [Indexed: 06/15/2023]
Abstract
Large clinical datasets derived from insurance claims and electronic health record (EHR) systems are valuable sources for precision medicine research. These datasets can be used to develop models for personalized prediction of risk or treatment response. Efficiently deriving prediction models using real world data, however, faces practical and methodological challenges. Precise information on important clinical outcomes such as time to cancer progression are not readily available in these databases. The true clinical event times typically cannot be approximated well based on simple extracts of billing or procedure codes. Whereas, annotating event times manually is time and resource prohibitive. In this paper, we propose a two-step semi-supervised multi-modal automated time annotation (MATA) method leveraging multi-dimensional longitudinal EHR encounter records. In step I, we employ a functional principal component analysis approach to estimate the underlying intensity functions based on observed point processes from the unlabeled patients. In step II, we fit a penalized proportional odds model to the event time outcomes with features derived in step I in the labeled data where the non-parametric baseline function is approximated using B-splines. Under regularity conditions, the resulting estimator of the feature effect vector is shown as root-n consistent. We demonstrate the superiority of our approach relative to existing approaches through simulations and a real data example on annotating lung cancer recurrence in an EHR cohort of lung cancer patients from Veteran Health Administration.
Collapse
Affiliation(s)
- Liang Liang
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Jue Hou
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Hajime Uno
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Kelly Cho
- Massachusetts Veterans Epidemiology Research and Information Center, US Department of Veteran Affairs, Boston, MA, USA
- Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Yanyuan Ma
- Department of Statistics, Penn State University, University Park, PA, Boston, USA
| | - Tianxi Cai
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
32
|
Kurniansyah N, Goodman MO, Kelly TN, Elfassy T, Wiggins KL, Bis JC, Guo X, Palmas W, Taylor KD, Lin HJ, Haessler J, Gao Y, Shimbo D, Smith JA, Yu B, Feofanova EV, Smit RAJ, Wang Z, Hwang SJ, Liu S, Wassertheil-Smoller S, Manson JE, Lloyd-Jones DM, Rich SS, Loos RJF, Redline S, Correa A, Kooperberg C, Fornage M, Kaplan RC, Psaty BM, Rotter JI, Arnett DK, Morrison AC, Franceschini N, Levy D, Sofer T. A multi-ethnic polygenic risk score is associated with hypertension prevalence and progression throughout adulthood. Nat Commun 2022; 13:3549. [PMID: 35729114 PMCID: PMC9213527 DOI: 10.1038/s41467-022-31080-2] [Citation(s) in RCA: 45] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2021] [Accepted: 05/31/2022] [Indexed: 12/12/2022] Open
Abstract
In a multi-stage analysis of 52,436 individuals aged 17-90 across diverse cohorts and biobanks, we train, test, and evaluate a polygenic risk score (PRS) for hypertension risk and progression. The PRS is trained using genome-wide association studies (GWAS) for systolic, diastolic blood pressure, and hypertension, respectively. For each trait, PRS is selected by optimizing the coefficient of variation (CV) across estimated effect sizes from multiple potential PRS using the same GWAS, after which the 3 trait-specific PRSs are combined via an unweighted sum called "PRSsum", forming the HTN-PRS. The HTN-PRS is associated with both prevalent and incident hypertension at 4-6 years of follow up. This association is further confirmed in age-stratified analysis. In an independent biobank of 40,201 individuals, the HTN-PRS is confirmed to be predictive of increased risk for coronary artery disease, ischemic stroke, type 2 diabetes, and chronic kidney disease.
Collapse
Affiliation(s)
- Nuzulul Kurniansyah
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA
| | - Matthew O Goodman
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA
- Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Tanika N Kelly
- Department of Epidemiology, Tulane University School of Public Health and Tropical Medicine, New Orleans, LA, USA
| | - Tali Elfassy
- Department of Medicine, University of Miami Miller School of Medicine, Miami, FL, USA
| | - Kerri L Wiggins
- Cardiovascular Health Research Unit, Department of Medicine, University of Washington, Seattle, WA, USA
| | - Joshua C Bis
- Cardiovascular Health Research Unit, Department of Medicine, University of Washington, Seattle, WA, USA
| | - Xiuqing Guo
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Walter Palmas
- Department of Medicine, Columbia University Medical Center, New York, NY, USA
| | - Kent D Taylor
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Henry J Lin
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Jeffrey Haessler
- Division of Public Health Sciences, Fred Hutchinson Cancer Center, Seattle, WA, USA
| | - Yan Gao
- The Jackson Heart Study, University of Mississippi Medical Center, Jackson, MS, USA
| | - Daichi Shimbo
- Department of Medicine, Columbia University Irving Medical Center, New York, NY, USA
| | - Jennifer A Smith
- Department of Epidemiology, University of Michigan School of Public Health, Ann Arbor, MI, USA
| | - Bing Yu
- Human Genetics Center, Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Elena V Feofanova
- Human Genetics Center, Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Roelof A J Smit
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Zhe Wang
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Shih-Jen Hwang
- Department of Biostatistics, Boston University, Boston, MA, USA
| | - Simin Liu
- Center for Global Cardiometabolic Health and Departments of Epidemiology, Medicine, and Surgery, Brown University, Providence, RI, USA
| | - Sylvia Wassertheil-Smoller
- Department of Epidemiology & Population Health, Department of Pediatrics, Albert Einstein College of Medicine, Bronx, NY, USA
| | - JoAnn E Manson
- Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | | | - Stephen S Rich
- Center for Public Health Genomics, University of Virginia School of Medicine, Charlottesville, VA, USA
| | - Ruth J F Loos
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Susan Redline
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA
- Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Adolfo Correa
- Departments of Medicine and Pediatrics, University of Mississippi Medical Center, Jackson, MS, USA
| | - Charles Kooperberg
- Division of Public Health Sciences, Fred Hutchinson Cancer Center, Seattle, WA, USA
| | - Myriam Fornage
- Human Genetics Center, Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, University of Texas Health Science Center at Houston, Houston, TX, USA
- Brown Foundation Institute of Molecular Medicine, McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Robert C Kaplan
- Division of Public Health Sciences, Fred Hutchinson Cancer Center, Seattle, WA, USA
- Department of Epidemiology and Population Health, Albert Einstein College of Medicine, Bronx, NY, USA
| | - Bruce M Psaty
- Cardiovascular Health Research Unit, Departments of Medicine, Epidemiology, and Health Systems and Population Health, University of Washington, Seattle, WA, USA
| | - Jerome I Rotter
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Donna K Arnett
- College of Public Health, University of Kentucky, Lexington, KY, USA
| | - Alanna C Morrison
- Human Genetics Center, Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Nora Franceschini
- Department of Epidemiology, University of North Carolina, Chapel Hill, NC, USA
| | - Daniel Levy
- The Population Sciences Branch of the National Heart, Lung and Blood Institute, Bethesda, MD, USA
- The Framingham Heart Study, Framingham, MA, USA
| | - Tamar Sofer
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA.
- Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| |
Collapse
|
33
|
Binkheder S, Wu HY, Quinney SK, Zhang S, Zitu MM, Chiang CW, Wang L, Jones J, Li L. PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature. J Biomed Semantics 2022; 13:17. [PMID: 35690873 PMCID: PMC9188713 DOI: 10.1186/s13326-022-00272-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Accepted: 05/18/2022] [Indexed: 12/28/2022] Open
Abstract
BACKGROUND Adverse events induced by drug-drug interactions are a major concern in the United States. Current research is moving toward using electronic health record (EHR) data, including for adverse drug events discovery. One of the first steps in EHR-based studies is to define a phenotype for establishing a cohort of patients. However, phenotype definitions are not readily available for all phenotypes. One of the first steps of developing automated text mining tools is building a corpus. Therefore, this study aimed to develop annotation guidelines and a gold standard corpus to facilitate building future automated approaches for mining phenotype definitions contained in the literature. Furthermore, our aim is to improve the understanding of how these published phenotype definitions are presented in the literature and how we annotate them for future text mining tasks. RESULTS Two annotators manually annotated the corpus on a sentence-level for the presence of evidence for phenotype definitions. Three major categories (inclusion, intermediate, and exclusion) with a total of ten dimensions were proposed characterizing major contextual patterns and cues for presenting phenotype definitions in published literature. The developed annotation guidelines were used to annotate the corpus that contained 3971 sentences: 1923 out of 3971 (48.4%) for the inclusion category, 1851 out of 3971 (46.6%) for the intermediate category, and 2273 out of 3971 (57.2%) for exclusion category. The highest number of annotated sentences was 1449 out of 3971 (36.5%) for the "Biomedical & Procedure" dimension. The lowest number of annotated sentences was 49 out of 3971 (1.2%) for "The use of NLP". The overall percent inter-annotator agreement was 97.8%. Percent and Kappa statistics also showed high inter-annotator agreement across all dimensions. CONCLUSIONS The corpus and annotation guidelines can serve as a foundational informatics approach for annotating and mining phenotype definitions in literature, and can be used later for text mining applications.
Collapse
Affiliation(s)
- Samar Binkheder
- Department of Biohealth Informatics, Indiana University School of Informatics and Computing, Indianapolis, IN, USA
- Medical Informatics Unit, Department of Medical Education, College of Medicine, King Saud University, Riyadh, Saudi Arabia
| | - Heng-Yi Wu
- Development Science Informatics, Genentech, South San Francisco, CA, USA
| | - Sara K Quinney
- Department of Obstetrics and Gynecology, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Shijun Zhang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, USA
| | - Md Muntasir Zitu
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, USA
| | - Chien-Wei Chiang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, USA
| | - Lei Wang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, USA
| | - Josette Jones
- Department of Biohealth Informatics, Indiana University School of Informatics and Computing, Indianapolis, IN, USA
| | - Lang Li
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, USA.
- , 250 Lincoln Tower, 1800 Cannon Drive, Columbus, OH, 43210, USA.
| |
Collapse
|
34
|
Hao L, Kraft P, Berriz GF, Hynes ED, Koch C, Korategere V Kumar P, Parpattedar SS, Steeves M, Yu W, Antwi AA, Brunette CA, Danowski M, Gala MK, Green RC, Jones NE, Lewis ACF, Lubitz SA, Natarajan P, Vassy JL, Lebo MS. Development of a clinical polygenic risk score assay and reporting workflow. Nat Med 2022; 28:1006-1013. [PMID: 35437332 PMCID: PMC9117136 DOI: 10.1038/s41591-022-01767-6] [Citation(s) in RCA: 92] [Impact Index Per Article: 30.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2021] [Accepted: 03/02/2022] [Indexed: 12/31/2022]
Abstract
Implementation of polygenic risk scores (PRS) may improve disease prevention and management but poses several challenges: the construction of clinically valid assays, interpretation for individual patients, and the development of clinical workflows and resources to support their use in patient care. For the ongoing Veterans Affairs Genomic Medicine at Veterans Affairs (GenoVA) Study we developed a clinical genotype array-based assay for six published PRS. We used data from 36,423 Mass General Brigham Biobank participants and adjustment for population structure to replicate known PRS-disease associations and published PRS thresholds for a disease odds ratio (OR) of 2 (ranging from 1.75 (95% CI: 1.57-1.95) for type 2 diabetes to 2.38 (95% CI: 2.07-2.73) for breast cancer). After confirming the high performance and robustness of the pipeline for use as a clinical assay for individual patients, we analyzed the first 227 prospective samples from the GenoVA Study and found that the frequency of PRS corresponding to published OR > 2 ranged from 13/227 (5.7%) for colorectal cancer to 23/150 (15.3%) for prostate cancer. In addition to the PRS laboratory report, we developed physician- and patient-oriented informational materials to support decision-making about PRS results. Our work illustrates the generalizable development of a clinical PRS assay for multiple conditions and the technical, reporting and clinical workflow challenges for implementing PRS information in the clinic.
Collapse
Affiliation(s)
- Limin Hao
- Laboratory for Molecular Medicine, Mass General Brigham Personalized Medicine, Cambridge, MA, USA
| | - Peter Kraft
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Gabriel F Berriz
- Laboratory for Molecular Medicine, Mass General Brigham Personalized Medicine, Cambridge, MA, USA
| | - Elizabeth D Hynes
- Laboratory for Molecular Medicine, Mass General Brigham Personalized Medicine, Cambridge, MA, USA
| | - Christopher Koch
- Laboratory for Molecular Medicine, Mass General Brigham Personalized Medicine, Cambridge, MA, USA
| | | | - Shruti S Parpattedar
- Laboratory for Molecular Medicine, Mass General Brigham Personalized Medicine, Cambridge, MA, USA
| | - Marcie Steeves
- Laboratory for Molecular Medicine, Mass General Brigham Personalized Medicine, Cambridge, MA, USA
- Medical Genetics, Massachusetts General Hospital, Boston, MA, USA
| | - Wanfeng Yu
- Laboratory for Molecular Medicine, Mass General Brigham Personalized Medicine, Cambridge, MA, USA
| | - Ashley A Antwi
- Veterans Affairs Boston Healthcare System, Boston, MA, USA
| | | | | | - Manish K Gala
- Division of Gastroenterology, Massachusetts General Hospital, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
| | - Robert C Green
- Harvard Medical School, Boston, MA, USA
- Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA
- Precision Population Health, Ariadne Labs, Boston, MA, USA
| | - Natalie E Jones
- Veterans Affairs Boston Healthcare System, Boston, MA, USA
- Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Anna C F Lewis
- E J Safra Center for Ethics, Harvard University, Cambridge, MA, USA
| | - Steven A Lubitz
- Cardiovascular Disease Initiative, Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
- Demoulas Center for Cardiac Arrhythmias, Massachusetts General Hospital, Boston, MA, USA
| | - Pradeep Natarajan
- Harvard Medical School, Boston, MA, USA
- Cardiovascular Disease Initiative, Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
| | - Jason L Vassy
- Veterans Affairs Boston Healthcare System, Boston, MA, USA.
- Harvard Medical School, Boston, MA, USA.
- Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA.
- Precision Population Health, Ariadne Labs, Boston, MA, USA.
| | - Matthew S Lebo
- Laboratory for Molecular Medicine, Mass General Brigham Personalized Medicine, Cambridge, MA, USA
- Harvard Medical School, Boston, MA, USA
- Department of Pathology, Brigham and Women's Hospital, Boston, MA, USA
| |
Collapse
|
35
|
Kachroo P, Stewart ID, Kelly RS, Stav M, Mendez K, Dahlin A, Soeteman DI, Chu SH, Huang M, Cote M, Knihtilä HM, Lee-Sarwar K, McGeachie M, Wang A, Wu AC, Virkud Y, Zhang P, Wareham NJ, Karlson EW, Wheelock CE, Clish C, Weiss ST, Langenberg C, Lasky-Su JA. Metabolomic profiling reveals extensive adrenal suppression due to inhaled corticosteroid therapy in asthma. Nat Med 2022; 28:814-822. [PMID: 35314841 PMCID: PMC9350737 DOI: 10.1038/s41591-022-01714-5] [Citation(s) in RCA: 58] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2021] [Accepted: 01/24/2022] [Indexed: 02/02/2023]
Abstract
The application of large-scale metabolomic profiling provides new opportunities for realizing the potential of omics-based precision medicine for asthma. By leveraging data from over 14,000 individuals in four distinct cohorts, this study identifies and independently replicates 17 steroid metabolites whose levels were significantly reduced in individuals with prevalent asthma. Although steroid levels were reduced among all asthma cases regardless of medication use, the largest reductions were associated with inhaled corticosteroid (ICS) treatment, as confirmed in a 4-year low-dose ICS clinical trial. Effects of ICS treatment on steroid levels were dose dependent; however, significant reductions also occurred with low-dose ICS treatment. Using information from electronic medical records, we found that cortisol levels were substantially reduced throughout the entire 24-hour daily period in patients with asthma who were treated with ICS compared to those who were untreated and to patients without asthma. Moreover, patients with asthma who were treated with ICS showed significant increases in fatigue and anemia as compared to those without ICS treatment. Adrenal suppression in patients with asthma treated with ICS might, therefore, represent a larger public health problem than previously recognized. Regular cortisol monitoring of patients with asthma treated with ICS is needed to provide the optimal balance between minimizing adverse effects of adrenal suppression while capitalizing on the established benefits of ICS treatment.
Collapse
Affiliation(s)
- Priyadarshini Kachroo
- Department of Medicine, Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | | | - Rachel S Kelly
- Department of Medicine, Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Meryl Stav
- Department of Medicine, Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Kevin Mendez
- Department of Medicine, Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Amber Dahlin
- Department of Medicine, Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Djøra I Soeteman
- Department of Medicine, Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
- Center for Health Decision Science, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Su H Chu
- Department of Medicine, Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Mengna Huang
- Department of Medicine, Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Margaret Cote
- Department of Medicine, Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Hanna M Knihtilä
- Department of Medicine, Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
- Department of Pediatrics, Stanford University School of Medicine, Stanford, CA, USA
| | - Kathleen Lee-Sarwar
- Department of Medicine, Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Michael McGeachie
- Department of Medicine, Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Alberta Wang
- Department of Medicine, Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Ann Chen Wu
- Harvard Pilgrim Health Care Institute and Department of Population Medicine, Harvard Medical School, Boston, MA, USA
| | - Yamini Virkud
- Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - Pei Zhang
- Gunma University Initiative for Advanced Research (GIAR), Gunma University, Maebashi, Japan
- Department of Medical Biochemistry and Biophysics, Division of Physiological Chemistry 2, Karolinska Institute, Stockholm, Sweden
| | | | - Elizabeth W Karlson
- Department of Medicine, Division of Rheumatology, Inflammation and Immunity, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Craig E Wheelock
- Gunma University Initiative for Advanced Research (GIAR), Gunma University, Maebashi, Japan
- Department of Medical Biochemistry and Biophysics, Division of Physiological Chemistry 2, Karolinska Institute, Stockholm, Sweden
- Department of Respiratory Medicine and Allergy, Karolinska University Hospital, Stockholm, Sweden
| | | | - Scott T Weiss
- Department of Medicine, Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Claudia Langenberg
- MRC Epidemiology Unit, University of Cambridge, Cambridge, UK
- Computational Medicine, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Jessica A Lasky-Su
- Department of Medicine, Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
36
|
Kothari C, Srivastava S, Kousa Y, Izem R, Gierdalski M, Kim D, Good A, Dies KA, Geisel G, Morizono H, Gallo V, Pomeroy SL, Garden GA, Guay-Woodford L, Sahin M, Avillach P. Validation of a computational phenotype for finding patients eligible for genetic testing for pathogenic PTEN variants across three centers. J Neurodev Disord 2022; 14:24. [PMID: 35321655 PMCID: PMC8943944 DOI: 10.1186/s11689-022-09434-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/10/2021] [Accepted: 03/04/2022] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND Computational phenotypes are most often combinations of patient billing codes that are highly predictive of disease using electronic health records (EHR). In the case of rare diseases that can only be diagnosed by genetic testing, computational phenotypes identify patient cohorts for genetic testing and possible diagnosis. This article details the validation of a computational phenotype for PTEN hamartoma tumor syndrome (PHTS) against the EHR of patients at three collaborating clinical research centers: Boston Children's Hospital, Children's National Hospital, and the University of Washington. METHODS A combination of billing codes from the International Classification of Diseases versions 9 and 10 (ICD-9 and ICD-10) for diagnostic criteria postulated by a research team at Cleveland Clinic was used to identify patient cohorts for genetic testing from the clinical data warehouses at the three research centers. Subsequently, the EHR-including billing codes, clinical notes, and genetic reports-of these patients were reviewed by clinical experts to identify patients with PHTS. RESULTS The PTEN genetic testing yield of the computational phenotype, the number of patients who needed to be genetically tested for incidence of pathogenic PTEN gene variants, ranged from 82 to 94% at the three centers. CONCLUSIONS Computational phenotypes have the potential to enable the timely and accurate diagnosis of rare genetic diseases such as PHTS by identifying patient cohorts for genetic sequencing and testing.
Collapse
Affiliation(s)
- Cartik Kothari
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, 02115, USA
| | - Siddharth Srivastava
- Department of Neurology, Rosamund Stone Zander Translational Neuroscience Center, Boston Children's Hospital, Harvard Medical School, Boston, MA, 02115, USA
| | - Youssef Kousa
- Division of Neurology, Children's National Hospital, Washington, DC, 20010, USA.,Department of Genomics and Precision Medicine, The George Washington University School of Medicine and Health Sciences, Washington, DC, 20052, USA
| | - Rima Izem
- Division of Biostatistics and Study Methodology, Children's National Research Institute, Silver Spring, MD, 20910, USA
| | - Marcin Gierdalski
- Division of Biostatistics and Study Methodology, Children's National Hospital, Washington, DC, 20010, USA
| | - Dongkyu Kim
- Division of Biostatistics and Study Methodology, Children's National Hospital, Washington, DC, 20010, USA
| | - Amy Good
- Institute for Translational Health Sciences, University of Washington, Seattle, WA, 98195, USA
| | - Kira A Dies
- Department of Neurology, Rosamund Stone Zander Translational Neuroscience Center, Boston Children's Hospital, Harvard Medical School, Boston, MA, 02115, USA
| | - Gregory Geisel
- Department of Neurology, Rosamund Stone Zander Translational Neuroscience Center, Boston Children's Hospital, Harvard Medical School, Boston, MA, 02115, USA
| | - Hiroki Morizono
- Center for Genetic Medicine Research, Children's National Hospital, Washington, DC, 20010, USA.,Department of Genomics and Precision Medicine, The George Washington University School of Medicine and Health Sciences, Washington, DC, 20052, USA
| | - Vittorio Gallo
- Center for Neuroscience Research, Children's National Research Institute, Children's National Hospital, Washington, DC, 20010, USA
| | - Scott L Pomeroy
- Department of Neurology, Boston Children's Hospital, Harvard Medical School, Boston, MA, 02115, USA
| | - Gwenn A Garden
- Department of Neurology and Center on Human Development and Disability, University of Washington, Seattle, WA, 98195, USA.,Department of Neurology, University of North Carolina, Chapel Hill, NC, 27599, USA
| | - Lisa Guay-Woodford
- Center for Translational Research, Children's National Hospital, Washington, DC, 20010, USA
| | - Mustafa Sahin
- Department of Neurology, Rosamund Stone Zander Translational Neuroscience Center, Boston Children's Hospital, Harvard Medical School, Boston, MA, 02115, USA
| | - Paul Avillach
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, 02115, USA.
| |
Collapse
|
37
|
O’Connor MJ, Schroeder P, Huerta-Chagoya A, Cortés-Sánchez P, Bonàs-Guarch S, Guindo-Martínez M, Cole JB, Kaur V, Torrents D, Veerapen K, Grarup N, Kurki M, Rundsten CF, Pedersen O, Brandslund I, Linneberg A, Hansen T, Leong A, Florez JC, Mercader JM. Recessive Genome-Wide Meta-analysis Illuminates Genetic Architecture of Type 2 Diabetes. Diabetes 2022; 71:554-565. [PMID: 34862199 PMCID: PMC8893948 DOI: 10.2337/db21-0545] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Accepted: 11/28/2021] [Indexed: 11/13/2022]
Abstract
Most genome-wide association studies (GWAS) of complex traits are performed using models with additive allelic effects. Hundreds of loci associated with type 2 diabetes have been identified using this approach. Additive models, however, can miss loci with recessive effects, thereby leaving potentially important genes undiscovered. We conducted the largest GWAS meta-analysis using a recessive model for type 2 diabetes. Our discovery sample included 33,139 case subjects and 279,507 control subjects from 7 European-ancestry cohorts, including the UK Biobank. We identified 51 loci associated with type 2 diabetes, including five variants undetected by prior additive analyses. Two of the five variants had minor allele frequency of <5% and were each associated with more than a doubled risk in homozygous carriers. Using two additional cohorts, FinnGen and a Danish cohort, we replicated three of the variants, including one of the low-frequency variants, rs115018790, which had an odds ratio in homozygous carriers of 2.56 (95% CI 2.05-3.19; P = 1 × 10-16) and a stronger effect in men than in women (for interaction, P = 7 × 10-7). The signal was associated with multiple diabetes-related traits, with homozygous carriers showing a 10% decrease in LDL cholesterol and a 20% increase in triglycerides; colocalization analysis linked this signal to reduced expression of the nearby PELO gene. These results demonstrate that recessive models, when compared with GWAS using the additive approach, can identify novel loci, including large-effect variants with pathophysiological consequences relevant to type 2 diabetes.
Collapse
Affiliation(s)
- Mark J. O’Connor
- Department of Medicine, Massachusetts General Hospital, Boston, MA
- Endocrine Division, Massachusetts General Hospital, Boston, MA
- Diabetes Unit, Massachusetts General Hospital, Boston, MA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA
- Programs in Metabolism and Medical and Population Genetics, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA
| | - Philip Schroeder
- Diabetes Unit, Massachusetts General Hospital, Boston, MA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA
- Programs in Metabolism and Medical and Population Genetics, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA
| | - Alicia Huerta-Chagoya
- Consejo Nacional de Ciencia y Tecnología (CONACYT), Instituto Nacional de Ciencias Médicas y Nutrición Salvador Zubirán, Mexico City, Mexico
| | | | | | | | - Joanne B. Cole
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA
- Programs in Metabolism and Medical and Population Genetics, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA
- Department of Medicine, Harvard Medical School, Boston, MA
- Center for Basic and Translations Obesity Research, Boston Children’s Hospital, Boston, MA
| | - Varinderpal Kaur
- Diabetes Unit, Massachusetts General Hospital, Boston, MA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA
- Programs in Metabolism and Medical and Population Genetics, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA
| | - David Torrents
- Barcelona Supercomputing Center (BSC), Barcelona, Spain
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
| | - Kumar Veerapen
- Department of Medicine, Harvard Medical School, Boston, MA
- Stanley Center for Psychiatric Genetics, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA
| | - Niels Grarup
- Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Mitja Kurki
- Department of Medicine, Harvard Medical School, Boston, MA
- Stanley Center for Psychiatric Genetics, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA
| | - Carsten F. Rundsten
- Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Oluf Pedersen
- Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Ivan Brandslund
- Department of Clinical Biochemistry, Lillebaelt Hospital, Vejle, Denmark
- Institute of Regional Health Research, University of Southern Denmark, Odense, Denmark
| | - Allan Linneberg
- Center for Clinical Research and Prevention, Bispebjerg and Frederiksberg Hospital, Copenhagen, Denmark
- Department of Clinical Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Torben Hansen
- Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Aaron Leong
- Department of Medicine, Massachusetts General Hospital, Boston, MA
- Endocrine Division, Massachusetts General Hospital, Boston, MA
- Diabetes Unit, Massachusetts General Hospital, Boston, MA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA
- Programs in Metabolism and Medical and Population Genetics, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA
- Department of Medicine, Harvard Medical School, Boston, MA
- Division of General Internal Medicine, Massachusetts General Hospital, Boston, MA
| | - Jose C. Florez
- Department of Medicine, Massachusetts General Hospital, Boston, MA
- Endocrine Division, Massachusetts General Hospital, Boston, MA
- Diabetes Unit, Massachusetts General Hospital, Boston, MA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA
- Programs in Metabolism and Medical and Population Genetics, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA
- Department of Medicine, Harvard Medical School, Boston, MA
| | - Josep M. Mercader
- Diabetes Unit, Massachusetts General Hospital, Boston, MA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA
- Programs in Metabolism and Medical and Population Genetics, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA
- Department of Medicine, Harvard Medical School, Boston, MA
| |
Collapse
|
38
|
Zhang Y, Liu M, Neykov M, Cai T. Prior Adaptive Semi-supervised Learning with Application to EHR Phenotyping. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2022; 23:83. [PMID: 37974910 PMCID: PMC10653017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/19/2023]
Abstract
Electronic Health Record (EHR) data, a rich source for biomedical research, have been successfully used to gain novel insight into a wide range of diseases. Despite its potential, EHR is currently underutilized for discovery research due to its major limitation in the lack of precise phenotype information. To overcome such difficulties, recent efforts have been devoted to developing supervised algorithms to accurately predict phenotypes based on relatively small training datasets with gold standard labels extracted via chart review. However, supervised methods typically require a sizable training set to yield generalizable algorithms, especially when the number of candidate features, p , is large. In this paper, we propose a semi-supervised (SS) EHR phenotyping method that borrows information from both a small, labeled dataset (where both the label Y and the feature set X are observed) and a much larger, weakly-labeled dataset in which the feature set X is accompanied only by a surrogate label S that is available to all patients. Under a working prior assumption that S is related to X only through Y and allowing it to hold approximately, we propose a prior adaptive semi-supervised (PASS) estimator that incorporates the prior knowledge by shrinking the estimator towards a direction derived under the prior. We derive asymptotic theory for the proposed estimator and justify its efficiency and robustness to prior information of poor quality. We also demonstrate its superiority over existing estimators under various scenarios via simulation studies and on three real-world EHR phenotyping studies at a large tertiary hospital.
Collapse
Affiliation(s)
- Yichi Zhang
- Department of Computer Science and Statistics, University of Rhode Island
| | - Molei Liu
- Department of Biostatistics, Harvard T.H. Chan School of Public Health
| | - Matey Neykov
- Department of Statistics and Data Science, Carnegie Mellon University
| | - Tianxi Cai
- Department of Biostatistics, Harvard T.H. Chan School of Public Health
| |
Collapse
|
39
|
Liu X, Chubak J, Hubbard RA, Chen Y. SAT: a Surrogate-Assisted Two-wave case boosting sampling method, with application to EHR-based association studies. J Am Med Inform Assoc 2021; 29:918-927. [PMID: 34962283 PMCID: PMC9714591 DOI: 10.1093/jamia/ocab267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Revised: 10/16/2021] [Accepted: 11/23/2021] [Indexed: 12/30/2022] Open
Abstract
OBJECTIVES Electronic health records (EHRs) enable investigation of the association between phenotypes and risk factors. However, studies solely relying on potentially error-prone EHR-derived phenotypes (ie, surrogates) are subject to bias. Analyses of low prevalence phenotypes may also suffer from poor efficiency. Existing methods typically focus on one of these issues but seldom address both. This study aims to simultaneously address both issues by developing new sampling methods to select an optimal subsample to collect gold standard phenotypes for improving the accuracy of association estimation. MATERIALS AND METHODS We develop a surrogate-assisted two-wave (SAT) sampling method, where a surrogate-guided sampling (SGS) procedure and a modified optimal subsampling procedure motivated from A-optimality criterion (OSMAC) are employed sequentially, to select a subsample for outcome validation through manual chart review subject to budget constraints. A model is then fitted based on the subsample with the true phenotypes. Simulation studies and an application to an EHR dataset of breast cancer survivors are conducted to demonstrate the effectiveness of SAT. RESULTS We found that the subsample selected with the proposed method contains informative observations that effectively reduce the mean squared error of the resultant estimator of the association. CONCLUSIONS The proposed approach can handle the problem brought by the rarity of cases and misclassification of the surrogate in phenotype-absent EHR-based association studies. With a well-behaved surrogate, SAT successfully boosts the case prevalence in the subsample and improves the efficiency of estimation.
Collapse
Affiliation(s)
- Xiaokang Liu
- Department of Biostatistics, Epidemiology and Informatics, The University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, USA
| | - Jessica Chubak
- Kaiser Permanente Washington Health Research Institute, Seattle, Washington, USA,Department of Epidemiology, University of Washington, Seattle, Washington, USA
| | - Rebecca A Hubbard
- Department of Biostatistics, Epidemiology and Informatics, The University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, USA
| | - Yong Chen
- Corresponding Author: Yong Chen, PhD, Department of Biostatistics, Epidemiology and Informatics, The University of Pennsylvania School of Medicine, 423 Guardian Drive, Philadelphia, PA 19104, USA ()
| |
Collapse
|
40
|
Greer ML, Davis K, Stack BC. Machine learning can identify patients at risk of hyperparathyroidism without known calcium and intact parathyroid hormone. Head Neck 2021; 44:817-822. [PMID: 34953008 DOI: 10.1002/hed.26970] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Revised: 11/01/2021] [Accepted: 12/16/2021] [Indexed: 01/16/2023] Open
Abstract
BACKGROUND To prove the concept of diagnosing primary hyperparathyroidism (pHPT) without calcium and parathyroid hormone (PTH) values and identifying potential risk factors for pHPT. METHODS Data were extracted from the clinical data warehouse (CDW) at the University of Arkansas for Medical Sciences (UAMS) Epic EHR (2014-2019). RESULTS 1737 patients with over 185 000 rows of clinical data were provided in a relational structure and processed/flattened to facilitate modeling. Phenotype elements were identified for pHPT without advance knowledge of calcium and PTH levels. The area under the curve (AUC) for the prediction of pHPT using our model was 0.86 with sensitivity and specificity of 0.8953 and 0.6686, respectively, using a 0.45 probability threshold. CONCLUSION Primary hyperparathyroidism was predicted from a dataset excluding calcium and PTH data with 86% accuracy. This approach needs to be validated/refined on larger samples of data and plans are in place to do this with other regional/national datasets.
Collapse
Affiliation(s)
- Melody L Greer
- Department of Health Informatics, University of Arkansas for Medical Sciences, Little Rock, Arkansas, USA
| | - Kyle Davis
- Department of Otolaryngology - Head and Neck Surgery, University of Arkansas for Medical Sciences, Little Rock, Arkansas, USA
| | - Brendan C Stack
- Department of Otolaryngology - Head and Neck Surgery, Southern Illinois University School of Medicine, Springfield, Illinois, USA
| |
Collapse
|
41
|
Kachroo P, Sordillo JE, Lutz SM, Weiss ST, Kelly RS, McGeachie MJ, Wu AC, Lasky-Su JA. Pharmaco-Metabolomics of Inhaled Corticosteroid Response in Individuals with Asthma. J Pers Med 2021; 11:jpm11111148. [PMID: 34834499 PMCID: PMC8622526 DOI: 10.3390/jpm11111148] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2021] [Revised: 10/28/2021] [Accepted: 10/30/2021] [Indexed: 12/26/2022] Open
Abstract
Metabolomic indicators of asthma treatment responses have yet to be identified. In this study, we aimed to uncover plasma metabolomic profiles associated with asthma exacerbations while on inhaled corticosteroid (ICS) treatment. We determined whether these profiles change with age from adolescence to adulthood. We utilized data from 170 individuals with asthma on ICS from the Mass General Brigham Biobank to identify plasma metabolites associated with asthma exacerbations while on ICS and examined potential effect modification of metabolite-exacerbation associations by age. We used liquid chromatography-high-resolution mass spectrometry-based metabolomic profiling. Sex-stratified analyses were also performed for the significant associations. The age range of the participating individuals was 13-43 years with a mean age of 33.5 years. Of the 783 endogenous metabolites tested, eight demonstrated significant associations with exacerbation after correction for multiple comparisons and adjusting for potential confounders (Bonferroni p value < 6.2 × 10-4). Potential effect modification by sex was detected for fatty acid metabolites, with males showing a greater reduction in their metabolite levels with ICS exacerbation. Thirty-eight metabolites showed suggestive interactions with age on exacerbation (nominal p-value < 0.05). Our findings demonstrate that plasma metabolomic profiles differ for individuals who experience asthma exacerbations while on ICS. The differentiating metabolites may serve as biomarkers of ICS response and may highlight metabolic pathways underlying ICS response variability.
Collapse
Affiliation(s)
- Priyadarshini Kachroo
- Department of Medicine, Channing Division of Network Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA 02115, USA; (P.K.); (S.T.W.); (R.S.K.); (M.J.M.)
| | - Joanne E. Sordillo
- PRecisiOn Medicine Translational Research (PROMoTeR) Center, Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care, Boston, MA 02215, USA; (J.E.S.); (S.M.L.); (A.C.W.)
| | - Sharon M. Lutz
- PRecisiOn Medicine Translational Research (PROMoTeR) Center, Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care, Boston, MA 02215, USA; (J.E.S.); (S.M.L.); (A.C.W.)
| | - Scott T. Weiss
- Department of Medicine, Channing Division of Network Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA 02115, USA; (P.K.); (S.T.W.); (R.S.K.); (M.J.M.)
| | - Rachel S. Kelly
- Department of Medicine, Channing Division of Network Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA 02115, USA; (P.K.); (S.T.W.); (R.S.K.); (M.J.M.)
| | - Michael J. McGeachie
- Department of Medicine, Channing Division of Network Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA 02115, USA; (P.K.); (S.T.W.); (R.S.K.); (M.J.M.)
| | - Ann Chen Wu
- PRecisiOn Medicine Translational Research (PROMoTeR) Center, Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care, Boston, MA 02215, USA; (J.E.S.); (S.M.L.); (A.C.W.)
| | - Jessica A. Lasky-Su
- Department of Medicine, Channing Division of Network Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA 02115, USA; (P.K.); (S.T.W.); (R.S.K.); (M.J.M.)
- Correspondence: ; Tel.: +1-617-875-9992
| |
Collapse
|
42
|
Estiri H, Strasser ZH, Murphy SN. High-throughput phenotyping with temporal sequences. J Am Med Inform Assoc 2021; 28:772-781. [PMID: 33313899 DOI: 10.1093/jamia/ocaa288] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2020] [Accepted: 11/04/2020] [Indexed: 12/15/2022] Open
Abstract
OBJECTIVE High-throughput electronic phenotyping algorithms can accelerate translational research using data from electronic health record (EHR) systems. The temporal information buried in EHRs is often underutilized in developing computational phenotypic definitions. This study aims to develop a high-throughput phenotyping method, leveraging temporal sequential patterns from EHRs. MATERIALS AND METHODS We develop a representation mining algorithm to extract 5 classes of representations from EHR diagnosis and medication records: the aggregated vector of the records (aggregated vector representation), the standard sequential patterns (sequential pattern mining), the transitive sequential patterns (transitive sequential pattern mining), and 2 hybrid classes. Using EHR data on 10 phenotypes from the Mass General Brigham Biobank, we train and validate phenotyping algorithms. RESULTS Phenotyping with temporal sequences resulted in a superior classification performance across all 10 phenotypes compared with the standard representations in electronic phenotyping. The high-throughput algorithm's classification performance was superior or similar to the performance of previously published electronic phenotyping algorithms. We characterize and evaluate the top transitive sequences of diagnosis records paired with the records of risk factors, symptoms, complications, medications, or vaccinations. DISCUSSION The proposed high-throughput phenotyping approach enables seamless discovery of sequential record combinations that may be difficult to assume from raw EHR data. Transitive sequences offer more accurate characterization of the phenotype, compared with its individual components, and reflect the actual lived experiences of the patients with that particular disease. CONCLUSION Sequential data representations provide a precise mechanism for incorporating raw EHR records into downstream machine learning. Our approach starts with user interpretability and works backward to the technology.
Collapse
Affiliation(s)
- Hossein Estiri
- Harvard Medical School, Boston, Massachusetts, USA.,Massachusetts General Hospital, Boston, Massachusetts, USA.,Mass General Brigham, Boston, Massachusetts, USA
| | - Zachary H Strasser
- Harvard Medical School, Boston, Massachusetts, USA.,Massachusetts General Hospital, Boston, Massachusetts, USA.,Mass General Brigham, Boston, Massachusetts, USA
| | - Shawn N Murphy
- Harvard Medical School, Boston, Massachusetts, USA.,Massachusetts General Hospital, Boston, Massachusetts, USA.,Mass General Brigham, Boston, Massachusetts, USA
| |
Collapse
|
43
|
Cai T, Liu M, Xia Y. Individual Data Protected Integrative Regression Analysis of High-Dimensional Heterogeneous Data. J Am Stat Assoc 2021; 117:2105-2119. [PMID: 37975021 PMCID: PMC10653033 DOI: 10.1080/01621459.2021.1904958] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2019] [Revised: 03/05/2021] [Accepted: 03/13/2021] [Indexed: 01/29/2023]
Abstract
Evidence-based decision making often relies on meta-analyzing multiple studies, which enables more precise estimation and investigation of generalizability. Integrative analysis of multiple heterogeneous studies is, however, highly challenging in the ultra high-dimensional setting. The challenge is even more pronounced when the individual-level data cannot be shared across studies, known as DataSHIELD contraint. Under sparse regression models that are assumed to be similar yet not identical across studies, we propose in this paper a novel integrative estimation procedure for data-Shielding High-dimensional Integrative Regression (SHIR). SHIR protects individual data through summary-statistics-based integrating procedure, accommodates between-study heterogeneity in both the covariate distribution and model parameters, and attains consistent variable selection. Theoretically, SHIR is statistically more efficient than the existing distributed approaches that integrate debiased LASSO estimators from the local sites. Furthermore, the estimation error incurred by aggregating derived data is negligible compared to the statistical minimax rate and SHIR is shown to be asymptotically equivalent in estimation to the ideal estimator obtained by sharing all data. The finite-sample performance of our method is studied and compared with existing approaches via extensive simulation settings. We further illustrate the utility of SHIR to derive phenotyping algorithms for coronary artery disease using electronic health records data from multiple chronic disease cohorts.
Collapse
Affiliation(s)
- Tianxi Cai
- Department of Biostatistics, Harvard School of Public Health, Harvard University, Boston, USA
| | - Molei Liu
- Department of Biostatistics, Harvard School of Public Health, Harvard University, Boston, USA
| | - Yin Xia
- Department of Statistics, School of Management, Fudan University, Shanghai, China
| |
Collapse
|
44
|
Wen A, Rasmussen LV, Stone D, Liu S, Kiefer R, Adekkanattu P, Brandt PS, Pacheco JA, Luo Y, Wang F, Pathak J, Liu H, Jiang G. CQL4NLP: Development and Integration of FHIR NLP Extensions in Clinical Quality Language for EHR-driven Phenotyping. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2021; 2021:624-633. [PMID: 34457178 PMCID: PMC8378647] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Lack of standardized representation of natural language processing (NLP) components in phenotyping algorithms hinders portability of the phenotyping algorithms and their execution in a high-throughput and reproducible manner. The objective of the study is to develop and evaluate a standard-driven approach - CQL4NLP - that integrates a collection of NLP extensions represented in the HL7 Fast Healthcare Interoperability Resources (FHIR) standard into the clinical quality language (CQL). A minimal NLP data model with 11 NLP-specific data elements was created, including six FHIR NLP extensions. All 11 data elements were identified from their usage in real-world phenotyping algorithms. An NLP ruleset generation mechanism was integrated into the NLP2FHIR pipeline and the NLP rulesets enabled comparable performance for a case study with the identification of obesity comorbidities. The NLP ruleset generation mechanism created a reproducible process for defining the NLP components of a phenotyping algorithm and its execution.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | - Yuan Luo
- Northwestern University, Chicago, IL
| | - Fei Wang
- Weill Cornell Medicine, New York, NY
| | | | | | | |
Collapse
|
45
|
Lee J, Liu C, Kim JH, Butler A, Shang N, Pang C, Natarajan K, Ryan P, Ta C, Weng C. Comparative effectiveness of medical concept embedding for feature engineering in phenotyping. JAMIA Open 2021; 4:ooab028. [PMID: 34142015 PMCID: PMC8206403 DOI: 10.1093/jamiaopen/ooab028] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Revised: 02/23/2021] [Accepted: 05/03/2021] [Indexed: 01/20/2023] Open
Abstract
OBJECTIVE Feature engineering is a major bottleneck in phenotyping. Properly learned medical concept embeddings (MCEs) capture the semantics of medical concepts, thus are useful for retrieving relevant medical features in phenotyping tasks. We compared the effectiveness of MCEs learned from knowledge graphs and electronic healthcare records (EHR) data in retrieving relevant medical features for phenotyping tasks. MATERIALS AND METHODS We implemented 5 embedding methods including node2vec, singular value decomposition (SVD), LINE, skip-gram, and GloVe with 2 data sources: (1) knowledge graphs obtained from the observational medical outcomes partnership (OMOP) common data model; and (2) patient-level data obtained from the OMOP compatible electronic health records (EHR) from Columbia University Irving Medical Center (CUIMC). We used phenotypes with their relevant concepts developed and validated by the electronic medical records and genomics (eMERGE) network to evaluate the performance of learned MCEs in retrieving phenotype-relevant concepts. Hits@k% in retrieving phenotype-relevant concepts based on a single and multiple seed concept(s) was used to evaluate MCEs. RESULTS Among all MCEs, MCEs learned by using node2vec with knowledge graphs showed the best performance. Of MCEs based on knowledge graphs and EHR data, MCEs learned by using node2vec with knowledge graphs and MCEs learned by using GloVe with EHR data outperforms other MCEs, respectively. CONCLUSION MCE enables scalable feature engineering tasks, thereby facilitating phenotyping. Based on current phenotyping practices, MCEs learned by using knowledge graphs constructed by hierarchical relationships among medical concepts outperformed MCEs learned by using EHR data.
Collapse
Affiliation(s)
- Junghwan Lee
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York 10032, USA
| | - Cong Liu
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York 10032, USA
| | - Jae Hyun Kim
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York 10032, USA
| | - Alex Butler
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York 10032, USA
| | - Ning Shang
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York 10032, USA
| | - Chao Pang
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York 10032, USA
| | - Karthik Natarajan
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York 10032, USA
| | - Patrick Ryan
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York 10032, USA
| | - Casey Ta
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York 10032, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York 10032, USA
| |
Collapse
|
46
|
Ruan X, Li Y, Jin X, Deng P, Xu J, Li N, Li X, Liu Y, Hu Y, Xie J, Wu Y, Long D, He W, Yuan D, Guo Y, Li H, Huang H, Yang S, Han M, Zhuang B, Qian J, Cao Z, Zhang X, Xiao J, Xu L. Health-adjusted life expectancy (HALE) in Chongqing, China, 2017: An artificial intelligence and big data method estimating the burden of disease at city level. THE LANCET REGIONAL HEALTH. WESTERN PACIFIC 2021; 9:100110. [PMID: 34379708 PMCID: PMC8315391 DOI: 10.1016/j.lanwpc.2021.100110] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Revised: 01/25/2021] [Accepted: 02/03/2021] [Indexed: 01/08/2023]
Abstract
BACKGROUND A universally applicable approach that provides standard HALE measurements for different regions has yet to be developed because of the difficulties of health information collection. In this study, we developed a natural language processing (NLP) based HALE estimation approach by using individual-level electronic medical records (EMRs), which made it possible to calculate HALE timely in different temporal or spatial granularities. METHODS We performed diagnostic concept extraction and normalisation on 13•99 million EMRs with NLP to estimate the prevalence of 254 diseases in WHO Global Burden of Disease Study (GBD). Then, we calculated HALE in Chongqing, 2017, by using the life table technique and Sullivan's method, and analysed the contribution of diseases to the expected years "lost" due to disability (DLE). FINDINGS Our method identified a life expectancy at birth (LE0) of 77•9 years and health-adjusted life expectancy at birth (HALE0) of 71•7 years for the general Chongqing population of 2017. In particular, the male LE0 and HALE0 were 76•3 years and 68•9 years, respectively, while the female LE0 and HALE0 were 80•0 years and 74•4 years, respectively. Cerebrovascular diseases, cancers, and injuries were the top three deterioration factors, which reduced HALE by 2•67, 2•15, and 1•19 years, respectively. INTERPRETATION The results demonstrated the feasibility and effectiveness of EMRs-based HALE estimation. Moreover, the method allowed for a potentially transferable framework that facilitated a more convenient comparison of cross-sectional and longitudinal studies on HALE between regions. In summary, this study provided insightful solutions to the global ageing and health problems that the world is facing. FUNDING National Key R and D Program of China (2018YFC2000400).
Collapse
Affiliation(s)
- Xiaowen Ruan
- Ping An Technology (Shenzhen) Co., Ltd., Ping'an International Financial Center, Futian District, Shenzhen 518001, China
| | - Yue Li
- China Population and Development Research Center, 12 Dahuisi Road, Haidian District, Beijing 100801, China
| | - Xiaohui Jin
- Ping An Technology (Shenzhen) Co., Ltd., No. 316, Laoshan Road, Pudong New District, Shanghai 200122, China
| | - Pan Deng
- Ping An Technology (Shenzhen) Co., Ltd., Ping'an International Financial Center, Futian District, Shenzhen 518001, China
| | - Jiaying Xu
- Ping An Technology (Shenzhen) Co., Ltd., Ping'an International Financial Center, Futian District, Shenzhen 518001, China
| | - Na Li
- Ping An Technology (Shenzhen) Co., Ltd., Ping An International Finance Centre, No. 3, South Xinyuan Road, Chaoyang District, Beijing 100011, China
| | - Xian Li
- Ping An Technology (Shenzhen) Co., Ltd., Ping'an International Financial Center, Futian District, Shenzhen 518001, China
| | - Yuqi Liu
- Ping An Technology (Shenzhen) Co., Ltd., Ping An International Finance Centre, No. 3, South Xinyuan Road, Chaoyang District, Beijing 100011, China
| | - Yiyi Hu
- Ping An Technology (Shenzhen) Co., Ltd., No. 316, Laoshan Road, Pudong New District, Shanghai 200122, China
| | - Jingwen Xie
- Ping An Technology (Shenzhen) Co., Ltd., No. 316, Laoshan Road, Pudong New District, Shanghai 200122, China
| | - Yingnan Wu
- Ping An Technology (Shenzhen) Co., Ltd., Ping An International Finance Centre, No. 3, South Xinyuan Road, Chaoyang District, Beijing 100011, China
| | - Dongyan Long
- Ping An Technology (Shenzhen) Co., Ltd., Ping'an International Financial Center, Futian District, Shenzhen 518001, China
| | - Wen He
- Ping An Technology (Shenzhen) Co., Ltd., Ping An International Finance Centre, No. 3, South Xinyuan Road, Chaoyang District, Beijing 100011, China
| | - Dongsheng Yuan
- Ping An Technology (Shenzhen) Co., Ltd., No. 316, Laoshan Road, Pudong New District, Shanghai 200122, China
| | - Yifei Guo
- Ping An Technology (Shenzhen) Co., Ltd., No. 316, Laoshan Road, Pudong New District, Shanghai 200122, China
| | - Heng Li
- Ping An Technology (Shenzhen) Co., Ltd., Ping'an International Financial Center, Futian District, Shenzhen 518001, China
| | - He Huang
- Chongqing Municipal Health Commission, No. 232 Renmin Road, Yuzhong District, Chongqing 400015, China
| | - Shan Yang
- Chongqing Municipal Health Commission, No. 232 Renmin Road, Yuzhong District, Chongqing 400015, China
| | - Mei Han
- Ping An Technology (Shenzhen) Co., Ltd., Ping An Tech, US Research Lab, Suite 150, 3000 EI Camino Real, Palo Alto, CA 94306, United States
| | - Bojin Zhuang
- Ping An Technology (Shenzhen) Co., Ltd., Ping'an International Financial Center, Futian District, Shenzhen 518001, China
| | - Jiang Qian
- Ping An Technology (Shenzhen) Co., Ltd., Ping'an International Financial Center, Futian District, Shenzhen 518001, China
| | - Zhenjie Cao
- Ping An Technology (Shenzhen) Co., Ltd., Ping An Tech, US Research Lab, Suite 150, 3000 EI Camino Real, Palo Alto, CA 94306, United States
| | - Xuying Zhang
- China Population and Development Research Center, 12 Dahuisi Road, Haidian District, Beijing 100801, China
| | - Jing Xiao
- Ping An Technology (Shenzhen) Co., Ltd., Ping'an International Financial Center, Futian District, Shenzhen 518001, China
| | - Liang Xu
- Ping An Technology (Shenzhen) Co., Ltd., Ping'an International Financial Center, Futian District, Shenzhen 518001, China
| |
Collapse
|
47
|
Ahuja Y, Zhou D, He Z, Sun J, Castro VM, Gainer V, Murphy SN, Hong C, Cai T. sureLDA: A multidisease automated phenotyping method for the electronic health record. J Am Med Inform Assoc 2021; 27:1235-1243. [PMID: 32548637 DOI: 10.1093/jamia/ocaa079] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2019] [Revised: 03/12/2020] [Accepted: 04/28/2020] [Indexed: 01/20/2023] Open
Abstract
OBJECTIVE A major bottleneck hindering utilization of electronic health record data for translational research is the lack of precise phenotype labels. Chart review as well as rule-based and supervised phenotyping approaches require laborious expert input, hampering applicability to studies that require many phenotypes to be defined and labeled de novo. Though International Classification of Diseases codes are often used as surrogates for true labels in this setting, these sometimes suffer from poor specificity. We propose a fully automated topic modeling algorithm to simultaneously annotate multiple phenotypes. MATERIALS AND METHODS Surrogate-guided ensemble latent Dirichlet allocation (sureLDA) is a label-free multidimensional phenotyping method. It first uses the PheNorm algorithm to initialize probabilities based on 2 surrogate features for each target phenotype, and then leverages these probabilities to constrain the LDA topic model to generate phenotype-specific topics. Finally, it combines phenotype-feature counts with surrogates via clustering ensemble to yield final phenotype probabilities. RESULTS sureLDA achieves reliably high accuracy and precision across a range of simulated and real-world phenotypes. Its performance is robust to phenotype prevalence and relative informativeness of surogate vs nonsurrogate features. It also exhibits powerful feature selection properties. DISCUSSION sureLDA combines attractive properties of PheNorm and LDA to achieve high accuracy and precision robust to diverse phenotype characteristics. It offers particular improvement for phenotypes insufficiently captured by a few surrogate features. Moreover, sureLDA's feature selection ability enables it to handle high feature dimensions and produce interpretable computational phenotypes. CONCLUSIONS sureLDA is well suited toward large-scale electronic health record phenotyping for highly multiphenotype applications such as phenome-wide association studies .
Collapse
Affiliation(s)
- Yuri Ahuja
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.,Harvard Medical School, Boston, Massachusetts, USA
| | - Doudou Zhou
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.,Department of Statistics, University of California, Davis, Davis, California, USA
| | - Zeling He
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
| | - Jiehuan Sun
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.,Massachusetts Veterans Epidemiology Research and Information Center, VA Boston Healthcare System, Boston, Massachusetts, USA
| | | | - Vivian Gainer
- Partners HealthCare, Charlestown, Massachusetts, USA
| | - Shawn N Murphy
- Harvard Medical School, Boston, Massachusetts, USA.,Partners HealthCare, Charlestown, Massachusetts, USA
| | - Chuan Hong
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.,Harvard Medical School, Boston, Massachusetts, USA
| | - Tianxi Cai
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.,Harvard Medical School, Boston, Massachusetts, USA.,Massachusetts Veterans Epidemiology Research and Information Center, VA Boston Healthcare System, Boston, Massachusetts, USA
| |
Collapse
|
48
|
Estiri H, Vasey S, Murphy SN. Generative transfer learning for measuring plausibility of EHR diagnosis records. J Am Med Inform Assoc 2021; 28:559-568. [PMID: 33043366 PMCID: PMC7936395 DOI: 10.1093/jamia/ocaa215] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2020] [Accepted: 08/18/2020] [Indexed: 12/12/2022] Open
Abstract
OBJECTIVE Due to a complex set of processes involved with the recording of health information in the Electronic Health Records (EHRs), the truthfulness of EHR diagnosis records is questionable. We present a computational approach to estimate the probability that a single diagnosis record in the EHR reflects the true disease. MATERIALS AND METHODS Using EHR data on 18 diseases from the Mass General Brigham (MGB) Biobank, we develop generative classifiers on a small set of disease-agnostic features from EHRs that aim to represent Patients, pRoviders, and their Interactions within the healthcare SysteM (PRISM features). RESULTS We demonstrate that PRISM features and the generative PRISM classifiers are potent for estimating disease probabilities and exhibit generalizable and transferable distributional characteristics across diseases and patient populations. The joint probabilities we learn about diseases through the PRISM features via PRISM generative models are transferable and generalizable to multiple diseases. DISCUSSION The Generative Transfer Learning (GTL) approach with PRISM classifiers enables the scalable validation of computable phenotypes in EHRs without the need for domain-specific knowledge about specific disease processes. CONCLUSION Probabilities computed from the generative PRISM classifier can enhance and accelerate applied Machine Learning research and discoveries with EHR data.
Collapse
Affiliation(s)
- Hossein Estiri
- Harvard Medical School, Boston, Massachusetts, USA
- Massachusetts General Hospital, Boston, Massachusetts, USA
- Mass General Brigham, Boston, Massachusetts, USA
| | - Sebastien Vasey
- Department of Mathematics, Harvard University, Cambridge, Massachusetts, USA
| | - Shawn N Murphy
- Harvard Medical School, Boston, Massachusetts, USA
- Massachusetts General Hospital, Boston, Massachusetts, USA
- Mass General Brigham, Boston, Massachusetts, USA
| |
Collapse
|
49
|
Li R, Chen Y, Moore JH. Integration of genetic and clinical information to improve imputation of data missing from electronic health records. J Am Med Inform Assoc 2021; 26:1056-1063. [PMID: 31329892 DOI: 10.1093/jamia/ocz041] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2018] [Revised: 03/12/2019] [Accepted: 03/18/2019] [Indexed: 01/29/2023] Open
Abstract
OBJECTIVE Clinical data of patients' measurements and treatment history stored in electronic health record (EHR) systems are starting to be mined for better treatment options and disease associations. A primary challenge associated with utilizing EHR data is the considerable amount of missing data. Failure to address this issue can introduce significant bias in EHR-based research. Currently, imputation methods rely on correlations among the structured phenotype variables in the EHR. However, genetic studies have shown that many EHR-based phenotypes have a heritable component, suggesting that measured genetic variants might be useful for imputing missing data. In this article, we developed a computational model that incorporates patients' genetic information to perform EHR data imputation. MATERIALS AND METHODS We used the individual single nucleotide polymorphism's association with phenotype variables in the EHR as input to construct a genetic risk score that quantifies the genetic contribution to the phenotype. Multiple approaches to constructing the genetic risk score were evaluated for optimal performance. The genetic score, along with phenotype correlation, is then used as a predictor to impute the missing values. RESULTS To demonstrate the method performance, we applied our model to impute missing cardiovascular related measurements including low-density lipoprotein, heart failure, and aortic aneurysm disease in the electronic Medical Records and Genomics data. The integration method improved imputation's area-under-the-curve for binary phenotypes and decreased root-mean-square error for continuous phenotypes. CONCLUSION Compared with standard imputation approaches, incorporating genetic information offers a novel approach that can utilize more of the EHR data for better performance in missing data imputation.
Collapse
Affiliation(s)
- Ruowang Li
- Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.,Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Yong Chen
- Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.,Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.,Center for Evidence-based Practice, The University of Pennsylvania, Philadelphia, Pennsylvania, USA.,Applied Mathematics & Computational Science, Penn Arts & Sciences, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Jason H Moore
- Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.,Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| |
Collapse
|
50
|
Liao KP, Sun J, Cai TA, Link N, Hong C, Huang J, Huffman JE, Gronsbell J, Zhang Y, Ho YL, Castro V, Gainer V, Murphy SN, O'Donnell CJ, Gaziano JM, Cho K, Szolovits P, Kohane IS, Yu S, Cai T. High-throughput multimodal automated phenotyping (MAP) with application to PheWAS. J Am Med Inform Assoc 2021; 26:1255-1262. [PMID: 31613361 DOI: 10.1093/jamia/ocz066] [Citation(s) in RCA: 69] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2018] [Revised: 04/08/2019] [Accepted: 04/26/2019] [Indexed: 01/01/2023] Open
Abstract
OBJECTIVE Electronic health records linked with biorepositories are a powerful platform for translational studies. A major bottleneck exists in the ability to phenotype patients accurately and efficiently. The objective of this study was to develop an automated high-throughput phenotyping method integrating International Classification of Diseases (ICD) codes and narrative data extracted using natural language processing (NLP). MATERIALS AND METHODS We developed a mapping method for automatically identifying relevant ICD and NLP concepts for a specific phenotype leveraging the Unified Medical Language System. Along with health care utilization, aggregated ICD and NLP counts were jointly analyzed by fitting an ensemble of latent mixture models. The multimodal automated phenotyping (MAP) algorithm yields a predicted probability of phenotype for each patient and a threshold for classifying participants with phenotype yes/no. The algorithm was validated using labeled data for 16 phenotypes from a biorepository and further tested in an independent cohort phenome-wide association studies (PheWAS) for 2 single nucleotide polymorphisms with known associations. RESULTS The MAP algorithm achieved higher or similar AUC and F-scores compared to the ICD code across all 16 phenotypes. The features assembled via the automated approach had comparable accuracy to those assembled via manual curation (AUCMAP 0.943, AUCmanual 0.941). The PheWAS results suggest that the MAP approach detected previously validated associations with higher power when compared to the standard PheWAS method based on ICD codes. CONCLUSION The MAP approach increased the accuracy of phenotype definition while maintaining scalability, thereby facilitating use in studies requiring large-scale phenotyping, such as PheWAS.
Collapse
Affiliation(s)
- Katherine P Liao
- Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, USA.,Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.,Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA
| | - Jiehuan Sun
- Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA.,Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Tianrun A Cai
- Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, USA.,Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.,Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA
| | - Nicholas Link
- Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA
| | - Chuan Hong
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.,Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA.,Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Jie Huang
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | | | | | - Yichi Zhang
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.,University of Rhode Island, Kingston, RI, USA
| | - Yuk-Lam Ho
- Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA
| | | | | | - Shawn N Murphy
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.,Partners Healthcare Systems, Summerville, MA, USA.,Massachusetts General Hospital, Boston, MA, USA
| | - Christopher J O'Donnell
- Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, USA.,Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA
| | - J Michael Gaziano
- Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, USA.,Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.,Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA
| | - Kelly Cho
- Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, USA.,Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.,Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA
| | - Peter Szolovits
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Isaac S Kohane
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Sheng Yu
- Center for Statistical Science, Tsinghua University, Beijing, China.,Department of Industrial Engineering, Tsinghua University, Beijing, China.,Institute for Data Science, Tsinghua University, Beijing, China
| | - Tianxi Cai
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.,Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA.,Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| |
Collapse
|