1
|
Løkhammer S, Koller D, Wendt FR, Choi KW, He J, Friligkou E, Overstreet C, Gelernter J, Hellard SL, Polimanti R. Distinguishing vulnerability and resilience to posttraumatic stress disorder evaluating traumatic experiences, genetic risk and electronic health records. Psychiatry Res 2024; 337:115950. [PMID: 38744179 PMCID: PMC11156529 DOI: 10.1016/j.psychres.2024.115950] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Revised: 04/29/2024] [Accepted: 05/04/2024] [Indexed: 05/16/2024]
Abstract
What distinguishes vulnerability and resilience to posttraumatic stress disorder (PTSD) remains unclear. Levering traumatic experiences reporting, genetic data, and electronic health records (EHR), we investigated and predicted the clinical comorbidities (co-phenome) of PTSD vulnerability and resilience in the UK Biobank (UKB) and All of Us Research Program (AoU), respectively. In 60,354 trauma-exposed UKB participants, we defined PTSD vulnerability and resilience considering PTSD symptoms, trauma burden, and polygenic risk scores. EHR-based phenome-wide association studies (PheWAS) were conducted to dissect the co-phenomes of PTSD vulnerability and resilience. Significant diagnostic endpoints were applied as weights, yielding a phenotypic risk score (PheRS) to conduct PheWAS of PTSD vulnerability and resilience PheRS in up to 95,761 AoU participants. EHR-based PheWAS revealed three significant phenotypes positively associated with PTSD vulnerability (top association "Sleep disorders") and five outcomes inversely associated with PTSD resilience (top association "Irritable Bowel Syndrome"). In the AoU cohort, PheRS analysis showed a partial inverse relationship between vulnerability and resilience with distinct comorbid associations. While PheRSvulnerability associations were linked to multiple phenotypes, PheRSresilience showed inverse relationships with eye conditions. Our study unveils phenotypic differences in PTSD vulnerability and resilience, highlighting that these concepts are not simply the absence and presence of PTSD.
Collapse
Affiliation(s)
- Solveig Løkhammer
- Department of Psychiatry, Yale School of Medicine, New Haven, Connecticut, USA
- Department of Clinical Science, University of Bergen, Bergen, Norway
- Dr. Einar Martens Research Group for Biological Psychiatry, Center for Medical Genetics and Molecular Medicine, Haukeland University Hospital, Bergen, Norway
| | - Dora Koller
- Department of Psychiatry, Yale School of Medicine, New Haven, Connecticut, USA
- Department of Genetics, Microbiology, and Statistics, Faculty of Biology, University of Barcelona, Catalonia, Spain
| | - Frank R. Wendt
- Department of Anthropology, University of Toronto, Mississauga, Canada
- Biostatistics Division, Dalla Lana School of Public Health, University of Toronto, Toronto, Canada
| | - Karmel W. Choi
- Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
- Department of Psychiatry, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts, USA
| | - Jun He
- Department of Psychiatry, Yale School of Medicine, New Haven, Connecticut, USA
- Veterans Affairs Connecticut Healthcare Center, West Haven, Connecticut, USA
| | - Eleni Friligkou
- Department of Psychiatry, Yale School of Medicine, New Haven, Connecticut, USA
- Veterans Affairs Connecticut Healthcare Center, West Haven, Connecticut, USA
| | - Cassie Overstreet
- Department of Psychiatry, Yale School of Medicine, New Haven, Connecticut, USA
- Veterans Affairs Connecticut Healthcare Center, West Haven, Connecticut, USA
| | - Joel Gelernter
- Department of Psychiatry, Yale School of Medicine, New Haven, Connecticut, USA
- Veterans Affairs Connecticut Healthcare Center, West Haven, Connecticut, USA
- Department of Genetics, Yale School of Medicine, New Haven, Connecticut, USA
- Department of Neuroscience, Yale School of Medicine, New Haven, Connecticut, USA
- Wu Tsai Institute, Yale University, New Haven, Connecticut, USA
| | - Stéphanie Le Hellard
- Department of Clinical Science, University of Bergen, Bergen, Norway
- Dr. Einar Martens Research Group for Biological Psychiatry, Center for Medical Genetics and Molecular Medicine, Haukeland University Hospital, Bergen, Norway
- Bergen Center of Brain Plasticity, Haukeland University Hospital, Bergen, Norway
| | - Renato Polimanti
- Department of Psychiatry, Yale School of Medicine, New Haven, Connecticut, USA
- Veterans Affairs Connecticut Healthcare Center, West Haven, Connecticut, USA
- Wu Tsai Institute, Yale University, New Haven, Connecticut, USA
- Department of Chronic Disease Epidemiology, Yale School of Public Health, New Haven, Connecticut, USA
| |
Collapse
|
2
|
Steinfeldt J, Wild B, Buergel T, Pietzner M, Upmeier Zu Belzen J, Vauvelle A, Hegselmann S, Denaxas S, Hemingway H, Langenberg C, Landmesser U, Deanfield J, Eils R. Medical history predicts phenome-wide disease onset and enables the rapid response to emerging health threats. Nat Commun 2024; 15:4257. [PMID: 38763986 PMCID: PMC11102902 DOI: 10.1038/s41467-024-48568-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Accepted: 05/03/2024] [Indexed: 05/21/2024] Open
Abstract
The COVID-19 pandemic exposed a global deficiency of systematic, data-driven guidance to identify high-risk individuals. Here, we illustrate the utility of routinely recorded medical history to predict the risk for 1883 diseases across clinical specialties and support the rapid response to emerging health threats such as COVID-19. We developed a neural network to learn from health records of 502,460 UK Biobank. Importantly, we observed discriminative improvements over basic demographic predictors for 1774 (94.3%) endpoints. After transferring the unmodified risk models to the All of US cohort, we replicated these improvements for 1347 (89.8%) of 1500 investigated endpoints, demonstrating generalizability across healthcare systems and historically underrepresented groups. Ultimately, we showed how this approach could have been used to identify individuals vulnerable to severe COVID-19. Our study demonstrates the potential of medical history to support guidance for emerging pandemics by systematically estimating risk for thousands of diseases at once at minimal cost.
Collapse
Affiliation(s)
- Jakob Steinfeldt
- Department of Cardiology, Angiology and Intensive Care Medicine, Deutsches Herzzentrum der Charité (DHZC), Berlin, Germany
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin, Klinik/Centrum, Charitéplatz 1, 10117, Berlin, Germany
- Computational Medicine, Berlin Institute of Health (BIH), Charite - University Medicine Berlin, Berlin, Germany
- Friede Springer Cardiovascular Prevention Center@Charite, Charite - University Medicine Berlin, Berlin, Germany
- Institute of Cardiovascular Sciences, University College London, London, UK
| | - Benjamin Wild
- Center for Digital Health, Berlin Institute of Health (BIH), Charite - University Medicine Berlin, Berlin, Germany
| | - Thore Buergel
- Institute of Cardiovascular Sciences, University College London, London, UK
- Center for Digital Health, Berlin Institute of Health (BIH), Charite - University Medicine Berlin, Berlin, Germany
| | - Maik Pietzner
- Computational Medicine, Berlin Institute of Health (BIH), Charite - University Medicine Berlin, Berlin, Germany
- MRC Epidemiology Unit, Institute of Metabolic Science, University of Cambridge, Cambridge, UK
- Precision Health University Research Institute, Queen Mary University of London and Barts NHS Trust, London, UK
| | - Julius Upmeier Zu Belzen
- Center for Digital Health, Berlin Institute of Health (BIH), Charite - University Medicine Berlin, Berlin, Germany
| | - Andre Vauvelle
- Institute of Health Informatics, University College London, London, UK
| | - Stefan Hegselmann
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Massachusetts, USA
- Pattern Recognition and Image Analysis Lab, University of Münster, Münster, Germany
| | - Spiros Denaxas
- Institute of Health Informatics, University College London, London, UK
- British Heart Foundation Data Science Centre, London, UK
- Health Data Research UK, London, UK
- National Institute for Health Research, Biomedical Research Centre at University College London Hospitals National Institute for Health Research, Biomedical Research Centre, London, UK
| | - Harry Hemingway
- Institute of Health Informatics, University College London, London, UK
- Health Data Research UK, London, UK
- National Institute for Health Research, Biomedical Research Centre at University College London Hospitals National Institute for Health Research, Biomedical Research Centre, London, UK
| | - Claudia Langenberg
- Computational Medicine, Berlin Institute of Health (BIH), Charite - University Medicine Berlin, Berlin, Germany
- MRC Epidemiology Unit, Institute of Metabolic Science, University of Cambridge, Cambridge, UK
- Precision Health University Research Institute, Queen Mary University of London and Barts NHS Trust, London, UK
| | - Ulf Landmesser
- Department of Cardiology, Angiology and Intensive Care Medicine, Deutsches Herzzentrum der Charité (DHZC), Berlin, Germany
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin, Klinik/Centrum, Charitéplatz 1, 10117, Berlin, Germany
- Friede Springer Cardiovascular Prevention Center@Charite, Charite - University Medicine Berlin, Berlin, Germany
- Berlin Institute of Health (BIH), Charite - University Medicine Berlin, Berlin, Germany
- DZHK (German Centre for Cardiovascular Research), Partner Site Berlin, Berlin, Berlin, Germany
| | - John Deanfield
- Institute of Cardiovascular Sciences, University College London, London, UK
| | - Roland Eils
- Center for Digital Health, Berlin Institute of Health (BIH), Charite - University Medicine Berlin, Berlin, Germany.
- Health Data Science Unit, Heidelberg University Hospital and BioQuant, Heidelberg, Germany.
| |
Collapse
|
3
|
Elfman J, Goins L, Heller T, Singh S, Wang YH, Li H. Discovery of a polymorphic gene fusion via bottom-up chimeric RNA prediction. Nucleic Acids Res 2024; 52:4409-4421. [PMID: 38587197 PMCID: PMC11077074 DOI: 10.1093/nar/gkae258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2022] [Accepted: 03/27/2024] [Indexed: 04/09/2024] Open
Abstract
Gene fusions and their chimeric products are commonly linked with cancer. However, recent studies have found chimeric transcripts in non-cancer tissues and cell lines. Large-scale efforts to annotate structural variations have identified gene fusions capable of generating chimeric transcripts even in normal tissues. In this study, we present a bottom-up approach targeting population-specific chimeric RNAs, identifying 58 such instances in the GTEx cohort, including notable cases such as SUZ12P1-CRLF3, TFG-ADGRG7 and TRPM4-PPFIA3, which possess distinct patterns across different ancestry groups. We provide direct evidence for an additional 29 polymorphic chimeric RNAs with associated structural variants, revealing 13 novel rare structural variants. Additionally, we utilize the All of Us dataset and a large cohort of clinical samples to characterize the association of the SUZ12P1-CRLF3-causing variant with patient phenotypes. Our study showcases SUZ12P1-CRLF3 as a representative example, illustrating the identification of elusive structural variants by focusing on those producing population-specific fusion transcripts.
Collapse
Affiliation(s)
- Justin Elfman
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22903, USA
| | - Lynette Goins
- Department of Biological Sciences, Clemson University, Clemson, SC 29631, USA
| | - Tessa Heller
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22903, USA
| | - Sandeep Singh
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22903, USA
- Computational Toxicology Facility, CSIR-Indian Institute of Toxicology Research, Lucknow, 226001, Uttar Pradesh, India
| | - Yuh-Hwa Wang
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22903, USA
| | - Hui Li
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22903, USA
- Department of Pathology, University of Virginia, Charlottesville, VA 22903, USA
| |
Collapse
|
4
|
Yan C, Ong HH, Grabowska ME, Krantz MS, Su WC, Dickson AL, Peterson JF, Feng Q, Roden DM, Stein CM, Kerchberger VE, Malin BA, Wei WQ. Large language models facilitate the generation of electronic health record phenotyping algorithms. J Am Med Inform Assoc 2024:ocae072. [PMID: 38613820 DOI: 10.1093/jamia/ocae072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Revised: 02/21/2024] [Accepted: 03/22/2024] [Indexed: 04/15/2024] Open
Abstract
OBJECTIVES Phenotyping is a core task in observational health research utilizing electronic health records (EHRs). Developing an accurate algorithm demands substantial input from domain experts, involving extensive literature review and evidence synthesis. This burdensome process limits scalability and delays knowledge discovery. We investigate the potential for leveraging large language models (LLMs) to enhance the efficiency of EHR phenotyping by generating high-quality algorithm drafts. MATERIALS AND METHODS We prompted four LLMs-GPT-4 and GPT-3.5 of ChatGPT, Claude 2, and Bard-in October 2023, asking them to generate executable phenotyping algorithms in the form of SQL queries adhering to a common data model (CDM) for three phenotypes (ie, type 2 diabetes mellitus, dementia, and hypothyroidism). Three phenotyping experts evaluated the returned algorithms across several critical metrics. We further implemented the top-rated algorithms and compared them against clinician-validated phenotyping algorithms from the Electronic Medical Records and Genomics (eMERGE) network. RESULTS GPT-4 and GPT-3.5 exhibited significantly higher overall expert evaluation scores in instruction following, algorithmic logic, and SQL executability, when compared to Claude 2 and Bard. Although GPT-4 and GPT-3.5 effectively identified relevant clinical concepts, they exhibited immature capability in organizing phenotyping criteria with the proper logic, leading to phenotyping algorithms that were either excessively restrictive (with low recall) or overly broad (with low positive predictive values). CONCLUSION GPT versions 3.5 and 4 are capable of drafting phenotyping algorithms by identifying relevant clinical criteria aligned with a CDM. However, expertise in informatics and clinical experience is still required to assess and further refine generated algorithms.
Collapse
Affiliation(s)
- Chao Yan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Henry H Ong
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Monika E Grabowska
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Matthew S Krantz
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Wu-Chen Su
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Alyson L Dickson
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Josh F Peterson
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - QiPing Feng
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Dan M Roden
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - C Michael Stein
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - V Eric Kerchberger
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Bradley A Malin
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Department of Computer Science, Vanderbilt University, Nashville, TN 37203, United States
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Wei-Qi Wei
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Department of Computer Science, Vanderbilt University, Nashville, TN 37203, United States
| |
Collapse
|
5
|
Wei WQ, Rowley R, Wood A, MacArthur J, Embi PJ, Denaxas S. Improving reporting standards for phenotyping algorithm in biomedical research: 5 fundamental dimensions. J Am Med Inform Assoc 2024; 31:1036-1041. [PMID: 38269642 PMCID: PMC10990558 DOI: 10.1093/jamia/ocae005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2023] [Revised: 12/12/2023] [Accepted: 01/08/2024] [Indexed: 01/26/2024] Open
Abstract
INTRODUCTION Phenotyping algorithms enable the interpretation of complex health data and definition of clinically relevant phenotypes; they have become crucial in biomedical research. However, the lack of standardization and transparency inhibits the cross-comparison of findings among different studies, limits large scale meta-analyses, confuses the research community, and prevents the reuse of algorithms, which results in duplication of efforts and the waste of valuable resources. RECOMMENDATIONS Here, we propose five independent fundamental dimensions of phenotyping algorithms-complexity, performance, efficiency, implementability, and maintenance-through which researchers can describe, measure, and deploy any algorithms efficiently and effectively. These dimensions must be considered in the context of explicit use cases and transparent methods to ensure that they do not reflect unexpected biases or exacerbate inequities.
Collapse
Affiliation(s)
- Wei-Qi Wei
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Robb Rowley
- National Human Genome Research Institute, Bethesda, MD 20892, United States
| | - Angela Wood
- Department of Public Health and Primary Care, University of Cambridge, Cambridge, CB2 1TN, United Kingdom
| | - Jacqueline MacArthur
- British Heart Foundation Data Science Center, Health Data Research, London, NW1 2BE, United Kingdom
| | - Peter J Embi
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Spiros Denaxas
- British Heart Foundation Data Science Center, Health Data Research, London, NW1 2BE, United Kingdom
- Institute of Health Informatics, University College London, London, WC1E 6BT, United Kingdom
| |
Collapse
|
6
|
Yan C, Ong HH, Grabowska ME, Krantz MS, Su WC, Dickson AL, Peterson JF, Feng Q, Roden DM, Stein CM, Kerchberger VE, Malin BA, Wei WQ. Large Language Models Facilitate the Generation of Electronic Health Record Phenotyping Algorithms. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2023.12.19.23300230. [PMID: 38196578 PMCID: PMC10775330 DOI: 10.1101/2023.12.19.23300230] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/11/2024]
Abstract
Objectives Phenotyping is a core task in observational health research utilizing electronic health records (EHRs). Developing an accurate algorithm demands substantial input from domain experts, involving extensive literature review and evidence synthesis. This burdensome process limits scalability and delays knowledge discovery. We investigate the potential for leveraging large language models (LLMs) to enhance the efficiency of EHR phenotyping by generating high-quality algorithm drafts. Materials and Methods We prompted four LLMs-GPT-4 and GPT-3.5 of ChatGPT, Claude 2, and Bard-in October 2023, asking them to generate executable phenotyping algorithms in the form of SQL queries adhering to a common data model (CDM) for three phenotypes (i.e., type 2 diabetes mellitus, dementia, and hypothyroidism). Three phenotyping experts evaluated the returned algorithms across several critical metrics. We further implemented the top-rated algorithms and compared them against clinician-validated phenotyping algorithms from the Electronic Medical Records and Genomics (eMERGE) network. Results GPT-4 and GPT-3.5 exhibited significantly higher overall expert evaluation scores in instruction following, algorithmic logic, and SQL executability, when compared to Claude 2 and Bard. Although GPT-4 and GPT-3.5 effectively identified relevant clinical concepts, they exhibited immature capability in organizing phenotyping criteria with the proper logic, leading to phenotyping algorithms that were either excessively restrictive (with low recall) or overly broad (with low positive predictive values). Conclusion GPT versions 3.5 and 4 are capable of drafting phenotyping algorithms by identifying relevant clinical criteria aligned with a CDM. However, expertise in informatics and clinical experience is still required to assess and further refine generated algorithms.
Collapse
Affiliation(s)
- Chao Yan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN
| | - Henry H. Ong
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN
| | - Monika E. Grabowska
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN
| | - Matthew S. Krantz
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN
| | - Wu-Chen Su
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN
| | - Alyson L. Dickson
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
| | - Josh F. Peterson
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
| | - QiPing Feng
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
| | - Dan M. Roden
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN
| | - C. Michael Stein
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
| | - V. Eric Kerchberger
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
| | - Bradley A. Malin
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN
- Department of Computer Science, Vanderbilt University, Nashville, TN
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN
| | - Wei-Qi Wei
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN
- Department of Computer Science, Vanderbilt University, Nashville, TN
| |
Collapse
|
7
|
Smith JC, Williamson BD, Cronkite DJ, Park D, Whitaker JM, McLemore MF, Osmanski JT, Winter R, Ramaprasan A, Kelley A, Shea M, Wittayanukorn S, Stojanovic D, Zhao Y, Toh S, Johnson KB, Aronoff DM, Carrell DS. Data-driven automated classification algorithms for acute health conditions: applying PheNorm to COVID-19 disease. J Am Med Inform Assoc 2024; 31:574-582. [PMID: 38109888 PMCID: PMC10873852 DOI: 10.1093/jamia/ocad241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Revised: 10/19/2023] [Accepted: 11/27/2023] [Indexed: 12/20/2023] Open
Abstract
OBJECTIVES Automated phenotyping algorithms can reduce development time and operator dependence compared to manually developed algorithms. One such approach, PheNorm, has performed well for identifying chronic health conditions, but its performance for acute conditions is largely unknown. Herein, we implement and evaluate PheNorm applied to symptomatic COVID-19 disease to investigate its potential feasibility for rapid phenotyping of acute health conditions. MATERIALS AND METHODS PheNorm is a general-purpose automated approach to creating computable phenotype algorithms based on natural language processing, machine learning, and (low cost) silver-standard training labels. We applied PheNorm to cohorts of potential COVID-19 patients from 2 institutions and used gold-standard manual chart review data to investigate the impact on performance of alternative feature engineering options and implementing externally trained models without local retraining. RESULTS Models at each institution achieved AUC, sensitivity, and positive predictive value of 0.853, 0.879, 0.851 and 0.804, 0.976, and 0.885, respectively, at quantiles of model-predicted risk that maximize F1. We report performance metrics for all combinations of silver labels, feature engineering options, and models trained internally versus externally. DISCUSSION Phenotyping algorithms developed using PheNorm performed well at both institutions. Performance varied with different silver-standard labels and feature engineering options. Models developed locally at one site also worked well when implemented externally at the other site. CONCLUSION PheNorm models successfully identified an acute health condition, symptomatic COVID-19. The simplicity of the PheNorm approach allows it to be applied at multiple study sites with substantially reduced overhead compared to traditional approaches.
Collapse
Affiliation(s)
- Joshua C Smith
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Brian D Williamson
- Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States
| | - David J Cronkite
- Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States
| | - Daniel Park
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Jill M Whitaker
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Michael F McLemore
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Joshua T Osmanski
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Robert Winter
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Arvind Ramaprasan
- Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States
| | - Ann Kelley
- Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States
| | - Mary Shea
- Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States
| | - Saranrat Wittayanukorn
- Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, MD 20903, United States
| | - Danijela Stojanovic
- Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, MD 20903, United States
| | - Yueqin Zhao
- Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, MD 20903, United States
| | - Sengwee Toh
- Harvard Pilgrim Health Care Institute, Boston, MA 02215, United States
| | - Kevin B Johnson
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - David M Aronoff
- Department of Medicine, Indiana University School of Medicine, Indianapolis, IN 46202, United States
| | - David S Carrell
- Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States
| |
Collapse
|
8
|
Wan NC, Yaqoob AA, Ong HH, Zhao J, Wei WQ. Evaluating resources composing the PheMAP knowledge base to enhance high-throughput phenotyping. J Am Med Inform Assoc 2023; 30:456-465. [PMID: 36451277 PMCID: PMC9933070 DOI: 10.1093/jamia/ocac234] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2022] [Revised: 10/28/2022] [Accepted: 11/23/2022] [Indexed: 12/02/2022] Open
Abstract
OBJECTIVE A previous study, PheMAP, combined independent, online resources to enable high-throughput phenotyping (HTP) using electronic health records (EHRs). However, online resources offer distinct quality descriptions of diseases which may affect phenotyping performance. We aimed to evaluate the phenotyping performance of single resource-based PheMAPs and investigate an optimized strategy for HTP. MATERIALS AND METHODS We compared how each resource produced top-ranked concept unique identifiers (CUIs) by term frequency-inverse document frequency with Jaccard matrices comparing single resources and the original PheMAP. We correlated top-ranked concepts from each resource to features used in established Phenotype KnowledgeBase (PheKB) algorithms for hypothyroidism, type II diabetes mellitus (T2DM), and dementias. Using resources separately, we calculated multiple phenotype risk scores for individuals from Vanderbilt University Medical Center's BioVU DNA Biobank and compared phenotyping performance against rule-based eMERGE algorithms. Lastly, we implemented an ensemble strategy which classified patient case/control status based upon PheMAP resource agreement. RESULTS Jaccard similarity matrices indicate that the similarity of CUIs comprising single resource-based PheMAPs varies. Single resource-based PheMAPs generated from MedlinePlus and MedicineNet outperformed others but only encompass 81.6% of overall disease phenotypes. We propose the PheMAP-Ensemble which provides higher average accuracy and precision than the combined average accuracy and precision of single resource-based PheMAPs. While offering complete phenotype coverage, PheMAP-Ensemble significantly increases phenotyping recall compared to the original iteration. CONCLUSIONS Resources comprising the PheMAP produce different phenotyping performance when implemented individually. The ensemble method significantly improves the quality of PheMAP by fully utilizing dissimilar resources to capture accurate phenotyping data from EHRs.
Collapse
Affiliation(s)
- Nicholas C Wan
- Department of Biomedical Engineering, Vanderbilt University, Nashville, Tennessee, USA
| | - Ali A Yaqoob
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Henry H Ong
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Juan Zhao
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Wei-Qi Wei
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| |
Collapse
|
9
|
Yang S, Varghese P, Stephenson E, Tu K, Gronsbell J. Machine learning approaches for electronic health records phenotyping: a methodical review. J Am Med Inform Assoc 2023; 30:367-381. [PMID: 36413056 PMCID: PMC9846699 DOI: 10.1093/jamia/ocac216] [Citation(s) in RCA: 16] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 09/27/2022] [Accepted: 10/27/2022] [Indexed: 11/23/2022] Open
Abstract
OBJECTIVE Accurate and rapid phenotyping is a prerequisite to leveraging electronic health records for biomedical research. While early phenotyping relied on rule-based algorithms curated by experts, machine learning (ML) approaches have emerged as an alternative to improve scalability across phenotypes and healthcare settings. This study evaluates ML-based phenotyping with respect to (1) the data sources used, (2) the phenotypes considered, (3) the methods applied, and (4) the reporting and evaluation methods used. MATERIALS AND METHODS We searched PubMed and Web of Science for articles published between 2018 and 2022. After screening 850 articles, we recorded 37 variables on 100 studies. RESULTS Most studies utilized data from a single institution and included information in clinical notes. Although chronic conditions were most commonly considered, ML also enabled the characterization of nuanced phenotypes such as social determinants of health. Supervised deep learning was the most popular ML paradigm, while semi-supervised and weakly supervised learning were applied to expedite algorithm development and unsupervised learning to facilitate phenotype discovery. ML approaches did not uniformly outperform rule-based algorithms, but deep learning offered a marginal improvement over traditional ML for many conditions. DISCUSSION Despite the progress in ML-based phenotyping, most articles focused on binary phenotypes and few articles evaluated external validity or used multi-institution data. Study settings were infrequently reported and analytic code was rarely released. CONCLUSION Continued research in ML-based phenotyping is warranted, with emphasis on characterizing nuanced phenotypes, establishing reporting and evaluation standards, and developing methods to accommodate misclassified phenotypes due to algorithm errors in downstream applications.
Collapse
Affiliation(s)
- Siyue Yang
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
| | | | - Ellen Stephenson
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Karen Tu
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Jessica Gronsbell
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
10
|
Meegan JE, Kerchberger VE, Fortune NL, McNeil JB, Bastarache JA, Austin ED, Ware LB, Hemnes AR, Brittain EL. Transpulmonary generation of cell-free hemoglobin contributes to vascular dysfunction in pulmonary arterial hypertension via dysregulated clearance mechanisms. Pulm Circ 2023; 13:e12185. [PMID: 36743426 PMCID: PMC9841468 DOI: 10.1002/pul2.12185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/31/2022] [Revised: 12/12/2022] [Accepted: 01/03/2023] [Indexed: 01/07/2023] Open
Abstract
Circulating cell-free hemoglobin (CFH) is elevated in pulmonary arterial hypertension (PAH) and associated with poor outcomes but the mechanisms are unknown. We hypothesized that CFH is generated from the pulmonary circulation and inadequately cleared in PAH. Transpulmonary CFH (difference between wedge and pulmonary artery positions) and lung hemoglobin α were analyzed in patients with PAH and healthy controls. Haptoglobin genotype and plasma hemoglobin processing proteins were analyzed in patients with PAH, unaffected bone morphogenetic protein receptor type II mutation carriers (UMCs), and control subjects. Transpulmonary CFH was increased in patients with PAH (p = 0.04) and correlated with pulmonary vascular resistanc (PVR) (r s = 0.75, p = 0.02) and mean pulmonary arterial pressure (mPAP) (r s = 0.78, p = 0.02). Pulmonary vascular hemoglobin α protein was increased in patients with PAH (p = 0.006), especially in occluded vessels (p = 0.04). Haptoglobin genotype did not differ between groups. Plasma haptoglobin was higher in UMCs compared with both control subjects (p = 0.03) and patients with HPAH (p < 0.0001); patients with IPAH had higher circulating haptoglobin levels than patients with HPAH (p = 0.006). Notably, circulating CFH to haptoglobin ratio was elevated in patients with HPAH compared to control subjects (p = 0.02) and UMCs (p = 0.006). Moreover, in patients with PAH, CFH: haptoglobin correlated with PVR (r s = 0.37, p = 0.0004) and mPAP (r s = 0.25, p = 0.02). Broad alterations in other plasma hemoglobin processing proteins (hemopexin, heme oxygenase-1, and sCD163) were observed. In conclusion, pulmonary vascular CFH is associated with increased PVR and mPAP in PAH and dysregulated CFH clearance may contribute to PAH pathology. Further study is needed to determine whether targeting CFH is a viable therapeutic for pulmonary vascular dysfunction in PAH.
Collapse
Affiliation(s)
- Jamie E. Meegan
- Department of Medicine, Division of Allergy, Pulmonary and Critical Care MedicineVanderbilt University Medical CenterNashvilleTennesseeUSA
| | - Vern Eric Kerchberger
- Department of Medicine, Division of Allergy, Pulmonary and Critical Care MedicineVanderbilt University Medical CenterNashvilleTennesseeUSA
| | - Niki L. Fortune
- Department of Medicine, Division of Allergy, Pulmonary and Critical Care MedicineVanderbilt University Medical CenterNashvilleTennesseeUSA
| | - Joel Brennan McNeil
- Department of Medicine, Division of Allergy, Pulmonary and Critical Care MedicineVanderbilt University Medical CenterNashvilleTennesseeUSA
| | - Julie A. Bastarache
- Department of Medicine, Division of Allergy, Pulmonary and Critical Care MedicineVanderbilt University Medical CenterNashvilleTennesseeUSA
- Department of Pathology, Microbiology and ImmunologyVanderbilt University Medical CenterNashvilleTennesseeUSA
- Department of Cell and Developmental BiologyVanderbilt University Medical CenterNashvilleTennesseeUSA
| | - Eric D. Austin
- Department of Pediatrics, Division of Allergy, Immunology, and Pulmonary MedicineVanderbilt University Medical CenterNashvilleTennesseeUSA
| | - Lorraine B. Ware
- Department of Medicine, Division of Allergy, Pulmonary and Critical Care MedicineVanderbilt University Medical CenterNashvilleTennesseeUSA
- Department of Pathology, Microbiology and ImmunologyVanderbilt University Medical CenterNashvilleTennesseeUSA
| | - Anna R. Hemnes
- Department of Medicine, Division of Allergy, Pulmonary and Critical Care MedicineVanderbilt University Medical CenterNashvilleTennesseeUSA
- Vanderbilt Pulmonary Circulation CenterVanderbilt University Medical CenterNashvilleTennesseeUSA
| | - Evan L. Brittain
- Vanderbilt Pulmonary Circulation CenterVanderbilt University Medical CenterNashvilleTennesseeUSA
- Department of Medicine, Division of Cardiovascular MedicineVanderbilt University Medical CenterNashvilleTennesseeUSA
| |
Collapse
|
11
|
Barr PB, Bigdeli TB, Meyers JL. Characterizing and Coding Psychiatric Diagnoses Using Electronic Health Record Data-Reply. JAMA Psychiatry 2022; 79:2796414. [PMID: 36103173 DOI: 10.1001/jamapsychiatry.2022.2739] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Affiliation(s)
- Peter B Barr
- Department of Psychiatry and Behavioral Sciences, SUNY Downstate Health Sciences University, Brooklyn, New York
- VA New York Harbor Healthcare System, Brooklyn
| | - Tim B Bigdeli
- Department of Psychiatry and Behavioral Sciences, SUNY Downstate Health Sciences University, Brooklyn, New York
- VA New York Harbor Healthcare System, Brooklyn
| | - Jacquelyn L Meyers
- Department of Psychiatry and Behavioral Sciences, SUNY Downstate Health Sciences University, Brooklyn, New York
- VA New York Harbor Healthcare System, Brooklyn
| |
Collapse
|
12
|
Brandt PS, Pacheco JA, Adekkanattu P, Sholle ET, Abedian S, Stone DJ, Knaack DM, Xu J, Xu Z, Peng Y, Benda NC, Wang F, Luo Y, Jiang G, Pathak J, Rasmussen LV. Design and validation of a FHIR-based EHR-driven phenotyping toolbox. J Am Med Inform Assoc 2022; 29:1449-1460. [PMID: 35799370 PMCID: PMC9382394 DOI: 10.1093/jamia/ocac063] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2021] [Revised: 04/04/2022] [Accepted: 06/17/2022] [Indexed: 12/14/2022] Open
Abstract
OBJECTIVES To develop and validate a standards-based phenotyping tool to author electronic health record (EHR)-based phenotype definitions and demonstrate execution of the definitions against heterogeneous clinical research data platforms. MATERIALS AND METHODS We developed an open-source, standards-compliant phenotyping tool known as the PhEMA Workbench that enables a phenotype representation using the Fast Healthcare Interoperability Resources (FHIR) and Clinical Quality Language (CQL) standards. We then demonstrated how this tool can be used to conduct EHR-based phenotyping, including phenotype authoring, execution, and validation. We validated the performance of the tool by executing a thrombotic event phenotype definition at 3 sites, Mayo Clinic (MC), Northwestern Medicine (NM), and Weill Cornell Medicine (WCM), and used manual review to determine precision and recall. RESULTS An initial version of the PhEMA Workbench has been released, which supports phenotype authoring, execution, and publishing to a shared phenotype definition repository. The resulting thrombotic event phenotype definition consisted of 11 CQL statements, and 24 value sets containing a total of 834 codes. Technical validation showed satisfactory performance (both NM and MC had 100% precision and recall and WCM had a precision of 95% and a recall of 84%). CONCLUSIONS We demonstrate that the PhEMA Workbench can facilitate EHR-driven phenotype definition, execution, and phenotype sharing in heterogeneous clinical research data environments. A phenotype definition that integrates with existing standards-compliant systems, and the use of a formal representation facilitates automation and can decrease potential for human error.
Collapse
Affiliation(s)
- Pascal S Brandt
- Corresponding Author: Pascal S. Brandt, Department of Biomedical Informatics & Medical Education, University of Washington, Box 358047, Seattle, WA 98195, USA;
| | - Jennifer A Pacheco
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, Illinois, USA
| | - Prakash Adekkanattu
- Department of Healthcare Policy and Research, Weill Cornell Medicine, New York, New York, USA
| | - Evan T Sholle
- Department of Healthcare Policy and Research, Weill Cornell Medicine, New York, New York, USA
| | - Sajjad Abedian
- Department of Healthcare Policy and Research, Weill Cornell Medicine, New York, New York, USA
| | - Daniel J Stone
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, Minnesota, USA
| | - David M Knaack
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, Minnesota, USA
| | - Jie Xu
- Department of Healthcare Policy and Research, Weill Cornell Medicine, New York, New York, USA
| | - Zhenxing Xu
- Department of Healthcare Policy and Research, Weill Cornell Medicine, New York, New York, USA
| | - Yifan Peng
- Department of Healthcare Policy and Research, Weill Cornell Medicine, New York, New York, USA
| | - Natalie C Benda
- Department of Healthcare Policy and Research, Weill Cornell Medicine, New York, New York, USA
| | - Fei Wang
- Department of Healthcare Policy and Research, Weill Cornell Medicine, New York, New York, USA
| | - Yuan Luo
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, Illinois, USA
| | - Guoqian Jiang
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, Minnesota, USA
| | - Jyotishman Pathak
- Department of Healthcare Policy and Research, Weill Cornell Medicine, New York, New York, USA
| | - Luke V Rasmussen
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, Illinois, USA
| |
Collapse
|
13
|
Krantz MS, Kerchberger VE, Wei WQ. Novel Analysis Methods to Mine Immune-Mediated Phenotypes and Find Genetic Variation Within the Electronic Health Record (Roadmap for Phenotype to Genotype: Immunogenomics). THE JOURNAL OF ALLERGY AND CLINICAL IMMUNOLOGY. IN PRACTICE 2022; 10:1757-1762. [PMID: 35487368 PMCID: PMC9624141 DOI: 10.1016/j.jaip.2022.04.016] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/27/2022] [Revised: 04/13/2022] [Accepted: 04/18/2022] [Indexed: 06/14/2023]
Abstract
The field of immunogenomics has the opportunity for accelerated genetic discovery aided by the maturation of electronic health records (EHRs) linked to DNA biobanks. Novel analysis methods in deep phenotyping of EHR data allow the full realization of the paired and increasingly dense genetic/phenotypic information available. This enables researchers to uncover genetic risk factors for the prevention and optimal treatment of immune-mediated diseases and immune-mediated adverse drug reactions. This article reviews the background of EHRs linked to DNA biobanks, potential applications to immunogenomic discovery, and current and emerging techniques in EHR-based deep phenotyping.
Collapse
Affiliation(s)
- Matthew S Krantz
- Division of Allergy, Pulmonary and Critical Care Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, Tenn.
| | - V Eric Kerchberger
- Division of Allergy, Pulmonary and Critical Care Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, Tenn; Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tenn
| | - Wei-Qi Wei
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tenn
| |
Collapse
|
14
|
Genetics in chronic kidney disease: conclusions from a Kidney Disease: Improving Global Outcomes (KDIGO) Controversies Conference. Kidney Int 2022; 101:1126-1141. [PMID: 35460632 PMCID: PMC9922534 DOI: 10.1016/j.kint.2022.03.019] [Citation(s) in RCA: 34] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2022] [Revised: 03/16/2022] [Accepted: 03/29/2022] [Indexed: 01/19/2023]
Abstract
Numerous genes for monogenic kidney diseases with classical patterns of inheritance, as well as genes for complex kidney diseases that manifest in combination with environmental factors, have been discovered. Genetic findings are increasingly used to inform clinical management of nephropathies, and have led to improved diagnostics, disease surveillance, choice of therapy, and family counseling. All of these steps rely on accurate interpretation of genetic data, which can be outpaced by current rates of data collection. In March of 2021, Kidney Diseases: Improving Global Outcomes (KDIGO) held a Controversies Conference on "Genetics in Chronic Kidney Disease (CKD)" to review the current state of understanding of monogenic and complex (polygenic) kidney diseases, processes for applying genetic findings in clinical medicine, and use of genomics for defining and stratifying CKD. Given the important contribution of genetic variants to CKD, practitioners with CKD patients are advised to "think genetic," which specifically involves obtaining a family history, collecting detailed information on age of CKD onset, performing clinical examination for extrarenal symptoms, and considering genetic testing. To improve the use of genetics in nephrology, meeting participants advised developing an advanced training or subspecialty track for nephrologists, crafting guidelines for testing and treatment, and educating patients, students, and practitioners. Key areas of future research, including clinical interpretation of genome variation, electronic phenotyping, global representation, kidney-specific molecular data, polygenic scores, translational epidemiology, and open data resources, were also identified.
Collapse
|
15
|
Integration of Omics and Phenotypic Data for Precision Medicine. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2022; 2486:19-35. [PMID: 35437716 DOI: 10.1007/978-1-0716-2265-0_2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Over the past two decades, biomedical research is moving toward a big-data-driven approach. The underlying causes of this transition include the ability to gather genetic or molecular profiles of humans faster, the increasing adoption of electronic health record (EHR) system, and the growing interest in linking omics and phenotypic data for analysis. The integration of individual's biology data (e.g., genomics, proteomics, metabolomics), and health-care data has created unprecedented opportunities for precision medicine, that is, a medical model that uses a patient's unique information, mainly genetic, to prevent, diagnose, or treat disease. This chapter reviewed the research opportunities and applications of integrating omics and phenotypic data for precision medicine, such as understanding the relationship between genotype and phenotype, disease subtyping, and diagnosis or prediction of adverse outcomes. We reviewed the recent advanced methods, particularly the machine learning and deep learning-based approaches used for harnessing and harmonizing the multiomics and phenotypic data to address these applications. We finally discussed the challenges and future directions.
Collapse
|
16
|
Denaxas S, Liu G, Feng Q, Fatemifar G, Bastarache L, Kerchberger EV, Hingorani AD, Lumbers T, Peterson JF, Wei WQ, Hemingway H. Mapping the Read2/CTV3 controlled clinical terminologies to Phecodes in UK Biobank primary care electronic health records: implementation and evaluation. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2022; 2021:362-371. [PMID: 35308936 PMCID: PMC8861677] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 03/30/2023]
Abstract
Objective: To establish and validate mappings between primary care clinical terminologies (Read Version 2, Clinical Terms Version 3) and Phecodes. Methods: We processed 123,662,421 primary care events from 230,096 UK Biobank (UKB) participants. We assessed the validity of the primary care-derived Phecodes by conducting PheWAS analyses for seven pre-selected SNPs in the UKB and compared with estimates from BioVU. Results: We mapped 92% of Read2 (n=10,834) and 91% of CTV3 (n=21,988) to 1,449 and 1,490 Phecodes. UKB PheWAS using Phecodes from primary care EHR and hospitalizations replicated all (n=22) previously-reported genotype-phenotype associations. When limiting Phecodes to primary care EHR, replication was 81% (n=18). Conclusion: We introduced a first version of mappings from Read2/CTV3 to Phecodes. The reference list of diseases provided by Phecodes can be extended, enabling researchers to leverage primary care EHR for high-throughput discovery research.
Collapse
Affiliation(s)
- Spiros Denaxas
- University College London, London, UK
- Health Data Research UK, London, UK
- BHF Research Accelerator, London, UK
- The Alan Turing Institute, London, UK
- NIHR UCLH BRC, London, UK
| | - Ge Liu
- Vanderbilt University Medical Center, Nashville, TN, USA
| | - Qiping Feng
- Vanderbilt University Medical Center, Nashville, TN, USA
| | - Ghazaleh Fatemifar
- Vanderbilt University Medical Center, Nashville, TN, USA
- Health Data Research UK, London, UK
| | | | | | - Aroon D Hingorani
- University College London, London, UK
- Health Data Research UK, London, UK
- BHF Research Accelerator, London, UK
| | - Tom Lumbers
- University College London, London, UK
- Health Data Research UK, London, UK
| | | | - Wei-Qi Wei
- Vanderbilt University Medical Center, Nashville, TN, USA
| | - Harry Hemingway
- University College London, London, UK
- Health Data Research UK, London, UK
- BHF Research Accelerator, London, UK
- NIHR UCLH BRC, London, UK
| |
Collapse
|
17
|
Maturation and application of phenome-wide association studies. Trends Genet 2022; 38:353-363. [PMID: 34991903 DOI: 10.1016/j.tig.2021.12.002] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2021] [Revised: 11/12/2021] [Accepted: 12/02/2021] [Indexed: 12/12/2022]
Abstract
In the past 10 years since its introduction, phenome-wide association studies (PheWAS) have uncovered novel genotype-phenotype relationships. Along the way, PheWAS have evolved in many aspects as a study design with the expanded availability of large data repositories with genome-wide data linked to detailed phenotypic data. Advancement in methods, including algorithms, software, and publicly available integrated resources, makes it feasible to more fully realize the potential of PheWAS, overcoming the previous computational and analytical limitations. We review here the most recent improvements and notable applications of PheWAS since the second half of the decade from its inception. We also note the challenges that remain embedded along the entire PheWAS analytical pipeline that necessitate further development of tools and resources to further advance the understanding of the complex genetic architecture underlying human diseases and traits.
Collapse
|
18
|
Association of step counts over time with the risk of chronic disease in the All of Us Research Program. Nat Med 2022; 28:2301-2308. [PMID: 36216933 PMCID: PMC9671804 DOI: 10.1038/s41591-022-02012-w] [Citation(s) in RCA: 50] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Accepted: 08/15/2022] [Indexed: 01/14/2023]
Abstract
The association between physical activity and human disease has not been examined using commercial devices linked to electronic health records. Using the electronic health records data from the All of Us Research Program, we show that step count volumes as captured by participants' own Fitbit devices were associated with risk of chronic disease across the entire human phenome. Of the 6,042 participants included in the study, 73% were female, 84% were white and 71% had a college degree, and participants had a median age of 56.7 (interquartile range 41.5-67.6) years and body mass index of 28.1 (24.3-32.9) kg m-2. Participants walked a median of 7,731.3 (5,866.8-9,826.8) steps per day over the median activity monitoring period of 4.0 (2.2-5.6) years with a total of 5.9 million person-days of monitoring. The relationship between steps per day and incident disease was inverse and linear for obesity (n = 368), sleep apnea (n = 348), gastroesophageal reflux disease (n = 432) and major depressive disorder (n = 467), with values above 8,200 daily steps associated with protection from incident disease. The relationships with incident diabetes (n = 156) and hypertension (n = 482) were nonlinear with no further risk reduction above 8,000-9,000 steps. Although validation in a more diverse sample is needed, these findings provide a real-world evidence-base for clinical guidance regarding activity levels that are necessary to reduce disease risk.
Collapse
|
19
|
Liu X, Chubak J, Hubbard RA, Chen Y. SAT: a Surrogate-Assisted Two-wave case boosting sampling method, with application to EHR-based association studies. J Am Med Inform Assoc 2021; 29:918-927. [PMID: 34962283 PMCID: PMC9714591 DOI: 10.1093/jamia/ocab267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Revised: 10/16/2021] [Accepted: 11/23/2021] [Indexed: 12/30/2022] Open
Abstract
OBJECTIVES Electronic health records (EHRs) enable investigation of the association between phenotypes and risk factors. However, studies solely relying on potentially error-prone EHR-derived phenotypes (ie, surrogates) are subject to bias. Analyses of low prevalence phenotypes may also suffer from poor efficiency. Existing methods typically focus on one of these issues but seldom address both. This study aims to simultaneously address both issues by developing new sampling methods to select an optimal subsample to collect gold standard phenotypes for improving the accuracy of association estimation. MATERIALS AND METHODS We develop a surrogate-assisted two-wave (SAT) sampling method, where a surrogate-guided sampling (SGS) procedure and a modified optimal subsampling procedure motivated from A-optimality criterion (OSMAC) are employed sequentially, to select a subsample for outcome validation through manual chart review subject to budget constraints. A model is then fitted based on the subsample with the true phenotypes. Simulation studies and an application to an EHR dataset of breast cancer survivors are conducted to demonstrate the effectiveness of SAT. RESULTS We found that the subsample selected with the proposed method contains informative observations that effectively reduce the mean squared error of the resultant estimator of the association. CONCLUSIONS The proposed approach can handle the problem brought by the rarity of cases and misclassification of the surrogate in phenotype-absent EHR-based association studies. With a well-behaved surrogate, SAT successfully boosts the case prevalence in the subsample and improves the efficiency of estimation.
Collapse
Affiliation(s)
- Xiaokang Liu
- Department of Biostatistics, Epidemiology and Informatics, The University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, USA
| | - Jessica Chubak
- Kaiser Permanente Washington Health Research Institute, Seattle, Washington, USA,Department of Epidemiology, University of Washington, Seattle, Washington, USA
| | - Rebecca A Hubbard
- Department of Biostatistics, Epidemiology and Informatics, The University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, USA
| | - Yong Chen
- Corresponding Author: Yong Chen, PhD, Department of Biostatistics, Epidemiology and Informatics, The University of Pennsylvania School of Medicine, 423 Guardian Drive, Philadelphia, PA 19104, USA ()
| |
Collapse
|
20
|
Lockshin MD, Crow MK, Barbhaiya M. When a Diagnosis Has No Name: Uncertainty and Opportunity. ACR Open Rheumatol 2021; 4:197-201. [PMID: 34806330 PMCID: PMC8916551 DOI: 10.1002/acr2.11368] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Diagnostic uncertainty, commonly encountered in rheumatology and other fields of medicine, is an opportunity: Stakeholders who understand uncertainty's causes and quantitate its effects can reduce uncertainty and can use uncertainty to improve medical practice, science, and administration. To articulate, bring attention to, and offer recommendations for diagnostic uncertainty, the Barbara Volcker Center at the Hospital for Special Surgery sponsored, in April 2021, a virtual international workshop, “When a Diagnosis Has No Name.” This paper summarizes the opinions of 72 stakeholders from the fields of medical research, industry, federal regulatory agencies, insurers, hospital management, medical philosophy, public media, health care law, clinical rheumatology, other specialty areas of medicine, and patients. Speakers addressed the effects of diagnostic uncertainty in their fields. The workshop addressed the following six questions: What is a diagnosis? What are the purposes of diagnoses? How do doctors assign diagnoses? What is uncertainty? What are its causes? How does understanding uncertainty offer opportunities to improve all fields of medicine? The workshop's conveners systematically reviewed video recordings of formal presentations, video recordings of open discussion periods, manuscripts, and slide files submitted by the speakers to develop consensus take‐home messages, which were as follows: Diagnostic uncertainty causes harm when patients lack access to laboratory test and treatments, do not participate in research studies, are not counted in administrative and public health documents, and suffer humiliation in their interactions with others. Uncertainty offers opportunities, such as quantifying uncertainty, using statistical technologies and automated intelligence to stratify patient groups by level of uncertainty, using a common vocabulary, and considering the effects of time.
Collapse
Affiliation(s)
- Michael D Lockshin
- Hospital for Special Surgery, Weill Cornell Medicine, New York, New York
| | - Mary K Crow
- Hospital for Special Surgery, Weill Cornell Medicine, New York, New York
| | - Medha Barbhaiya
- Hospital for Special Surgery, Weill Cornell Medicine, New York, New York
| |
Collapse
|
21
|
Zheng NS, Kerchberger VE, Borza VA, Eken HN, Smith JC, Wei WQ. An updated, computable MEDication-Indication resource for biomedical research. Sci Rep 2021; 11:18953. [PMID: 34556781 PMCID: PMC8460636 DOI: 10.1038/s41598-021-98579-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2020] [Accepted: 09/02/2021] [Indexed: 11/09/2022] Open
Abstract
The MEDication-Indication (MEDI) knowledgebase has been utilized in research with electronic health records (EHRs) since its publication in 2013. To account for new drugs and terminology updates, we rebuilt MEDI to overhaul the knowledgebase for modern EHRs. Indications for prescribable medications were extracted using natural language processing and ontology relationships from six publicly available resources: RxNorm, Side Effect Resource 4.1, Mayo Clinic, WebMD, MedlinePlus, and Wikipedia. We compared the estimated precision and recall between the previous MEDI (MEDI-1) and the updated version (MEDI-2) with manual review. MEDI-2 contains 3031 medications and 186,064 indications. The MEDI-2 high precision subset (HPS) includes indications found within RxNorm or at least three other resources. MEDI-2 and MEDI-2 HPS contain 13% more medications and over triple the indications compared to MEDI-1 and MEDI-1 HPS, respectively. Manual review showed MEDI-2 achieves the same precision (0.60) with better recall (0.89 vs. 0.79) compared to MEDI-1. Likewise, MEDI-2 HPS had the same precision (0.92) and improved recall (0.65 vs. 0.55) than MEDI-1 HPS. The combination of MEDI-1 and MEDI-2 achieved a recall of 0.95. In updating MEDI, we present a more comprehensive medication-indication knowledgebase that can continue to facilitate applications and research with EHRs.
Collapse
Affiliation(s)
- Neil S Zheng
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA.,Yale School of Medicine, New Haven, CT, USA
| | - V Eric Kerchberger
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA.,Division of Allergy, Pulmonary and Critical Care Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
| | | | - H Nur Eken
- Vanderbilt School of Medicine, Nashville, TN, USA
| | - Joshua C Smith
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Wei-Qi Wei
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA. .,Department of Biomedical Informatics, Vanderbilt University Medical Center, 2525 West End Avenue Suite 1500, Nashville, TN, 37232-6602, USA.
| |
Collapse
|
22
|
Chapman M, Mumtaz S, Rasmussen LV, Karwath A, Gkoutos GV, Gao C, Thayer D, Pacheco JA, Parkinson H, Richesson RL, Jefferson E, Denaxas S, Curcin V. Desiderata for the development of next-generation electronic health record phenotype libraries. Gigascience 2021; 10:giab059. [PMID: 34508578 PMCID: PMC8434766 DOI: 10.1093/gigascience/giab059] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Revised: 07/15/2021] [Accepted: 08/18/2021] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND High-quality phenotype definitions are desirable to enable the extraction of patient cohorts from large electronic health record repositories and are characterized by properties such as portability, reproducibility, and validity. Phenotype libraries, where definitions are stored, have the potential to contribute significantly to the quality of the definitions they host. In this work, we present a set of desiderata for the design of a next-generation phenotype library that is able to ensure the quality of hosted definitions by combining the functionality currently offered by disparate tooling. METHODS A group of researchers examined work to date on phenotype models, implementation, and validation, as well as contemporary phenotype libraries developed as a part of their own phenomics communities. Existing phenotype frameworks were also examined. This work was translated and refined by all the authors into a set of best practices. RESULTS We present 14 library desiderata that promote high-quality phenotype definitions, in the areas of modelling, logging, validation, and sharing and warehousing. CONCLUSIONS There are a number of choices to be made when constructing phenotype libraries. Our considerations distil the best practices in the field and include pointers towards their further development to support portable, reproducible, and clinically valid phenotype design. The provision of high-quality phenotype definitions enables electronic health record data to be more effectively used in medical domains.
Collapse
Affiliation(s)
- Martin Chapman
- Department of Population Health Sciences, King's College London, London, SE1 1UL, UK
| | - Shahzad Mumtaz
- Health Informatics Centre (HIC), University of Dundee, Dundee, DD1 9SY, UK
| | - Luke V Rasmussen
- Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Andreas Karwath
- Institute of Cancer and Genomic Sciences, University of Birmingham, Birmingham, B15 2TT, UK
| | - Georgios V Gkoutos
- Institute of Cancer and Genomic Sciences, University of Birmingham, Birmingham, B15 2TT, UK
| | - Chuang Gao
- Health Informatics Centre (HIC), University of Dundee, Dundee, DD1 9SY, UK
| | - Dan Thayer
- SAIL Databank, Swansea University, Swansea, SA2 8PP, UK
| | - Jennifer A Pacheco
- Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Helen Parkinson
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, CB10 1SD, UK
| | - Rachel L Richesson
- Department of Learning Health Sciences, University of Michigan Medical School, MI 48109, USA
| | - Emily Jefferson
- Health Informatics Centre (HIC), University of Dundee, Dundee, DD1 9SY, UK
| | - Spiros Denaxas
- Institute of Health Informatics, University College London, London, NW1 2DA, UK
| | - Vasa Curcin
- Department of Population Health Sciences, King's College London, London, SE1 1UL, UK
| |
Collapse
|
23
|
De Freitas JK, Johnson KW, Golden E, Nadkarni GN, Dudley JT, Bottinger EP, Glicksberg BS, Miotto R. Phe2vec: Automated disease phenotyping based on unsupervised embeddings from electronic health records. PATTERNS (NEW YORK, N.Y.) 2021; 2:100337. [PMID: 34553174 PMCID: PMC8441576 DOI: 10.1016/j.patter.2021.100337] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 06/30/2021] [Accepted: 08/05/2021] [Indexed: 11/23/2022]
Abstract
Robust phenotyping of patients from electronic health records (EHRs) at scale is a challenge in clinical informatics. Here, we introduce Phe2vec, an automated framework for disease phenotyping from EHRs based on unsupervised learning and assess its effectiveness against standard rule-based algorithms from Phenotype KnowledgeBase (PheKB). Phe2vec is based on pre-computing embeddings of medical concepts and patients' clinical history. Disease phenotypes are then derived from a seed concept and its neighbors in the embedding space. Patients are linked to a disease if their embedded representation is close to the disease phenotype. Comparing Phe2vec and PheKB cohorts head-to-head using chart review, Phe2vec performed on par or better in nine out of ten diseases. Differently from other approaches, it can scale to any condition and was validated against widely adopted expert-based standards. Phe2vec aims to optimize clinical informatics research by augmenting current frameworks to characterize patients by condition and derive reliable disease cohorts.
Collapse
Affiliation(s)
- Jessica K. De Freitas
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
| | - Kipp W. Johnson
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
| | - Eddye Golden
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
| | - Girish N. Nadkarni
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
- Department of Medicine, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
| | - Joel T. Dudley
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
| | - Erwin P. Bottinger
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
- Department of Medicine, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
- Digital Health Center at Hasso Plattner Institute, University of Potsdam, Professor-Dr.-Helmert-Str 2–3, 14482 Potsdam, Germany
| | - Benjamin S. Glicksberg
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
| | - Riccardo Miotto
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
| |
Collapse
|
24
|
DeLozier S, Bland HT, McPheeters M, Wells Q, Farber-Eger E, Bejan CA, Fabbri D, Rosenbloom T, Roden D, Johnson KB, Wei WQ, Peterson J, Bastarache L. Phenotyping coronavirus disease 2019 during a global health pandemic: Lessons learned from the characterization of an early cohort. J Biomed Inform 2021; 117:103777. [PMID: 33838341 PMCID: PMC8026248 DOI: 10.1016/j.jbi.2021.103777] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2020] [Revised: 02/09/2021] [Accepted: 04/03/2021] [Indexed: 01/08/2023]
Abstract
From the start of the coronavirus disease 2019 (COVID-19) pandemic, researchers have looked to electronic health record (EHR) data as a way to study possible risk factors and outcomes. To ensure the validity and accuracy of research using these data, investigators need to be confident that the phenotypes they construct are reliable and accurate, reflecting the healthcare settings from which they are ascertained. We developed a COVID-19 registry at a single academic medical center and used data from March 1 to June 5, 2020 to assess differences in population-level characteristics in pandemic and non-pandemic years respectively. Median EHR length, previously shown to impact phenotype performance in type 2 diabetes, was significantly shorter in the SARS-CoV-2 positive group relative to a 2019 influenza tested group (median 3.1 years vs 8.7; Wilcoxon rank sum P = 1.3e-52). Using three phenotyping methods of increasing complexity (billing codes alone and domain-specific algorithms provided by an EHR vendor and clinical experts), common medical comorbidities were abstracted from COVID-19 EHRs, defined by the presence of a positive laboratory test (positive predictive value 100%, recall 93%). After combining performance data across phenotyping methods, we observed significantly lower false negative rates for those records billed for a comprehensive care visit (p = 4e-11) and those with complete demographics data recorded (p = 7e-5). In an early COVID-19 cohort, we found that phenotyping performance of nine common comorbidities was influenced by median EHR length, consistent with previous studies, as well as by data density, which can be measured using portable metrics including CPT codes. Here we present those challenges and potential solutions to creating deeply phenotyped, acute COVID-19 cohorts.
Collapse
Affiliation(s)
- Sarah DeLozier
- Department of Biomedical Informatics, Vanderbilt University Medical Center, West End Ave, Suite 1475, Nashville, TN 37203, USA.
| | - Harris T Bland
- Department of Biomedical Informatics, Vanderbilt University Medical Center, West End Ave, Suite 1475, Nashville, TN 37203, USA
| | - Melissa McPheeters
- Department of Biomedical Informatics, Vanderbilt University Medical Center, West End Ave, Suite 1475, Nashville, TN 37203, USA
| | - Quinn Wells
- Division of Cardiovascular Medicine, Vanderbilt University Medical Center, Pierce Avenue, 383 Preston Research Building, Nashville, TN 37232, USA
| | - Eric Farber-Eger
- Division of Cardiovascular Medicine, Vanderbilt University Medical Center, Pierce Avenue, 383 Preston Research Building, Nashville, TN 37232, USA
| | - Cosmin A Bejan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, West End Ave, Suite 1475, Nashville, TN 37203, USA
| | - Daniel Fabbri
- Department of Biomedical Informatics, Vanderbilt University Medical Center, West End Ave, Suite 1475, Nashville, TN 37203, USA
| | - Trent Rosenbloom
- Department of Biomedical Informatics, Vanderbilt University Medical Center, West End Ave, Suite 1475, Nashville, TN 37203, USA
| | - Dan Roden
- Department of Biomedical Informatics, Vanderbilt University Medical Center, West End Ave, Suite 1475, Nashville, TN 37203, USA; Division of Cardiovascular Medicine, Vanderbilt University Medical Center, Pierce Avenue, 383 Preston Research Building, Nashville, TN 37232, USA
| | - Kevin B Johnson
- Department of Biomedical Informatics, Vanderbilt University Medical Center, West End Ave, Suite 1475, Nashville, TN 37203, USA
| | - Wei-Qi Wei
- Department of Biomedical Informatics, Vanderbilt University Medical Center, West End Ave, Suite 1475, Nashville, TN 37203, USA
| | - Josh Peterson
- Department of Biomedical Informatics, Vanderbilt University Medical Center, West End Ave, Suite 1475, Nashville, TN 37203, USA
| | - Lisa Bastarache
- Department of Biomedical Informatics, Vanderbilt University Medical Center, West End Ave, Suite 1475, Nashville, TN 37203, USA
| |
Collapse
|
25
|
Si Y, Bernstam EV, Roberts K. Generalized and transferable patient language representation for phenotyping with limited data. J Biomed Inform 2021; 116:103726. [PMID: 33711541 DOI: 10.1016/j.jbi.2021.103726] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2020] [Revised: 12/14/2020] [Accepted: 02/23/2021] [Indexed: 12/19/2022]
Abstract
The paradigm of representation learning through transfer learning has the potential to greatly enhance clinical natural language processing. In this work, we propose a multi-task pre-training and fine-tuning approach for learning generalized and transferable patient representations from medical language. The model is first pre-trained with different but related high-prevalence phenotypes and further fine-tuned on downstream target tasks. Our main contribution focuses on the impact this technique can have on low-prevalence phenotypes, a challenging task due to the dearth of data. We validate the representation from pre-training, and fine-tune the multi-task pre-trained models on low-prevalence phenotypes including 38 circulatory diseases, 23 respiratory diseases, and 17 genitourinary diseases. We find multi-task pre-training increases learning efficiency and achieves consistently high performance across the majority of phenotypes. Most important, the multi-task pre-training is almost always either the best-performing model or performs tolerably close to the best-performing model, a property we refer to as robust. All these results lead us to conclude that this multi-task transfer learning architecture is a robust approach for developing generalized and transferable patient language representations for numerous phenotypes.
Collapse
Affiliation(s)
- Yuqi Si
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, TX, USA
| | - Elmer V Bernstam
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, TX, USA; Division of General Internal Medicine, McGovern Medical School, The University of Texas Health Science Center at Houston, TX, USA
| | - Kirk Roberts
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, TX, USA.
| |
Collapse
|
26
|
Newman-Griffis D, Fosler-Lussier E. Automated Coding of Under-Studied Medical Concept Domains: Linking Physical Activity Reports to the International Classification of Functioning, Disability, and Health. Front Digit Health 2021; 3:620828. [PMID: 33791684 PMCID: PMC8009547 DOI: 10.3389/fdgth.2021.620828] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2020] [Accepted: 02/16/2021] [Indexed: 11/13/2022] Open
Abstract
Linking clinical narratives to standardized vocabularies and coding systems is a key component of unlocking the information in medical text for analysis. However, many domains of medical concepts, such as functional outcomes and social determinants of health, lack well-developed terminologies that can support effective coding of medical text. We present a framework for developing natural language processing (NLP) technologies for automated coding of medical information in under-studied domains, and demonstrate its applicability through a case study on physical mobility function. Mobility function is a component of many health measures, from post-acute care and surgical outcomes to chronic frailty and disability, and is represented as one domain of human activity in the International Classification of Functioning, Disability, and Health (ICF). However, mobility and other types of functional activity remain under-studied in the medical informatics literature, and neither the ICF nor commonly-used medical terminologies capture functional status terminology in practice. We investigated two data-driven paradigms, classification and candidate selection, to link narrative observations of mobility status to standardized ICF codes, using a dataset of clinical narratives from physical therapy encounters. Recent advances in language modeling and word embedding were used as features for established machine learning models and a novel deep learning approach, achieving a macro-averaged F-1 score of 84% on linking mobility activity reports to ICF codes. Both classification and candidate selection approaches present distinct strengths for automated coding in under-studied domains, and we highlight that the combination of (i) a small annotated data set; (ii) expert definitions of codes of interest; and (iii) a representative text corpus is sufficient to produce high-performing automated coding systems. This research has implications for continued development of language technologies to analyze functional status information, and the ongoing growth of NLP tools for a variety of specialized applications in clinical care and research.
Collapse
Affiliation(s)
- Denis Newman-Griffis
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, United States
- Epidemiology & Biostatistics Section, Rehabilitation Medicine Department, National Institutes of Health Clinical Center, Bethesda, MD, United States
| | - Eric Fosler-Lussier
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, United States
| |
Collapse
|