1
|
Huang X, Arora J, Erzurumluoglu AM, Stanhope SA, Lam D, Zhao H, Ding Z, Wang Z, de Jong J. Enhancing patient representation learning with inferred family pedigrees improves disease risk prediction. J Am Med Inform Assoc 2025; 32:435-446. [PMID: 39723811 PMCID: PMC11833479 DOI: 10.1093/jamia/ocae297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2024] [Revised: 10/29/2024] [Accepted: 11/19/2024] [Indexed: 12/28/2024] Open
Abstract
BACKGROUND Machine learning and deep learning are powerful tools for analyzing electronic health records (EHRs) in healthcare research. Although family health history has been recognized as a major predictor for a wide spectrum of diseases, research has so far adopted a limited view of family relations, essentially treating patients as independent samples in the analysis. METHODS To address this gap, we present ALIGATEHR, which models inferred family relations in a graph attention network augmented with an attention-based medical ontology representation, thus accounting for the complex influence of genetics, shared environmental exposures, and disease dependencies. RESULTS Taking disease risk prediction as a use case, we demonstrate that explicitly modeling family relations significantly improves predictions across the disease spectrum. We then show how ALIGATEHR's attention mechanism, which links patients' disease risk to their relatives' clinical profiles, successfully captures genetic aspects of diseases using longitudinal EHR diagnosis data. Finally, we use ALIGATEHR to successfully distinguish the 2 main inflammatory bowel disease subtypes with highly shared risk factors and symptoms (Crohn's disease and ulcerative colitis). CONCLUSION Overall, our results highlight that family relations should not be overlooked in EHR research and illustrate ALIGATEHR's great potential for enhancing patient representation learning for predictive and interpretable modeling of EHRs.
Collapse
Affiliation(s)
- Xiayuan Huang
- Department of Biostatistics, Yale University School of Public Health, New Haven, CT 06510, United States
| | - Jatin Arora
- Human Genetics, Global Computational Biology and Digital Sciences, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riβ 88400, Germany
| | - Abdullah Mesut Erzurumluoglu
- Human Genetics, Global Computational Biology and Digital Sciences, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riβ 88400, Germany
| | - Stephen A Stanhope
- Real World Data and Analytics, Global Medical Affairs, Boehringer Ingelheim, Ridgefield, CT 06877, United States
| | - Daniel Lam
- CB CMDR, Global Computational Biology and Digital Sciences, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riβ 88400, Germany
| | - Hongyu Zhao
- Department of Biostatistics, Yale University School of Public Health, New Haven, CT 06510, United States
| | - Zhihao Ding
- Human Genetics, Global Computational Biology and Digital Sciences, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riβ 88400, Germany
| | - Zuoheng Wang
- Department of Biostatistics, Yale University School of Public Health, New Haven, CT 06510, United States
- Department of Biomedical Informatics & Data Science, Yale University School of Medicine, New Haven, CT 06510, United States
| | - Johann de Jong
- Statistical Modeling, Global Computational Biology and Digital Sciences, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riβ 88400, Germany
| |
Collapse
|
2
|
Mayer J, Delgoffe B, Hebbring S. Identifying Family Structures from Obituaries and Matching them to Patients in an Electronic Heath Record. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.26.625445. [PMID: 39677647 PMCID: PMC11642772 DOI: 10.1101/2024.11.26.625445] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2024]
Abstract
Motivation Family data is a valuable data source in bioinformatic research. This is because family members often share common genetic and environmental exposures. Collecting this family data is traditionally very labor intensive but advances in electronic health record (EHR) data mining has proven useful when identifying pedigrees linked to longitudinal health histories. These are called e-pedigrees. Unfortunately, e-pedigrees tend to miss the oldest generations who inherently have the longest and richest health histories. A good source of family data from older generations includes obituaries, as they have a formulaic nature making them a good candidate for natural language processing that can extract relationships to the decedent. While there have been several studies on obtaining such data from obituaries, we demonstrate for the first-time approaches that tie that information to an EHR. Results NLP extraction resulted in 8,166,534 family members being abstracted from 567,279 obituaries published in the state of Wisconsin. After matching decedent and family members to patients in the EHR, we identified 109,365 unique patients that were put in 34,158 pedigrees. The largest pedigree consisted of 21 individuals. Heritability of adult height was quantified (H 2 = 0.51 +- .04, P=< 1.00e-07) demonstrating this data's use in genetic research. The heritability data, coupled with overlapping data in a biobank, suggested 80% - 90% of familial relationships were accurately defined. The totality of these findings demonstrate obituaries with the oldest generations can be highly informative for bioinformatic research. Availability and Implementation Code is available on GitHub at https://github.com/jgmayer672/ObituaryNLP .
Collapse
|
3
|
Huang X, Tatonetti N, LaRow K, Delgoffee B, Mayer J, Page D, Hebbring SJ. E-Pedigrees: a large-scale automatic family pedigree prediction application. Bioinformatics 2021; 37:3966-3968. [PMID: 34086863 PMCID: PMC8570807 DOI: 10.1093/bioinformatics/btab419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2021] [Revised: 04/30/2021] [Accepted: 06/03/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The use and functionality of Electronic Health Records (EHR) have increased rapidly in the past few decades. EHRs are becoming an important depository of patient health information and can capture family data. Pedigree analysis is a longstanding and powerful approach that can gain insight into the underlying genetic and environmental factors in human health, but traditional approaches to identifying and recruiting families are low-throughput and labor-intensive. Therefore, high-throughput methods to automatically construct family pedigrees are needed. RESULTS We developed a stand-alone application: Electronic Pedigrees, or E-Pedigrees, which combines two validated family prediction algorithms into a single software package for high throughput pedigrees construction. The convenient platform considers patients' basic demographic information and/or emergency contact data to infer high-accuracy parent-child relationship. Importantly, E-Pedigrees allows users to layer in additional pedigree data when available and provides options for applying different logical rules to improve accuracy of inferred family relationships. This software is fast and easy to use, is compatible with different EHR data sources, and its output is a standard PED file appropriate for multiple downstream analyses. AVAILABILITY AND IMPLEMENTATION The Python 3.3+ version E-Pedigrees application is freely available on: https://github.com/xiayuan-huang/E-pedigrees.
Collapse
Affiliation(s)
- Xiayuan Huang
- Department of Biostatistics & Medical Informatics, University of Wisconsin-Madison, Madison, WI 53706, USA
| | - Nicholas Tatonetti
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Katie LaRow
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Brooke Delgoffee
- Office of Research Computing and Analytics, Marshfield Clinic Research Foundation, Marshfield, WI 54449, USA
| | - John Mayer
- Office of Research Computing and Analytics, Marshfield Clinic Research Foundation, Marshfield, WI 54449, USA
| | - David Page
- Department of Biostatistics & Bioinformatics, Duke University, Durham, NC 27710, USA
| | - Scott J Hebbring
- Center for Precision Medicine Research, Marshfield Clinic Research Foundation, Marshfield, WI 54449, USA
| |
Collapse
|
4
|
Yin Z, Liu Y, McCoy AB, Malin BA, Sengstack PR. Contribution of Free-Text Comments to the Burden of Documentation: Assessment and Analysis of Vital Sign Comments in Flowsheets. J Med Internet Res 2021; 23:e22806. [PMID: 33661128 PMCID: PMC7974764 DOI: 10.2196/22806] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Revised: 10/11/2020] [Accepted: 01/18/2021] [Indexed: 11/21/2022] Open
Abstract
Background Documentation burden is a common problem with modern electronic health record (EHR) systems. To reduce this burden, various recording methods (eg, voice recorders or motion sensors) have been proposed. However, these solutions are in an early prototype phase and are unlikely to transition into practice in the near future. A more pragmatic alternative is to directly modify the implementation of the existing functionalities of an EHR system. Objective This study aims to assess the nature of free-text comments entered into EHR flowsheets that supplement quantitative vital sign values and examine opportunities to simplify functionality and reduce documentation burden. Methods We evaluated 209,055 vital sign comments in flowsheets that were generated in the Epic EHR system at the Vanderbilt University Medical Center in 2018. We applied topic modeling, as well as the natural language processing Clinical Language Annotation, Modeling, and Processing software system, to extract generally discussed topics and detailed medical terms (expressed as probability distribution) to investigate the stories communicated in these comments. Results Our analysis showed that 63.33% (6053/9557) of the users who entered vital signs made at least one free-text comment in vital sign flowsheet entries. The user roles that were most likely to compose comments were registered nurse, technician, and licensed nurse. The most frequently identified topics were the notification of a result to health care providers (0.347), the context of a measurement (0.307), and an inability to obtain a vital sign (0.224). There were 4187 unique medical terms that were extracted from 46,029 (0.220) comments, including many symptom-related terms such as “pain,” “upset,” “dizziness,” “coughing,” “anxiety,” “distress,” and “fever” and drug-related terms such as “tylenol,” “anesthesia,” “cannula,” “oxygen,” “motrin,” “rituxan,” and “labetalol.” Conclusions Considering that flowsheet comments are generally not displayed or automatically pulled into any clinical notes, our findings suggest that the flowsheet comment functionality can be simplified (eg, via structured response fields instead of a text input dialog) to reduce health care provider effort. Moreover, rich and clinically important medical terms such as medications and symptoms should be explicitly recorded in clinical notes for better visibility.
Collapse
Affiliation(s)
- Zhijun Yin
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States.,Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Yongtai Liu
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Allison B McCoy
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Bradley A Malin
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States.,Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN, United States.,Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, United States
| | | |
Collapse
|
5
|
Family member information extraction via neural sequence labeling models with different tag schemes. BMC Med Inform Decis Mak 2019; 19:257. [PMID: 31881965 PMCID: PMC6933890 DOI: 10.1186/s12911-019-0996-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Background Family history information (FHI) described in unstructured electronic health records (EHRs) is a valuable information source for patient care and scientific researches. Since FHI is usually described in the format of free text, the entire process of FHI extraction consists of various steps including section segmentation, family member and clinical observation extraction, and relation discovery between the extracted members and their observations. The extraction step involves the recognition of FHI concepts along with their properties such as the family side attribute of the family member concept. Methods This study focuses on the extraction step and formulates it as a sequence labeling problem. We employed a neural sequence labeling model along with different tag schemes to distinguish family members and their observations. Corresponding to different tag schemes, the identified entities were aggregated and processed by different algorithms to determine the required properties. Results We studied the effectiveness of encoding required properties in the tag schemes by evaluating their performance on the dataset released by the BioCreative/OHNLP challenge 2018. It was observed that the proposed side scheme along with the developed features and neural network architecture can achieve an overall F1-score of 0.849 on the test set, which ranked second in the FHI entity recognition subtask. Conclusions By comparing with the performance of conditional random fields models, the developed neural network-based models performed significantly better. However, our error analysis revealed two challenging issues of the current approach. One is that some properties required cross-sentence inferences. The other is that the current model is not able to distinguish between the narratives describing the family members of the patient and those specifying the relatives of the patient’s family members.
Collapse
|
6
|
Shor T, Kalka I, Geiger D, Erlich Y, Weissbrod O. Estimating variance components in population scale family trees. PLoS Genet 2019; 15:e1008124. [PMID: 31071088 PMCID: PMC6529016 DOI: 10.1371/journal.pgen.1008124] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2018] [Revised: 05/21/2019] [Accepted: 04/03/2019] [Indexed: 12/14/2022] Open
Abstract
The rapid digitization of genealogical and medical records enables the assembly of extremely large pedigree records spanning millions of individuals and trillions of pairs of relatives. Such pedigrees provide the opportunity to investigate the sociological and epidemiological history of human populations in scales much larger than previously possible. Linear mixed models (LMMs) are routinely used to analyze extremely large animal and plant pedigrees for the purposes of selective breeding. However, LMMs have not been previously applied to analyze population-scale human family trees. Here, we present Sparse Cholesky factorIzation LMM (Sci-LMM), a modeling framework for studying population-scale family trees that combines techniques from the animal and plant breeding literature and from human genetics literature. The proposed framework can construct a matrix of relationships between trillions of pairs of individuals and fit the corresponding LMM in several hours. We demonstrate the capabilities of Sci-LMM via simulation studies and by estimating the heritability of longevity and of reproductive fitness (quantified via number of children) in a large pedigree spanning millions of individuals and over five centuries of human history. Sci-LMM provides a unified framework for investigating the epidemiological history of human populations via genealogical records.
Collapse
Affiliation(s)
- Tal Shor
- Computer Science Department, Technion—Israel Institute of Technology, Haifa, Israel
- MyHeritage Ltd., Or Yehuda, Israel
| | - Iris Kalka
- Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel
- Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot, Israel
| | - Dan Geiger
- Computer Science Department, Technion—Israel Institute of Technology, Haifa, Israel
| | - Yaniv Erlich
- MyHeritage Ltd., Or Yehuda, Israel
- The New York Genome Center, New York, NY, United States of America
- Department of Computer Science, Fu School of Engineering, Columbia University, NY, United States of America
| | - Omer Weissbrod
- Computer Science Department, Technion—Israel Institute of Technology, Haifa, Israel
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, United States of America
| |
Collapse
|
7
|
Genomic and Phenomic Research in the 21st Century. Trends Genet 2018; 35:29-41. [PMID: 30342790 DOI: 10.1016/j.tig.2018.09.007] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2018] [Revised: 09/24/2018] [Accepted: 09/25/2018] [Indexed: 02/06/2023]
Abstract
The field of human genomics has changed dramatically over time. Initial genomic studies were predominantly restricted to rare disorders in small families. Over the past decade, researchers changed course from family-based studies and instead focused on common diseases and traits in populations of unrelated individuals. With further advancements in biobanking, computer science, electronic health record (EHR) data, and more affordable high-throughput genomics, we are experiencing a new paradigm in human genomic research. Rapidly changing technologies and resources now make it possible to study thousands of diseases simultaneously at the genomic level. This review will focus on these advancements as scientists begin to incorporate phenome-wide strategies in human genomic research to understand the etiology of human diseases and develop new drugs to treat them.
Collapse
|