1
|
Lerga-Jaso J, Terpolovsky A, Novković B, Osama A, Manson C, Bohn S, De Marino A, Kunitomi M, Yazdi PG. Optimization of multi-ancestry polygenic risk score disease prediction models. Sci Rep 2025; 15:17495. [PMID: 40394127 PMCID: PMC12092622 DOI: 10.1038/s41598-025-02903-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2024] [Accepted: 05/16/2025] [Indexed: 05/22/2025] Open
Abstract
Polygenic risk scores (PRS) have ushered in a new era in genetic epidemiology, offering insights into individual predispositions to a wide range of diseases. However, despite recent marked enhancements in predictive power, PRS-based models still need to overcome several hurdles before they can be broadly applied in the clinic. Chiefly, they need to achieve sufficient accuracy, easy interpretability and portability across diverse populations. Leveraging trans-ancestry genome-wide association study (GWAS) meta-analysis, we generated novel, diverse summary statistics for 30 medically-related traits and benchmarked the performance of six existing PRS algorithms using UK Biobank. We built an ensemble model using logistic regression to combine outputs of top-performing algorithms and validated it on the diverse eMERGE and PAGE MEC cohorts. It surpassed current state-of-the-art PRS models, with minimal performance drops in external cohorts, indicating good calibration. To enhance predictive accuracy for clinical application, we incorporated easily-accessible clinical characteristics such as age, gender, ancestry and risk factors, creating disease prediction models intended as prospective diagnostic tests, with easily interpretable positive or negative outcomes. After adding clinical characteristics, 12 out of 30 models surpassed 80% AUC. Further, 25 traits exceeded the diagnostic odds ratio (DOR) of five, and 19 traits exceeded DOR of 10 for all ancestry groups, indicating high predictive value. Our PRS model for coronary artery disease identified 55-80 times more true coronary events than rare pathogenic variant models, reinforcing its clinical potential. The polygenic component modulated the effect of high-risk rare variants, stressing the need to consider all genetic components in clinical settings. These findings show that newly developed PRS-based disease prediction models have sufficient accuracy and portability to warrant consideration of being used in the clinic.
Collapse
Affiliation(s)
| | | | | | - Alex Osama
- Research & Development, Omics Edge, Miami, FL, USA
| | | | - Sandra Bohn
- Research & Development, Omics Edge, Miami, FL, USA
| | | | | | - Puya G Yazdi
- Research & Development, Omics Edge, Miami, FL, USA.
| |
Collapse
|
2
|
Dutta D, Chatterjee N. Expanding scope of genetic studies in the era of biobanks. Hum Mol Genet 2025:ddaf054. [PMID: 40312842 DOI: 10.1093/hmg/ddaf054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2025] [Revised: 03/25/2025] [Accepted: 04/08/2025] [Indexed: 05/03/2025] Open
Abstract
Biobanks have become pivotal in genetic research, particularly through genome-wide association studies (GWAS), driving transformative insights into the genetic basis of complex diseases and traits through the integration of genetic data with phenotypic, environmental, family history, and behavioral information. This review explores the distinct design and utility of different biobanks, highlighting their unique contributions to genetic research. We further discuss the utility and methodological advances in combining data from disease-specific study or consortia with that of biobanks, especially focusing on summary statistics based meta-analysis. Subsequently we review the spectrum of additional advantages offered by biobanks in genetic studies in representing population differences, calibration of polygenic scores, assessment of pleiotropy and improving post-GWAS in silico analyses. Advances in sequencing technologies, particularly whole-exome and whole-genome sequencing, have further enabled the discovery of rare variants at biobank scale. Among recent developments, the integration of large-scale multi-omics data especially proteomics and metabolomics, within biobanks provides deeper insights into disease mechanisms and regulatory pathways. Despite challenges like ascertainment strategies and phenotypic misclassification, biobanks continue to evolve, driving methodological innovation and enabling precision medicine. We highlight the contributions of biobanks to genetic research, their growing integration with multi-omics, and finally discuss their future potential for advancing healthcare and therapeutic development.
Collapse
Affiliation(s)
- Diptavo Dutta
- Integrative Tumor Epidemiology Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, 9609 Medical Center Drive, Rockville, MD, 20879, United States
| | - Nilanjan Chatterjee
- Department of Biostatistics, Johns Hopkins University, 615 N Wolfe Street, Baltimore, MD, 21205, United States
- Department of Oncology, Johns Hopkins University, 615 N Wolfe Street, Baltimore, MD, 21205, United States
| |
Collapse
|
3
|
Duan R, Gao C, Tubbs J, Han Y, Guo M, Li S, Ma E, Luo D, Smoller J, Lee P. Unsupervised Ensemble Learning for Efficient Integration of Pre-trained Polygenic Risk Scores. RESEARCH SQUARE 2025:rs.3.rs-5976048. [PMID: 40235488 PMCID: PMC11998766 DOI: 10.21203/rs.3.rs-5976048/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/17/2025]
Abstract
The growing availability of pre-trained polygenic risk score (PRS) models has enabled their integration into real-world applications, reducing the need for extensive data labeling, training, and calibration. However, selecting the most suitable PRS model for a specific target population remains challenging, due to issues such as limited transferability, data heterogeneity, and the scarcity of observed phenotype in real-world settings. Ensemble learning offers a promising avenue to enhance the predictive accuracy of genetic risk assessments, but most existing methods often rely on observed phenotype data or additional genome-wide association studies (GWAS) from the target population to optimize ensemble weights, limiting their utility in real-time implementation. Here, we present the UNSupervised enSemble PRS (UNSemblePRS), an unsupervised ensemble learning framework, that combines pre-trained PRS models without requiring phenotype data or summaries from the target population. Unlike traditional supervised approaches, UNSemblePRS aggregates models based on prediction concordance across a curated subset of candidate PRS models. We evaluated UNSemblePRS using both continuous and binary traits in the All of Us database, demonstrating its scalability and robust performance across diverse populations. These results underscore UNSemblePRS as an accessible tool for integrating PRS models into real-world contexts, offering broad applicability as the availability of PRS models continues to expand.
Collapse
|
4
|
Smeland OB, Busch C, Andreassen OA, Manchia M. Novel Multimodal Precision Medicine Approaches and the Relevance of Developmental Trajectories in Bipolar Disorder. Biol Psychiatry 2025:S0006-3223(25)01098-4. [PMID: 40157588 DOI: 10.1016/j.biopsych.2025.03.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/02/2024] [Revised: 02/14/2025] [Accepted: 03/20/2025] [Indexed: 04/01/2025]
Abstract
There is a pressing need to establish objective measures to improve diagnosis, prediction, prevention, and treatment of bipolar disorder (BD). Multimodal artificial intelligence (AI) tools could provide these means by incorporating various layers of data orthogonally related to BD, including genomics and other omics, environmental exposures, imaging measures, electronic health records, cognition, sensing devices, and clinical variables. These rapidly evolving AI models hold promise to capture the multidimensional complexity of BD and delineate clinically relevant developmental trajectories that could guide clinical care and therapeutic strategies. In this review, we describe the potential of mapping developmental trajectories that underlie BD, outline how novel multimodal models could improve the prediction of BD and related outcomes, and discuss specific clinical use cases and key ethical and practical challenges regarding the development and potential implementation of these multimodal AI solutions to advance precision medicine approaches in BD.
Collapse
Affiliation(s)
- Olav B Smeland
- Centre for Precision Psychiatry, Division of Mental Health and Addiction, Oslo University Hospital and Institute of Clinical Medicine, University of Oslo, Oslo, Norway.
| | - Cecilie Busch
- Centre for Precision Psychiatry, Division of Mental Health and Addiction, Oslo University Hospital and Institute of Clinical Medicine, University of Oslo, Oslo, Norway
| | - Ole A Andreassen
- Centre for Precision Psychiatry, Division of Mental Health and Addiction, Oslo University Hospital and Institute of Clinical Medicine, University of Oslo, Oslo, Norway; K.G. Jebsen Centre for Neurodevelopmental Disorders, University of Oslo, Oslo, Norway.
| | - Mirko Manchia
- Unit of Psychiatry, Department of Medical Sciences and Public Health, University of Cagliari, Cagliari, Italy; Unit of Clinical Psychiatry, Department of Medicine, University Hospital Agency of Cagliari, Cagliari, Italy; Department of Pharmacology, Dalhousie University, Halifax, Nova Scotia, Canada.
| |
Collapse
|
5
|
Gao C, Tubbs JD, Han Y, Guo M, Li S, Ma E, Luo D, Smoller JW, Lee PH, Duan R. Unsupervised Ensemble Learning for Efficient Integration of Pre-trained Polygenic Risk Scores. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2025:2025.01.06.25320058. [PMID: 39830281 PMCID: PMC11741443 DOI: 10.1101/2025.01.06.25320058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/22/2025]
Abstract
The growing availability of pre-trained polygenic risk score (PRS) models has enabled their integration into real-world applications, reducing the need for extensive data labeling, training, and calibration. However, selecting the most suitable PRS model for a specific target population remains challenging, due to issues such as limited transferability, data heterogeneity, and the scarcity of observed phenotype in real-world settings. Ensemble learning offers a promising avenue to enhance the predictive accuracy of genetic risk assessments, but most existing methods often rely on observed phenotype data or additional genome-wide association studies (GWAS) from the target population to optimize ensemble weights, limiting their utility in real-time implementation. Here, we present the UN supervised en Semble PRS ( UNSemblePRS ), an unsupervised ensemble learning framework, that combines pre-trained PRS models without requiring phenotype data or summaries from the target population. Unlike traditional supervised approaches, UNSemblePRS aggregates models based on prediction concordance across a curated subset of candidate PRS models. We evaluated UNSemblePRS using both continuous and binary traits in the All of Us database, demonstrating its scalability and robust performance across diverse populations. These results underscore UNSemblePRS as an accessible tool for integrating PRS models into real-world contexts, offering broad applicability as the availability of PRS models continues to expand.
Collapse
|
6
|
Teng J, Zhai T, Zhang X, Zhao C, Wang W, Tang H, Ning C, Shang Y, Wang D, Zhang Q. Improving multi-trait genomic prediction by incorporating local genetic correlations. Commun Biol 2025; 8:307. [PMID: 40000888 PMCID: PMC11861333 DOI: 10.1038/s42003-025-07721-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Accepted: 02/11/2025] [Indexed: 02/27/2025] Open
Abstract
Genomic prediction holds significant potential for advancing precision medicine in humans, as well as accelerating genetic improvement in animals and plants. For multi-trait prediction, the conventional multi-trait models are primarily based on global genetic correlations between traits. With the development of local genetic correlation (LGC) estimation methods, it is now possible to analyze LGCs confined to specific genomic regions and it is expected that incorporating LGCs into multi-trait prediction model would enhance the prediction ability. Here, we proposed three models to address this issue and evaluated their performances using simulated data and three real datasets from human, cow, and pig populations. Our results demonstrate that LGCs are heterogeneous across the genome and incorporating LGCs in multi-trait prediction would increase the prediction accuracy by an average of 12.76% ± 2.07% compared to conventional multi-trait genomic prediction method (MTGBLUP) in the real datasets. Our findings highlight the importance of considering LGCs in improving multi-trait genomic prediction.
Collapse
Affiliation(s)
- Jun Teng
- Shandong Provincial Key Laboratory for Livestock Germplasm Innovation & Utilization, College of Animal Science and Technology, Shandong Agricultural University, Tai'an, China
- Shandong Futeng Food Co. Ltd., Zaozhuang, China
| | - Tingting Zhai
- National Key Laboratory of Wheat Improvement, College of Life Science, Shandong Agricultural University, Tai'an, China
| | - Xinyi Zhang
- Shandong Provincial Key Laboratory for Livestock Germplasm Innovation & Utilization, College of Animal Science and Technology, Shandong Agricultural University, Tai'an, China
| | - Changheng Zhao
- Shandong Provincial Key Laboratory for Livestock Germplasm Innovation & Utilization, College of Animal Science and Technology, Shandong Agricultural University, Tai'an, China
| | - Wenwen Wang
- Shandong Provincial Key Laboratory for Livestock Germplasm Innovation & Utilization, College of Animal Science and Technology, Shandong Agricultural University, Tai'an, China
| | - Hui Tang
- Shandong Provincial Key Laboratory for Livestock Germplasm Innovation & Utilization, College of Animal Science and Technology, Shandong Agricultural University, Tai'an, China
| | - Chao Ning
- Shandong Provincial Key Laboratory for Livestock Germplasm Innovation & Utilization, College of Animal Science and Technology, Shandong Agricultural University, Tai'an, China
| | - Yingli Shang
- College of Veterinary Medicine, Shandong Agricultural University, Tai'an, China
| | - Dan Wang
- Shandong Provincial Key Laboratory for Livestock Germplasm Innovation & Utilization, College of Animal Science and Technology, Shandong Agricultural University, Tai'an, China.
| | - Qin Zhang
- Shandong Provincial Key Laboratory for Livestock Germplasm Innovation & Utilization, College of Animal Science and Technology, Shandong Agricultural University, Tai'an, China.
| |
Collapse
|
7
|
Kunkel D, Sørensen P, Shankar V, Morgante F. Improving polygenic prediction from summary data by learning patterns of effect sharing across multiple phenotypes. PLoS Genet 2025; 21:e1011519. [PMID: 39775068 PMCID: PMC11741642 DOI: 10.1371/journal.pgen.1011519] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Revised: 01/17/2025] [Accepted: 11/27/2024] [Indexed: 01/11/2025] Open
Abstract
Polygenic prediction of complex trait phenotypes has become important in human genetics, especially in the context of precision medicine. Recently, mr.mash, a flexible and computationally efficient method that models multiple phenotypes jointly and leverages sharing of effects across such phenotypes to improve prediction accuracy, was introduced. However, a drawback of mr.mash is that it requires individual-level data, which are often not publicly available. In this work, we introduce mr.mash-rss, an extension of the mr.mash model that requires only summary statistics from Genome-Wide Association Studies (GWAS) and linkage disequilibrium (LD) estimates from a reference panel. By using summary data, we achieve the twin goal of increasing the applicability of the mr.mash model to data sets that are not publicly available and making it scalable to biobank-size data. Through simulations, we show that mr.mash-rss is competitive with, and often outperforms, current state-of-the-art methods for single- and multi-phenotype polygenic prediction in a variety of scenarios that differ in the pattern of effect sharing across phenotypes, the number of phenotypes, the number of causal variants, and the genomic heritability. We also present a real data analysis of 16 blood cell phenotypes in the UK Biobank, showing that mr.mash-rss achieves higher prediction accuracy than competing methods for the majority of traits, especially when the data set has smaller sample size.
Collapse
Affiliation(s)
- Deborah Kunkel
- School of Mathematical and Statistical Sciences, Clemson University, Clemson, South Carolina, United States of America
| | - Peter Sørensen
- Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, Denmark
| | - Vijay Shankar
- Center for Human Genetics, Clemson University, Greenwood, South Carolina, United States of America
| | - Fabio Morgante
- Center for Human Genetics, Clemson University, Greenwood, South Carolina, United States of America
- Department of Genetics and Biochemistry, Clemson University, Clemson, South Carolina, United States of America
| |
Collapse
|
8
|
Zhao Z, Dorn S, Wu Y, Yang X, Jin J, Lu Q. One score to rule them all: regularized ensemble polygenic risk prediction with GWAS summary statistics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.27.625748. [PMID: 39677614 PMCID: PMC11642782 DOI: 10.1101/2024.11.27.625748] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/17/2024]
Abstract
Ensemble learning has been increasingly popular for boosting the predictive power of polygenic risk scores (PRS), with almost every recent multi-ancestry PRS approach employing ensemble learning as a final step. Existing ensemble approaches rely on individual-level data for model training, which severely limits their real-world applications, especially in non-European populations without sufficient genomic samples. Here, we introduce a statistical framework to construct regularized ensemble PRS, which allows us to combine a large number of candidate PRS models using only summary statistics from genome-wide association studies. We demonstrate its robust and substantial improvement over many existing PRS models in both within- and cross-ancestry applications. We believe this is truly "one score to rule them all" due to its capability to continuously combine newly developed PRS models with existing models to improve prediction performance, which makes it a universal approach that should always be employed in future PRS applications.
Collapse
Affiliation(s)
- Zijie Zhao
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI
| | - Stephen Dorn
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI
| | - Yuchang Wu
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI
| | - Xiaoyu Yang
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI
| | - Jin Jin
- Department of Biostatistics, Epidemiology and Bioinformatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA
| | - Qiongshi Lu
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI
- Department of Statistics, University of Wisconsin-Madison, Madison, WI
| |
Collapse
|
9
|
Jayasinghe D, Eshetie S, Beckmann K, Benyamin B, Lee SH. Advancements and limitations in polygenic risk score methods for genomic prediction: a scoping review. Hum Genet 2024; 143:1401-1431. [PMID: 39542907 DOI: 10.1007/s00439-024-02716-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Accepted: 10/31/2024] [Indexed: 11/17/2024]
Abstract
This scoping review aims to identify and evaluate the landscape of Polygenic Risk Score (PRS)-based methods for genomic prediction from 2013 to 2023, highlighting their advancements, key concepts, and existing gaps in knowledge, research, and technology. Over the past decade, various PRS-based methods have emerged, each employing different statistical frameworks aimed at enhancing prediction accuracy, processing speed and memory efficiency. Despite notable advancements, challenges persist, including unrealistic assumptions regarding sample sizes and the polygenicity of traits necessary for accurate predictions, as well as limitations in exploring hyper-parameter spaces and considering environmental interactions. We included studies focusing on PRS-based methods for risk prediction that underwent methodological evaluations using valid approaches and released computational tools/software. Additionally, we restricted our selection to studies involving human participants that were published in English language. This review followed the standard protocol recommended by Joanna Briggs Institute Reviewer's Manual, systematically searching Ovid MEDLINE, Ovid Embase, Scopus and Web of Science databases. Additionally, searches included grey literature sources like pre-print servers such as bioRxiv, and articles recommended by experts to ensure comprehensive and diverse coverage of relevant records. This study identified 34 studies detailing 37 genomic prediction methods, the majority of which rely on linkage disequilibrium (LD) information and necessitate hyper-parameter tuning. Nine methods integrate functional/gene annotation, while 12 are suitable for cross-ancestry genomic prediction, with only one considering gene-environment (GxE) interaction. While some methods require individual-level data, most leverage summary statistics, offering flexibility. Despite progress, challenges remain. These include computational complexity and the need for large sample sizes for high prediction accuracy. Furthermore, recent methods exhibit varying effectiveness across traits, with absolute accuracies often falling short of clinical utility. Transferability across ancestries varies, influenced by trait heritability and diversity of training data, while handling admixed populations remains challenging. Additionally, the absence of standard error measurements for individual PRSs, crucial in clinical settings, underscores a critical gap. Another issue is the lack of customizable graphical visualization tools among current software packages. While genomic prediction methods have advanced significantly, there is still room for improvement. Addressing current challenges and embracing future research directions will lead to the development of more universally applicable, robust, and clinically relevant genomic prediction tools.
Collapse
Affiliation(s)
- Dovini Jayasinghe
- Australian Centre for Precision Health, University of South Australia, Adelaide, SA, 5000, Australia.
- UniSA Allied Health and Human Performance, University of South Australia, Adelaide, SA, 5000, Australia.
- South Australian Health and Medical Research Institute (SAHMRI), University of South Australia, Adelaide, SA, 5000, Australia.
| | - Setegn Eshetie
- Australian Centre for Precision Health, University of South Australia, Adelaide, SA, 5000, Australia
- UniSA Allied Health and Human Performance, University of South Australia, Adelaide, SA, 5000, Australia
- South Australian Health and Medical Research Institute (SAHMRI), University of South Australia, Adelaide, SA, 5000, Australia
- College of Medicine and Health Sciences, University of Gondar, Gondar, Ethiopia
| | - Kerri Beckmann
- UniSA Allied Health and Human Performance, University of South Australia, Adelaide, SA, 5000, Australia
| | - Beben Benyamin
- Australian Centre for Precision Health, University of South Australia, Adelaide, SA, 5000, Australia
- UniSA Allied Health and Human Performance, University of South Australia, Adelaide, SA, 5000, Australia
- South Australian Health and Medical Research Institute (SAHMRI), University of South Australia, Adelaide, SA, 5000, Australia
| | - S Hong Lee
- Australian Centre for Precision Health, University of South Australia, Adelaide, SA, 5000, Australia
- UniSA Allied Health and Human Performance, University of South Australia, Adelaide, SA, 5000, Australia
- South Australian Health and Medical Research Institute (SAHMRI), University of South Australia, Adelaide, SA, 5000, Australia
| |
Collapse
|
10
|
Zhao Z, Gruenloh T, Yan M, Wu Y, Sun Z, Miao J, Wu Y, Song J, Lu Q. Optimizing and benchmarking polygenic risk scores with GWAS summary statistics. Genome Biol 2024; 25:260. [PMID: 39379999 PMCID: PMC11462675 DOI: 10.1186/s13059-024-03400-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2023] [Accepted: 09/23/2024] [Indexed: 10/10/2024] Open
Abstract
BACKGROUND Polygenic risk score (PRS) is a major research topic in human genetics. However, a significant gap exists between PRS methodology and applications in practice due to often unavailable individual-level data for various PRS tasks including model fine-tuning, benchmarking, and ensemble learning. RESULTS We introduce an innovative statistical framework to optimize and benchmark PRS models using summary statistics of genome-wide association studies. This framework builds upon our previous work and can fine-tune virtually all existing PRS models while accounting for linkage disequilibrium. In addition, we provide an ensemble learning strategy named PUMAS-ensemble to combine multiple PRS models into an ensemble score without requiring external data for model fitting. Through extensive simulations and analysis of many complex traits in the UK Biobank, we demonstrate that this approach closely approximates gold-standard analytical strategies based on external validation, and substantially outperforms state-of-the-art PRS methods. CONCLUSIONS Our method is a powerful and general modeling technique that can continue to combine the best-performing PRS methods out there through ensemble learning and could become an integral component for all future PRS applications.
Collapse
Affiliation(s)
- Zijie Zhao
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA
| | - Tim Gruenloh
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA
| | - Meiyi Yan
- Department of Statistics, University of Wisconsin-Madison, Madison, WI, USA
| | - Yixuan Wu
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA
| | - Zhongxuan Sun
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA
| | - Jiacheng Miao
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA
| | - Yuchang Wu
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA
- Center for Demography of Health and Aging, University of Wisconsin-Madison, Madison, WI, USA
| | - Jie Song
- Center for Demography of Health and Aging, University of Wisconsin-Madison, Madison, WI, USA
| | - Qiongshi Lu
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA.
- Department of Statistics, University of Wisconsin-Madison, Madison, WI, USA.
- Center for Demography of Health and Aging, University of Wisconsin-Madison, Madison, WI, USA.
| |
Collapse
|
11
|
Tyrer JP, Peng PC, DeVries AA, Gayther SA, Jones MR, Pharoah PD. Improving on polygenic scores across complex traits using select and shrink with summary statistics (S4) and LDpred2. BMC Genomics 2024; 25:878. [PMID: 39294559 PMCID: PMC11411995 DOI: 10.1186/s12864-024-10706-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Accepted: 08/13/2024] [Indexed: 09/20/2024] Open
Abstract
BACKGROUND As precision medicine advances, polygenic scores (PGS) have become increasingly important for clinical risk assessment. Many methods have been developed to create polygenic models with increased accuracy for risk prediction. Our select and shrink with summary statistics (S4) PGS method has previously been shown to accurately predict the polygenic risk of epithelial ovarian cancer. Here, we applied S4 PGS to 12 phenotypes for UK Biobank participants, and compared it with the LDpred2 and a combined S4 + LDpred2 method. RESULTS The S4 + LDpred2 method provided overall improved PGS accuracy across a variety of phenotypes for UK Biobank participants. Additionally, the S4 + LDpred2 method had the best estimated PGS accuracy in Finnish and Japanese populations. We also addressed the challenge of limited genotype level data by developing the PGS models using only GWAS summary statistics. CONCLUSIONS Taken together, the S4 + LDpred2 method represents an improvement in overall PGS accuracy across multiple phenotypes and populations.
Collapse
Affiliation(s)
- Jonathan P Tyrer
- Centre for Cancer Genetic Epidemiology, Department of Public Health and Primary Care, University of Cambridge, Cambridge, CB1 8RN, UK
| | - Pei-Chen Peng
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, California, 90048, United States of America
| | - Amber A DeVries
- Center for Bioinformatics and Functional Genomics, Cedars-Sinai Medical Center, California, 90048, United States of America
| | - Simon A Gayther
- Center for Inherited Oncogenesis, Department of Medicine, UT Health San Antonio, Texas, 78229, United States of America
| | - Michelle R Jones
- Center for Bioinformatics and Functional Genomics, Cedars-Sinai Medical Center, California, 90048, United States of America.
| | - Paul D Pharoah
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, California, 90048, United States of America
| |
Collapse
|
12
|
Sharew NT, Clark SR, Schubert KO, Amare AT. Pharmacogenomic scores in psychiatry: systematic review of current evidence. Transl Psychiatry 2024; 14:322. [PMID: 39107294 PMCID: PMC11303815 DOI: 10.1038/s41398-024-02998-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/05/2024] [Revised: 06/21/2024] [Accepted: 06/27/2024] [Indexed: 08/10/2024] Open
Abstract
In the past two decades, significant progress has been made in the development of polygenic scores (PGSs). One specific application of PGSs is the development and potential use of pharmacogenomic- scores (PGx-scores) to identify patients who can benefit from a specific medication or are likely to experience side effects. This systematic review comprehensively evaluates published PGx-score studies in psychiatry and provides insights into their potential clinical use and avenues for future development. A systematic literature search was conducted across PubMed, EMBASE, and Web of Science databases until 22 August 2023. This review included fifty-three primary studies, of which the majority (69.8%) were conducted using samples of European ancestry. We found that over 90% of PGx-scores in psychiatry have been developed based on psychiatric and medical diagnoses or trait variants, rather than pharmacogenomic variants. Among these PGx-scores, the polygenic score for schizophrenia (PGSSCZ) has been most extensively studied in relation to its impact on treatment outcomes (32 publications). Twenty (62.5%) of these studies suggest that individuals with higher PGSSCZ have negative outcomes from psychotropic treatment - poorer treatment response, higher rates of treatment resistance, more antipsychotic-induced side effects, or more psychiatric hospitalizations, while the remaining studies did not find significant associations. Although PGx-scores alone accounted for at best 5.6% of the variance in treatment outcomes (in schizophrenia treatment resistance), together with clinical variables they explained up to 13.7% (in bipolar lithium response), suggesting that clinical translation might be achieved by including PGx-scores in multivariable models. In conclusion, our literature review found that there are still very few studies developing PGx-scores using pharmacogenomic variants. Research with larger and diverse populations is required to develop clinically relevant PGx-scores, using biology-informed and multi-phenotypic polygenic scoring approaches, as well as by integrating clinical variables with these scores to facilitate their translation to psychiatric practice.
Collapse
Affiliation(s)
- Nigussie T Sharew
- Discipline of Psychiatry, Adelaide Medical School, The University of Adelaide, Adelaide, SA, Australia
- Asrat Woldeyes Health Science Campus, Debre Berhan University, Debre Berhan, Ethiopia
| | - Scott R Clark
- Discipline of Psychiatry, Adelaide Medical School, The University of Adelaide, Adelaide, SA, Australia
| | - K Oliver Schubert
- Discipline of Psychiatry, Adelaide Medical School, The University of Adelaide, Adelaide, SA, Australia
- Division of Mental Health, Northern Adelaide Local Health Network, SA Health, Adelaide, Australia
- Headspace Adelaide Early Psychosis - Sonder, Adelaide, SA, Australia
| | - Azmeraw T Amare
- Discipline of Psychiatry, Adelaide Medical School, The University of Adelaide, Adelaide, SA, Australia.
| |
Collapse
|
13
|
Monti R, Eick L, Hudjashov G, Läll K, Kanoni S, Wolford BN, Wingfield B, Pain O, Wharrie S, Jermy B, McMahon A, Hartonen T, Heyne H, Mars N, Lambert S, Hveem K, Inouye M, van Heel DA, Mägi R, Marttinen P, Ripatti S, Ganna A, Lippert C. Evaluation of polygenic scoring methods in five biobanks shows larger variation between biobanks than methods and finds benefits of ensemble learning. Am J Hum Genet 2024; 111:1431-1447. [PMID: 38908374 PMCID: PMC11267524 DOI: 10.1016/j.ajhg.2024.06.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Revised: 05/31/2024] [Accepted: 06/05/2024] [Indexed: 06/24/2024] Open
Abstract
Methods of estimating polygenic scores (PGSs) from genome-wide association studies are increasingly utilized. However, independent method evaluation is lacking, and method comparisons are often limited. Here, we evaluate polygenic scores derived via seven methods in five biobank studies (totaling about 1.2 million participants) across 16 diseases and quantitative traits, building on a reference-standardized framework. We conducted meta-analyses to quantify the effects of method choice, hyperparameter tuning, method ensembling, and the target biobank on PGS performance. We found that no single method consistently outperformed all others. PGS effect sizes were more variable between biobanks than between methods within biobanks when methods were well tuned. Differences between methods were largest for the two investigated autoimmune diseases, seropositive rheumatoid arthritis and type 1 diabetes. For most methods, cross-validation was more reliable for tuning hyperparameters than automatic tuning (without the use of target data). For a given target phenotype, elastic net models combining PGS across methods (ensemble PGS) tuned in the UK Biobank provided consistent, high, and cross-biobank transferable performance, increasing PGS effect sizes (β coefficients) by a median of 5.0% relative to LDpred2 and MegaPRS (the two best-performing single methods when tuned with cross-validation). Our interactively browsable online-results and open-source workflow prspipe provide a rich resource and reference for the analysis of polygenic scoring methods across biobanks.
Collapse
Affiliation(s)
- Remo Monti
- Hasso Plattner Institute, University of Potsdam, Digital Engineering Faculty, Potsdam, Germany; Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, Berlin Institute for Medical Systems Biology, Berlin, Germany
| | - Lisa Eick
- Institute for Molecular Medicine Finland, Helsinki Institute of Life Science, University of Helsinki, Helsinki, Finland
| | - Georgi Hudjashov
- Estonian Genome Centre, Institute of Genomics, University of Tartu, Tartu, Estonia
| | - Kristi Läll
- Estonian Genome Centre, Institute of Genomics, University of Tartu, Tartu, Estonia
| | - Stavroula Kanoni
- William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, UK
| | - Brooke N Wolford
- K.G. Jebsen Center for Genetic Epidemiology, Department of Public Health and Nursing, Faculty of Medicine and Health, Norwegian University of Science and Technology, Trondheim, Norway
| | - Benjamin Wingfield
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Oliver Pain
- Maurice Wohl Clinical Neuroscience Institute, Department of Basic and Clinical Neuroscience; Institute of Psychiatry, Psychology and Neuroscience; King's College London, London, UK
| | - Sophie Wharrie
- Aalto University, Department of Computer Science, Espoo, Finland
| | - Bradley Jermy
- Institute for Molecular Medicine Finland, Helsinki Institute of Life Science, University of Helsinki, Helsinki, Finland
| | - Aoife McMahon
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Tuomo Hartonen
- Institute for Molecular Medicine Finland, Helsinki Institute of Life Science, University of Helsinki, Helsinki, Finland
| | - Henrike Heyne
- Hasso Plattner Institute, University of Potsdam, Digital Engineering Faculty, Potsdam, Germany
| | - Nina Mars
- Institute for Molecular Medicine Finland, Helsinki Institute of Life Science, University of Helsinki, Helsinki, Finland; Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA; Stanley Center for Psychiatric Research and Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Samuel Lambert
- Levanger Hospital, Nord-Trøndelag Hospital Trust, Levanger, Norway; Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, VIC, Australia; British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK; British Heart Foundation Cambridge Centre of Research Excellence, School of Clinical Medicine, University of Cambridge, Cambridge, UK
| | - Kristian Hveem
- K.G. Jebsen Center for Genetic Epidemiology, Department of Public Health and Nursing, Faculty of Medicine and Health, Norwegian University of Science and Technology, Trondheim, Norway; Levanger Hospital, Nord-Trøndelag Hospital Trust, Levanger, Norway
| | - Michael Inouye
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK; Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, VIC, Australia; British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK; Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK; British Heart Foundation Cambridge Centre of Research Excellence, School of Clinical Medicine, University of Cambridge, Cambridge, UK; Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK
| | | | - Reedik Mägi
- Estonian Genome Centre, Institute of Genomics, University of Tartu, Tartu, Estonia
| | - Pekka Marttinen
- Aalto University, Department of Computer Science, Espoo, Finland
| | - Samuli Ripatti
- Institute for Molecular Medicine Finland, Helsinki Institute of Life Science, University of Helsinki, Helsinki, Finland; Department of Public Health, University of Helsinki, Helsinki, Finland; Department of Public Health, University of Helsinki, Helsinki, Finland
| | - Andrea Ganna
- Institute for Molecular Medicine Finland, Helsinki Institute of Life Science, University of Helsinki, Helsinki, Finland; Massachusetts General Hospital and Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Christoph Lippert
- Hasso Plattner Institute, University of Potsdam, Digital Engineering Faculty, Potsdam, Germany; Windreich Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA; Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, NY, USA; Department of Diagnostic, Molecular, and Interventional Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| |
Collapse
|
14
|
MacCarthy G, Pazoki R. Using Machine Learning to Evaluate the Value of Genetic Liabilities in the Classification of Hypertension within the UK Biobank. J Clin Med 2024; 13:2955. [PMID: 38792496 PMCID: PMC11122671 DOI: 10.3390/jcm13102955] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2024] [Revised: 05/01/2024] [Accepted: 05/07/2024] [Indexed: 05/26/2024] Open
Abstract
Background and Objective: Hypertension increases the risk of cardiovascular diseases (CVD) such as stroke, heart attack, heart failure, and kidney disease, contributing to global disease burden and premature mortality. Previous studies have utilized statistical and machine learning techniques to develop hypertension prediction models. Only a few have included genetic liabilities and evaluated their predictive values. This study aimed to develop an effective hypertension classification model and investigate the potential influence of genetic liability for multiple risk factors linked to CVD on hypertension risk using the random forest and the neural network. Materials and Methods: The study involved 244,718 European participants, who were divided into training and testing sets. Genetic liabilities were constructed using genetic variants associated with CVD risk factors obtained from genome-wide association studies (GWAS). Various combinations of machine learning models before and after feature selection were tested to develop the best classification model. The models were evaluated using area under the curve (AUC), calibration, and net reclassification improvement in the testing set. Results: The models without genetic liabilities achieved AUCs of 0.70 and 0.72 using the random forest and the neural network methods, respectively. Adding genetic liabilities improved the AUC for the random forest but not for the neural network. The best classification model was achieved when feature selection and classification were performed using random forest (AUC = 0.71, Spiegelhalter z score = 0.10, p-value = 0.92, calibration slope = 0.99). This model included genetic liabilities for total cholesterol and low-density lipoprotein (LDL). Conclusions: The study highlighted that incorporating genetic liabilities for lipids in a machine learning model may provide incremental value for hypertension classification beyond baseline characteristics.
Collapse
Affiliation(s)
- Gideon MacCarthy
- Cardiovascular and Metabolic Research Group, Division of Biomedical Sciences, Department of Life Sciences, College of Health, Medicine and Life Sciences, Brunel University London, London UB8 3PH, UK
| | - Raha Pazoki
- Cardiovascular and Metabolic Research Group, Division of Biomedical Sciences, Department of Life Sciences, College of Health, Medicine and Life Sciences, Brunel University London, London UB8 3PH, UK
- MRC Centre for Environment and Health, Department of Epidemiology and Biostatistics, School of Public Health, St Mary’s Campus, Norfolk Place, Imperial College London, London W2 1PG, UK
| |
Collapse
|
15
|
Jin J, Zhan J, Zhang J, Zhao R, O'Connell J, Jiang Y, Buyske S, Gignoux C, Haiman C, Kenny EE, Kooperberg C, North K, Koelsch BL, Wojcik G, Zhang H, Chatterjee N. MUSSEL: Enhanced Bayesian polygenic risk prediction leveraging information across multiple ancestry groups. CELL GENOMICS 2024; 4:100539. [PMID: 38604127 PMCID: PMC11019365 DOI: 10.1016/j.xgen.2024.100539] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Revised: 09/07/2023] [Accepted: 03/14/2024] [Indexed: 04/13/2024]
Abstract
Polygenic risk scores (PRSs) are now showing promising predictive performance on a wide variety of complex traits and diseases, but there exists a substantial performance gap across populations. We propose MUSSEL, a method for ancestry-specific polygenic prediction that borrows information in summary statistics from genome-wide association studies (GWASs) across multiple ancestry groups via Bayesian hierarchical modeling and ensemble learning. In our simulation studies and data analyses across four distinct studies, totaling 5.7 million participants with a substantial ancestral diversity, MUSSEL shows promising performance compared to alternatives. For example, MUSSEL has an average gain in prediction R2 across 11 continuous traits of 40.2% and 49.3% compared to PRS-CSx and CT-SLEB, respectively, in the African ancestry population. The best-performing method, however, varies by GWAS sample size, target ancestry, trait architecture, and linkage disequilibrium reference samples; thus, ultimately a combination of methods may be needed to generate the most robust PRSs across diverse populations.
Collapse
Affiliation(s)
- Jin Jin
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA; Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19103, USA.
| | | | - Jingning Zhang
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
| | - Ruzhang Zhao
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
| | | | | | - Steven Buyske
- Department of Statistics, Rutgers University, New Brunswick, NJ 08854, USA
| | - Christopher Gignoux
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Christopher Haiman
- Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA 90032, USA
| | - Eimear E Kenny
- Icahn Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Charles Kooperberg
- Division of Public Health Sciences, Fred Hutchinson Cancer Center, Seattle, WA 98109, USA
| | - Kari North
- Department of Epidemiology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27516, USA
| | | | - Genevieve Wojcik
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
| | - Haoyu Zhang
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA; Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA
| | - Nilanjan Chatterjee
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA; Department of Oncology, School of Medicine, Johns Hopkins University, Baltimore, MD 21205, USA.
| |
Collapse
|
16
|
Truong B, Hull LE, Ruan Y, Huang QQ, Hornsby W, Martin H, van Heel DA, Wang Y, Martin AR, Lee SH, Natarajan P. Integrative polygenic risk score improves the prediction accuracy of complex traits and diseases. CELL GENOMICS 2024; 4:100523. [PMID: 38508198 PMCID: PMC11019356 DOI: 10.1016/j.xgen.2024.100523] [Citation(s) in RCA: 18] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Revised: 09/15/2023] [Accepted: 02/20/2024] [Indexed: 03/22/2024]
Abstract
Polygenic risk scores (PRSs) are an emerging tool to predict the clinical phenotypes and outcomes of individuals. We propose PRSmix, a framework that leverages the PRS corpus of a target trait to improve prediction accuracy, and PRSmix+, which incorporates genetically correlated traits to better capture the human genetic architecture for 47 and 32 diseases/traits in European and South Asian ancestries, respectively. PRSmix demonstrated a mean prediction accuracy improvement of 1.20-fold (95% confidence interval [CI], [1.10; 1.3]; p = 9.17 × 10-5) and 1.19-fold (95% CI, [1.11; 1.27]; p = 1.92 × 10-6), and PRSmix+ improved the prediction accuracy by 1.72-fold (95% CI, [1.40; 2.04]; p = 7.58 × 10-6) and 1.42-fold (95% CI, [1.25; 1.59]; p = 8.01 × 10-7) in European and South Asian ancestries, respectively. Compared to the previously cross-trait-combination methods with scores from pre-defined correlated traits, we demonstrated that our method improved prediction accuracy for coronary artery disease up to 3.27-fold (95% CI, [2.1; 4.44]; p value after false discovery rate (FDR) correction = 2.6 × 10-4). Our method provides a comprehensive framework to benchmark and leverage the combined power of PRS for maximal performance in a desired target population.
Collapse
Affiliation(s)
- Buu Truong
- Program in Medical and Population Genetics and the Cardiovascular Disease Initiative, Broad Institute of MIT and Harvard, 415 Main St, Cambridge, MA 02142, USA; Center for Genomic Medicine and Cardiovascular Research Center, Massachusetts General Hospital, 185 Cambridge Street, Boston, MA 02114, USA
| | - Leland E Hull
- Division of General Internal Medicine, Massachusetts General Hospital, 100 Cambridge Street, Boston, MA 02114, USA; Department of Medicine, Harvard Medical School, 25 Shattuck Street, Boston, MA 02115, USA
| | - Yunfeng Ruan
- Program in Medical and Population Genetics and the Cardiovascular Disease Initiative, Broad Institute of MIT and Harvard, 415 Main St, Cambridge, MA 02142, USA; Center for Genomic Medicine and Cardiovascular Research Center, Massachusetts General Hospital, 185 Cambridge Street, Boston, MA 02114, USA
| | - Qin Qin Huang
- Department of Human Genetics, Wellcome Sanger Institute, Cambridge, UK
| | - Whitney Hornsby
- Program in Medical and Population Genetics and the Cardiovascular Disease Initiative, Broad Institute of MIT and Harvard, 415 Main St, Cambridge, MA 02142, USA; Center for Genomic Medicine and Cardiovascular Research Center, Massachusetts General Hospital, 185 Cambridge Street, Boston, MA 02114, USA
| | - Hilary Martin
- Department of Human Genetics, Wellcome Sanger Institute, Cambridge, UK
| | - David A van Heel
- Blizard Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, UK
| | - Ying Wang
- Program in Medical and Population Genetics and the Cardiovascular Disease Initiative, Broad Institute of MIT and Harvard, 415 Main St, Cambridge, MA 02142, USA; Stanley Center for Psychiatric Research, Broad Institute of Harvard and MIT, Cambridge, MA, USA; Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
| | - Alicia R Martin
- Stanley Center for Psychiatric Research, Broad Institute of Harvard and MIT, Cambridge, MA, USA; Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
| | - S Hong Lee
- Australian Centre for Precision Health, University of South Australia Cancer Research Institute, University of South Australia, Adelaide, SA 5000, Australia
| | - Pradeep Natarajan
- Program in Medical and Population Genetics and the Cardiovascular Disease Initiative, Broad Institute of MIT and Harvard, 415 Main St, Cambridge, MA 02142, USA; Center for Genomic Medicine and Cardiovascular Research Center, Massachusetts General Hospital, 185 Cambridge Street, Boston, MA 02114, USA; Department of Medicine, Harvard Medical School, 25 Shattuck Street, Boston, MA 02115, USA.
| |
Collapse
|
17
|
Yurtseven A, Buyanova S, Agrawal AA, Bochkareva OO, Kalinina OV. Machine learning and phylogenetic analysis allow for predicting antibiotic resistance in M. tuberculosis. BMC Microbiol 2023; 23:404. [PMID: 38124060 PMCID: PMC10731705 DOI: 10.1186/s12866-023-03147-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Accepted: 12/07/2023] [Indexed: 12/23/2023] Open
Abstract
BACKGROUND Antimicrobial resistance (AMR) poses a significant global health threat, and an accurate prediction of bacterial resistance patterns is critical for effective treatment and control strategies. In recent years, machine learning (ML) approaches have emerged as powerful tools for analyzing large-scale bacterial AMR data. However, ML methods often ignore evolutionary relationships among bacterial strains, which can greatly impact performance of the ML methods, especially if resistance-associated features are attempted to be detected. Genome-wide association studies (GWAS) methods like linear mixed models accounts for the evolutionary relationships in bacteria, but they uncover only highly significant variants which have already been reported in literature. RESULTS In this work, we introduce a novel phylogeny-related parallelism score (PRPS), which measures whether a certain feature is correlated with the population structure of a set of samples. We demonstrate that PRPS can be used, in combination with SVM- and random forest-based models, to reduce the number of features in the analysis, while simultaneously increasing models' performance. We applied our pipeline to publicly available AMR data from PATRIC database for Mycobacterium tuberculosis against six common antibiotics. CONCLUSIONS Using our pipeline, we re-discovered known resistance-associated mutations as well as new candidate mutations which can be related to resistance and not previously reported in the literature. We demonstrated that taking into account phylogenetic relationships not only improves the model performance, but also yields more biologically relevant predicted most contributing resistance markers.
Collapse
Affiliation(s)
- Alper Yurtseven
- Department of Drug Bioinformatics, Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Campus E8.1, Saarbrücken, 66123, Saarland, Germany.
- Graduate School of Computer Science, Saarland University, Saarbrücken, 66123, Saarland, Germany.
| | - Sofia Buyanova
- Institute of Science and Technology Austria (ISTA), Am Campus 1, Klosterneuburg, 3400, Austria
| | - Amay Ajaykumar Agrawal
- Department of Drug Bioinformatics, Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Campus E8.1, Saarbrücken, 66123, Saarland, Germany
- Graduate School of Computer Science, Saarland University, Saarbrücken, 66123, Saarland, Germany
| | - Olga O Bochkareva
- Institute of Science and Technology Austria (ISTA), Am Campus 1, Klosterneuburg, 3400, Austria
- Centre for Microbiology and Environmental Systems Science, Division of Computational System Biology, University of Vienna, Djerassiplatz 1 A, Wien, 1030, Austria
| | - Olga V Kalinina
- Department of Drug Bioinformatics, Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Campus E8.1, Saarbrücken, 66123, Saarland, Germany
- Graduate School of Computer Science, Saarland University, Saarbrücken, 66123, Saarland, Germany
- Faculty of Medicine, Saarland University, Homburg, 66421, Saarland, Germany
| |
Collapse
|
18
|
Chen SF, Loguercio S, Chen KY, Lee SE, Park JB, Liu S, Sadaei HJ, Torkamani A. Artificial Intelligence for Risk Assessment on Primary Prevention of Coronary Artery Disease. CURRENT CARDIOVASCULAR RISK REPORTS 2023; 17:215-231. [DOI: 10.1007/s12170-023-00731-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/09/2023] [Indexed: 01/04/2025]
Abstract
Abstract
Purpose of Review
Coronary artery disease (CAD) is a common and etiologically complex disease worldwide. Current guidelines for primary prevention, or the prevention of a first acute event, include relatively simple risk assessment and leave substantial room for improvement both for risk ascertainment and selection of prevention strategies. Here, we review how advances in big data and predictive modeling foreshadow a promising future of improved risk assessment and precision medicine for CAD.
Recent Findings
Artificial intelligence (AI) has improved the utility of high dimensional data, providing an opportunity to better understand the interplay between numerous CAD risk factors. Beyond applications of AI in cardiac imaging, the vanguard application of AI in healthcare, recent translational research is also revealing a promising path for AI in multi-modal risk prediction using standard biomarkers, genetic and other omics technologies, a variety of biosensors, and unstructured data from electronic health records (EHRs). However, gaps remain in clinical validation of AI models, most notably in the actionability of complex risk prediction for more precise therapeutic interventions.
Summary
The recent availability of nation-scale biobank datasets has provided a tremendous opportunity to richly characterize longitudinal health trajectories using health data collected at home, at laboratories, and through clinic visits. The ever-growing availability of deep genotype-phenotype data is poised to drive a transition from simple risk prediction algorithms to complex, “data-hungry,” AI models in clinical decision-making. While AI models provide the means to incorporate essentially all risk factors into comprehensive risk prediction frameworks, there remains a need to wrap these predictions in interpretable frameworks that map to our understanding of underlying biological mechanisms and associated personalized intervention. This review explores recent advances in the role of machine learning and AI in CAD primary prevention and highlights current strengths as well as limitations mediating potential future applications.
Collapse
|
19
|
Wang G, Ott J. Digenic Analysis Finds Highly Interactive Genetic Variants Underlying Polygenic Traits. MEDICAL RESEARCH ARCHIVES 2023; 11:10.18103/mra.v11i10.4604. [PMID: 38882238 PMCID: PMC11177775 DOI: 10.18103/mra.v11i10.4604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/18/2024]
Abstract
We briefly review our recently published approach to mining digenic genotype patterns, which consist of two genotypes each originating in a different DNA variant. We do this for a genetic case-control study by evaluating all possible pairs of genotypes, distributing the workload over numerous CPUs (threads) in a high-performance computing environment and apply our methods to two known datasets, age-related macular degeneration (AMD) and Parkinson Disease (PD). Based on a list of (e.g., 100,000) genotype pairs with largest genotype pair frequency differences between cases and controls, we determine the numberN u of unique variants occurring in this list. For each unique variant, we find the number of genotype pairs it participates in, which identifies a set of variants "connected" with the given unique variant. Among the total of variants "connected" with all unique variants, only a subset of variants is unique. The ratio of all connected variants divided by that subset of variants is a measure for the overall density or connectedness of variants interacting with each other. We find that variants for the AMD data are much more interconnected than those for PD, at least based on the 100,000 genotype pairs with largest chi-square we investigated. Further, for each of theN u unique variants, we use the number of variants connected with it as a test statistic, weighted by the inverse of the rank at which the unique variant first occurred in the original list of genotype patterns. This weighing scheme ties the number of connections to the genetics of the trait and allows us to obtain, for each of theN u unique variants, an empirical significance level by permuting ranks. We find 12 and 8 significant, highly connected variants for AMD and PD, respectively, some of which have previously been identified by other machine learning methods, thus providing credence to our approach. Among the 100,000 genotype pairs investigated for each of AMD and PD, significant variants showed connections with up to 7,093 and 3,777 other variants, respectively. Our approach has been implemented in a freely available piece of software, the Digenic Network Test. Thus, our statistical genetics method can provide important information on the genetic architecture of polygenic traits.
Collapse
Affiliation(s)
| | - Jurg Ott
- Rockefeller University, New York
| |
Collapse
|