1
|
Jin J, Li B, Wang X, Yang X, Li Y, Wang R, Ye C, Shu J, Fan Z, Xue F, Ge T, Ritchie MD, Pasaniuc B, Wojcik G, Zhao B. PennPRS: a centralized cloud computing platform for efficient polygenic risk score training in precision medicine. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2025:2025.02.07.25321875. [PMID: 39990574 PMCID: PMC11844566 DOI: 10.1101/2025.02.07.25321875] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 02/25/2025]
Abstract
Polygenic risk scores (PRS) are becoming increasingly vital for risk prediction and stratification in precision medicine. However, PRS model training presents significant challenges for broader adoption of PRS, including limited access to computational resources, difficulties in implementing advanced PRS methods, and availability and privacy concerns over individual-level genetic data. Cloud computing provides a promising solution with centralized computing and data resources. Here we introduce PennPRS (https://pennprs.org), a scalable cloud computing platform for online PRS model training in precision medicine. We developed novel pseudo-training algorithms for multiple PRS methods and ensemble approaches, enabling model training without requiring individual-level data. These methods were rigorously validated through extensive simulations and large-scale real data analyses involving over 6,000 phenotypes across various data sources. PennPRS supports online single- and multi-ancestry PRS training with seven methods, allowing users to upload their own data or query from more than 27,000 datasets in the GWAS Catalog, submit jobs, and download trained PRS models. Additionally, we applied our pseudo-training pipeline to train PRS models for over 8,000 phenotypes and made their PRS weights publicly accessible. In summary, PennPRS provides a novel cloud computing solution to improve the accessibility of PRS applications and reduce disparities in computational resources for the global PRS research community.
Collapse
Affiliation(s)
- Jin Jin
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
- Penn Center for Eye-Brain Health, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Bingxuan Li
- UCLA Samueli School of Engineering, Los Angeles, CA 90095, USA
| | - Xiyao Wang
- Department of Computer Science, Columbia University, New York, NY 10027, USA
| | - Xiaochen Yang
- Department of Statistics, Purdue University, West Lafayette, IN 47907, USA
| | - Yujue Li
- Department of Statistics, Purdue University, West Lafayette, IN 47907, USA
| | - Ruofan Wang
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Chenglong Ye
- Department of Statistics, University of Kentucky, Lexington, KY 40536, USA
| | - Juan Shu
- Department of Statistics and Data Science, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Zirui Fan
- Department of Statistics and Data Science, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Fei Xue
- Department of Statistics, Purdue University, West Lafayette, IN 47907, USA
| | - Tian Ge
- Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Marylyn D. Ritchie
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
- Penn Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Bogdan Pasaniuc
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Genevieve Wojcik
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
| | - Bingxin Zhao
- Penn Center for Eye-Brain Health, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
- Department of Statistics, Purdue University, West Lafayette, IN 47907, USA
- Department of Statistics and Data Science, University of Pennsylvania, Philadelphia, PA 19104, USA
- Penn Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
2
|
Zhao Z, Dorn S, Wu Y, Yang X, Jin J, Lu Q. One score to rule them all: regularized ensemble polygenic risk prediction with GWAS summary statistics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.27.625748. [PMID: 39677614 PMCID: PMC11642782 DOI: 10.1101/2024.11.27.625748] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/17/2024]
Abstract
Ensemble learning has been increasingly popular for boosting the predictive power of polygenic risk scores (PRS), with almost every recent multi-ancestry PRS approach employing ensemble learning as a final step. Existing ensemble approaches rely on individual-level data for model training, which severely limits their real-world applications, especially in non-European populations without sufficient genomic samples. Here, we introduce a statistical framework to construct regularized ensemble PRS, which allows us to combine a large number of candidate PRS models using only summary statistics from genome-wide association studies. We demonstrate its robust and substantial improvement over many existing PRS models in both within- and cross-ancestry applications. We believe this is truly "one score to rule them all" due to its capability to continuously combine newly developed PRS models with existing models to improve prediction performance, which makes it a universal approach that should always be employed in future PRS applications.
Collapse
Affiliation(s)
- Zijie Zhao
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI
| | - Stephen Dorn
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI
| | - Yuchang Wu
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI
| | - Xiaoyu Yang
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI
| | - Jin Jin
- Department of Biostatistics, Epidemiology and Bioinformatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA
| | - Qiongshi Lu
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI
- Department of Statistics, University of Wisconsin-Madison, Madison, WI
| |
Collapse
|
3
|
Zhao Z, Gruenloh T, Yan M, Wu Y, Sun Z, Miao J, Wu Y, Song J, Lu Q. Optimizing and benchmarking polygenic risk scores with GWAS summary statistics. Genome Biol 2024; 25:260. [PMID: 39379999 PMCID: PMC11462675 DOI: 10.1186/s13059-024-03400-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2023] [Accepted: 09/23/2024] [Indexed: 10/10/2024] Open
Abstract
BACKGROUND Polygenic risk score (PRS) is a major research topic in human genetics. However, a significant gap exists between PRS methodology and applications in practice due to often unavailable individual-level data for various PRS tasks including model fine-tuning, benchmarking, and ensemble learning. RESULTS We introduce an innovative statistical framework to optimize and benchmark PRS models using summary statistics of genome-wide association studies. This framework builds upon our previous work and can fine-tune virtually all existing PRS models while accounting for linkage disequilibrium. In addition, we provide an ensemble learning strategy named PUMAS-ensemble to combine multiple PRS models into an ensemble score without requiring external data for model fitting. Through extensive simulations and analysis of many complex traits in the UK Biobank, we demonstrate that this approach closely approximates gold-standard analytical strategies based on external validation, and substantially outperforms state-of-the-art PRS methods. CONCLUSIONS Our method is a powerful and general modeling technique that can continue to combine the best-performing PRS methods out there through ensemble learning and could become an integral component for all future PRS applications.
Collapse
Affiliation(s)
- Zijie Zhao
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA
| | - Tim Gruenloh
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA
| | - Meiyi Yan
- Department of Statistics, University of Wisconsin-Madison, Madison, WI, USA
| | - Yixuan Wu
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA
| | - Zhongxuan Sun
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA
| | - Jiacheng Miao
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA
| | - Yuchang Wu
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA
- Center for Demography of Health and Aging, University of Wisconsin-Madison, Madison, WI, USA
| | - Jie Song
- Center for Demography of Health and Aging, University of Wisconsin-Madison, Madison, WI, USA
| | - Qiongshi Lu
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA.
- Department of Statistics, University of Wisconsin-Madison, Madison, WI, USA.
- Center for Demography of Health and Aging, University of Wisconsin-Madison, Madison, WI, USA.
| |
Collapse
|
4
|
Xu L, Zhou G, Jiang W, Guan L, Zhao H. Leveraging genetic correlations and multiple populations to improve genetic risk prediction for non-European populations. RESEARCH SQUARE 2023:rs.3.rs-3741763. [PMID: 38234764 PMCID: PMC10793485 DOI: 10.21203/rs.3.rs-3741763/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/19/2024]
Abstract
The disparity in genetic risk prediction accuracy between European and non-European individuals highlights a critical challenge in health inequality. To bridge this gap, we introduce JointPRS, a novel method that models multiple populations jointly to improve genetic risk predictions for non-European individuals. JointPRS has three key features. First, it encompasses all diverse populations to improve prediction accuracy, rather than relying solely on the target population with a singular auxiliary European group. Second, it autonomously estimates and leverages chromosome-wise cross-population genetic correlations to infer the effect sizes of genetic variants. Lastly, it provides an auto version that has comparable performance to the tuning version to accommodate the situation with no validation dataset. Through extensive simulations and real data applications to 22 quantitative traits and four binary traits in East Asian populations, nine quantitative traits and one binary trait in African populations, and four quantitative traits in South Asian populations, we demonstrate that JointPRS outperforms state-of-art methods, improving the prediction accuracy for both quantitative and binary traits in non-European populations.
Collapse
Affiliation(s)
- Leqi Xu
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Geyu Zhou
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Wei Jiang
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Leying Guan
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
| | - Hongyu Zhao
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
| |
Collapse
|