1
|
Dutta D, Chatterjee N. Expanding scope of genetic studies in the era of biobanks. Hum Mol Genet 2025:ddaf054. [PMID: 40312842 DOI: 10.1093/hmg/ddaf054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2025] [Revised: 03/25/2025] [Accepted: 04/08/2025] [Indexed: 05/03/2025] Open
Abstract
Biobanks have become pivotal in genetic research, particularly through genome-wide association studies (GWAS), driving transformative insights into the genetic basis of complex diseases and traits through the integration of genetic data with phenotypic, environmental, family history, and behavioral information. This review explores the distinct design and utility of different biobanks, highlighting their unique contributions to genetic research. We further discuss the utility and methodological advances in combining data from disease-specific study or consortia with that of biobanks, especially focusing on summary statistics based meta-analysis. Subsequently we review the spectrum of additional advantages offered by biobanks in genetic studies in representing population differences, calibration of polygenic scores, assessment of pleiotropy and improving post-GWAS in silico analyses. Advances in sequencing technologies, particularly whole-exome and whole-genome sequencing, have further enabled the discovery of rare variants at biobank scale. Among recent developments, the integration of large-scale multi-omics data especially proteomics and metabolomics, within biobanks provides deeper insights into disease mechanisms and regulatory pathways. Despite challenges like ascertainment strategies and phenotypic misclassification, biobanks continue to evolve, driving methodological innovation and enabling precision medicine. We highlight the contributions of biobanks to genetic research, their growing integration with multi-omics, and finally discuss their future potential for advancing healthcare and therapeutic development.
Collapse
Affiliation(s)
- Diptavo Dutta
- Integrative Tumor Epidemiology Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, 9609 Medical Center Drive, Rockville, MD, 20879, United States
| | - Nilanjan Chatterjee
- Department of Biostatistics, Johns Hopkins University, 615 N Wolfe Street, Baltimore, MD, 21205, United States
- Department of Oncology, Johns Hopkins University, 615 N Wolfe Street, Baltimore, MD, 21205, United States
| |
Collapse
|
2
|
Shao Z, Tang W, Wu H, Kong Y, Hao X. Incorporating multiple functional annotations to improve polygenic risk prediction accuracy. CELL GENOMICS 2025:100850. [PMID: 40239655 DOI: 10.1016/j.xgen.2025.100850] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/26/2024] [Revised: 01/21/2025] [Accepted: 03/18/2025] [Indexed: 04/18/2025]
Abstract
We present OmniPRS, a scalable biobank-scale framework that improves genetic risk prediction for complex traits by integrating genome-wide association study (GWAS) summary statistics and functional annotations. It employs a mixed model incorporating tissue-specific genetic variance components from annotations to re-estimate single-nucleotide polymorphism (SNP) effects and constructs tissue-specific polygenic risk scores (PRSs) and aggregates them into the final OmniPRS. Our experiments, encompassing 135 simulation scenarios and 11 representative traits, demonstrate that OmniPRS is flexible and robust, delivering efficient and accurate predictions comparable to ten leading PRS methods. For quantitative (binary) traits, OmniPRS achieved an average improvement of 52.31% (19.83%) versus the clumping and thresholding (C+T) method, 3.92% (1.31%) versus the annotation-integrated PRSs (LDpred-funct), and 8.44% (2.27%) versus the Bayesian-based PRSs (PRScs). Notably, it achieved 35× faster computation than the PRScs. This rapid, precise framework enables efficient polygenic risk scoring with multi-annotation integration for large-scale genomic studies.
Collapse
Affiliation(s)
- Zhonghe Shao
- Department of Epidemiology and Biostatistics, School of Public Health, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei 430030, China
| | - Wangxia Tang
- Department of Epidemiology and Biostatistics, School of Public Health, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei 430030, China
| | - Hongji Wu
- Department of Epidemiology and Biostatistics, School of Public Health, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei 430030, China
| | - Yifan Kong
- Department of Epidemiology and Biostatistics, School of Public Health, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei 430030, China
| | - Xingjie Hao
- Department of Epidemiology and Biostatistics, School of Public Health, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei 430030, China.
| |
Collapse
|
3
|
Teng J, Zhai T, Zhang X, Zhao C, Wang W, Tang H, Ning C, Shang Y, Wang D, Zhang Q. Improving multi-trait genomic prediction by incorporating local genetic correlations. Commun Biol 2025; 8:307. [PMID: 40000888 PMCID: PMC11861333 DOI: 10.1038/s42003-025-07721-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Accepted: 02/11/2025] [Indexed: 02/27/2025] Open
Abstract
Genomic prediction holds significant potential for advancing precision medicine in humans, as well as accelerating genetic improvement in animals and plants. For multi-trait prediction, the conventional multi-trait models are primarily based on global genetic correlations between traits. With the development of local genetic correlation (LGC) estimation methods, it is now possible to analyze LGCs confined to specific genomic regions and it is expected that incorporating LGCs into multi-trait prediction model would enhance the prediction ability. Here, we proposed three models to address this issue and evaluated their performances using simulated data and three real datasets from human, cow, and pig populations. Our results demonstrate that LGCs are heterogeneous across the genome and incorporating LGCs in multi-trait prediction would increase the prediction accuracy by an average of 12.76% ± 2.07% compared to conventional multi-trait genomic prediction method (MTGBLUP) in the real datasets. Our findings highlight the importance of considering LGCs in improving multi-trait genomic prediction.
Collapse
Affiliation(s)
- Jun Teng
- Shandong Provincial Key Laboratory for Livestock Germplasm Innovation & Utilization, College of Animal Science and Technology, Shandong Agricultural University, Tai'an, China
- Shandong Futeng Food Co. Ltd., Zaozhuang, China
| | - Tingting Zhai
- National Key Laboratory of Wheat Improvement, College of Life Science, Shandong Agricultural University, Tai'an, China
| | - Xinyi Zhang
- Shandong Provincial Key Laboratory for Livestock Germplasm Innovation & Utilization, College of Animal Science and Technology, Shandong Agricultural University, Tai'an, China
| | - Changheng Zhao
- Shandong Provincial Key Laboratory for Livestock Germplasm Innovation & Utilization, College of Animal Science and Technology, Shandong Agricultural University, Tai'an, China
| | - Wenwen Wang
- Shandong Provincial Key Laboratory for Livestock Germplasm Innovation & Utilization, College of Animal Science and Technology, Shandong Agricultural University, Tai'an, China
| | - Hui Tang
- Shandong Provincial Key Laboratory for Livestock Germplasm Innovation & Utilization, College of Animal Science and Technology, Shandong Agricultural University, Tai'an, China
| | - Chao Ning
- Shandong Provincial Key Laboratory for Livestock Germplasm Innovation & Utilization, College of Animal Science and Technology, Shandong Agricultural University, Tai'an, China
| | - Yingli Shang
- College of Veterinary Medicine, Shandong Agricultural University, Tai'an, China
| | - Dan Wang
- Shandong Provincial Key Laboratory for Livestock Germplasm Innovation & Utilization, College of Animal Science and Technology, Shandong Agricultural University, Tai'an, China.
| | - Qin Zhang
- Shandong Provincial Key Laboratory for Livestock Germplasm Innovation & Utilization, College of Animal Science and Technology, Shandong Agricultural University, Tai'an, China.
| |
Collapse
|
4
|
Wang C, Markus H, Diwadkar AR, Khunsriraksakul C, Carrel L, Li B, Zhong X, Wang X, Zhan X, Foulke GT, Olsen NJ, Liu DJ, Jiang B. Integrating electronic health records and GWAS summary statistics to predict the progression of autoimmune diseases from preclinical stages. Nat Commun 2025; 16:180. [PMID: 39747168 PMCID: PMC11695684 DOI: 10.1038/s41467-024-55636-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Accepted: 12/18/2024] [Indexed: 01/04/2025] Open
Abstract
Autoimmune diseases often exhibit a preclinical stage before diagnosis. Electronic health record (EHR) based-biobanks contain genetic data and diagnostic information, which can identify preclinical individuals at risk for progression. Biobanks typically have small numbers of cases, which are not sufficient to construct accurate polygenic risk scores (PRS). Importantly, progression and case-control phenotypes may have shared genetic basis, which we can exploit to improve prediction accuracy. We propose a novel method Genetic Progression Score (GPS) that integrates biobank and case-control study to predict the disease progression risk. Via penalized regression, GPS incorporates PRS weights for case-control studies as prior and forces model parameters to be similar to the prior if the prior improves prediction accuracy. In simulations, GPS consistently yields better prediction accuracy than alternative strategies relying on biobank or case-control samples only and those combining biobank and case-control samples. The improvement is particularly evident when biobank sample is smaller or the genetic correlation is lower. We derive PRS for the progression from preclinical rheumatoid arthritis and systemic lupus erythematosus in the BioVU biobank and validate them in All of Us. For both diseases, GPS achieves the highest predictionR 2 and the resulting PRS yields the strongest correlation with progression prevalence.
Collapse
Affiliation(s)
- Chen Wang
- Bioinformatics and Genomics Graduate Program, College of Medicine, Penn State University, Hershey, PA, USA
- Department of Public Health Sciences, College of Medicine, Penn State University, Hershey, PA, USA
| | - Havell Markus
- Bioinformatics and Genomics Graduate Program, College of Medicine, Penn State University, Hershey, PA, USA
| | - Avantika R Diwadkar
- Bioinformatics and Genomics Graduate Program, College of Medicine, Penn State University, Hershey, PA, USA
- Department of Public Health Sciences, College of Medicine, Penn State University, Hershey, PA, USA
| | - Chachrit Khunsriraksakul
- Bioinformatics and Genomics Graduate Program, College of Medicine, Penn State University, Hershey, PA, USA
| | - Laura Carrel
- Department of Biochemistry and Molecular Biology, College of Medicine, Penn State University, Hershey, PA, USA
| | - Bingshan Li
- Department of Molecular Physiology & Biophysics, Vanderbilt University, Nashville, TN, USA
| | - Xue Zhong
- Department of Medicine, Division of Genetic Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Xingyan Wang
- Department of Public Health Sciences, College of Medicine, Penn State University, Hershey, PA, USA
| | - Xiaowei Zhan
- Department of Statistical Science, Southern Methodist University, Dallas, TX, USA
- Department of Population and Data Sciences, Quantitative Biomedical Research Center, Southwestern Medical Center University of Texas, Dallas, TX, USA
- Center for Genetics of Host Defense, Southwestern Medical Center University of Texas, Dallas, TX, USA
| | - Galen T Foulke
- Department of Public Health Sciences, College of Medicine, Penn State University, Hershey, PA, USA
- Department of Dermatology, College of Medicine, Penn State University, Hershey, PA, USA
| | - Nancy J Olsen
- Department of Medicine, College of Medicine, Penn State University, Hershey, PA, USA
| | - Dajiang J Liu
- Bioinformatics and Genomics Graduate Program, College of Medicine, Penn State University, Hershey, PA, USA.
- Department of Public Health Sciences, College of Medicine, Penn State University, Hershey, PA, USA.
| | - Bibo Jiang
- Department of Public Health Sciences, College of Medicine, Penn State University, Hershey, PA, USA.
| |
Collapse
|
5
|
Kunkel D, Sørensen P, Shankar V, Morgante F. Improving polygenic prediction from summary data by learning patterns of effect sharing across multiple phenotypes. PLoS Genet 2025; 21:e1011519. [PMID: 39775068 PMCID: PMC11741642 DOI: 10.1371/journal.pgen.1011519] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Revised: 01/17/2025] [Accepted: 11/27/2024] [Indexed: 01/11/2025] Open
Abstract
Polygenic prediction of complex trait phenotypes has become important in human genetics, especially in the context of precision medicine. Recently, mr.mash, a flexible and computationally efficient method that models multiple phenotypes jointly and leverages sharing of effects across such phenotypes to improve prediction accuracy, was introduced. However, a drawback of mr.mash is that it requires individual-level data, which are often not publicly available. In this work, we introduce mr.mash-rss, an extension of the mr.mash model that requires only summary statistics from Genome-Wide Association Studies (GWAS) and linkage disequilibrium (LD) estimates from a reference panel. By using summary data, we achieve the twin goal of increasing the applicability of the mr.mash model to data sets that are not publicly available and making it scalable to biobank-size data. Through simulations, we show that mr.mash-rss is competitive with, and often outperforms, current state-of-the-art methods for single- and multi-phenotype polygenic prediction in a variety of scenarios that differ in the pattern of effect sharing across phenotypes, the number of phenotypes, the number of causal variants, and the genomic heritability. We also present a real data analysis of 16 blood cell phenotypes in the UK Biobank, showing that mr.mash-rss achieves higher prediction accuracy than competing methods for the majority of traits, especially when the data set has smaller sample size.
Collapse
Affiliation(s)
- Deborah Kunkel
- School of Mathematical and Statistical Sciences, Clemson University, Clemson, South Carolina, United States of America
| | - Peter Sørensen
- Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, Denmark
| | - Vijay Shankar
- Center for Human Genetics, Clemson University, Greenwood, South Carolina, United States of America
| | - Fabio Morgante
- Center for Human Genetics, Clemson University, Greenwood, South Carolina, United States of America
- Department of Genetics and Biochemistry, Clemson University, Clemson, South Carolina, United States of America
| |
Collapse
|
6
|
Zhao Z, Dorn S, Wu Y, Yang X, Jin J, Lu Q. One score to rule them all: regularized ensemble polygenic risk prediction with GWAS summary statistics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.27.625748. [PMID: 39677614 PMCID: PMC11642782 DOI: 10.1101/2024.11.27.625748] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/17/2024]
Abstract
Ensemble learning has been increasingly popular for boosting the predictive power of polygenic risk scores (PRS), with almost every recent multi-ancestry PRS approach employing ensemble learning as a final step. Existing ensemble approaches rely on individual-level data for model training, which severely limits their real-world applications, especially in non-European populations without sufficient genomic samples. Here, we introduce a statistical framework to construct regularized ensemble PRS, which allows us to combine a large number of candidate PRS models using only summary statistics from genome-wide association studies. We demonstrate its robust and substantial improvement over many existing PRS models in both within- and cross-ancestry applications. We believe this is truly "one score to rule them all" due to its capability to continuously combine newly developed PRS models with existing models to improve prediction performance, which makes it a universal approach that should always be employed in future PRS applications.
Collapse
Affiliation(s)
- Zijie Zhao
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI
| | - Stephen Dorn
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI
| | - Yuchang Wu
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI
| | - Xiaoyu Yang
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI
| | - Jin Jin
- Department of Biostatistics, Epidemiology and Bioinformatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA
| | - Qiongshi Lu
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI
- Department of Statistics, University of Wisconsin-Madison, Madison, WI
| |
Collapse
|
7
|
Xu L, Zhou G, Jiang W, Zhang H, Dong Y, Guan L, Zhao H. JointPRS: A Data-Adaptive Framework for Multi-Population Genetic Risk Prediction Incorporating Genetic Correlation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.10.29.564615. [PMID: 37961111 PMCID: PMC10634936 DOI: 10.1101/2023.10.29.564615] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
Genetic prediction accuracy for non-European populations is hindered by the limited sample size of Genome-wide association studies (GWAS) data in these populations. Additionally, it is challenging to tune model parameters with a small tuning dataset for methods that require tuning data, which is often the case for non-European samples. To address these challenges, we propose JointPRS, a novel, data-adaptive framework that simultaneously models multiple populations using GWAS summary statistics. JointPRS incorporates genetic correlation structures into the prediction framework, enabling accurate performance even without individual-level tuning data. Additionally, it uniquely employs a data-adaptive approach, providing a robust solution when only a small tuning dataset is available. Through extensive simulations and real data applications to 22 quantitative traits and four binary traits in five continental populations (European (EUR); East Asian (EAS); African (AFR); South Asian (SAS); and Admixed American (AMR)) evaluated using the UK Biobank (UKBB) and All of Us (AoU), we demonstrate that JointPRS outperforms six other state-of-art methods across three different data scenarios (no tuning data, tuning and testing data from the same cohort, and tuning and testing data from different cohorts) for most traits in non-European populations, while maintaining model simplicity and computational efficiency.
Collapse
Affiliation(s)
- Leqi Xu
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Geyu Zhou
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Wei Jiang
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Haoyu Zhang
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, USA
| | - Yikai Dong
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Leying Guan
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
| | - Hongyu Zhao
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
| |
Collapse
|
8
|
Mbatchou J, McPeek MS. JASPER: Fast, powerful, multitrait association testing in structured samples gives insight on pleiotropy in gene expression. Am J Hum Genet 2024; 111:1750-1769. [PMID: 39025064 PMCID: PMC11339629 DOI: 10.1016/j.ajhg.2024.06.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 06/19/2024] [Accepted: 06/20/2024] [Indexed: 07/20/2024] Open
Abstract
Joint association analysis of multiple traits with multiple genetic variants can provide insight into genetic architecture and pleiotropy, improve trait prediction, and increase power for detecting association. Furthermore, some traits are naturally high-dimensional, e.g., images, networks, or longitudinally measured traits. Assessing significance for multitrait genetic association can be challenging, especially when the sample has population sub-structure and/or related individuals. Failure to adequately adjust for sample structure can lead to power loss and inflated type 1 error, and commonly used methods for assessing significance can work poorly with a large number of traits or be computationally slow. We developed JASPER, a fast, powerful, robust method for assessing significance of multitrait association with a set of genetic variants, in samples that have population sub-structure, admixture, and/or relatedness. In simulations, JASPER has higher power, better type 1 error control, and faster computation than existing methods, with the power and speed advantage of JASPER increasing with the number of traits. JASPER is potentially applicable to a wide range of association testing applications, including for multiple disease traits, expression traits, image-derived traits, and microbiome abundances. It allows for covariates, ascertainment, and rare variants and is robust to phenotype model misspecification. We apply JASPER to analyze gene expression in the Framingham Heart Study, where, compared to alternative approaches, JASPER finds more significant associations, including several that indicate pleiotropic effects, most of which replicate previous results, while others have not previously been reported. Our results demonstrate the promise of JASPER for powerful multitrait analysis in structured samples.
Collapse
Affiliation(s)
- Joelle Mbatchou
- Regeneron Genetics Center, Tarrytown, NY 10591, USA; Department of Statistics, The University of Chicago, Chicago, IL 60637, USA
| | - Mary Sara McPeek
- Department of Statistics, The University of Chicago, Chicago, IL 60637, USA; Department of Human Genetics, The University of Chicago, Chicago, IL 60637, USA.
| |
Collapse
|
9
|
Kunkel D, Sørensen P, Shankar V, Morgante F. Improving polygenic prediction from summary data by learning patterns of effect sharing across multiple phenotypes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.06.592745. [PMID: 38766136 PMCID: PMC11100663 DOI: 10.1101/2024.05.06.592745] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2024]
Abstract
Polygenic prediction of complex trait phenotypes has become important in human genetics, especially in the context of precision medicine. Recently, Morgante et al. introduced mr.mash, a flexible and computationally efficient method that models multiple phenotypes jointly and leverages sharing of effects across such phenotypes to improve prediction accuracy. However, a drawback of mr.mash is that it requires individual-level data, which are often not publicly available. In this work, we introduce mr.mash-rss, an extension of the mr.mash model that requires only summary statistics from Genome-Wide Association Studies (GWAS) and linkage disequilibrium (LD) estimates from a reference panel. By using summary data, we achieve the twin goal of increasing the applicability of the mr.mash model to data sets that are not publicly available and making it scalable to biobank-size data. Through simulations, we show that mr.mash-rss is competitive with, and often outperforms, current state-of-the-art methods for single- and multi-phenotype polygenic prediction in a variety of scenarios that differ in the pattern of effect sharing across phenotypes, the number of phenotypes, the number of causal variants, and the genomic heritability. We also present a real data analysis of 16 blood cell phenotypes in UK Biobank, showing that mr.mash-rss achieves higher prediction accuracy than competing methods for the majority of traits, especially when the data has smaller sample size.
Collapse
Affiliation(s)
- Deborah Kunkel
- School of Mathematical and Statistical Sciences, Clemson University, Clemson, SC, United States of America
| | - Peter Sørensen
- Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, Denmark
| | - Vijay Shankar
- Center for Human Genetics, Clemson University, Greenwood, SC, United States of America
| | - Fabio Morgante
- Center for Human Genetics, Clemson University, Greenwood, SC, United States of America
- Department of Genetics and Biochemistry, Clemson University, Clemson, SC, United States of America
| |
Collapse
|
10
|
Xu L, Zhou G, Jiang W, Guan L, Zhao H. Leveraging genetic correlations and multiple populations to improve genetic risk prediction for non-European populations. RESEARCH SQUARE 2023:rs.3.rs-3741763. [PMID: 38234764 PMCID: PMC10793485 DOI: 10.21203/rs.3.rs-3741763/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/19/2024]
Abstract
The disparity in genetic risk prediction accuracy between European and non-European individuals highlights a critical challenge in health inequality. To bridge this gap, we introduce JointPRS, a novel method that models multiple populations jointly to improve genetic risk predictions for non-European individuals. JointPRS has three key features. First, it encompasses all diverse populations to improve prediction accuracy, rather than relying solely on the target population with a singular auxiliary European group. Second, it autonomously estimates and leverages chromosome-wise cross-population genetic correlations to infer the effect sizes of genetic variants. Lastly, it provides an auto version that has comparable performance to the tuning version to accommodate the situation with no validation dataset. Through extensive simulations and real data applications to 22 quantitative traits and four binary traits in East Asian populations, nine quantitative traits and one binary trait in African populations, and four quantitative traits in South Asian populations, we demonstrate that JointPRS outperforms state-of-art methods, improving the prediction accuracy for both quantitative and binary traits in non-European populations.
Collapse
Affiliation(s)
- Leqi Xu
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Geyu Zhou
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Wei Jiang
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Leying Guan
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
| | - Hongyu Zhao
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
| |
Collapse
|
11
|
Mbatchou J, McPeek MS. JASPER: fast, powerful, multitrait association testing in structured samples gives insight on pleiotropy in gene expression. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.18.571948. [PMID: 38187553 PMCID: PMC10769254 DOI: 10.1101/2023.12.18.571948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]
Abstract
Joint association analysis of multiple traits with multiple genetic variants can provide insight into genetic architecture and pleiotropy, improve trait prediction and increase power for detecting association. Furthermore, some traits are naturally high-dimensional, e.g., images, networks or longitudinally measured traits. Assessing significance for multitrait genetic association can be challenging, especially when the sample has population sub-structure and/or related individuals. Failure to adequately adjust for sample structure can lead to power loss and inflated type 1 error, and commonly used methods for assessing significance can work poorly with a large number of traits or be computationally slow. We developed JASPER, a fast, powerful, robust method for assessing significance of multitrait association with a set of genetic variants, in samples that have population sub-structure, admixture and/or relatedness. In simulations, JASPER has higher power, better type 1 error control, and faster computation than existing methods, with the power and speed advantage of JASPER increasing with the number of traits. JASPER is potentially applicable to a wide range of association testing applications, including for multiple disease traits, expression traits, image-derived traits and microbiome abundances. It allows for covariates, ascertainment and rare variants and is robust to phenotype model misspecification. We apply JASPER to analyze gene expression in the Framingham Heart Study, where, compared to alternative approaches, JASPER finds more significant associations, including several that indicate pleiotropic effects, some of which replicate previous results, while others have not previously been reported. Our results demonstrate the promise of JASPER for powerful multitrait analysis in structured samples.
Collapse
Affiliation(s)
- Joelle Mbatchou
- Regeneron Genetics Center, Tarrytown, NY 10591, USA
- Department of Statistics, The University of Chicago, Chicago, IL 60637, USA
| | - Mary Sara McPeek
- Department of Statistics, The University of Chicago, Chicago, IL 60637, USA
- Department of Human Genetics, The University of Chicago, Chicago, IL 60637, USA
| |
Collapse
|