1
|
Guardado M, Perez C, Campana S, Chavez Rojas B, Magaña J, Jackson S, Samperio E, Hernandez S, Syas K, Hernandez RD, Zavala EI, Rohlfs RV. py_ped_sim: a flexible forward pedigree and genetic simulator for complex family pedigree analysis. BMC Bioinformatics 2025; 26:122. [PMID: 40335952 PMCID: PMC12060417 DOI: 10.1186/s12859-025-06142-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2024] [Accepted: 04/14/2025] [Indexed: 05/09/2025] Open
Abstract
BACKGROUND Large-scale family pedigrees are commonly used across medical, evolutionary, and forensic genetics. These pedigrees are tools for identifying genetic disorders, tracking evolutionary patterns, and establishing familial relationships via forensic genetic identification. However, there is a lack of software to accurately simulate different pedigree structures along with genomes corresponding to those individuals in a family pedigree. This limits simulation-based evaluations of methods that use pedigrees. RESULTS We have developed a python command-line-based tool called py_ped_sim that facilitates the simulation of pedigree structures and the genomes of individuals in a pedigree. py_ped_sim represents pedigrees as directed acyclic graphs, enabling conversion between standard pedigree formats and integration with the forward population genetic simulator, SLiM. Notably, py_ped_sim allows the simulation of varying numbers of offspring for a set of parents, with the capacity to shift the distribution of sibship sizes over generations. We additionally add simulations for events of misattributed paternity, which offers a way to simulate half-sibling relationships, and simulations to extend the breadth of a family pedigree. We validated the accuracy of both our genome simulator and pedigree simulator. We show that we can simulate genomes onto family pedigrees with levels of expected kinship. CONCLUSIONS py_ped_sim is a user-friendly and open-source solution for simulating pedigree structures and conducting pedigree genome simulations. It empowers medical, forensic, and evolutionary genetics researchers to gain deeper insights into the dynamics of genetic inheritance and relatedness within families.
Collapse
Affiliation(s)
- Miguel Guardado
- Department of Mathematics, San Francisco State University, San Francisco, CA, 94132, USA.
- Biological and Medical Informatics Graduate Program, University of California San Francisco, San Francisco, CA, 94158, USA.
- Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA, 94134, USA.
- Department of Data Science, University of Oregon, Eugene, OR, 97403, USA.
| | - Cynthia Perez
- Department of Biology, San Francisco State University, San Francisco, CA, 94132, USA
| | - Sthen Campana
- Department of Data Science, University of Oregon, Eugene, OR, 97403, USA
- Department of Biology, San Francisco State University, San Francisco, CA, 94132, USA
| | - Berenice Chavez Rojas
- Department of Biology, San Francisco State University, San Francisco, CA, 94132, USA
| | - Joaquín Magaña
- Department of Biology, San Francisco State University, San Francisco, CA, 94132, USA
| | - Shalom Jackson
- Department of Biology, San Francisco State University, San Francisco, CA, 94132, USA
| | - Emily Samperio
- Department of Biology, San Francisco State University, San Francisco, CA, 94132, USA
| | - Selena Hernandez
- Department of Biology, San Francisco State University, San Francisco, CA, 94132, USA
| | - Kaela Syas
- Department of Mathematics, San Francisco State University, San Francisco, CA, 94132, USA
| | - Ryan D Hernandez
- Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA, 94134, USA
| | - Elena I Zavala
- Department of Biology, San Francisco State University, San Francisco, CA, 94132, USA
- Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA, 94720, USA
| | - Rori V Rohlfs
- Department of Data Science, University of Oregon, Eugene, OR, 97403, USA.
- Department of Biology, San Francisco State University, San Francisco, CA, 94132, USA.
| |
Collapse
|
2
|
Sun W. Integrative functional logistic regression model for genome-wide association studies. Comput Biol Med 2025; 187:109766. [PMID: 39919666 DOI: 10.1016/j.compbiomed.2025.109766] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2024] [Revised: 01/08/2025] [Accepted: 01/28/2025] [Indexed: 02/09/2025]
Abstract
BACKGROUND Progress in rapid genomic sequencing techniques have transformed the field of disease biomarker identification by offering vast genetic information. The complexity of traits is not only influenced by single genetic loci but also by interactions among multiple genetic loci. When the dimensionality of SNP data is large, identifying a significant number of genetic variants associated with diseases becomes extremely challenging. To address these high-dimensionality issues, we employed functional data analysis techniques. METHODS Because there are a lot of ordered genetic variants spread out across a small space, multiple gene variations are handled as a continuous data set rather than discrete variables in some areas. This paper introduces a novel approach for analyzing the association of multiple genes within a region, by employing an integrative functional logistic regression model. RESULTS The proposed technique has shown promising results in both simulation and real data analysis, indicating its ability to generate smooth signals and accurately estimate the coefficients of the function while recognizing the null regions. CONCLUSIONS Integrative functional logistic regression method adopt functional data analysis and assume that high-dimensional genetic data follow a continuous process. It not only naturally accommodates correlations among adjacent SNPs but also avoids the unstable estimation of a large number of parameters. This is especially desirable with the rapidly increasing dimensions of SNP data but still limited sample size. In summary, the suggested approach offers a valuable new avenue for identifying disease-related genetic variants in GWAS.
Collapse
Affiliation(s)
- Wenyuan Sun
- Department of Mathematics, College of Science, Yanbian University, Yanji, 133002, Jilin, China.
| |
Collapse
|
3
|
Ghosal S, Schatz MC, Venkataraman A. BEATRICE: Bayesian fine-mapping from summary data using deep variational inference. Bioinformatics 2024; 40:btae590. [PMID: 39360993 PMCID: PMC11496888 DOI: 10.1093/bioinformatics/btae590] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2024] [Revised: 08/30/2024] [Accepted: 10/01/2024] [Indexed: 10/09/2024] Open
Abstract
MOTIVATION We introduce a novel framework BEATRICE to identify putative causal variants from GWAS statistics. Identifying causal variants is challenging due to their sparsity and high correlation in the nearby regions. To account for these challenges, we rely on a hierarchical Bayesian model that imposes a binary concrete prior on the set of causal variants. We derive a variational algorithm for this fine-mapping problem by minimizing the KL divergence between an approximate density and the posterior probability distribution of the causal configurations. Correspondingly, we use a deep neural network as an inference machine to estimate the parameters of our proposal distribution. Our stochastic optimization procedure allows us to sample from the space of causal configurations, which we use to compute the posterior inclusion probabilities and determine credible sets for each causal variant. We conduct a detailed simulation study to quantify the performance of our framework against two state-of-the-art baseline methods across different numbers of causal variants and noise paradigms, as defined by the relative genetic contributions of causal and noncausal variants. RESULTS We demonstrate that BEATRICE achieves uniformly better coverage with comparable power and set sizes, and that the performance gain increases with the number of causal variants. We also show the efficacy BEATRICE in finding causal variants from the GWAS study of Alzheimer's disease. In comparison to the baselines, only BEATRICE can successfully find the APOE ϵ2 allele, a commonly associated variant of Alzheimer's. AVAILABILITY AND IMPLEMENTATION BEATRICE is available for download at https://github.com/sayangsep/Beatrice-Finemapping.
Collapse
Affiliation(s)
- Sayan Ghosal
- Chan Zuckerberg Initiative Foundation, Redwood City, CA 94065, United States
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, United States
| | - Archana Venkataraman
- Department of Electrical and Computer Engineering, Boston University, Boston, MA 02215, United States
| |
Collapse
|
4
|
Jewett EM. SIMULATING PEDIGREES ASCERTAINED ON THE BASIS OF OBSERVED IBD SHARING. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.13.594012. [PMID: 38872734 PMCID: PMC11170672 DOI: 10.1101/2024.05.13.594012] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/15/2024]
Abstract
In large genotyping datasets, individuals often have thousands of distant cousins with whom they share detectable segments of DNA identically by descent (IBD). The ability to simulate these distant relationships is important for developing and testing methods, carrying out power analyses, and performing population genetic analyses. Because distant relatives are unlikely to share detectable IBD segments by chance, many simulation replicates are needed to sample IBD between any given pair of distant relatives. Exponentially more samples are needed to simulate observable segments of IBD simultaneously among multiple pairs of distant relatives in a single pedigree. Using existing pedigree simulation methods that do not condition on the event that IBD is observed among certain pairs of relatives, the chances of sampling shared IBD patterns that reflect those observed in real data ascertained from large genotyping datasets are vanishingly small, even for pedigrees of modest size. Here, we show how to sample recombination breakpoints on a fixed pedigree while conditioning on the event that specified pairs of individuals share at least one observed segment of IBD. The resulting simulator makes it possible to sample genotypes and IBD segments on pedigrees that reflect those ascertained from biobank scale data.
Collapse
|
5
|
Guardado M, Perez C, Jackson S, Magaña J, Campana S, Samperio E, Rojas BC, Hernandez S, Syas K, Hernandez R, Zavala EI, Rohlfs R. py_ped_sim - A flexible forward genetic simulator for complex family pedigree analysis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.25.586501. [PMID: 38585824 PMCID: PMC10996500 DOI: 10.1101/2024.03.25.586501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/09/2024]
Abstract
Background Large-scale family pedigrees are commonly used across medical, evolutionary, and forensic genetics. These pedigrees are tools for identifying genetic disorders, tracking evolutionary patterns, and establishing familial relationships via forensic genetic identification. However, there is a lack of software to accurately simulate different pedigree structures along with genomes corresponding to those individuals in a family pedigree. This limits simulation-based evaluations of methods that use pedigrees. Results We have developed a python command-line-based tool called py_ped_sim that facilitates the simulation of pedigree structures and the genomes of individuals in a pedigree. py_ped_sim represents pedigrees as directed acyclic graphs, enabling conversion between standard pedigree formats and integration with the forward population genetic simulator, SLiM. Notably, py_ped_sim allows the simulation of varying numbers of offspring for a set of parents, with the capacity to shift the distribution of sibship sizes over generations. We additionally add simulations for events of misattributed paternity, which offers a way to simulate half-sibling relationships. We validated the accuracy of our software by simulating genomes onto diverse family pedigree structures, showing that the estimated kinship coefficients closely approximated expected values. Conclusions py_ped_sim is a user-friendly and open-source solution for simulating pedigree structures and conducting pedigree genome simulations. It empowers medical, forensic, and evolutionary genetics researchers to gain deeper insights into the dynamics of genetic inheritance and relatedness within families.
Collapse
Affiliation(s)
- Miguel Guardado
- San Francisco State University, Department of Mathematics, San Francisco CA, 94132, USA
- University of California San Francisco, Biological and Medical Informatics Graduate Program. San Francisco CA, 94158
- Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA; San Francisco, 94134, CA, USA
- University of Oregon; Department of Data Science; Eugene, OR, 97403, USA
| | - Cynthia Perez
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
| | - Shalom Jackson
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
| | - Joaquín Magaña
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
| | - Sthen Campana
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
| | - Emily Samperio
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
| | | | - Selena Hernandez
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
| | - Kaela Syas
- San Francisco State University, Department of Mathematics, San Francisco CA, 94132, USA
| | - Ryan Hernandez
- Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA; San Francisco, 94134, CA, USA
| | - Elena I. Zavala
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
- University of California, Berkeley, Department of Molecular and Cell Biology, Berkeley, CA, 94720, USA
| | - Rori Rohlfs
- San Francisco State University, Department of Biology, San Francisco CA, 94132, USA
- University of Oregon; Department of Data Science; Eugene, OR, 97403, USA
| |
Collapse
|
6
|
Wharrie S, Yang Z, Raj V, Monti R, Gupta R, Wang Y, Martin A, O’Connor LJ, Kaski S, Marttinen P, Palamara PF, Lippert C, Ganna A. HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes. Bioinformatics 2023; 39:btad535. [PMID: 37647640 PMCID: PMC10493177 DOI: 10.1093/bioinformatics/btad535] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2022] [Revised: 08/23/2023] [Accepted: 08/29/2023] [Indexed: 09/01/2023] Open
Abstract
MOTIVATION Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking. RESULTS We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures. AVAILABILITY AND IMPLEMENTATION A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.
Collapse
Affiliation(s)
- Sophie Wharrie
- Department of Computer Science, Aalto University, Espoo 02150, Finland
| | - Zhiyu Yang
- Institute for Molecular Medicine Finland, FIMM, HiLIFE, University of Helsinki, Helsinki 00014, Finland
| | - Vishnu Raj
- Department of Computer Science, Aalto University, Espoo 02150, Finland
| | - Remo Monti
- Hasso Plattner Institute, University of Potsdam, Digital Engineering Faculty, Potsdam 14469, Germany
| | - Rahul Gupta
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, United States
| | - Ying Wang
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, United States
| | - Alicia Martin
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, United States
| | - Luke J O’Connor
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, United States
| | - Samuel Kaski
- Department of Computer Science, Aalto University, Espoo 02150, Finland
- Department of Computer Science, University of Manchester, Manchester M13 9PL, United Kingdom
| | - Pekka Marttinen
- Department of Computer Science, Aalto University, Espoo 02150, Finland
| | | | - Christoph Lippert
- Hasso Plattner Institute, University of Potsdam, Digital Engineering Faculty, Potsdam 14469, Germany
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, New York 10065, United States
| | - Andrea Ganna
- Institute for Molecular Medicine Finland, FIMM, HiLIFE, University of Helsinki, Helsinki 00014, Finland
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, United States
| |
Collapse
|
7
|
Bocher O, Marenne G, Génin E, Perdry H. Ravages: An R package for the simulation and analysis of rare variants in multicategory phenotypes. Genet Epidemiol 2023; 47:450-460. [PMID: 37158367 DOI: 10.1002/gepi.22529] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Revised: 03/27/2023] [Accepted: 04/25/2023] [Indexed: 05/10/2023]
Abstract
Current software packages for the analysis and the simulations of rare variants are only available for binary and continuous traits. Ravages provides solutions in a single R package to perform rare variant association tests for multicategory, binary and continuous phenotypes, to simulate datasets under different scenarios and to compute statistical power. Association tests can be run in the whole genome thanks to C++ implementation of most of the functions, using either RAVA-FIRST, a recently developed strategy to filter and analyse genome-wide rare variants, or user-defined candidate regions. Ravages also includes a simulation module that generates genetic data for cases who can be stratified into several subgroups and for controls. Through comparisons with existing programmes, we show that Ravages complements existing tools and will be useful to study the genetic architecture of complex diseases. Ravages is available on the CRAN at https://cran.r-project.org/web/packages/Ravages/ and maintained on Github at https://github.com/genostats/Ravages.
Collapse
Affiliation(s)
- Ozvan Bocher
- Univ Brest, Inserm, EFS, UMR 1078, GGB, Brest, France
- Institute of Translational Genomics, Helmholtz Zentrum München, Munich, Germany
| | | | | | - Hervé Perdry
- CESP Inserm, U1018, UFR Médecine, Univ Paris-Sud, Université Paris-Saclay, Villejuif, France
| |
Collapse
|
8
|
Yang Z, Wang C, Liu L, Khan A, Lee A, Vardarajan B, Mayeux R, Kiryluk K, Ionita-Laza I. CARMA is a new Bayesian model for fine-mapping in genome-wide association meta-analyses. Nat Genet 2023; 55:1057-1065. [PMID: 37169873 DOI: 10.1038/s41588-023-01392-0] [Citation(s) in RCA: 33] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Accepted: 04/11/2023] [Indexed: 05/13/2023]
Abstract
Fine-mapping is commonly used to identify putative causal variants at genome-wide significant loci. Here we propose a Bayesian model for fine-mapping that has several advantages over existing methods, including flexible specification of the prior distribution of effect sizes, joint modeling of summary statistics and functional annotations and accounting for discrepancies between summary statistics and external linkage disequilibrium in meta-analyses. Using simulations, we compare performance with commonly used fine-mapping methods and show that the proposed model has higher power and lower false discovery rate (FDR) when including functional annotations, and higher power, lower FDR and higher coverage for credible sets in meta-analyses. We further illustrate our approach by applying it to a meta-analysis of Alzheimer's disease genome-wide association studies where we prioritize putatively causal variants and genes.
Collapse
Affiliation(s)
- Zikun Yang
- Department of Biostatistics, Columbia University, New York City, NY, USA
| | - Chen Wang
- Department of Biostatistics, Columbia University, New York City, NY, USA
- Division of Nephrology Department of Medicine College of Physicians and Surgeons, Columbia University, New York City, NY, USA
| | - Linxi Liu
- Department of Statistics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Atlas Khan
- Division of Nephrology Department of Medicine College of Physicians and Surgeons, Columbia University, New York City, NY, USA
| | - Annie Lee
- Department of Neurology College of Physicians and Surgeons, Columbia University, New York City, NY, USA
| | - Badri Vardarajan
- Department of Neurology College of Physicians and Surgeons, Columbia University, New York City, NY, USA
| | - Richard Mayeux
- Department of Neurology College of Physicians and Surgeons, Columbia University, New York City, NY, USA
| | - Krzysztof Kiryluk
- Division of Nephrology Department of Medicine College of Physicians and Surgeons, Columbia University, New York City, NY, USA
| | | |
Collapse
|
9
|
Knutson KA, Pan W. MATS: a novel multi-ancestry transcriptome-wide association study to account for heterogeneity in the effects of cis-regulated gene expression on complex traits. Hum Mol Genet 2023; 32:1237-1251. [PMID: 36179104 PMCID: PMC10077507 DOI: 10.1093/hmg/ddac247] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Revised: 09/16/2022] [Accepted: 09/28/2022] [Indexed: 01/16/2023] Open
Abstract
The Transcriptome-Wide Association Study (TWAS) is a widely used approach which integrates gene expression and Genome Wide Association Study (GWAS) data to study the role of cis-regulated gene expression (GEx) in complex traits. However, the genetic architecture of GEx varies across populations, and recent findings point to possible ancestral heterogeneity in the effects of GEx on complex traits, which may be amplified in TWAS by modeling GEx as a function of cis-eQTLs. Here, we present a novel extension to TWAS to account for heterogeneity in the effects of cis-regulated GEx which are correlated with ancestry. Our proposed Multi-Ancestry TwaS (MATS) framework jointly analyzes samples from multiple populations and distinguishes between shared, ancestry-specific and/or subject-specific expression-trait associations. As such, MATS amplifies power to detect shared GEx associations over ancestry-stratified TWAS through increased sample sizes, and facilitates the detection of genes with subgroup-specific associations which may be masked by standard TWAS. Our simulations highlight the improved Type-I error conservation and power of MATS compared with competing approaches. Our real data applications to Alzheimer's disease (AD) case-control genotypes from the Alzheimer's Disease Sequencing Project (ADSP) and continuous phenotypes from the UK Biobank (UKBB) identify a number of unique gene-trait associations which were not discovered through standard and/or ancestry-stratified TWAS. Ultimately, these findings promote MATS as a powerful method for detecting and estimating significant gene expression effects on complex traits within multi-ancestry cohorts and corroborates the mounting evidence for inter-population heterogeneity in gene-trait associations.
Collapse
Affiliation(s)
| | - Wei Pan
- Division of Biostatistics, University of Minnesota, Minneapolis, MN, USA
| |
Collapse
|
10
|
Gu T, Lee PH, Duan R. COMMUTE: Communication-efficient transfer learning for multi-site risk prediction. J Biomed Inform 2023; 137:104243. [PMID: 36403757 PMCID: PMC9868117 DOI: 10.1016/j.jbi.2022.104243] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2022] [Revised: 09/20/2022] [Accepted: 11/06/2022] [Indexed: 11/19/2022]
Abstract
OBJECTIVES We propose a communication-efficient transfer learning approach (COMMUTE) that effectively incorporates multi-site healthcare data for training a risk prediction model in a target population of interest, accounting for challenges including population heterogeneity and data sharing constraints across sites. METHODS We first train population-specific source models locally within each site. Using data from a given target population, COMMUTE learns a calibration term for each source model, which adjusts for potential data heterogeneity through flexible distance-based regularizations. In a centralized setting where multi-site data can be directly pooled, all data are combined to train the target model after calibration. When individual-level data are not shareable in some sites, COMMUTE requests only the locally trained models from these sites, with which, COMMUTE generates heterogeneity-adjusted synthetic data for training the target model. We evaluate COMMUTE via extensive simulation studies and an application to multi-site data from the electronic Medical Records and Genomics (eMERGE) Network to predict extreme obesity. RESULTS Simulation studies show that COMMUTE outperforms methods without adjusting for population heterogeneity and methods trained in a single population over a broad spectrum of settings. Using eMERGE data, COMMUTE achieves an area under the receiver operating characteristic curve (AUC) around 0.80, which outperforms other benchmark methods with AUC ranging from 0.51 to 0.70. CONCLUSION COMMUTE improves the risk prediction in a target population with limited samples and safeguards against negative transfer when some source populations are highly different from the target. In a federated setting, it is highly communication efficient as it only requires each site to share model parameter estimates once, and no iterative communication or higher-order terms are needed.
Collapse
Affiliation(s)
- Tian Gu
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States
| | - Phil H Lee
- Department of Psychiatry, Harvard Medical School, Boston, MA, United States; Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, United States; Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, United States
| | - Rui Duan
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States.
| |
Collapse
|
11
|
Dias R, Evans D, Chen SF, Chen KY, Loguercio S, Chan L, Torkamani A. Rapid, Reference-Free human genotype imputation with denoising autoencoders. eLife 2022; 11:e75600. [PMID: 36148981 PMCID: PMC9555874 DOI: 10.7554/elife.75600] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Accepted: 09/19/2022] [Indexed: 11/13/2022] Open
Abstract
Genotype imputation is a foundational tool for population genetics. Standard statistical imputation approaches rely on the co-location of large whole-genome sequencing-based reference panels, powerful computing environments, and potentially sensitive genetic study data. This results in computational resource and privacy-risk barriers to access to cutting-edge imputation techniques. Moreover, the accuracy of current statistical approaches is known to degrade in regions of low and complex linkage disequilibrium. Artificial neural network-based imputation approaches may overcome these limitations by encoding complex genotype relationships in easily portable inference models. Here, we demonstrate an autoencoder-based approach for genotype imputation, using a large, commonly used reference panel, and spanning the entirety of human chromosome 22. Our autoencoder-based genotype imputation strategy achieved superior imputation accuracy across the allele-frequency spectrum and across genomes of diverse ancestry, while delivering at least fourfold faster inference run time relative to standard imputation tools.
Collapse
Affiliation(s)
- Raquel Dias
- Scripps Research Translational Institute, Scripps Research InstituteLa JollaUnited States
- Department of Integrative Structural and Computational Biology, Scripps ResearchLa JollaUnited States
- Department of Microbiology and Cell Science, University of FloridaGainesvilleUnited States
| | - Doug Evans
- Scripps Research Translational Institute, Scripps Research InstituteLa JollaUnited States
- Department of Integrative Structural and Computational Biology, Scripps ResearchLa JollaUnited States
| | - Shang-Fu Chen
- Scripps Research Translational Institute, Scripps Research InstituteLa JollaUnited States
- Department of Integrative Structural and Computational Biology, Scripps ResearchLa JollaUnited States
| | - Kai-Yu Chen
- Scripps Research Translational Institute, Scripps Research InstituteLa JollaUnited States
- Department of Integrative Structural and Computational Biology, Scripps ResearchLa JollaUnited States
| | - Salvatore Loguercio
- Scripps Research Translational Institute, Scripps Research InstituteLa JollaUnited States
- Department of Integrative Structural and Computational Biology, Scripps ResearchLa JollaUnited States
| | - Leslie Chan
- Scripps Research Translational Institute, Scripps Research InstituteLa JollaUnited States
- Department of Integrative Structural and Computational Biology, Scripps ResearchLa JollaUnited States
| | - Ali Torkamani
- Scripps Research Translational Institute, Scripps Research InstituteLa JollaUnited States
- Department of Integrative Structural and Computational Biology, Scripps ResearchLa JollaUnited States
| |
Collapse
|
12
|
Integrative transcriptomic, evolutionary, and causal inference framework for region-level analysis: Application to COVID-19. NPJ Genom Med 2022; 7:24. [PMID: 35318325 PMCID: PMC8940898 DOI: 10.1038/s41525-022-00296-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Accepted: 02/15/2022] [Indexed: 11/09/2022] Open
Abstract
We developed an integrative transcriptomic, evolutionary, and causal inference framework for a deep region-level analysis, which integrates several published approaches and a new summary-statistics-based methodology. To illustrate the framework, we applied it to understanding the host genetics of COVID-19 severity. We identified putative causal genes, including SLC6A20, CXCR6, CCR9, and CCR5 in the locus on 3p21.31, quantifying their effect on mediating expression and on severe COVID-19. We confirmed that individuals who carry the introgressed archaic segment in the locus have a substantially higher risk of developing the severe disease phenotype, estimating its contribution to expression-mediated heritability using a new summary-statistics-based approach we developed here. Through a large-scale phenome-wide scan for the genes in the locus, several potential complications, including inflammatory, immunity, olfactory, and gustatory traits, were identified. Notably, the introgressed segment showed a much higher concentration of expression-mediated causal effect on severity (0.9–11.5 times) than the entire locus, explaining, on average, 15.7% of the causal effect. The region-level framework (implemented in publicly available software, SEGMENT-SCAN) has important implications for the elucidation of molecular mechanisms of disease and the rational design of potentially novel therapeutics.
Collapse
|
13
|
Choi YH, Briollais L, He W, Kopciuk K. FamEvent: An R Package for Generating and Modeling Time-to-Event Data in Family Designs. J Stat Softw 2021; 97:10.18637/jss.v097.i07. [PMID: 34512212 PMCID: PMC8427460 DOI: 10.18637/jss.v097.i07] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open
Abstract
FamEvent is a comprehensive R package for simulating and modelling age-at-disease onset in families carrying a rare gene mutation. The package can simulate complex family data for variable time-to-event outcomes under three common family study designs (population, high-risk clinic and multi-stage) with various levels of missing genetic information among family members. Residual familial correlation can be induced through the inclusion of a frailty term or a second gene. Disease-gene carrier probabilities are evaluated assuming Mendelian transmission or empirically from the data. When genetic information on the disease gene is missing, an Expectation-Maximization algorithm is employed to calculate the carrier probabilities. Penetrance model functions with ascertainment correction adapted to the sampling design provide age-specific cumulative disease risks by sex, mutation status, and other covariates for simulated data as well as real data analysis. Robust standard errors and 95% confidence intervals are available for these estimates. Plots of pedigrees and penetrance functions based on the fitted model provide graphical displays to evaluate and summarize the models.
Collapse
|
14
|
Gleason KJ, Yang F, Pierce BL, He X, Chen LS. Primo: integration of multiple GWAS and omics QTL summary statistics for elucidation of molecular mechanisms of trait-associated SNPs and detection of pleiotropy in complex traits. Genome Biol 2020; 21:236. [PMID: 32912334 PMCID: PMC7488447 DOI: 10.1186/s13059-020-02125-w] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2019] [Accepted: 07/29/2020] [Indexed: 01/10/2023] Open
Abstract
To provide a comprehensive mechanistic interpretation of how known trait-associated SNPs affect complex traits, we propose a method, Primo, for integrative analysis of GWAS summary statistics with multiple sets of omics QTL summary statistics from different cellular conditions or studies. Primo examines association patterns of SNPs to complex and omics traits. In gene regions harboring known susceptibility loci, Primo performs conditional association analysis to account for linkage disequilibrium. Primo allows for unknown study heterogeneity and sample correlations. We show two applications using Primo to examine the molecular mechanisms of known susceptibility loci and to detect and interpret pleiotropic effects.
Collapse
Affiliation(s)
- Kevin J. Gleason
- Department of Public Health Sciences, University of Chicago, 5841 South Maryland Ave MC2000, Chicago, 60637 IL USA
| | - Fan Yang
- Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, 13001 E. 17th Place, Aurora, 80045 CO USA
| | - Brandon L. Pierce
- Department of Public Health Sciences, University of Chicago, 5841 South Maryland Ave MC2000, Chicago, 60637 IL USA
- Department of Human Genetics, University of Chicago, 920 E 58th St, Chicago, 60637 IL USA
| | - Xin He
- Department of Human Genetics, University of Chicago, 920 E 58th St, Chicago, 60637 IL USA
| | - Lin S. Chen
- Department of Public Health Sciences, University of Chicago, 5841 South Maryland Ave MC2000, Chicago, 60637 IL USA
| |
Collapse
|
15
|
Romanescu RG, Green J, Andrulis IL, Bull SB. Gene-based and pathway-based testing for rare-variant association in affected sib pairs. Genet Epidemiol 2020; 44:368-381. [PMID: 32237178 PMCID: PMC7318298 DOI: 10.1002/gepi.22291] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2019] [Revised: 02/28/2020] [Accepted: 03/06/2020] [Indexed: 12/04/2022]
Abstract
Next generation sequencing technologies have made it possible to investigate the role of rare variants (RVs) in disease etiology. Because RVs associated with disease susceptibility tend to be enriched in families with affected individuals, study designs based on affected sib pairs (ASP) can be more powerful than case-control studies. We construct tests of RV-set association in ASPs for single genomic regions as well as for multiple regions. Single-region tests can efficiently detect a gene region harboring susceptibility variants, while multiple-region extensions are meant to capture signals dispersed across a biological pathway, potentially as a result of locus heterogeneity. Within ascertained ASPs, the test statistics contrast the frequencies of duplicate rare alleles (usually appearing on a shared haplotype) against frequencies of a single rare allele copy (appearing on a nonshared haplotype); we call these allelic parity tests. Incorporation of minor allele frequency estimates from reference populations can markedly improve test efficiency. Under various genetic penetrance models, application of the tests in simulated ASP data sets demonstrates good type I error properties as well as power gains over approaches that regress ASP rare allele counts on sharing state, especially in small samples. We discuss robustness of the allelic parity methods to the presence of genetic linkage, misspecification of reference population allele frequencies, sequencing error and de novo mutations, and population stratification. As proof of principle, we apply single- and multiple-region tests in a motivating study data set consisting of whole exome sequencing of sisters ascertained with early onset breast cancer.
Collapse
Affiliation(s)
- Razvan G. Romanescu
- Lunenfeld‐Tanenbaum Research InstituteSinai Health SystemTorontoOntarioCanada
- Centre for Healthcare Innovation, Rady Faculty of Health ScienceUniversity of ManitobaWinnipegManitobaCanada
| | - Jessica Green
- Lunenfeld‐Tanenbaum Research InstituteSinai Health SystemTorontoOntarioCanada
| | - Irene L. Andrulis
- Lunenfeld‐Tanenbaum Research InstituteSinai Health SystemTorontoOntarioCanada
- Department of Molecular GeneticsUniversity of TorontoTorontoOntarioCanada
| | - Shelley B. Bull
- Division of Biostatistics, Dalla Lana School of Public HealthUniversity of TorontoTorontoOntarioCanada
| |
Collapse
|
16
|
Xu J, Xu W, Briollais L. A Bayes factor approach with informative prior for rare genetic variant analysis from next generation sequencing data. Biometrics 2020; 77:316-328. [PMID: 32277476 DOI: 10.1111/biom.13278] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Revised: 02/15/2020] [Accepted: 04/01/2020] [Indexed: 11/28/2022]
Abstract
The discovery of rare genetic variants through next generation sequencing is a very challenging issue in the field of human genetics. We propose a novel region-based statistical approach based on a Bayes Factor (BF) to assess evidence of association between a set of rare variants (RVs) located on the same genomic region and a disease outcome in the context of case-control design. Marginal likelihoods are computed under the null and alternative hypotheses assuming a binomial distribution for the RV count in the region and a beta or mixture of Dirac and beta prior distribution for the probability of RV. We derive the theoretical null distribution of the BF under our prior setting and show that a Bayesian control of the false Discovery Rate can be obtained for genome-wide inference. Informative priors are introduced using prior evidence of association from a Kolmogorov-Smirnov test statistic. We use our simulation program, sim1000G, to generate RV data similar to the 1000 genomes sequencing project. Our simulation studies showed that the new BF statistic outperforms standard methods (SKAT, SKAT-O, Burden test) in case-control studies with moderate sample sizes and is equivalent to them under large sample size scenarios. Our real data application to a lung cancer case-control study found enrichment for RVs in known and novel cancer genes. It also suggests that using the BF with informative prior improves the overall gene discovery compared to the BF with noninformative prior.
Collapse
Affiliation(s)
- Jingxiong Xu
- Dalla Lana School of Public Health, University of Toronto, Toronto, Canada.,Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, Canada
| | - Wei Xu
- Dalla Lana School of Public Health, University of Toronto, Toronto, Canada.,Princess Margaret Cancer Center, Toronto, Canada
| | - Laurent Briollais
- Dalla Lana School of Public Health, University of Toronto, Toronto, Canada.,Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, Canada
| |
Collapse
|
17
|
Nieuwoudt C, Brooks-Wilson A, Graham J. SimRVSequences: an R package to simulate genetic sequence data for pedigrees. Bioinformatics 2020; 36:2295-2297. [PMID: 31764964 DOI: 10.1093/bioinformatics/btz881] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Revised: 11/12/2019] [Accepted: 11/22/2019] [Indexed: 11/12/2022] Open
Abstract
SUMMARY We present the R package SimRVSequences to simulate sequence data for pedigrees. SimRVSequences allows for simulations of large numbers of single-nucleotide variants (SNVs) and scales well with increasing numbers of pedigrees. Users provide a sample of pedigrees and SNV data from a sample of unrelated individuals. AVAILABILITY AND IMPLEMENTATION SimRVSequences is publicly-available on CRAN https://cran.r-project.org/web/packages/SimRVSequences/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Christina Nieuwoudt
- Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC V5A 1S6
| | - Angela Brooks-Wilson
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 1L3.,Department of Biomedical Physiology and Kinesiology, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
| | - Jinko Graham
- Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC V5A 1S6
| |
Collapse
|
18
|
Juan L, Wang Y, Jiang J, Yang Q, Jiang Q, Wang Y. PGsim: A Comprehensive and Highly Customizable Personal Genome Simulator. Front Bioeng Biotechnol 2020; 8:28. [PMID: 32047747 PMCID: PMC6997238 DOI: 10.3389/fbioe.2020.00028] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2019] [Accepted: 01/13/2020] [Indexed: 11/26/2022] Open
Abstract
Although genome sequencing has become increasingly popular, the simulation of individual genomes is still important. This is because sequencing a large number of individual genomes is costly and genome data with extreme and boundary conditions, such as fatal genetic defects, are difficult to obtain. Privacy and legal barriers also prevent many applications of real data. Large sequencing projects in recent years have provided a deeper understanding of the human genome. However, there is a lack of tools to leverage known data to simulate personal genomes as real as possible. Here, we designed and developed PGsim, a comprehensive and highly customizable individual genome simulator, that fully uses existing knowledge, such as variant allele frequencies in global or world main populations, mutation probability differences between protein-coding regions and non-coding regions, transition/transversion (Ti/Tv) ratios, Indel incidence, Indel length distribution, structural variation sites, and pathogenic mutation sites. Users can flexibly control the proportion and quantity of known variants, common variants, novel variants in both coding and non-coding regions, and special variants through detailed parameter settings. To ensure that the simulated personal genome has sufficient randomness, PGsim makes the generated variants more real and reliable in terms of variant distribution, proportion, and population characteristics. PGsim is able to employ a huge volume database as background data to simulate personal genomes and does not require SQL database support. Users can easily change the variant databases used as needed. As a Perl script, there is no obstacle to running PGsim on any version of the MAC OS or Linux systems, and no libraries, packages, interpreters, compilers, or other dependencies need to be installed in advance. The PGsim tool is publicly available at https://github.com/lrjuan/PGsim.
Collapse
Affiliation(s)
- Liran Juan
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yongtian Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Jingyi Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Qi Yang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
19
|
Bocher O, Marenne G, Saint Pierre A, Ludwig TE, Guey S, Tournier-Lasserve E, Perdry H, Génin E. Rare variant association testing for multicategory phenotype. Genet Epidemiol 2019; 43:646-656. [PMID: 31087445 DOI: 10.1002/gepi.22210] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2018] [Revised: 04/03/2019] [Accepted: 04/17/2019] [Indexed: 01/09/2023]
Abstract
Genetic association studies have provided new insights into the genetic variability of human complex traits with a focus mainly on continuous or binary traits. Methods have been proposed to take into account disease heterogeneity between subgroups of patients when studying common variants but none was specifically designed for rare variants. Because rare variants are expected to have stronger effects and to be more heterogeneously distributed among cases than common ones, subgroup analyses might be particularly attractive in this context. To address this issue, we propose an extension of burden tests by using a multinomial regression model, which enables association tests between rare variants and multicategory phenotypes. We evaluated the type I error and the power of two burden tests, CAST and WSS, by simulating data under different scenarios. In the case of genetic heterogeneity between case subgroups, we showed an advantage of multinomial regression over logistic regression, which considers all the cases against the controls. We replicated these results on real data from Moyamoya disease where the burden tests performed better when cases were stratified according to age-of-onset. We implemented the functions for association tests in the R package "Ravages" available on Github.
Collapse
Affiliation(s)
- Ozvan Bocher
- Univ Brest, Inserm, EFS, UMR 1078, GGB, Brest, France
| | | | | | - Thomas E Ludwig
- Univ Brest, Inserm, EFS, UMR 1078, GGB, Brest, France.,CHU Brest, Brest, France
| | - Stéphanie Guey
- Inserm UMR-S1161, Génétique et Physiopathologie des Maladies Cérébro-vasculaires, Université Paris Diderot, Sorbonne Paris Cité, Paris, France
| | - Elisabeth Tournier-Lasserve
- Inserm UMR-S1161, Génétique et Physiopathologie des Maladies Cérébro-vasculaires, Université Paris Diderot, Sorbonne Paris Cité, Paris, France
| | - Hervé Perdry
- CESP Inserm, U1018, UFR Médecine, Univ Paris-Sud, Université Paris-Saclay, Villejuif, France
| | | |
Collapse
|