1
|
Juan L, Wang Y, Jiang J, Yang Q, Jiang Q, Wang Y. PGsim: A Comprehensive and Highly Customizable Personal Genome Simulator. Front Bioeng Biotechnol 2020; 8:28. [PMID: 32047747 PMCID: PMC6997238 DOI: 10.3389/fbioe.2020.00028] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2019] [Accepted: 01/13/2020] [Indexed: 11/26/2022] Open
Abstract
Although genome sequencing has become increasingly popular, the simulation of individual genomes is still important. This is because sequencing a large number of individual genomes is costly and genome data with extreme and boundary conditions, such as fatal genetic defects, are difficult to obtain. Privacy and legal barriers also prevent many applications of real data. Large sequencing projects in recent years have provided a deeper understanding of the human genome. However, there is a lack of tools to leverage known data to simulate personal genomes as real as possible. Here, we designed and developed PGsim, a comprehensive and highly customizable individual genome simulator, that fully uses existing knowledge, such as variant allele frequencies in global or world main populations, mutation probability differences between protein-coding regions and non-coding regions, transition/transversion (Ti/Tv) ratios, Indel incidence, Indel length distribution, structural variation sites, and pathogenic mutation sites. Users can flexibly control the proportion and quantity of known variants, common variants, novel variants in both coding and non-coding regions, and special variants through detailed parameter settings. To ensure that the simulated personal genome has sufficient randomness, PGsim makes the generated variants more real and reliable in terms of variant distribution, proportion, and population characteristics. PGsim is able to employ a huge volume database as background data to simulate personal genomes and does not require SQL database support. Users can easily change the variant databases used as needed. As a Perl script, there is no obstacle to running PGsim on any version of the MAC OS or Linux systems, and no libraries, packages, interpreters, compilers, or other dependencies need to be installed in advance. The PGsim tool is publicly available at https://github.com/lrjuan/PGsim.
Collapse
Affiliation(s)
- Liran Juan
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yongtian Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Jingyi Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Qi Yang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
2
|
Uricchio LH. Evolutionary perspectives on polygenic selection, missing heritability, and GWAS. Hum Genet 2020; 139:5-21. [PMID: 31201529 PMCID: PMC8059781 DOI: 10.1007/s00439-019-02040-6] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2018] [Accepted: 06/06/2019] [Indexed: 12/26/2022]
Abstract
Genome-wide association studies (GWAS) have successfully identified many trait-associated variants, but there is still much we do not know about the genetic basis of complex traits. Here, we review recent theoretical and empirical literature regarding selection on complex traits to argue that "missing heritability" is as much an evolutionary problem as it is a statistical problem. We discuss empirical findings that suggest a role for selection in shaping the effect sizes and allele frequencies of causal variation underlying complex traits, and the limitations of these studies. We then use simulations of selection, realistic genome structure, and complex human demography to illustrate the results of recent theoretical work on polygenic selection, and show that statistical inference of causal loci is sharply affected by evolutionary processes. In particular, when selection acts on causal alleles, it hampers the ability to detect causal loci and constrains the transferability of GWAS results across populations. Last, we discuss the implications of these findings for future association studies, and suggest that future statistical methods to infer causal loci for genetic traits will benefit from explicit modeling of the joint distribution of effect sizes and allele frequencies under plausible evolutionary models.
Collapse
Affiliation(s)
- Lawrence H Uricchio
- Department of Biology, Stanford University, Stanford, CA, USA.
- Department of Integrative Biology, University of California, Berkeley, Berkeley, CA, USA.
| |
Collapse
|
3
|
Tong DMH, Hernandez RD. Population genetic simulation study of power in association testing across genetic architectures and study designs. Genet Epidemiol 2020; 44:90-103. [PMID: 31587362 PMCID: PMC6980249 DOI: 10.1002/gepi.22264] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2019] [Revised: 08/26/2019] [Accepted: 09/16/2019] [Indexed: 12/22/2022]
Abstract
While it is well established that genetics can be a major contributor to population variation of complex traits, the relative contributions of rare and common variants to phenotypic variation remains a matter of considerable debate. Here, we simulate genetic and phenotypic data across different case/control panel sampling strategies, sequencing methods, and genetic architecture models based on evolutionary forces to determine the statistical performance of rare variant association tests (RVATs) widely in use. We find that the highest statistical power of RVATs is achieved by sampling case/control individuals from the extremes of an underlying quantitative trait distribution. We also demonstrate that the use of genotyping arrays, in conjunction with imputation from a whole-genome sequenced (WGS) reference panel, recovers the vast majority (90%) of the power that could be achieved by sequencing the case/control panel using current tools. Finally, we show that for dichotomous traits, the statistical performance of RVATs decreases as rare variants become more important in the trait architecture. Our results extend previous work to show that RVATs are insufficiently powered to make generalizable conclusions about the role of rare variants in dichotomous complex traits.
Collapse
Affiliation(s)
- Dominic M. H. Tong
- University of California, Berkeley ‐ University of California, San Francisco Graduate Program in BioengineeringSan FranciscoCalifornia
| | - Ryan D. Hernandez
- Department of Bioengineering and Therapeutic SciencesUniversity of CaliforniaSan FranciscoCalifornia
- Department of Human GeneticsMcGill UniversityMontrealCanada
| |
Collapse
|
4
|
Hernandez RD, Uricchio LH, Hartman K, Ye C, Dahl A, Zaitlen N. Ultrarare variants drive substantial cis heritability of human gene expression. Nat Genet 2019; 51:1349-1355. [PMID: 31477931 PMCID: PMC6730564 DOI: 10.1038/s41588-019-0487-7] [Citation(s) in RCA: 72] [Impact Index Per Article: 14.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2018] [Accepted: 07/08/2019] [Indexed: 11/09/2022]
Abstract
The vast majority of human mutations have minor allele frequencies under 1%, with the plurality observed only once (that is, 'singletons'). While Mendelian diseases are predominantly caused by rare alleles, their cumulative contribution to complex phenotypes is largely unknown. We develop and rigorously validate an approach to jointly estimate the contribution of all alleles, including singletons, to phenotypic variation. We apply our approach to transcriptional regulation, an intermediate between genetic variation and complex disease. Using whole-genome DNA and lymphoblastoid cell line RNA sequencing data from 360 European individuals, we conservatively estimate that singletons contribute approximately 25% of cis heritability across genes (dwarfing the contributions of other frequencies). The majority (approximately 76%) of singleton heritability derives from ultrarare variants absent from thousands of additional samples. We develop an inference procedure to demonstrate that our results are consistent with pervasive purifying selection shaping the regulatory architecture of most human genes.
Collapse
Affiliation(s)
- Ryan D Hernandez
- Bioengineering & Therapeutic Sciences, UCSF, San Francisco, CA, USA.
- Institute for Human Genetics, UCSF, San Francisco, CA, USA.
- Institute for Quantitative Biosciences, UCSF, San Francisco, CA, USA.
- Institute for Computational Health Sciences, UCSF, San Francisco, CA, USA.
- Department of Human Genetics, McGill University, Montreal, Quebec, Canada.
- McGill University and the Genome Quebec Innovation Center, Montreal, Quebec, Canada.
| | | | - Kevin Hartman
- Biological and Medical Informatics Graduate Program, UCSF, San Francisco, CA, USA
| | - Chun Ye
- Institute for Human Genetics, UCSF, San Francisco, CA, USA
- Epidemiology & Biostatistics, UCSF, San Francisco, CA, USA
| | - Andrew Dahl
- Institute for Human Genetics, UCSF, San Francisco, CA, USA
- Institute for Quantitative Biosciences, UCSF, San Francisco, CA, USA
| | - Noah Zaitlen
- Institute for Human Genetics, UCSF, San Francisco, CA, USA.
- Institute for Quantitative Biosciences, UCSF, San Francisco, CA, USA.
- Department of Medicine Lung Biology Center, UCSF, San Francisco, CA, USA.
| |
Collapse
|
5
|
Exploiting selection at linked sites to infer the rate and strength of adaptation. Nat Ecol Evol 2019; 3:977-984. [PMID: 31061475 PMCID: PMC6693860 DOI: 10.1038/s41559-019-0890-6] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2018] [Accepted: 03/28/2019] [Indexed: 12/18/2022]
Abstract
Genomic data encodes past evolutionary events and has the potential to reveal the strength, rate, and biological drivers of adaptation. However, jointly estimating adaptation rate (a) and adaptation strength remains challenging because evolutionary processes such as demography, linkage, and non-neutral polymorphism can confound inference. Here, we exploit the influence of background selection to reduce the fixation rate of weakly-beneficial alleles to jointly infer the strength and rate of adaptation. We develop an MK-based method (ABC-MK) to infer adaptation rate and strength, and estimate α = 0.135 in human protein-coding sequences, 72% of which is contributed by weakly-adaptive variants. We show that in this adaptation regime α is reduced ≈ 25% by linkage genome-wide. Moreover, we show that virus-interacting proteins (VIPs) undergo adaptation that is both stronger and nearly twice as frequent as the genome average (α = 0.224, 56% due to strongly-beneficial alleles). Our results suggest that while most adaptation in human proteins is weakly-beneficial, adaptation to viruses is often strongly-beneficial. Our method provides a robust framework for estimating adaptation rate and strength across species.
Collapse
|
6
|
Torres R, Szpiech ZA, Hernandez RD. Human demographic history has amplified the effects of background selection across the genome. PLoS Genet 2018; 14:e1007387. [PMID: 29912945 PMCID: PMC6056204 DOI: 10.1371/journal.pgen.1007387] [Citation(s) in RCA: 47] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2017] [Revised: 07/23/2018] [Accepted: 04/30/2018] [Indexed: 01/22/2023] Open
Abstract
Natural populations often grow, shrink, and migrate over time. Such demographic processes can affect genome-wide levels of genetic diversity. Additionally, genetic variation in functional regions of the genome can be altered by natural selection, which drives adaptive mutations to higher frequencies or purges deleterious ones. Such selective processes affect not only the sites directly under selection but also nearby neutral variation through genetic linkage via processes referred to as genetic hitchhiking in the context of positive selection and background selection (BGS) in the context of purifying selection. While there is extensive literature examining the consequences of selection at linked sites at demographic equilibrium, less is known about how non-equilibrium demographic processes influence the effects of hitchhiking and BGS. Utilizing a global sample of human whole-genome sequences from the Thousand Genomes Project and extensive simulations, we investigate how non-equilibrium demographic processes magnify and dampen the consequences of selection at linked sites across the human genome. When binning the genome by inferred strength of BGS, we observe that, compared to Africans, non-African populations have experienced larger proportional decreases in neutral genetic diversity in strong BGS regions. We replicate these findings in admixed populations by showing that non-African ancestral components of the genome have also been affected more severely in these regions. We attribute these differences to the strong, sustained/recurrent population bottlenecks that non-Africans experienced as they migrated out of Africa and throughout the globe. Furthermore, we observe a strong correlation between FST and the inferred strength of BGS, suggesting a stronger rate of genetic drift. Forward simulations of human demographic history with a model of BGS support these observations. Our results show that non-equilibrium demography significantly alters the consequences of selection at linked sites and support the need for more work investigating the dynamic process of multiple evolutionary forces operating in concert.
Collapse
Affiliation(s)
- Raul Torres
- Biomedical Sciences Graduate Program, University of California San Francisco, San Francisco, CA, United States of America
| | - Zachary A. Szpiech
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, United States of America
| | - Ryan D. Hernandez
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, United States of America
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA, United States of America
- Institute for Computational Health Sciences, University of California San Francisco, San Francisco, CA, United States of America
- Quantitative Biosciences Institute, University of California San Francisco, San Francisco, CA, United States of America
- * E-mail:
| |
Collapse
|
7
|
A Model of Compound Heterozygous, Loss-of-Function Alleles Is Broadly Consistent with Observations from Complex-Disease GWAS Datasets. PLoS Genet 2017; 13:e1006573. [PMID: 28103232 PMCID: PMC5289629 DOI: 10.1371/journal.pgen.1006573] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2016] [Revised: 02/02/2017] [Accepted: 01/05/2017] [Indexed: 12/17/2022] Open
Abstract
The genetic component of complex disease risk in humans remains largely unexplained. A corollary is that the allelic spectrum of genetic variants contributing to complex disease risk is unknown. Theoretical models that relate population genetic processes to the maintenance of genetic variation for quantitative traits may suggest profitable avenues for future experimental design. Here we use forward simulation to model a genomic region evolving under a balance between recurrent deleterious mutation and Gaussian stabilizing selection. We consider multiple genetic and demographic models, and several different methods for identifying genomic regions harboring variants associated with complex disease risk. We demonstrate that the model of gene action, relating genotype to phenotype, has a qualitative effect on several relevant aspects of the population genetic architecture of a complex trait. In particular, the genetic model impacts genetic variance component partitioning across the allele frequency spectrum and the power of statistical tests. Models with partial recessivity closely match the minor allele frequency distribution of significant hits from empirical genome-wide association studies without requiring homozygous effect sizes to be small. We highlight a particular gene-based model of incomplete recessivity that is appealing from first principles. Under that model, deleterious mutations in a genomic region partially fail to complement one another. This model of gene-based recessivity predicts the empirically observed inconsistency between twin and SNP based estimated of dominance heritability. Furthermore, this model predicts considerable levels of unexplained variance associated with intralocus epistasis. Our results suggest a need for improved statistical tools for region based genetic association and heritability estimation. Gene action determines how mutations affect phenotype. When placed in an evolutionary context, the details of the genotype-to-phenotype model can impact the maintenance of genetic variation for complex traits. Likewise, non-equilibrium demographic history may affect patterns of genetic variation. Here, we explore the impact of genetic model and population growth on distribution of genetic variance across the allele frequency spectrum underlying risk for a complex disease. Using forward-in-time population genetic simulations, we show that the genetic model has important impacts on the composition of variation for complex disease risk in a population. We explicitly simulate genome-wide association studies (GWAS) and perform heritability estimation on population samples. A particular model of gene-based partial recessivity, based on allelic non-complementation, aligns well with empirical results. This model is congruent with the dominance variance estimates from both SNPs and twins, and the minor allele frequency distribution of GWAS hits.
Collapse
|
8
|
Uricchio LH, Zaitlen NA, Ye CJ, Witte JS, Hernandez RD. Selection and explosive growth alter genetic architecture and hamper the detection of causal rare variants. Genome Res 2016; 26:863-73. [PMID: 27197206 PMCID: PMC4937562 DOI: 10.1101/gr.202440.115] [Citation(s) in RCA: 52] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2015] [Accepted: 05/16/2016] [Indexed: 12/20/2022]
Abstract
The role of rare alleles in complex phenotypes has been hotly debated, but most rare variant association tests (RVATs) do not account for the evolutionary forces that affect genetic architecture. Here, we use simulation and numerical algorithms to show that explosive population growth, as experienced by human populations, can dramatically increase the impact of very rare alleles on trait variance. We then assess the ability of RVATs to detect causal loci using simulations and human RNA-seq data. Surprisingly, we find that statistical performance is worst for phenotypes in which genetic variance is due mainly to rare alleles, and explosive population growth decreases power. Although many studies have attempted to identify causal rare variants, few have reported novel associations. This has sometimes been interpreted to mean that rare variants make negligible contributions to complex trait heritability. Our work shows that RVATs are not robust to realistic human evolutionary forces, so general conclusions about the impact of rare variants on complex traits may be premature.
Collapse
Affiliation(s)
- Lawrence H Uricchio
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, California 94143, USA; Graduate Program in Bioinformatics, University of California, San Francisco, San Francisco, California 94143, USA
| | - Noah A Zaitlen
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, California 94143, USA; Institute for Human Genetics, University of California, San Francisco, San Francisco, California 94143, USA; Institute for Quantitative Biosciences (QB3), University of California, San Francisco, San Francisco, California 94143, USA
| | - Chun Jimmie Ye
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, California 94143, USA; Institute for Human Genetics, University of California, San Francisco, San Francisco, California 94143, USA; Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, California 94143, USA
| | - John S Witte
- Institute for Human Genetics, University of California, San Francisco, San Francisco, California 94143, USA; Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, California 94143, USA
| | - Ryan D Hernandez
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, California 94143, USA; Institute for Human Genetics, University of California, San Francisco, San Francisco, California 94143, USA; Institute for Quantitative Biosciences (QB3), University of California, San Francisco, San Francisco, California 94143, USA
| |
Collapse
|