1
|
Barbitoff YA, Khmelkova DN, Pomerantseva EA, Slepchenkov AV, Zubashenko NA, Mironova IV, Kaimonov VS, Polev DE, Tsay VV, Glotov AS, Aseev MV, Shcherbak SG, Glotov OS, Isaev AA, Predeus AV. Expanding the Russian allele frequency reference via cross-laboratory data integration: insights from 7452 exome samples. Natl Sci Rev 2024; 11:nwae326. [PMID: 39498263 PMCID: PMC11533896 DOI: 10.1093/nsr/nwae326] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2024] [Revised: 08/17/2024] [Accepted: 09/12/2024] [Indexed: 11/07/2024] Open
Abstract
Population allele frequency is crucially important for accurate interpretation of known and novel variants in medical genetics. Recently, several large allele frequency databases, such as the Genome Aggregation Database (gnomAD), have been created to serve as a global reference for such studies. However, frequencies of many rare alleles vary dramatically between populations, and population-specific allele frequency is often more informative than the global one. Many countries and regions, including Russia, remain poorly studied from the genetic perspective. Here, we report the first successful attempt to integrate genetic information between major medical genetic laboratories in Russia. We construct RUSeq, an open, large-scale reference set of genetic variants by analyzing 7452 exome samples collected in two major Russian cities-Moscow and St. Petersburg. An ∼10-fold increase in sample size compared to previous studies allowed us to characterize extensive genetic diversity within the admixed Russian population with contributions from several major ancestral groups. We highlight 51 known pathogenic variants that are overrepresented in Russia compared to other European countries. We also identify several dozen high-impact variants that are present in healthy donors despite being annotated as pathogenic in ClinVar and falling within genes associated with autosomal dominant disorders. The constructed database of genetic variant frequencies in Russia has been made available to the medical genetics community through a variant browser available at http://ruseq.ru.
Collapse
Affiliation(s)
- Yury A Barbitoff
- CerbaLab Ltd., St. Petersburg 199106, Russia
- Bioinformatics Institute, St. Petersburg 197342, Russia
- Department of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology and Reproductology, St. Petersburg 199034, Russia
| | - Darya N Khmelkova
- Genetics and Reproductive Medicine Center “GENETICO” Ltd., Moscow 121205, Russia
| | | | | | - Nikita A Zubashenko
- Genetics and Reproductive Medicine Center “GENETICO” Ltd., Moscow 121205, Russia
| | - Irina V Mironova
- Genetics and Reproductive Medicine Center “GENETICO” Ltd., Moscow 121205, Russia
| | - Vladimir S Kaimonov
- Genetics and Reproductive Medicine Center “GENETICO” Ltd., Moscow 121205, Russia
| | - Dmitrii E Polev
- CerbaLab Ltd., St. Petersburg 199106, Russia
- Metagenomics Research Group, St. Petersburg Pasteur Institute, St. Petersburg 197101, Russia
| | - Victoria V Tsay
- CerbaLab Ltd., St. Petersburg 199106, Russia
- FGBE “Children's Scientific and Clinical Center for Infectious Diseases of the Federal Medical and Biological Agency”, St. Petersburg 197022, Russia
| | - Andrey S Glotov
- Department of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology and Reproductology, St. Petersburg 199034, Russia
| | - Mikhail V Aseev
- CerbaLab Ltd., St. Petersburg 199106, Russia
- Department of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology and Reproductology, St. Petersburg 199034, Russia
| | | | - Oleg S Glotov
- CerbaLab Ltd., St. Petersburg 199106, Russia
- Department of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology and Reproductology, St. Petersburg 199034, Russia
- FGBE “Children's Scientific and Clinical Center for Infectious Diseases of the Federal Medical and Biological Agency”, St. Petersburg 197022, Russia
- City Hospital No. 40, St. Petersburg 197706, Russia
| | - Arthur A Isaev
- Genetics and Reproductive Medicine Center “GENETICO” Ltd., Moscow 121205, Russia
| | | |
Collapse
|
2
|
Schraiber JG, Spence JP, Edge MD. Estimation of demography and mutation rates from one million haploid genomes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.18.613708. [PMID: 39345369 PMCID: PMC11429810 DOI: 10.1101/2024.09.18.613708] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/01/2024]
Abstract
As genetic sequencing costs have plummeted, datasets with sizes previously un-thinkable have begun to appear. Such datasets present new opportunities to learn about evolutionary history, particularly via rare alleles that record the very recent past. However, beyond the computational challenges inherent in the analysis of many large-scale datasets, large population-genetic datasets present theoretical problems. In particular, the majority of population-genetic tools require the assumption that each mutant allele in the sample is the result of a single mutation (the "infinite sites" assumption), which is violated in large samples. Here, we present DR EVIL, a method for estimating mutation rates and recent demographic history from very large samples. DR EVIL avoids the infinite-sites assumption by using a diffusion approximation to a branching-process model with recurrent mutation. The branching-process approach limits the method to rare alleles, but, along with recent results, renders tractable likelihoods with recurrent mutation. We show that DR EVIL performs well in simulations and apply it to rare-variant data from a million haploid samples, identifying a signal of mutation-rate heterogeneity within commonly analyzed classes and predicting that in modern sample sizes, most rare variants at sites with high mutation rates represent the descendants of multiple mutation events.
Collapse
|
3
|
Kyriazis CC, Lohmueller KE. Constraining models of dominance for nonsynonymous mutations in the human genome. PLoS Genet 2024; 20:e1011198. [PMID: 39302992 PMCID: PMC11446423 DOI: 10.1371/journal.pgen.1011198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2024] [Revised: 10/02/2024] [Accepted: 09/04/2024] [Indexed: 09/22/2024] Open
Abstract
Dominance is a fundamental parameter in genetics, determining the dynamics of natural selection on deleterious and beneficial mutations, the patterns of genetic variation in natural populations, and the severity of inbreeding depression in a population. Despite this importance, dominance parameters remain poorly known, particularly in humans or other non-model organisms. A key reason for this lack of information about dominance is that it is extremely challenging to disentangle the selection coefficient (s) of a mutation from its dominance coefficient (h). Here, we explore dominance and selection parameters in humans by fitting models to the site frequency spectrum (SFS) for nonsynonymous mutations. When assuming a single dominance coefficient for all nonsynonymous mutations, we find that numerous h values can fit the data, so long as h is greater than ~0.15. Moreover, we also observe that theoretically-predicted models with a negative relationship between h and s can also fit the data well, including models with h = 0.05 for strongly deleterious mutations. Finally, we use our estimated dominance and selection parameters to inform simulations revisiting the question of whether the out-of-Africa bottleneck has led to differences in genetic load between African and non-African human populations. These simulations suggest that the relative burden of genetic load in non-African populations depends on the dominance model assumed, with slight increases for more weakly recessive models and slight decreases shown for more strongly recessive models. Moreover, these results also demonstrate that models of partially recessive nonsynonymous mutations can explain the observed severity of inbreeding depression in humans, bridging the gap between molecular population genetics and direct measures of fitness in humans. Our work represents a comprehensive assessment of dominance and deleterious variation in humans, with implications for parameterizing models of deleterious variation in humans and other mammalian species.
Collapse
Affiliation(s)
- Christopher C. Kyriazis
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, California, United States of America
| | - Kirk E. Lohmueller
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, California, United States of America
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, California, United States of America
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, California, United States of America
| |
Collapse
|
4
|
Zeng T, Spence JP, Mostafavi H, Pritchard JK. Bayesian estimation of gene constraint from an evolutionary model with gene features. Nat Genet 2024; 56:1632-1643. [PMID: 38977852 DOI: 10.1038/s41588-024-01820-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Accepted: 05/29/2024] [Indexed: 07/10/2024]
Abstract
Measures of selective constraint on genes have been used for many applications, including clinical interpretation of rare coding variants, disease gene discovery and studies of genome evolution. However, widely used metrics are severely underpowered at detecting constraints for the shortest ~25% of genes, potentially causing important pathogenic mutations to be overlooked. Here we developed a framework combining a population genetics model with machine learning on gene features to enable accurate inference of an interpretable constraint metric, shet. Our estimates outperform existing metrics for prioritizing genes important for cell essentiality, human disease and other phenotypes, especially for short genes. Our estimates of selective constraint should have wide utility for characterizing genes relevant to human disease. Finally, our inference framework, GeneBayes, provides a flexible platform that can improve the estimation of many gene-level properties, such as rare variant burden or gene expression differences.
Collapse
Affiliation(s)
- Tony Zeng
- Department of Genetics, Stanford University, Stanford, CA, USA.
| | | | - Hakhamanesh Mostafavi
- Department of Genetics, Stanford University, Stanford, CA, USA
- Department of Population Health, New York University, New York, NY, USA
| | - Jonathan K Pritchard
- Department of Genetics, Stanford University, Stanford, CA, USA.
- Department of Biology, Stanford University, Stanford, CA, USA.
| |
Collapse
|
5
|
Zeng T, Spence JP, Mostafavi H, Pritchard JK. Bayesian estimation of gene constraint from an evolutionary model with gene features. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.05.19.541520. [PMID: 37292653 PMCID: PMC10245655 DOI: 10.1101/2023.05.19.541520] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Measures of selective constraint on genes have been used for many applications including clinical interpretation of rare coding variants, disease gene discovery, and studies of genome evolution. However, widely-used metrics are severely underpowered at detecting constraint for the shortest ∼25% of genes, potentially causing important pathogenic mutations to be overlooked. We developed a framework combining a population genetics model with machine learning on gene features to enable accurate inference of an interpretable constraint metric, shet. Our estimates outperform existing metrics for prioritizing genes important for cell essentiality, human disease, and other phenotypes, especially for short genes. Our new estimates of selective constraint should have wide utility for characterizing genes relevant to human disease. Finally, our inference framework, GeneBayes, provides a flexible platform that can improve estimation of many gene-level properties, such as rare variant burden or gene expression differences.
Collapse
Affiliation(s)
- Tony Zeng
- Department of Genetics, Stanford University, Stanford CA
| | | | | | - Jonathan K. Pritchard
- Department of Genetics, Stanford University, Stanford CA
- Department of Biology, Stanford University, Stanford CA
| |
Collapse
|
6
|
Rodrigues MF, Kern AD, Ralph PL. Shared evolutionary processes shape landscapes of genomic variation in the great apes. Genetics 2024; 226:iyae006. [PMID: 38242701 PMCID: PMC10990428 DOI: 10.1093/genetics/iyae006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 10/26/2023] [Accepted: 01/03/2024] [Indexed: 01/21/2024] Open
Abstract
For at least the past 5 decades, population genetics, as a field, has worked to describe the precise balance of forces that shape patterns of variation in genomes. The problem is challenging because modeling the interactions between evolutionary processes is difficult, and different processes can impact genetic variation in similar ways. In this paper, we describe how diversity and divergence between closely related species change with time, using correlations between landscapes of genetic variation as a tool to understand the interplay between evolutionary processes. We find strong correlations between landscapes of diversity and divergence in a well-sampled set of great ape genomes, and explore how various processes such as incomplete lineage sorting, mutation rate variation, GC-biased gene conversion and selection contribute to these correlations. Through highly realistic, chromosome-scale, forward-in-time simulations, we show that the landscapes of diversity and divergence in the great apes are too well correlated to be explained via strictly neutral processes alone. Our best fitting simulation includes both deleterious and beneficial mutations in functional portions of the genome, in which 9% of fixations within those regions is driven by positive selection. This study provides a framework for modeling genetic variation in closely related species, an approach which can shed light on the complex balance of forces that have shaped genetic variation.
Collapse
Affiliation(s)
- Murillo F Rodrigues
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403, USA
- Department of Biology, University of Oregon, Eugene, OR 97403, USA
| | - Andrew D Kern
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403, USA
- Department of Biology, University of Oregon, Eugene, OR 97403, USA
| | - Peter L Ralph
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403, USA
- Department of Biology, University of Oregon, Eugene, OR 97403, USA
- Department of Mathematics, University of Oregon, Eugene, OR 97403, USA
| |
Collapse
|
7
|
Hinch R, Donnelly P, Hinch AG. Meiotic DNA breaks drive multifaceted mutagenesis in the human germ line. Science 2023; 382:eadh2531. [PMID: 38033082 PMCID: PMC7615360 DOI: 10.1126/science.adh2531] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Accepted: 09/29/2023] [Indexed: 12/02/2023]
Abstract
Meiotic recombination commences with hundreds of programmed DNA breaks; however, the degree to which they are accurately repaired remains poorly understood. We report that meiotic break repair is eightfold more mutagenic for single-base substitutions than was previously understood, leading to de novo mutation in one in four sperm and one in 12 eggs. Its impact on indels and structural variants is even higher, with 100- to 1300-fold increases in rates per break. We uncovered new mutational signatures and footprints relative to break sites, which implicate unexpected biochemical processes and error-prone DNA repair mechanisms, including translesion synthesis and end joining in meiotic break repair. We provide evidence that these mechanisms drive mutagenesis in human germ lines and lead to disruption of hundreds of genes genome wide.
Collapse
Affiliation(s)
- Robert Hinch
- Big Data Institute, University of Oxford; Oxford, UK
| | - Peter Donnelly
- Wellcome Centre for Human Genetics, University of Oxford; Oxford, UK
- Genomics plc; Oxford, UK
| | | |
Collapse
|
8
|
Seplyarskiy V, Koch EM, Lee DJ, Lichtman JS, Luan HH, Sunyaev SR. A mutation rate model at the basepair resolution identifies the mutagenic effect of polymerase III transcription. Nat Genet 2023; 55:2235-2242. [PMID: 38036792 PMCID: PMC11348951 DOI: 10.1038/s41588-023-01562-0] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2022] [Accepted: 10/06/2023] [Indexed: 12/02/2023]
Abstract
De novo mutations occur at substantially different rates depending on genomic location, sequence context and DNA strand. The success of methods to estimate selection intensity, infer demographic history and map rare disease genes, depends strongly on assumptions about the local mutation rate. Here we present Roulette, a genome-wide mutation rate model at basepair resolution that incorporates known determinants of local mutation rate. Roulette is shown to be more accurate than existing models. We use Roulette to refine the estimates of population growth within Europe by incorporating the full range of human mutation rates. The analysis of significant deviations from the model predictions revealed a tenfold increase in mutation rate in nearly all genes transcribed by polymerase III (Pol III), suggesting a new mutagenic mechanism. We also detected an elevated mutation rate within transcription factor binding sites restricted to sites actively used in testis and residing in promoters.
Collapse
Affiliation(s)
- Vladimir Seplyarskiy
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Brigham and Women's Hospital, Division of Genetics, Harvard Medical School, Boston, MA, USA
| | - Evan M Koch
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Brigham and Women's Hospital, Division of Genetics, Harvard Medical School, Boston, MA, USA
| | - Daniel J Lee
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Brigham and Women's Hospital, Division of Genetics, Harvard Medical School, Boston, MA, USA
| | - Joshua S Lichtman
- NGM Biopharmaceuticals Inc., South San Francisco, CA, USA
- Soleil Labs, South San Francisco, CA, USA
| | - Harding H Luan
- NGM Biopharmaceuticals Inc., South San Francisco, CA, USA
- Soleil Labs, South San Francisco, CA, USA
| | - Shamil R Sunyaev
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
- Brigham and Women's Hospital, Division of Genetics, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
9
|
Spence JP, Zeng T, Mostafavi H, Pritchard JK. Scaling the discrete-time Wright-Fisher model to biobank-scale datasets. Genetics 2023; 225:iyad168. [PMID: 37724741 PMCID: PMC10627256 DOI: 10.1093/genetics/iyad168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 06/01/2023] [Accepted: 09/08/2023] [Indexed: 09/21/2023] Open
Abstract
The discrete-time Wright-Fisher (DTWF) model and its diffusion limit are central to population genetics. These models can describe the forward-in-time evolution of allele frequencies in a population resulting from genetic drift, mutation, and selection. Computing likelihoods under the diffusion process is feasible, but the diffusion approximation breaks down for large samples or in the presence of strong selection. Existing methods for computing likelihoods under the DTWF model do not scale to current exome sequencing sample sizes in the hundreds of thousands. Here, we present a scalable algorithm that approximates the DTWF model with provably bounded error. Our approach relies on two key observations about the DTWF model. The first is that transition probabilities under the model are approximately sparse. The second is that transition distributions for similar starting allele frequencies are extremely close as distributions. Together, these observations enable approximate matrix-vector multiplication in linear (as opposed to the usual quadratic) time. We prove similar properties for Hypergeometric distributions, enabling fast computation of likelihoods for subsamples of the population. We show theoretically and in practice that this approximation is highly accurate and can scale to population sizes in the tens of millions, paving the way for rigorous biobank-scale inference. Finally, we use our results to estimate the impact of larger samples on estimating selection coefficients for loss-of-function variants. We find that increasing sample sizes beyond existing large exome sequencing cohorts will provide essentially no additional information except for genes with the most extreme fitness effects.
Collapse
Affiliation(s)
- Jeffrey P Spence
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - Tony Zeng
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | | | - Jonathan K Pritchard
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
- Department of Biology, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
10
|
Adams CJ, Conery M, Auerbach BJ, Jensen ST, Mathieson I, Voight BF. Regularized sequence-context mutational trees capture variation in mutation rates across the human genome. PLoS Genet 2023; 19:e1010807. [PMID: 37418489 PMCID: PMC10355397 DOI: 10.1371/journal.pgen.1010807] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Revised: 07/19/2023] [Accepted: 06/01/2023] [Indexed: 07/09/2023] Open
Abstract
Germline mutation is the mechanism by which genetic variation in a population is created. Inferences derived from mutation rate models are fundamental to many population genetics methods. Previous models have demonstrated that nucleotides flanking polymorphic sites-the local sequence context-explain variation in the probability that a site is polymorphic. However, limitations to these models exist as the size of the local sequence context window expands. These include a lack of robustness to data sparsity at typical sample sizes, lack of regularization to generate parsimonious models and lack of quantified uncertainty in estimated rates to facilitate comparison between models. To address these limitations, we developed Baymer, a regularized Bayesian hierarchical tree model that captures the heterogeneous effect of sequence contexts on polymorphism probabilities. Baymer implements an adaptive Metropolis-within-Gibbs Markov Chain Monte Carlo sampling scheme to estimate the posterior distributions of sequence-context based probabilities that a site is polymorphic. We show that Baymer accurately infers polymorphism probabilities and well-calibrated posterior distributions, robustly handles data sparsity, appropriately regularizes to return parsimonious models, and scales computationally at least up to 9-mer context windows. We demonstrate application of Baymer in three ways-first, identifying differences in polymorphism probabilities between continental populations in the 1000 Genomes Phase 3 dataset, second, in a sparse data setting to examine the use of polymorphism models as a proxy for de novo mutation probabilities as a function of variant age, sequence context window size, and demographic history, and third, comparing model concordance between different great ape species. We find a shared context-dependent mutation rate architecture underlying our models, enabling a transfer-learning inspired strategy for modeling germline mutations. In summary, Baymer is an accurate polymorphism probability estimation algorithm that automatically adapts to data sparsity at different sequence context levels, thereby making efficient use of the available data.
Collapse
Affiliation(s)
- Christopher J. Adams
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Mitchell Conery
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Benjamin J. Auerbach
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Shane T. Jensen
- Department of Statistics and Data Science, The Wharton School at the University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Iain Mathieson
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Benjamin F. Voight
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
- Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| |
Collapse
|
11
|
Wade EE, Kyriazis CC, Cavassim MIA, Lohmueller KE. Quantifying the fraction of new mutations that are recessive lethal. Evolution 2023; 77:1539-1549. [PMID: 37074880 PMCID: PMC10309970 DOI: 10.1093/evolut/qpad061] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Revised: 03/21/2023] [Accepted: 04/14/2023] [Indexed: 04/20/2023]
Abstract
The presence and impact of recessive lethal mutations have been widely documented in diploid outcrossing species. However, precise estimates of the proportion of new mutations that are recessive lethal remain limited. Here, we evaluate the performance of Fit∂a∂i, a commonly used method for inferring the distribution of fitness effects (DFE), in the presence of lethal mutations. Using simulations, we demonstrate that in both additive and recessive cases, inference of the deleterious nonlethal portion of the DFE is minimally affected by a small proportion (<10%) of lethal mutations. Additionally, we demonstrate that while Fit∂a∂i cannot estimate the fraction of recessive lethal mutations, Fit∂a∂i can accurately infer the fraction of additive lethal mutations. Finally, as an alternative approach to estimate the proportion of mutations that are recessive lethal, we employ models of mutation-selection-drift balance using existing genomic parameters and estimates of segregating recessive lethals for humans and Drosophila melanogaster. In both species, the segregating recessive lethal load can be explained by a very small fraction (<1%) of new nonsynonymous mutations being recessive lethal. Our results refute recent assertions of a much higher proportion of mutations being recessive lethal (4%-5%), while highlighting the need for additional information on the joint distribution of selection and dominance coefficients.
Collapse
Affiliation(s)
- Emma E Wade
- Department of Ecology and Evolutionary Biology, University of California–Los Angeles, Los Angeles, CA, United States
- Department of Computer Science and Engineering, Mississippi State University, Starkville, MS, United States
| | - Christopher C Kyriazis
- Department of Ecology and Evolutionary Biology, University of California–Los Angeles, Los Angeles, CA, United States
| | - Maria Izabel A Cavassim
- Department of Ecology and Evolutionary Biology, University of California–Los Angeles, Los Angeles, CA, United States
| | - Kirk E Lohmueller
- Department of Ecology and Evolutionary Biology, University of California–Los Angeles, Los Angeles, CA, United States
- Interdepartmental Program in Bioinformatics, University of California–Los Angeles, Los Angeles, CA, United States
- Department of Human Genetics, David Geffen School of Medicine, University of California–Los Angeles, Los Angeles, CA, United States
| |
Collapse
|
12
|
Zeng T, Spence JP, Mostafavi H, Pritchard JK. Bayesian estimation of gene constraint from an evolutionary model with gene features. RESEARCH SQUARE 2023:rs.3.rs-3012879. [PMID: 37398424 PMCID: PMC10312940 DOI: 10.21203/rs.3.rs-3012879/v1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
Measures of selective constraint on genes have been used for many applications including clinical interpretation of rare coding variants, disease gene discovery, and studies of genome evolution. However, widely-used metrics are severely underpowered at detecting constraint for the shortest ~25% of genes, potentially causing important pathogenic mutations to be overlooked. We developed a framework combining a population genetics model with machine learning on gene features to enable accurate inference of an interpretable constraint metric, s het . Our estimates outperform existing metrics for prioritizing genes important for cell essentiality, human disease, and other phenotypes, especially for short genes. Our new estimates of selective constraint should have wide utility for characterizing genes relevant to human disease. Finally, our inference framework, GeneBayes, provides a flexible platform that can improve estimation of many gene-level properties, such as rare variant burden or gene expression differences.
Collapse
Affiliation(s)
- Tony Zeng
- Department of Genetics, Stanford University, Stanford CA
| | | | | | - Jonathan K. Pritchard
- Department of Genetics, Stanford University, Stanford CA
- Department of Biology, Stanford University, Stanford CA
| |
Collapse
|
13
|
Spence JP, Zeng T, Mostafavi H, Pritchard JK. Scaling the Discrete-time Wright Fisher model to biobank-scale datasets. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.19.541517. [PMID: 37293115 PMCID: PMC10245735 DOI: 10.1101/2023.05.19.541517] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The Discrete-Time Wright Fisher (DTWF) model and its large population diffusion limit are central to population genetics. These models describe the forward-in-time evolution of the frequency of an allele in a population and can include the fundamental forces of genetic drift, mutation, and selection. Computing like-lihoods under the diffusion process is feasible, but the diffusion approximation breaks down for large sample sizes or in the presence of strong selection. Unfortunately, existing methods for computing likelihoods under the DTWF model do not scale to current exome sequencing sample sizes in the hundreds of thousands. Here we present an algorithm that approximates the DTWF model with provably bounded error and runs in time linear in the size of the population. Our approach relies on two key observations about Binomial distributions. The first is that Binomial distributions are approximately sparse. The second is that Binomial distributions with similar success probabilities are extremely close as distributions, allowing us to approximate the DTWF Markov transition matrix as a very low rank matrix. Together, these observations enable matrix-vector multiplication in linear (as opposed to the usual quadratic) time. We prove similar properties for Hypergeometric distributions, enabling fast computation of likelihoods for subsamples of the population. We show theoretically and in practice that this approximation is highly accurate and can scale to population sizes in the billions, paving the way for rigorous biobank-scale population genetic inference. Finally, we use our results to estimate how increasing sample sizes will improve the estimation of selection coefficients acting on loss-of-function variants. We find that increasing sample sizes beyond existing large exome sequencing cohorts will provide essentially no additional information except for genes with the most extreme fitness effects.
Collapse
Affiliation(s)
| | - Tony Zeng
- Department of Genetics, Stanford University
| | | | - Jonathan K. Pritchard
- Department of Genetics, Stanford University
- Department of Biology, Stanford University
| |
Collapse
|
14
|
Si Y, Zöllner S. Inferring CpG methylation signatures accumulated along human history from genetic variation catalogs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.24.534151. [PMID: 36993375 PMCID: PMC10055312 DOI: 10.1101/2023.03.24.534151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/31/2023]
Abstract
Understanding the DNA methylation patterns in the human genome is a key step to decipher gene regulatory mechanisms and model mutation rate heterogeneity in the human genome. While methylation rates can be measured e.g. with bisulfite sequencing, such measures do not capture historical patterns. Here we present a new method, Methylation Hidden Markov Model (MHMM), to estimate the accumulated germline methylation signature in human population history leveraging two properties: (1) Mutation rates of cytosine to thymine transitions at methylated CG dinucleotides are orders of magnitude higher than that in the rest of the genome. (2) Methylation levels are locally correlated, so the allele frequencies of neighboring CpGs can be used jointly to estimate methylation status. We applied MHMM to allele frequencies from the TOPMed and the gnomAD genetic variation catalogs. Our estimates are consistent with whole genome bisulfite sequencing (WGBS) measured human germ cell methylation levels at 90% of CpG sites, but we also identified ~ 442, 000 historically methylated CpG sites that could not be captured due to sample genetic variation, and inferred methylation status for ~ 721, 000 CpG sites that were missing from WGBS. Hypo-methylated regions identified by combining our results with experimental measures are 1.7 times more likely to recover known active genomic regions than those identified by WGBS alone. Our estimated historical methylation status can be leveraged to enhance bioinformatic analysis of germline methylation such as annotating regulatory and inactivated genomic regions and provide insights in sequence evolution including predicting mutation constraint.
Collapse
Affiliation(s)
- Yichen Si
- Department of Biostatistics, University of Michigan
| | - Sebastian Zöllner
- Department of Biostatistics, University of Michigan
- Department of Psychiatry, University of Michigan
| |
Collapse
|
15
|
Agarwal I, Fuller ZL, Myers SR, Przeworski M. Relating pathogenic loss-of-function mutations in humans to their evolutionary fitness costs. eLife 2023; 12:e83172. [PMID: 36648429 PMCID: PMC9937649 DOI: 10.7554/elife.83172] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Accepted: 01/16/2023] [Indexed: 01/18/2023] Open
Abstract
Causal loss-of-function (LOF) variants for Mendelian and severe complex diseases are enriched in 'mutation intolerant' genes. We show how such observations can be interpreted in light of a model of mutation-selection balance and use the model to relate the pathogenic consequences of LOF mutations at present to their evolutionary fitness effects. To this end, we first infer posterior distributions for the fitness costs of LOF mutations in 17,318 autosomal and 679 X-linked genes from exome sequences in 56,855 individuals. Estimated fitness costs for the loss of a gene copy are typically above 1%; they tend to be largest for X-linked genes, whether or not they have a Y homolog, followed by autosomal genes and genes in the pseudoautosomal region. We compare inferred fitness effects for all possible de novo LOF mutations to those of de novo mutations identified in individuals diagnosed with one of six severe, complex diseases or developmental disorders. Probands carry an excess of mutations with estimated fitness effects above 10%; as we show by simulation, when sampled in the population, such highly deleterious mutations are typically only a couple of generations old. Moreover, the proportion of highly deleterious mutations carried by probands reflects the typical age of onset of the disease. The study design also has a discernible influence: a greater proportion of highly deleterious mutations is detected in pedigree than case-control studies, and for autism, in simplex than multiplex families and in female versus male probands. Thus, anchoring observations in human genetics to a population genetic model allows us to learn about the fitness effects of mutations identified by different mapping strategies and for different traits.
Collapse
Affiliation(s)
- Ipsita Agarwal
- Department of Biological Sciences, Columbia UniversityNew YorkUnited States
- Department of Statistics, University of OxfordOxfordUnited Kingdom
| | - Zachary L Fuller
- Department of Biological Sciences, Columbia UniversityNew YorkUnited States
| | - Simon R Myers
- Department of Statistics, University of OxfordOxfordUnited Kingdom
- The Wellcome Centre for Human Genetics, University of OxfordOxfordUnited Kingdom
| | - Molly Przeworski
- Department of Biological Sciences, Columbia UniversityNew YorkUnited States
- Department of Systems Biology, Columbia UniversityNew YorkUnited States
| |
Collapse
|
16
|
Tan DS, Cheung SL, Gao Y, Weinbuch M, Hu H, Shi L, Ti SC, Hutchins AP, Cojocaru V, Jauch R. The homeodomain of Oct4 is a dimeric binder of methylated CpG elements. Nucleic Acids Res 2023; 51:1120-1138. [PMID: 36631980 PMCID: PMC9943670 DOI: 10.1093/nar/gkac1262] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Revised: 12/14/2022] [Accepted: 12/19/2022] [Indexed: 01/13/2023] Open
Abstract
Oct4 is essential to maintain pluripotency and has a pivotal role in establishing the germline. Its DNA-binding POU domain was recently found to bind motifs with methylated CpG elements normally associated with epigenetic silencing. However, the mode of binding and the consequences of this capability has remained unclear. Here, we show that Oct4 binds to a compact palindromic DNA element with a methylated CpG core (CpGpal) in alternative states of pluripotency and during cellular reprogramming towards induced pluripotent stem cells (iPSCs). During cellular reprogramming, typical Oct4 bound enhancers are uniformly demethylated, with the prominent exception of the CpGpal sites where DNA methylation is often maintained. We demonstrate that Oct4 cooperatively binds the CpGpal element as a homodimer, which contrasts with the ectoderm-expressed POU factor Brn2. Indeed, binding to CpGpal is Oct4-specific as other POU factors expressed in somatic cells avoid this element. Binding assays combined with structural analyses and molecular dynamic simulations show that dimeric Oct4-binding to CpGpal is driven by the POU-homeodomain whilst the POU-specific domain is detached from DNA. Collectively, we report that Oct4 exerts parts of its regulatory function in the context of methylated DNA through a DNA recognition mechanism that solely relies on its homeodomain.
Collapse
Affiliation(s)
- Daisylyn Senna Tan
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Shun Lai Cheung
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Ya Gao
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Maike Weinbuch
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China,Institute for Molecular Medicine, Ulm University, Ulm, Germany
| | - Haoqing Hu
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Liyang Shi
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, Department of Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Shih-Chieh Ti
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Andrew P Hutchins
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, Department of Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Vlad Cojocaru
- STAR-UBB Institute, Babeş-Bolyai University, Cluj-Napoca, Romania,Computational Structural Biology Group, Utrecht University, The Netherlands,Max Planck Institute for Molecular Biomedicine, Münster, Germany
| | - Ralf Jauch
- To whom correspondence should be addressed. Tel: +852 3917 9511; Fax: +852 28559730;
| |
Collapse
|
17
|
Árnason E, Koskela J, Halldórsdóttir K, Eldon B. Sweepstakes reproductive success via pervasive and recurrent selective sweeps. eLife 2023; 12:80781. [PMID: 36806325 PMCID: PMC9940914 DOI: 10.7554/elife.80781] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2022] [Accepted: 12/28/2022] [Indexed: 02/22/2023] Open
Abstract
Highly fecund natural populations characterized by high early mortality abound, yet our knowledge about their recruitment dynamics is somewhat rudimentary. This knowledge gap has implications for our understanding of genetic variation, population connectivity, local adaptation, and the resilience of highly fecund populations. The concept of sweepstakes reproductive success, which posits a considerable variance and skew in individual reproductive output, is key to understanding the distribution of individual reproductive success. However, it still needs to be determined whether highly fecund organisms reproduce through sweepstakes and, if they do, the relative roles of neutral and selective sweepstakes. Here, we use coalescent-based statistical analysis of population genomic data to show that selective sweepstakes likely explain recruitment dynamics in the highly fecund Atlantic cod. We show that the Kingman coalescent (modelling no sweepstakes) and the Xi-Beta coalescent (modelling random sweepstakes), including complex demography and background selection, do not provide an adequate fit for the data. The Durrett-Schweinsberg coalescent, in which selective sweepstakes result from recurrent and pervasive selective sweeps of new mutations, offers greater explanatory power. Our results show that models of sweepstakes reproduction and multiple-merger coalescents are relevant and necessary for understanding genetic diversity in highly fecund natural populations. These findings have fundamental implications for understanding the recruitment variation of fish stocks and general evolutionary genomics of high-fecundity organisms.
Collapse
Affiliation(s)
- Einar Árnason
- Institute of Life- and environmental Sciences, University of IcelandReykjavikIceland,Department of Organismal and Evolutionary Biology, Harvard UniversityCambridgeUnited States
| | - Jere Koskela
- Department of Statistics, University of WarwickCoventryUnited Kingdom
| | - Katrín Halldórsdóttir
- Institute of Life- and environmental Sciences, University of IcelandReykjavikIceland
| | - Bjarki Eldon
- Leibniz Institute for Evolution and Biodiversity Science, Museum für NaturkundeBerlinGermany
| |
Collapse
|
18
|
Bolognesi G, Bacalini MG, Pirazzini C, Garagnani P, Giuliani C. Evolutionary Implications of Environmental Toxicant Exposure. Biomedicines 2022; 10:3090. [PMID: 36551846 PMCID: PMC9775150 DOI: 10.3390/biomedicines10123090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Revised: 11/25/2022] [Accepted: 11/27/2022] [Indexed: 12/03/2022] Open
Abstract
Homo sapiens have been exposed to various toxins and harmful compounds that change according to various phases of human evolution. Population genetics studies showed that such exposures lead to adaptive genetic changes; while observing present exposures to different toxicants, the first molecular mechanism that confers plasticity is epigenetic remodeling and, in particular, DNA methylation variation, a molecular mechanism proposed for medium-term adaptation. A large amount of scientific literature from clinical and medical studies revealed the high impact of such exposure on human biology; thus, in this review, we examine and infer the impact that different environmental toxicants may have in shaping human evolution. We first describe how environmental toxicants shape natural human variation in terms of genetic and epigenetic diversity, and then we describe how DNA methylation may influence mutation rate and, thus, genetic variability. We describe the impact of these substances on biological fitness in terms of reproduction and survival, and in conclusion, we focus on their effect on brain evolution and physiology.
Collapse
Affiliation(s)
- Giorgia Bolognesi
- Department of Experimental, Diagnostic and Specialty Medicine (DIMES), University of Bologna, via San Giacomo 12, 40126 Bologna, Italy
- Laboratory of Molecular Anthropology, Centre for Genome Biology, Department of Biological, Geological and Environmental Sciences, University of Bologna, via Francesco Selmi 3, 40126 Bologna, Italy
| | - Maria Giulia Bacalini
- IRCCS Istituto Delle Scienze Neurologiche di Bologna, via Altura 3, 40139 Bologna, Italy
| | - Chiara Pirazzini
- IRCCS Istituto Delle Scienze Neurologiche di Bologna, via Altura 3, 40139 Bologna, Italy
| | - Paolo Garagnani
- Department of Experimental, Diagnostic and Specialty Medicine (DIMES), University of Bologna, via San Giacomo 12, 40126 Bologna, Italy
| | - Cristina Giuliani
- Laboratory of Molecular Anthropology, Centre for Genome Biology, Department of Biological, Geological and Environmental Sciences, University of Bologna, via Francesco Selmi 3, 40126 Bologna, Italy
| |
Collapse
|
19
|
Dhindsa RS, Wang Q, Vitsios D, Burren OS, Hu F, DiCarlo JE, Kruglyak L, MacArthur DG, Hurles ME, Petrovski S. A minimal role for synonymous variation in human disease. Am J Hum Genet 2022; 109:2105-2109. [PMID: 36459978 PMCID: PMC9808499 DOI: 10.1016/j.ajhg.2022.10.016] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
Synonymous mutations change the DNA sequence of a gene without affecting the amino acid sequence of the encoded protein. Although some synonymous mutations can affect RNA splicing, translational efficiency, and mRNA stability, studies in human genetics, mutagenesis screens, and other experiments and evolutionary analyses have repeatedly shown that most synonymous variants are neutral or only weakly deleterious, with some notable exceptions. Based on a recent study in yeast, there have been claims that synonymous mutations could be as important as nonsynonymous mutations in causing disease, assuming the yeast findings hold up and translate to humans. Here, we argue that there is insufficient evidence to overturn the large, coherent body of knowledge establishing the predominant neutrality of synonymous variants in the human genome.
Collapse
Affiliation(s)
- Ryan S. Dhindsa
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA,Jan and Dan Duncan Neurological Research Institute at Texas Children’s Hospital, Houston, TX, USA,Centre for Genomics Research, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, Waltham, MA, USA,Corresponding author
| | - Quanli Wang
- Centre for Genomics Research, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, Waltham, MA, USA
| | - Dimitrios Vitsios
- Centre for Genomics Research, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, Cambridge, UK
| | - Oliver S. Burren
- Centre for Genomics Research, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, Cambridge, UK
| | - Fengyuan Hu
- Centre for Genomics Research, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, Cambridge, UK
| | - James E. DiCarlo
- Department of Pathology, Brigham and Women’s Hospital, Boston, MA, USA
| | - Leonid Kruglyak
- Department of Human Genetics and Department of Biological Chemistry, University of California, Los Angeles, Los Angeles, CA, USA,Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Daniel G. MacArthur
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA,Centre for Population Genomics, Garvan Institute of Medical Research, and UNSW Sydney, Sydney, NSW, Australia,Centre for Population Genomics, Murdoch Children’s Research Institute, Melbourne, VIC, Australia
| | | | - Slavé Petrovski
- Centre for Genomics Research, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, Cambridge, UK,Department of Medicine, University of Melbourne, Austin Health, Melbourne, VIC, Australia,Corresponding author
| |
Collapse
|
20
|
Halldorsson BV, Eggertsson HP, Moore KHS, Hauswedell H, Eiriksson O, Ulfarsson MO, Palsson G, Hardarson MT, Oddsson A, Jensson BO, Kristmundsdottir S, Sigurpalsdottir BD, Stefansson OA, Beyter D, Holley G, Tragante V, Gylfason A, Olason PI, Zink F, Asgeirsdottir M, Sverrisson ST, Sigurdsson B, Gudjonsson SA, Sigurdsson GT, Halldorsson GH, Sveinbjornsson G, Norland K, Styrkarsdottir U, Magnusdottir DN, Snorradottir S, Kristinsson K, Sobech E, Jonsson H, Geirsson AJ, Olafsson I, Jonsson P, Pedersen OB, Erikstrup C, Brunak S, Ostrowski SR, Thorleifsson G, Jonsson F, Melsted P, Jonsdottir I, Rafnar T, Holm H, Stefansson H, Saemundsdottir J, Gudbjartsson DF, Magnusson OT, Masson G, Thorsteinsdottir U, Helgason A, Jonsson H, Sulem P, Stefansson K. The sequences of 150,119 genomes in the UK Biobank. Nature 2022; 607:732-740. [PMID: 35859178 PMCID: PMC9329122 DOI: 10.1038/s41586-022-04965-x] [Citation(s) in RCA: 238] [Impact Index Per Article: 79.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2021] [Accepted: 06/10/2022] [Indexed: 12/25/2022]
Abstract
Detailed knowledge of how diversity in the sequence of the human genome affects phenotypic diversity depends on a comprehensive and reliable characterization of both sequences and phenotypic variation. Over the past decade, insights into this relationship have been obtained from whole-exome sequencing or whole-genome sequencing of large cohorts with rich phenotypic data1,2. Here we describe the analysis of whole-genome sequencing of 150,119 individuals from the UK Biobank3. This constitutes a set of high-quality variants, including 585,040,410 single-nucleotide polymorphisms, representing 7.0% of all possible human single-nucleotide polymorphisms, and 58,707,036 indels. This large set of variants allows us to characterize selection based on sequence variation within a population through a depletion rank score of windows along the genome. Depletion rank analysis shows that coding exons represent a small fraction of regions in the genome subject to strong sequence conservation. We define three cohorts within the UK Biobank: a large British Irish cohort, a smaller African cohort and a South Asian cohort. A haplotype reference panel is provided that allows reliable imputation of most variants carried by three or more sequenced individuals. We identified 895,055 structural variants and 2,536,688 microsatellites, groups of variants typically excluded from large-scale whole-genome sequencing studies. Using this formidable new resource, we provide several examples of trait associations for rare variants with large effects not found previously through studies based on whole-exome sequencing and/or imputation.
Collapse
Affiliation(s)
- Bjarni V Halldorsson
- deCODE genetics/Amgen Inc., Reykjavik, Iceland. .,School of Technology, Reykjavik University, Reykjavik, Iceland.
| | | | | | | | | | - Magnus O Ulfarsson
- deCODE genetics/Amgen Inc., Reykjavik, Iceland.,School of Engineering and Natural Sciences, University of Iceland, Reykjavik, Iceland
| | | | - Marteinn T Hardarson
- deCODE genetics/Amgen Inc., Reykjavik, Iceland.,School of Technology, Reykjavik University, Reykjavik, Iceland
| | | | | | - Snaedis Kristmundsdottir
- deCODE genetics/Amgen Inc., Reykjavik, Iceland.,School of Technology, Reykjavik University, Reykjavik, Iceland
| | - Brynja D Sigurpalsdottir
- deCODE genetics/Amgen Inc., Reykjavik, Iceland.,School of Technology, Reykjavik University, Reykjavik, Iceland
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Helgi Jonsson
- Landspitali-University Hospital, Reykjavik, Iceland.,Faculty of Medicine, School of Health Sciences, University of Iceland, Reykjavik, Iceland
| | | | | | - Palmi Jonsson
- Landspitali-University Hospital, Reykjavik, Iceland.,Faculty of Medicine, School of Health Sciences, University of Iceland, Reykjavik, Iceland
| | - Ole Birger Pedersen
- Department of Clinical Immunology, Zealand University Hospital, Køge, Denmark
| | - Christian Erikstrup
- Department of Clinical Medicine, Aarhus University, Aarhus, Denmark.,Department of Clinical Immunology, Aarhus University Hospital, Aarhus, Denmark
| | - Søren Brunak
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Sisse Rye Ostrowski
- Department of Clinical Immunology, Copenhagen University Hospital (Rigshospitalet), Copenhagen, Denmark.,Department of Clinical Medicine, Faculty of Health and Clinical Sciences, Copenhagen University, Copenhagen, Denmark
| | | | | | | | - Pall Melsted
- deCODE genetics/Amgen Inc., Reykjavik, Iceland.,School of Engineering and Natural Sciences, University of Iceland, Reykjavik, Iceland
| | - Ingileif Jonsdottir
- deCODE genetics/Amgen Inc., Reykjavik, Iceland.,Faculty of Medicine, School of Health Sciences, University of Iceland, Reykjavik, Iceland
| | | | - Hilma Holm
- deCODE genetics/Amgen Inc., Reykjavik, Iceland
| | | | | | - Daniel F Gudbjartsson
- deCODE genetics/Amgen Inc., Reykjavik, Iceland.,School of Engineering and Natural Sciences, University of Iceland, Reykjavik, Iceland
| | | | | | - Unnur Thorsteinsdottir
- deCODE genetics/Amgen Inc., Reykjavik, Iceland.,Faculty of Medicine, School of Health Sciences, University of Iceland, Reykjavik, Iceland
| | - Agnar Helgason
- deCODE genetics/Amgen Inc., Reykjavik, Iceland.,Department of Anthropology, University of Iceland, Reykjavik, Iceland
| | | | | | | |
Collapse
|
21
|
Young RS, Talmane L, Marion de Procé S, Taylor MS. The contribution of evolutionarily volatile promoters to molecular phenotypes and human trait variation. Genome Biol 2022; 23:89. [PMID: 35379293 PMCID: PMC8978360 DOI: 10.1186/s13059-022-02634-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2021] [Accepted: 02/16/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Promoters are sites of transcription initiation that harbour a high concentration of phenotype-associated genetic variation. The evolutionary gain and loss of promoters between species (collectively, termed turnover) is pervasive across mammalian genomes and may play a prominent role in driving human phenotypic diversity. RESULTS We classified human promoters by their evolutionary history during the divergence of mouse and human lineages from a common ancestor. This defined conserved, human-inserted and mouse-deleted promoters, and a class of functional-turnover promoters that align between species but are only active in humans. We show that promoters of all evolutionary categories are hotspots for substitution and often, insertion mutations. Loci with a history of insertion and deletion continue that mode of evolution within contemporary humans. The presence of an evolutionary volatile promoter within a gene is associated with increased expression variance between individuals, but only in the case of human-inserted and mouse-deleted promoters does that correspond to an enrichment of promoter-proximal genetic effects. Despite the enrichment of these molecular quantitative trait loci (QTL) at evolutionarily volatile promoters, this does not translate into a corresponding enrichment of phenotypic traits mapping to these loci. CONCLUSIONS Promoter turnover is pervasive in the human genome, and these promoters are rich in molecularly quantifiable but phenotypically inconsequential variation in gene expression. However, since evolutionarily volatile promoters show evidence of selection, coupled with high mutation rates and enrichment of QTLs, this implicates them as a source of evolutionary innovation and phenotypic variation, albeit with a high background of selectively neutral expression variation.
Collapse
Affiliation(s)
- Robert S Young
- Usher Institute, University of Edinburgh, Teviot Place, Edinburgh, EH8 9AG, UK. .,Zhejiang University - University of Edinburgh Institute, Zhejiang University, 718 East Haizhou Road, 314400, Haining, China. .,MRC Human Genetics Unit, Institute for Genetics and Cancer, University of Edinburgh, Crewe Road, Edinburgh, EH4 2XU, UK.
| | - Lana Talmane
- MRC Human Genetics Unit, Institute for Genetics and Cancer, University of Edinburgh, Crewe Road, Edinburgh, EH4 2XU, UK
| | - Sophie Marion de Procé
- Usher Institute, University of Edinburgh, Teviot Place, Edinburgh, EH8 9AG, UK.,MRC Human Genetics Unit, Institute for Genetics and Cancer, University of Edinburgh, Crewe Road, Edinburgh, EH4 2XU, UK
| | - Martin S Taylor
- MRC Human Genetics Unit, Institute for Genetics and Cancer, University of Edinburgh, Crewe Road, Edinburgh, EH4 2XU, UK.
| |
Collapse
|