1
|
Malekpour SA, Kalirad A, Majidian S. Inferring the Selective History of CNVs Using a Maximum Likelihood Model. Genome Biol Evol 2025; 17:evaf050. [PMID: 40100752 PMCID: PMC11950529 DOI: 10.1093/gbe/evaf050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2024] [Revised: 02/27/2025] [Accepted: 03/13/2025] [Indexed: 03/20/2025] Open
Abstract
Copy number variations (CNVs)-structural variations generated by deletion and/or duplication that result in a change in DNA dosage-are prevalent in nature. CNVs can drastically affect the phenotype of an organism and have been shown to be both involved in genetic disorders and be used as raw material in adaptive evolution. Unlike single-nucleotide variations, the often large and varied effects of CNVs on phenotype hinders our ability to infer their selective advantage based on the population genetics data. Here, we present a likelihood-based approach, dubbed PoMoCNV (POlymorphism-aware phylogenetic MOdel for CNVs), that estimates the evolutionary parameters such as mutation rates among different copy numbers and relative fitness loss per copy deletion at a genomic locus based on population genetics data. As a case study, we analyze the genomics data of 40 strains of Caenorhabditis elegans, representing four different populations. We take advantage of the data on chromatin accessibility to interpret the mutation rate and fitness of copy numbers, as inferred by PoMoCNV, specifically in open or closed chromatin loci. We further test the reliability of PoMoCNV by estimating the evolutionary parameters of CNVs for mutation-accumulation experiments in C. elegans with varying levels of genetic drift.
Collapse
Affiliation(s)
- Seyed Amir Malekpour
- School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran 19395-5746, Iran
| | - Ata Kalirad
- Department for Integrative Evolutionary Biology, Max Planck Institute for Biology Tübingen, Tübingen 72076, Germany
| | - Sina Majidian
- SIB Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
- Department of Computational Biology, University of Lausanne, Lausanne 1015, Switzerland
| |
Collapse
|
2
|
Lin X, Yan C, Wang Y, Huang S, Yu H, Shih C, Jiang J, Xie F. The Genetic Architecture of Local Adaptation and Reproductive Character Displacement in Scutiger boulengeri Complex (Anura: Megophryidae). Mol Ecol 2025; 34:e17611. [PMID: 39681833 DOI: 10.1111/mec.17611] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Revised: 11/05/2024] [Accepted: 11/26/2024] [Indexed: 12/18/2024]
Abstract
Speciation is a continuous process driven by barriers to gene flow. Based on genome-wide SNPs (single nucleotide polymorphisms) of 190 toads from 31 sampling sites of Scutiger boulengeri complex, we found evidence for monophyly which represented a continuous speciation process of at least six lineages in S. boulengeri, which radiated and exhibited varying degrees of divergence and gene flow. The SNP-based phylogenetic tree was largely discordant with the multilocus mitochondrial tree (i.e., S. mammatus and S. glandulatus nested in the lineages of S. boulengeri) published before. The Min Mountains (MM) and Qinghai-Tibet Plateau (QTP) lineages differ fundamentally in habitat (i.e., elevation) and morphology (i.e., SVL), we detected signatures of potential high-altitude and cold adaptation genes in QTP (vs. MM). We found the evidence of reproductive trait disparity (i.e., SVL and nuptial pads) is key to promoting sympatric rather than allopatric species pairs. In addition, we identified selection signals for genes related to sympatric character displacement, genes linked to obesity-related traits, nuptial spines morphology and enlarged chest nuptial pads in S. mammatus (vs. QTP group of S. boulengeri). Our study provided new insight and paradigm for a varied speciation pattern from local adaptation of allopatry to sympatric character displacement in the S. boulengeri complex.
Collapse
Affiliation(s)
- Xiuqin Lin
- CAS Key Laboratory of Mountain Ecological Restoration and Bioresource Utilization and Ecological Restoration Biodiversity Conservation Key Laboratory of Sichuan Province, Chengdu Institute of Biology, Chinese Academy of Sciences, Chengdu, China
| | - Chaochao Yan
- CAS Key Laboratory of Mountain Ecological Restoration and Bioresource Utilization and Ecological Restoration Biodiversity Conservation Key Laboratory of Sichuan Province, Chengdu Institute of Biology, Chinese Academy of Sciences, Chengdu, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Yuanfei Wang
- CAS Key Laboratory of Mountain Ecological Restoration and Bioresource Utilization and Ecological Restoration Biodiversity Conservation Key Laboratory of Sichuan Province, Chengdu Institute of Biology, Chinese Academy of Sciences, Chengdu, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Sining Huang
- CAS Key Laboratory of Mountain Ecological Restoration and Bioresource Utilization and Ecological Restoration Biodiversity Conservation Key Laboratory of Sichuan Province, Chengdu Institute of Biology, Chinese Academy of Sciences, Chengdu, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Haoqi Yu
- CAS Key Laboratory of Mountain Ecological Restoration and Bioresource Utilization and Ecological Restoration Biodiversity Conservation Key Laboratory of Sichuan Province, Chengdu Institute of Biology, Chinese Academy of Sciences, Chengdu, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Chungkun Shih
- College of Life Sciences, Capital Normal University, Beijing, China
- Department of Paleobiology, National Museum of Natural History, Smithsonian Institution, Washington, DC, USA
| | - Jianping Jiang
- CAS Key Laboratory of Mountain Ecological Restoration and Bioresource Utilization and Ecological Restoration Biodiversity Conservation Key Laboratory of Sichuan Province, Chengdu Institute of Biology, Chinese Academy of Sciences, Chengdu, China
- University of Chinese Academy of Sciences, Beijing, China
- Mangkang Ecological Station, Tibet Ecological Safety Monitor Network, Changdu, China
| | - Feng Xie
- CAS Key Laboratory of Mountain Ecological Restoration and Bioresource Utilization and Ecological Restoration Biodiversity Conservation Key Laboratory of Sichuan Province, Chengdu Institute of Biology, Chinese Academy of Sciences, Chengdu, China
- University of Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
3
|
Vieira AR, de Sousa F, Bilro J, Viegas MB, Svanbäck R, Gordo LS, Paulo OS. Mitochondrial genomes of the European sardine (Sardina pilchardus) reveal Pliocene diversification, extensive gene flow and pervasive purifying selection. Sci Rep 2024; 14:30977. [PMID: 39730618 DOI: 10.1038/s41598-024-82054-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2024] [Accepted: 12/02/2024] [Indexed: 12/29/2024] Open
Abstract
The development of management strategies for the promotion of sustainable fisheries relies on a deep knowledge of ecological and evolutionary processes driving the diversification and genetic variation of marine organisms. Sustainability strategies are especially relevant for marine species such as the European sardine (Sardina pilchardus), a small pelagic fish with high ecological and socioeconomic importance, especially in Southern Europe, whose stock has declined since 2006, possibly due to environmental factors. Here, we generated sequences for 139 mitochondrial genomes from individuals from 19 different geographical locations across most of the species distribution range, which was used to assess genetic diversity, diversification history and genomic signatures of selection. Our data supported an extensive gene flow in European sardine. However, phylogenetic analyses of mitogenomes revealed diversification patterns related to climate shifts in the late Miocene and Pliocene that may indicate past divergence related to rapid demographic expansion. Tests of selection showed a significant signature of purifying selection, but positive selection was also detected in different sites and specific mitochondrial lineages. Our results showed that European sardine diversification has been strongly driven by climate shifts, and rapid changes in marine environmental conditions are likely to strongly affect the distribution and stock size of this species.
Collapse
Affiliation(s)
- Ana Rita Vieira
- MARE - Marine and Environmental Sciences Centre & ARNET - Aquatic Research Network, Faculdade de Ciências, Universidade de Lisboa, Campo Grande, 1749-016, Lisboa, Portugal.
- Departamento de Biologia Animal, Faculdade de Ciências, Universidade de Lisboa, Campo Grande, 1749-016, Lisboa, Portugal.
| | - Filipe de Sousa
- Departamento de Biologia Animal, Faculdade de Ciências, Universidade de Lisboa, Campo Grande, 1749-016, Lisboa, Portugal
- cE3c - Centre for Ecology, Evolution and Environmental Changes & CHANGE - Global Change and Sustainability Institute, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisboa, Portugal
| | - João Bilro
- cE3c - Centre for Ecology, Evolution and Environmental Changes & CHANGE - Global Change and Sustainability Institute, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisboa, Portugal
| | - Mariana Bray Viegas
- cE3c - Centre for Ecology, Evolution and Environmental Changes & CHANGE - Global Change and Sustainability Institute, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisboa, Portugal
| | - Richard Svanbäck
- Department of Ecology and Genetics, Section of Animal Ecology, Evolutionary Biology Centre, Uppsala University, Norbyvägen 18D, 75236, Uppsala, Sweden
| | - Leonel S Gordo
- MARE - Marine and Environmental Sciences Centre & ARNET - Aquatic Research Network, Faculdade de Ciências, Universidade de Lisboa, Campo Grande, 1749-016, Lisboa, Portugal
- Departamento de Biologia Animal, Faculdade de Ciências, Universidade de Lisboa, Campo Grande, 1749-016, Lisboa, Portugal
| | - Octávio S Paulo
- Departamento de Biologia Animal, Faculdade de Ciências, Universidade de Lisboa, Campo Grande, 1749-016, Lisboa, Portugal
- cE3c - Centre for Ecology, Evolution and Environmental Changes & CHANGE - Global Change and Sustainability Institute, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisboa, Portugal
| |
Collapse
|
4
|
Kaj I, Mugal CF, Müller-Widmann R. A Wright-Fisher graph model and the impact of directional selection on genetic variation. Theor Popul Biol 2024; 159:13-24. [PMID: 39019334 DOI: 10.1016/j.tpb.2024.07.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2023] [Revised: 07/06/2024] [Accepted: 07/12/2024] [Indexed: 07/19/2024]
Abstract
We introduce a multi-allele Wright-Fisher model with mutation and selection such that allele frequencies at a single locus are traced by the path of a hybrid jump-diffusion process. The state space of the process is given by the vertices and edges of a topological graph, i.e. edges are unit intervals. Vertices represent monomorphic population states and positions on the edges mark the biallelic proportions of ancestral and derived alleles during polymorphic segments. In this setting, mutations can only occur at monomorphic loci. We derive the stationary distribution in mutation-selection-drift equilibrium and obtain the expected allele frequency spectrum under large population size scaling. For the extended model with multiple independent loci we derive rigorous upper bounds for a wide class of associated measures of genetic variation. Within this framework we present mathematically precise arguments to conclude that the presence of directional selection reduces the magnitude of genetic variation, as constrained by the bounds for neutral evolution.
Collapse
Affiliation(s)
- Ingemar Kaj
- Department of Mathematics, Uppsala University, Uppsala, Sweden.
| | - Carina F Mugal
- Department of Ecology and Genetics, Uppsala University, Uppsala, Sweden; Laboratory of Biometry and Evolutionary Biology, University of Lyon 1, UMR CNRS 5558, Villeurbanne, France
| | | |
Collapse
|
5
|
Braichenko S, Borges R, Kosiol C. Polymorphism-Aware Models in RevBayes: Species Trees, Disentangling Balancing Selection, and GC-Biased Gene Conversion. Mol Biol Evol 2024; 41:msae138. [PMID: 38980178 PMCID: PMC11272101 DOI: 10.1093/molbev/msae138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2023] [Revised: 04/19/2024] [Accepted: 07/06/2024] [Indexed: 07/10/2024] Open
Abstract
The role of balancing selection is a long-standing evolutionary puzzle. Balancing selection is a crucial evolutionary process that maintains genetic variation (polymorphism) over extended periods of time; however, detecting it poses a significant challenge. Building upon the Polymorphism-aware phylogenetic Models (PoMos) framework rooted in the Moran model, we introduce a PoMoBalance model. This novel approach is designed to disentangle the interplay of mutation, genetic drift, and directional selection (GC-biased gene conversion), along with the previously unexplored balancing selection pressures on ultra-long timescales comparable with species divergence times by analyzing multi-individual genomic and phylogenetic divergence data. Implemented in the open-source RevBayes Bayesian framework, PoMoBalance offers a versatile tool for inferring phylogenetic trees as well as quantifying various selective pressures. The novel aspect of our approach in studying balancing selection lies in polymorphism-aware phylogenetic models' ability to account for ancestral polymorphisms and incorporate parameters that measure frequency-dependent selection, allowing us to determine the strength of the effect and exact frequencies under selection. We implemented validation tests and assessed the model on the data simulated with SLiM and a custom Moran model simulator. Real sequence analysis of Drosophila populations reveals insights into the evolutionary dynamics of regions subject to frequency-dependent balancing selection, particularly in the context of sex-limited color dimorphism in Drosophila erecta.
Collapse
Affiliation(s)
- Svitlana Braichenko
- Centre for Biological Diversity, School of Biology, University of St Andrews, Fife KY16 9TH, UK
- Institute of Genetics and Cancer, University of Edinburgh, Edinburgh EH4 2XU, UK
| | - Rui Borges
- Institut für Populationsgenetik, Vetmeduni Vienna, Wien 1210, Austria
| | - Carolin Kosiol
- Centre for Biological Diversity, School of Biology, University of St Andrews, Fife KY16 9TH, UK
| |
Collapse
|
6
|
Schraiber JG, Edge MD, Pennell M. Unifying approaches from statistical genetics and phylogenetics for mapping phenotypes in structured populations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.10.579721. [PMID: 38496530 PMCID: PMC10942266 DOI: 10.1101/2024.02.10.579721] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/19/2024]
Abstract
In both statistical genetics and phylogenetics, a major goal is to identify correlations between genetic loci or other aspects of the phenotype or environment and a focal trait. In these two fields, there are sophisticated but disparate statistical traditions aimed at these tasks. The disconnect between their respective approaches is becoming untenable as questions in medicine, conservation biology, and evolutionary biology increasingly rely on integrating data from within and among species, and once-clear conceptual divisions are becoming increasingly blurred. To help bridge this divide, we derive a general model describing the covariance between the genetic contributions to the quantitative phenotypes of different individuals. Taking this approach shows that standard models in both statistical genetics (e.g., Genome-Wide Association Studies; GWAS) and phylogenetic comparative biology (e.g., phylogenetic regression) can be interpreted as special cases of this more general quantitative-genetic model. The fact that these models share the same core architecture means that we can build a unified understanding of the strengths and limitations of different methods for controlling for genetic structure when testing for associations. We develop intuition for why and when spurious correlations may occur using analytical theory and conduct population-genetic and phylogenetic simulations of quantitative traits. The structural similarity of problems in statistical genetics and phylogenetics enables us to take methodological advances from one field and apply them in the other. We demonstrate this by showing how a standard GWAS technique-including both the genetic relatedness matrix (GRM) as well as its leading eigenvectors, corresponding to the principal components of the genotype matrix, in a regression model-can mitigate spurious correlations in phylogenetic analyses. As a case study of this, we re-examine an analysis testing for co-evolution of expression levels between genes across a fungal phylogeny, and show that including covariance matrix eigenvectors as covariates decreases the false positive rate while simultaneously increasing the true positive rate. More generally, this work provides a foundation for more integrative approaches for understanding the genetic architecture of phenotypes and how evolutionary processes shape it.
Collapse
|
7
|
Becker D, Champredon D, Chato C, Gugan G, Poon A. SUP: a probabilistic framework to propagate genome sequence uncertainty, with applications. NAR Genom Bioinform 2023; 5:lqad038. [PMID: 37101658 PMCID: PMC10124968 DOI: 10.1093/nargab/lqad038] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 02/15/2023] [Accepted: 04/06/2023] [Indexed: 04/28/2023] Open
Abstract
Genetic sequencing is subject to many different types of errors, but most analyses treat the resultant sequences as if they are known without error. Next generation sequencing methods rely on significantly larger numbers of reads than previous sequencing methods in exchange for a loss of accuracy in each individual read. Still, the coverage of such machines is imperfect and leaves uncertainty in many of the base calls. In this work, we demonstrate that the uncertainty in sequencing techniques will affect downstream analysis and propose a straightforward method to propagate the uncertainty. Our method (which we have dubbed Sequence Uncertainty Propagation, or SUP) uses a probabilistic matrix representation of individual sequences which incorporates base quality scores as a measure of uncertainty that naturally lead to resampling and replication as a framework for uncertainty propagation. With the matrix representation, resampling possible base calls according to quality scores provides a bootstrap- or prior distribution-like first step towards genetic analysis. Analyses based on these re-sampled sequences will include a more complete evaluation of the error involved in such analyses. We demonstrate our resampling method on SARS-CoV-2 data. The resampling procedures add a linear computational cost to the analyses, but the large impact on the variance in downstream estimates makes it clear that ignoring this uncertainty may lead to overly confident conclusions. We show that SARS-CoV-2 lineage designations via Pangolin are much less certain than the bootstrap support reported by Pangolin would imply and the clock rate estimates for SARS-CoV-2 are much more variable than reported.
Collapse
Affiliation(s)
- Devan Becker
- To whom correspondence should be addressed. Tel: +1 519 884 1970 (Ext 2464);
| | | | - Connor Chato
- Department of Pathology and Laboratory Medicine, Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada
| | - Gopi Gugan
- Department of Pathology and Laboratory Medicine, Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada
| | - Art Poon
- Department of Pathology and Laboratory Medicine, Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada
| |
Collapse
|
8
|
Hibbins MS, Breithaupt LC, Hahn MW. Phylogenomic comparative methods: Accurate evolutionary inferences in the presence of gene tree discordance. Proc Natl Acad Sci U S A 2023; 120:e2220389120. [PMID: 37216509 PMCID: PMC10235958 DOI: 10.1073/pnas.2220389120] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Accepted: 04/24/2023] [Indexed: 05/24/2023] Open
Abstract
Phylogenetic comparative methods have long been a mainstay of evolutionary biology, allowing for the study of trait evolution across species while accounting for their common ancestry. These analyses typically assume a single, bifurcating phylogenetic tree describing the shared history among species. However, modern phylogenomic analyses have shown that genomes are often composed of mosaic histories that can disagree both with the species tree and with each other-so-called discordant gene trees. These gene trees describe shared histories that are not captured by the species tree, and therefore that are unaccounted for in classic comparative approaches. The application of standard comparative methods to species histories containing discordance leads to incorrect inferences about the timing, direction, and rate of evolution. Here, we develop two approaches for incorporating gene tree histories into comparative methods: one that constructs an updated phylogenetic variance-covariance matrix from gene trees, and another that applies Felsenstein's pruning algorithm over a set of gene trees to calculate trait histories and likelihoods. Using simulation, we demonstrate that our approaches generate much more accurate estimates of tree-wide rates of trait evolution than standard methods. We apply our methods to two clades of the wild tomato genus Solanum with varying rates of discordance, demonstrating the contribution of gene tree discordance to variation in a set of floral traits. Our approaches have the potential to be applied to a broad range of classic inference problems in phylogenetics, including ancestral state reconstruction and the inference of lineage-specific rate shifts.
Collapse
Affiliation(s)
- Mark S Hibbins
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, ON M5S 3B2, Canada
- Department of Biology, Indiana University, Bloomington, IN 47405
| | - Lara C Breithaupt
- Department of Biology, Indiana University, Bloomington, IN 47405
- Department of Computer Science, Duke University, Durham, NC 27710
| | - Matthew W Hahn
- Department of Biology, Indiana University, Bloomington, IN 47405
- Department of Computer Science, Indiana University, Bloomington, IN 47405
| |
Collapse
|
9
|
Stiller J, Wilson NG, Rouse GW. Range-wide population genomics of common seadragons shows secondary contact over a former barrier and insights on illegal capture. BMC Biol 2023; 21:129. [PMID: 37248474 DOI: 10.1186/s12915-023-01628-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Accepted: 05/16/2023] [Indexed: 05/31/2023] Open
Abstract
BACKGROUND Common seadragons (Phyllopteryx taeniolatus, Syngnathidae) are an emblem of the diverse endemic fauna of Australia's southern rocky reefs, the newly recognized "Great Southern Reef." A lack of assessments spanning this global biodiversity hotspot in its entirety is currently hampering an understanding of the factors that have contributed to its diversity. The common seadragon has a wide range across Australia's entire temperate south and includes a geogenetic break over a former land bridge, which has called its status as a single species into question. As a popular aquarium display that sells for high prices, common seadragons are also vulnerable to illegal capture. RESULTS Here, we provide range-wide nuclear sequences (986 variable Ultraconserved Elements) for 198 individuals and mitochondrial genomes for 140 individuals to assess species status, identify genetic units and their diversity, and trace the source of two poached individuals. Using published data of the other two seadragon species, we found that lineages of common seadragons have diverged relatively recently (< 0.63 Ma). Within common seadragons, we found pronounced genetic structure, falling into three major groups in the western, central, and eastern parts of the range. While populations across the Bassian Isthmus were divergent, there is also evidence for secondary contact since the passage opened. We found a strong cline of genetic diversity from the range center tapering symmetrically towards the range peripheries. Based on their genetic similarities, the poached individuals were inferred to have originated from around Albany in southwestern Australia. CONCLUSIONS We conclude that common seadragons constitute a single species with strong geographic structure but coherence through gene flow. The low genetic diversity on the east and west coasts is concerning given that these areas are projected to face fast climate change. Our results suggest that in addition to their life history, geological events and demographic expansions have all played a role in shaping populations in the temperate south. These insights are an important step towards understanding the historical determinants of the diversity of species endemic to the Great Southern Reef.
Collapse
Affiliation(s)
- Josefin Stiller
- Scripps Institution of Oceanography, University of California San Diego, La Jolla, 92093 , USA.
- Centre for Biodiversity Genomics, University of Copenhagen, 2100, Copenhagen, Denmark.
| | - Nerida G Wilson
- Scripps Institution of Oceanography, University of California San Diego, La Jolla, 92093 , USA
- Research & Collections, Western Australian Museum, Perth, Western Australia, 6106, Australia
- School of Biological Sciences, University of Western Australia, Perth, Western Australia, 6009, Australia
| | - Greg W Rouse
- Scripps Institution of Oceanography, University of California San Diego, La Jolla, 92093 , USA.
| |
Collapse
|
10
|
Catalan A, Höhna S, Lower SE, Duchen P. Inferring the demographic history of the North American firefly Photinus pyralis. J Evol Biol 2022; 35:1488-1499. [PMID: 36168726 DOI: 10.1111/jeb.14094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2022] [Revised: 06/13/2022] [Accepted: 07/11/2022] [Indexed: 11/28/2022]
Abstract
The firefly Photinus pyralis inhabits a wide range of latitudinal and ecological niches, with populations living from temperate to tropical habitats. Despite its broad distribution, its demographic history is unknown. In this study, we modelled and inferred different demographic scenarios for North American populations of P. pyralis, which were collected from Texas to New Jersey. We used a combination of ABC techniques (for multi-population/colonization analyses) and likelihood inference (dadi, StairwayPlot2, PoMo) for single-population demographic inference, which proved useful with our RAD data. We uncovered that the most ancestral North American population lays in Texas, which further colonized the Central region of the US and more recently the North Eastern coast. Our study confidently rejects a demographic scenario where the North Eastern populations colonized more southern populations until reaching Texas. To estimate the age of divergence between of P. pyralis, which provides deeper insights into the history of the entire species, we assembled a multi-locus phylogenetic data covering the genus Photinus. We uncovered that the phylogenetic node leading to P. pyralis lies at the end of the Miocene. Importantly, modelling the demographic history of North American P. pyralis serves as a null model of nucleotide diversity patterns in a widespread native insect species, which will serve in future studies for the detection of adaptation events in this firefly species, as well as a comparison for future studies of other North American insect taxa.
Collapse
Affiliation(s)
- Ana Catalan
- Division of Evolutionary Biology, Ludwig-Maximilians-Universität München, Planegg-Martinsried, Germany
| | - Sebastian Höhna
- GeoBio-Center, Ludwig-Maximilians-Universität München, Munich, Germany.,Department of Earth and Environmental Sciences, Paleontology & Geobiology, Ludwig-Maximilians-Universität München, Munich, Germany
| | - Sarah E Lower
- Department of Biology, Bucknell University, Lewisburg, PA, USA
| | - Pablo Duchen
- Institute for Organismal and Molecular Evolutionary Biology, Johannes Gutenberg University of Mainz, Mainz, Germany
| |
Collapse
|
11
|
Borges R, Boussau B, Höhna S, Pereira RJ, Kosiol C. Polymorphism‐aware estimation of species trees and evolutionary forces from genomic sequences with
RevBayes. Methods Ecol Evol 2022. [DOI: 10.1111/2041-210x.13980] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Affiliation(s)
- Rui Borges
- Institut für Populationsgenetik, Vetmeduni Vienna Wien Austria
| | - Bastien Boussau
- Université de Lyon, Université Claude Bernard Lyon 1 Villeurbanne France
| | - Sebastian Höhna
- GeoBio‐Center, Ludwig‐Maximilians‐Universität München Munich Germany
- Department of Earth and Environmental Sciences, Paleontology & Geobiology Ludwig‐Maximilians‐Universität München Munich Germany
| | - Ricardo J. Pereira
- Division of Evolutionary Biology, Department of Biology II Ludwig‐Maximilians‐Universität München Martinsried Germany
| | - Carolin Kosiol
- Institut für Populationsgenetik, Vetmeduni Vienna Wien Austria
- Centre for Biological Diversity University of St Andrews St Andrews UK
| |
Collapse
|
12
|
Borges R, Boussau B, Szöllősi GJ, Kosiol C. Nucleotide Usage Biases Distort Inferences of the Species Tree. Genome Biol Evol 2022; 14:6496956. [PMID: 34983052 PMCID: PMC8829901 DOI: 10.1093/gbe/evab290] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/27/2021] [Indexed: 12/15/2022] Open
Abstract
Despite the importance of natural selection in species’ evolutionary history, phylogenetic methods that take into account population-level processes typically ignore selection. The assumption of neutrality is often based on the idea that selection occurs at a minority of loci in the genome and is unlikely to compromise phylogenetic inferences significantly. However, genome-wide processes like GC-bias and some variation segregating at the coding regions are known to evolve in the nearly neutral range. As we are now using genome-wide data to estimate species trees, it is natural to ask whether weak but pervasive selection is likely to blur species tree inferences. We developed a polymorphism-aware phylogenetic model tailored for measuring signatures of nucleotide usage biases to test the impact of selection in the species tree. Our analyses indicate that although the inferred relationships among species are not significantly compromised, the genetic distances are systematically underestimated in a node-height-dependent manner: that is, the deeper nodes tend to be more underestimated than the shallow ones. Such biases have implications for molecular dating. We dated the evolutionary history of 30 worldwide fruit fly populations, and we found signatures of GC-bias considerably affecting the estimated divergence times (up to 23%) in the neutral model. Our findings call for the need to account for selection when quantifying divergence or dating species evolution.
Collapse
Affiliation(s)
- Rui Borges
- Institut für Populationsgenetik, Vetmeduni Vienna, Wien, Austria
| | - Bastien Boussau
- Université de Lyon, Université Claude Bernard Lyon 1, CNRS UMR 5558, LBBE, Villeurbanne, France
| | - Gergely J Szöllősi
- Department of Biological Physics, Eötvös University, Budapest , Hungary.,MTA-ELTE "Lendület" Evolutionary Genomics Research Group, Budapest, Hungary.,Evolutionary Systems Research Group, Centre for Ecological Research, Hungarian Academy of Sciences, Tihany, Hungary
| | - Carolin Kosiol
- Institut für Populationsgenetik, Vetmeduni Vienna, Wien, Austria.,Centre for Biological Diversity, University of St Andrews, St Andrews, United Kingdom
| |
Collapse
|
13
|
Vogl C, Mikula LC. A nearly-neutral biallelic Moran model with biased mutation and linear and quadratic selection. Theor Popul Biol 2021; 139:1-17. [PMID: 33964284 DOI: 10.1016/j.tpb.2021.03.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2020] [Revised: 03/28/2021] [Accepted: 03/29/2021] [Indexed: 01/27/2023]
Abstract
In this article, a biallelic reversible mutation model with linear and quadratic selection is analysed. The approach reconnects to one proposed by Kimura (1979), who starts from a diffusion model and derives its equilibrium distribution up to a constant. We use a boundary-mutation Moran model, which approximates a general mutation model for small effective mutation rates, and derive its equilibrium distribution for polymorphic and monomorphic variants in small to moderately sized populations. Using this model, we show that biased mutation rates and linear selection alone can cause patterns of polymorphism within and substitution rates between populations that are usually ascribed to balancing or overdominant selection. We illustrate this using a data set of short introns and fourfold degenerate sites from Drosophila simulans and Drosophila melanogaster.
Collapse
Affiliation(s)
- Claus Vogl
- Department of Biomedical Sciences, Vetmeduni Vienna, Veterinärplatz 1, A-1210 Wien, Austria; Vienna Graduate School of Population Genetics, A-1210 Wien, Austria.
| | - Lynette Caitlin Mikula
- Centre for Biological Diversity, School of Biology, University of St. Andrews, St Andrews KY16 9TH, UK.
| |
Collapse
|
14
|
De Maio N, Walker CR, Turakhia Y, Lanfear R, Corbett-Detig R, Goldman N. Mutation Rates and Selection on Synonymous Mutations in SARS-CoV-2. Genome Biol Evol 2021; 13:evab087. [PMID: 33895815 PMCID: PMC8135539 DOI: 10.1093/gbe/evab087] [Citation(s) in RCA: 79] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/19/2021] [Indexed: 12/23/2022] Open
Abstract
The COVID-19 pandemic has seen an unprecedented response from the sequencing community. Leveraging the sequence data from more than 140,000 SARS-CoV-2 genomes, we study mutation rates and selective pressures affecting the virus. Understanding the processes and effects of mutation and selection has profound implications for the study of viral evolution, for vaccine design, and for the tracking of viral spread. We highlight and address some common genome sequence analysis pitfalls that can lead to inaccurate inference of mutation rates and selection, such as ignoring skews in the genetic code, not accounting for recurrent mutations, and assuming evolutionary equilibrium. We find that two particular mutation rates, G →U and C →U, are similarly elevated and considerably higher than all other mutation rates, causing the majority of mutations in the SARS-CoV-2 genome, and are possibly the result of APOBEC and ROS activity. These mutations also tend to occur many times at the same genome positions along the global SARS-CoV-2 phylogeny (i.e., they are very homoplasic). We observe an effect of genomic context on mutation rates, but the effect of the context is overall limited. Although previous studies have suggested selection acting to decrease U content at synonymous sites, we bring forward evidence suggesting the opposite.
Collapse
Affiliation(s)
- Nicola De Maio
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridgeshire, United Kingdom
| | - Conor R Walker
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridgeshire, United Kingdom
- Department of Genetics, University of Cambridge, United Kingdom
| | - Yatish Turakhia
- Department of Biomolecular Engineering, University of California, Santa Cruz, California, USA
- Genomics Institute, University of California, Santa Cruz, California, USA
| | - Robert Lanfear
- Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, ACT, Australia
| | - Russell Corbett-Detig
- Department of Biomolecular Engineering, University of California, Santa Cruz, California, USA
- Genomics Institute, University of California, Santa Cruz, California, USA
| | - Nick Goldman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridgeshire, United Kingdom
| |
Collapse
|
15
|
De Maio N, Walker CR, Turakhia Y, Lanfear R, Corbett-Detig R, Goldman N. Mutation rates and selection on synonymous mutations in SARS-CoV-2. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2021:2021.01.14.426705. [PMID: 33469589 PMCID: PMC7814826 DOI: 10.1101/2021.01.14.426705] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
The COVID-19 pandemic has seen an unprecedented response from the sequencing community. Leveraging the sequence data from more than 140,000 SARS-CoV-2 genomes, we study mutation rates and selective pressures affecting the virus. Understanding the processes and effects of mutation and selection has profound implications for the study of viral evolution, for vaccine design, and for the tracking of viral spread. We highlight and address some common genome sequence analysis pitfalls that can lead to inaccurate inference of mutation rates and selection, such as ignoring skews in the genetic code, not accounting for recurrent mutations, and assuming evolutionary equilibrium. We find that two particular mutation rates, G→U and C→U, are similarly elevated and considerably higher than all other mutation rates, causing the majority of mutations in the SARS-CoV-2 genome, and are possibly the result of APOBEC and ROS activity. These mutations also tend to occur many times at the same genome positions along the global SARS-CoV-2 phylogeny (i.e., they are very homoplasic). We observe an effect of genomic context on mutation rates, but the effect of the context is overall limited. While previous studies have suggested selection acting to decrease U content at synonymous sites, we bring forward evidence suggesting the opposite.
Collapse
Affiliation(s)
- Nicola De Maio
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Conor R Walker
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Yatish Turakhia
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Robert Lanfear
- Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, ACT 2601, Australia
| | - Russell Corbett-Detig
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Nick Goldman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| |
Collapse
|
16
|
Zhang C, Scornavacca C, Molloy EK, Mirarab S. ASTRAL-Pro: Quartet-Based Species-Tree Inference despite Paralogy. Mol Biol Evol 2020; 37:3292-3307. [PMID: 32886770 PMCID: PMC7751180 DOI: 10.1093/molbev/msaa139] [Citation(s) in RCA: 107] [Impact Index Per Article: 21.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Phylogenetic inference from genome-wide data (phylogenomics) has revolutionized the study of evolution because it enables accounting for discordance among evolutionary histories across the genome. To this end, summary methods have been developed to allow accurate and scalable inference of species trees from gene trees. However, most of these methods, including the widely used ASTRAL, can only handle single-copy gene trees and do not attempt to model gene duplication and gene loss. As a result, most phylogenomic studies have focused on single-copy genes and have discarded large parts of the data. Here, we first propose a measure of quartet similarity between single-copy and multicopy trees that accounts for orthology and paralogy. We then introduce a method called ASTRAL-Pro (ASTRAL for PaRalogs and Orthologs) to find the species tree that optimizes our quartet similarity measure using dynamic programing. By studying its performance on an extensive collection of simulated data sets and on real data sets, we show that ASTRAL-Pro is more accurate than alternative methods.
Collapse
Affiliation(s)
- Chao Zhang
- Bioinformatics and Systems Biology, University of California San Diego, San Diego, CA
| | | | - Erin K Molloy
- Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, CA
| |
Collapse
|
17
|
Sackton TB. Studying Natural Selection in the Era of Ubiquitous Genomes. Trends Genet 2020; 36:792-803. [DOI: 10.1016/j.tig.2020.07.008] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Revised: 07/10/2020] [Accepted: 07/13/2020] [Indexed: 01/15/2023]
|
18
|
Rabiee M, Mirarab S. INSTRAL: Discordance-Aware Phylogenetic Placement Using Quartet Scores. Syst Biol 2020; 69:384-391. [PMID: 31290974 DOI: 10.1093/sysbio/syz045] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2018] [Accepted: 07/02/2019] [Indexed: 11/13/2022] Open
Abstract
Phylogenomic analyses have increasingly adopted species tree reconstruction using methods that account for gene tree discordance using pipelines that require both human effort and computational resources. As the number of available genomes continues to increase, a new problem is facing researchers. Once more species become available, they have to repeat the whole process from the beginning because updating species trees is currently not possible. However, the de novo inference can be prohibitively costly in human effort or machine time. In this article, we introduce INSTRAL, a method that extends ASTRAL to enable phylogenetic placement. INSTRAL is designed to place a new species on an existing species tree after sequences from the new species have already been added to gene trees; thus, INSTRAL is complementary to existing placement methods that update gene trees. [ASTRAL; ILS; phylogenetic placement; species tree reconstruction.].
Collapse
Affiliation(s)
- Maryam Rabiee
- Department of Computer Science and Engineering, UC San Diego, La Jolla, CA 92093, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, UC, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
| |
Collapse
|
19
|
Borges R, Kosiol C. Consistency and identifiability of the polymorphism-aware phylogenetic models. J Theor Biol 2020; 486:110074. [PMID: 31711991 DOI: 10.1016/j.jtbi.2019.110074] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2019] [Accepted: 11/06/2019] [Indexed: 10/25/2022]
Abstract
Polymorphism-aware phylogenetic models (PoMo) constitute an alternative approach for species tree estimation from genome-wide data. PoMo builds on the standard substitution models of DNA evolution but expands the classic alphabet of the four nucleotide bases to include polymorphic states. By doing so, PoMo accounts for ancestral and current intra-population variation, while also accommodating population-level processes ruling the substitution process (e.g. genetic drift, mutations, allelic selection). PoMo has shown to be a valuable tool in several phylogenetic applications but a proof of statistical consistency (and identifiability, a necessary condition for consistency) is lacking. Here, we prove that PoMo is identifiable and, using this result, we further show that the maximum a posteriori (MAP) tree estimator of PoMo is a consistent estimator of the species tree. We complement our theoretical results with a simulated data set mimicking the diversity observed in natural populations exhibiting incomplete lineage sorting. We implemented PoMo in a Bayesian framework and show that the MAP tree easily recovers the true tree for typical numbers of sites that are sampled in genome-wide analyses.
Collapse
Affiliation(s)
- Rui Borges
- Institut für Populationsgenetik, Vetmeduni Vienna, Veterinärplatz 1, Wien 1210, Austria
| | - Carolin Kosiol
- Institut für Populationsgenetik, Vetmeduni Vienna, Veterinärplatz 1, Wien 1210, Austria; Centre for Biological Diversity, University of St Andrews, St Andrews, Fife KY16 9TH, UK.
| |
Collapse
|
20
|
Mugal CF, Kutschera VE, Botero-Castro F, Wolf JBW, Kaj I. Polymorphism Data Assist Estimation of the Nonsynonymous over Synonymous Fixation Rate Ratio ω for Closely Related Species. Mol Biol Evol 2020; 37:260-279. [PMID: 31504782 PMCID: PMC6984366 DOI: 10.1093/molbev/msz203] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
The ratio of nonsynonymous over synonymous sequence divergence, dN/dS, is a widely used estimate of the nonsynonymous over synonymous fixation rate ratio ω, which measures the extent to which natural selection modulates protein sequence evolution. Its computation is based on a phylogenetic approach and computes sequence divergence of protein-coding DNA between species, traditionally using a single representative DNA sequence per species. This approach ignores the presence of polymorphisms and relies on the indirect assumption that new mutations fix instantaneously, an assumption which is generally violated and reasonable only for distantly related species. The violation of the underlying assumption leads to a time-dependence of sequence divergence, and biased estimates of ω in particular for closely related species, where the contribution of ancestral and lineage-specific polymorphisms to sequence divergence is substantial. We here use a time-dependent Poisson random field model to derive an analytical expression of dN/dS as a function of divergence time and sample size. We then extend our framework to the estimation of the proportion of adaptive protein evolution α. This mathematical treatment enables us to show that the joint usage of polymorphism and divergence data can assist the inference of selection for closely related species. Moreover, our analytical results provide the basis for a protocol for the estimation of ω and α for closely related species. We illustrate the performance of this protocol by studying a population data set of four corvid species, which involves the estimation of ω and α at different time-scales and for several choices of sample sizes.
Collapse
Affiliation(s)
- Carina F Mugal
- Department of Ecology and Genetics, Uppsala University, Uppsala, Sweden
| | - Verena E Kutschera
- Department of Ecology and Genetics, Uppsala University, Uppsala, Sweden.,Science for Life Laboratory, Stockholm University, Stockholm, Sweden.,Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | - Fidel Botero-Castro
- Division of Evolutionary Biology, Faculty of Biology, LMU Munich, Planegg-Martinsried, Germany
| | - Jochen B W Wolf
- Department of Ecology and Genetics, Uppsala University, Uppsala, Sweden.,Division of Evolutionary Biology, Faculty of Biology, LMU Munich, Planegg-Martinsried, Germany
| | - Ingemar Kaj
- Department of Mathematics, Uppsala University, Uppsala, Sweden
| |
Collapse
|
21
|
Borges R, Szöllősi GJ, Kosiol C. Quantifying GC-Biased Gene Conversion in Great Ape Genomes Using Polymorphism-Aware Models. Genetics 2019; 212:1321-1336. [PMID: 31147380 PMCID: PMC6707462 DOI: 10.1534/genetics.119.302074] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2018] [Accepted: 05/20/2019] [Indexed: 11/18/2022] Open
Abstract
As multi-individual population-scale data become available, more complex modeling strategies are needed to quantify genome-wide patterns of nucleotide usage and associated mechanisms of evolution. Recently, the multivariate neutral Moran model was proposed. However, it was shown insufficient to explain the distribution of alleles in great apes. Here, we propose a new model that includes allelic selection. Our theoretical results constitute the basis of a new Bayesian framework to estimate mutation rates and selection coefficients from population data. We apply the new framework to a great ape dataset, where we found patterns of allelic selection that match those of genome-wide GC-biased gene conversion (gBGC). In particular, we show that great apes have patterns of allelic selection that vary in intensity-a feature that we correlated with great apes' distinct demographies. We also demonstrate that the AT/GC toggling effect decreases the probability of a substitution, promoting more polymorphisms in the base composition of great ape genomes. We further assess the impact of GC-bias in molecular analysis, and find that mutation rates and genetic distances are estimated under bias when gBGC is not properly accounted for. Our results contribute to the discussion on the tempo and mode of gBGC evolution, while stressing the need for gBGC-aware models in population genetics and phylogenetics.
Collapse
Affiliation(s)
- Rui Borges
- Institut für Populationsgenetik, Vetmeduni Vienna, 1210 Wien, Wien, Austria
| | - Gergely J Szöllősi
- Department of Biological Physics, MTA-ELTE "Lendulet" Evolutionary Genomics Research Group, Eötvös University, Pázmány P. stny. 1A, Budapest 1117, Hungary
| | - Carolin Kosiol
- Institut für Populationsgenetik, Vetmeduni Vienna, 1210 Wien, Wien, Austria
- Centre for Biological Diversity, School of Biology, University of St Andrews, Fife KY16 9TH, UK
| |
Collapse
|
22
|
Bouckaert R, Vaughan TG, Barido-Sottani J, Duchêne S, Fourment M, Gavryushkina A, Heled J, Jones G, Kühnert D, De Maio N, Matschiner M, Mendes FK, Müller NF, Ogilvie HA, du Plessis L, Popinga A, Rambaut A, Rasmussen D, Siveroni I, Suchard MA, Wu CH, Xie D, Zhang C, Stadler T, Drummond AJ. BEAST 2.5: An advanced software platform for Bayesian evolutionary analysis. PLoS Comput Biol 2019; 15:e1006650. [PMID: 30958812 PMCID: PMC6472827 DOI: 10.1371/journal.pcbi.1006650] [Citation(s) in RCA: 1984] [Impact Index Per Article: 330.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2018] [Revised: 04/18/2019] [Accepted: 02/04/2019] [Indexed: 11/18/2022] Open
Abstract
Elaboration of Bayesian phylogenetic inference methods has continued at pace in recent years with major new advances in nearly all aspects of the joint modelling of evolutionary data. It is increasingly appreciated that some evolutionary questions can only be adequately answered by combining evidence from multiple independent sources of data, including genome sequences, sampling dates, phenotypic data, radiocarbon dates, fossil occurrences, and biogeographic range information among others. Including all relevant data into a single joint model is very challenging both conceptually and computationally. Advanced computational software packages that allow robust development of compatible (sub-)models which can be composed into a full model hierarchy have played a key role in these developments. Developing such software frameworks is increasingly a major scientific activity in its own right, and comes with specific challenges, from practical software design, development and engineering challenges to statistical and conceptual modelling challenges. BEAST 2 is one such computational software platform, and was first announced over 4 years ago. Here we describe a series of major new developments in the BEAST 2 core platform and model hierarchy that have occurred since the first release of the software, culminating in the recent 2.5 release.
Collapse
Affiliation(s)
- Remco Bouckaert
- Centre of Computational Evolution, University of Auckland, Auckland, New Zealand
- Max Planck Institute for the Science of Human History, Jena, Germany
| | - Timothy G. Vaughan
- ETH Zürich, Department of Biosystems Science and Engineering, 4058 Basel, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Joëlle Barido-Sottani
- ETH Zürich, Department of Biosystems Science and Engineering, 4058 Basel, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Sebastián Duchêne
- Department of Biochemistry and Molecular Biology, University of Melbourne, Melbourne, Victoria, Australia
| | - Mathieu Fourment
- ithree institute, University of Technology Sydney, Sydney, Australia
| | | | | | - Graham Jones
- Department of Biological and Environmental Sciences, University of Gothenburg, Box 461, SE 405 30 Göteborg, Sweden
| | - Denise Kühnert
- Max Planck Institute for the Science of Human History, Jena, Germany
| | - Nicola De Maio
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridgeshire, UK
| | - Michael Matschiner
- Department of Environmental Sciences, University of Basel, 4051 Basel, Switzerland
| | - Fábio K. Mendes
- Centre of Computational Evolution, University of Auckland, Auckland, New Zealand
| | - Nicola F. Müller
- ETH Zürich, Department of Biosystems Science and Engineering, 4058 Basel, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Huw A. Ogilvie
- Department of Computer Science, Rice University, Houston, TX 77005-1892, USA
| | - Louis du Plessis
- Department of Zoology, University of Oxford, Oxford, OX1 3PS, UK
| | - Alex Popinga
- Centre of Computational Evolution, University of Auckland, Auckland, New Zealand
| | - Andrew Rambaut
- Institute of Evolutionary Biology, University of Edinburgh, Ashworth Laboratories, Edinburgh, EH9 3FL UK
| | - David Rasmussen
- Department of Entomology and Plant Pathology, North Carolina State University, Raleigh, NC 27695, USA
| | - Igor Siveroni
- Department of Infectious Disease Epidemiology, Imperial College London, Norfolk Place, W2 1PG, UK
| | - Marc A. Suchard
- Department of Biomathematics, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
| | - Chieh-Hsi Wu
- Department of Statistics, University of Oxford, OX1 3LB, UK
| | - Dong Xie
- Centre of Computational Evolution, University of Auckland, Auckland, New Zealand
| | - Chi Zhang
- Institute of Vertebrate Paleontology and Paleoanthropology, Chinese Academy of Sciences, Beijing, China
| | - Tanja Stadler
- ETH Zürich, Department of Biosystems Science and Engineering, 4058 Basel, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Alexei J. Drummond
- Centre of Computational Evolution, University of Auckland, Auckland, New Zealand
| |
Collapse
|
23
|
Bravo GA, Antonelli A, Bacon CD, Bartoszek K, Blom MPK, Huynh S, Jones G, Knowles LL, Lamichhaney S, Marcussen T, Morlon H, Nakhleh LK, Oxelman B, Pfeil B, Schliep A, Wahlberg N, Werneck FP, Wiedenhoeft J, Willows-Munro S, Edwards SV. Embracing heterogeneity: coalescing the Tree of Life and the future of phylogenomics. PeerJ 2019; 7:e6399. [PMID: 30783571 PMCID: PMC6378093 DOI: 10.7717/peerj.6399] [Citation(s) in RCA: 67] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2018] [Accepted: 01/07/2019] [Indexed: 12/23/2022] Open
Abstract
Building the Tree of Life (ToL) is a major challenge of modern biology, requiring advances in cyberinfrastructure, data collection, theory, and more. Here, we argue that phylogenomics stands to benefit by embracing the many heterogeneous genomic signals emerging from the first decade of large-scale phylogenetic analysis spawned by high-throughput sequencing (HTS). Such signals include those most commonly encountered in phylogenomic datasets, such as incomplete lineage sorting, but also those reticulate processes emerging with greater frequency, such as recombination and introgression. Here we focus specifically on how phylogenetic methods can accommodate the heterogeneity incurred by such population genetic processes; we do not discuss phylogenetic methods that ignore such processes, such as concatenation or supermatrix approaches or supertrees. We suggest that methods of data acquisition and the types of markers used in phylogenomics will remain restricted until a posteriori methods of marker choice are made possible with routine whole-genome sequencing of taxa of interest. We discuss limitations and potential extensions of a model supporting innovation in phylogenomics today, the multispecies coalescent model (MSC). Macroevolutionary models that use phylogenies, such as character mapping, often ignore the heterogeneity on which building phylogenies increasingly rely and suggest that assimilating such heterogeneity is an important goal moving forward. Finally, we argue that an integrative cyberinfrastructure linking all steps of the process of building the ToL, from specimen acquisition in the field to publication and tracking of phylogenomic data, as well as a culture that values contributors at each step, are essential for progress.
Collapse
Affiliation(s)
- Gustavo A. Bravo
- Department of Organismic and Evolutionary Biology, Museum of Comparative Zoology, Harvard University, Cambridge, MA, USA
| | - Alexandre Antonelli
- Department of Organismic and Evolutionary Biology, Museum of Comparative Zoology, Harvard University, Cambridge, MA, USA
- Gothenburg Global Biodiversity Centre, Göteborg, Sweden
- Department of Biological and Environmental Sciences, University of Gothenburg, Göteborg, Sweden
- Gothenburg Botanical Garden, Göteborg, Sweden
| | - Christine D. Bacon
- Gothenburg Global Biodiversity Centre, Göteborg, Sweden
- Department of Biological and Environmental Sciences, University of Gothenburg, Göteborg, Sweden
| | - Krzysztof Bartoszek
- Department of Computer and Information Science, Linköping University, Linköping, Sweden
| | - Mozes P. K. Blom
- Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Stockholm, Sweden
| | - Stella Huynh
- Institut de Biologie, Université de Neuchâtel, Neuchâtel, Switzerland
| | - Graham Jones
- Department of Biological and Environmental Sciences, University of Gothenburg, Göteborg, Sweden
| | - L. Lacey Knowles
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI, USA
| | - Sangeet Lamichhaney
- Department of Organismic and Evolutionary Biology, Museum of Comparative Zoology, Harvard University, Cambridge, MA, USA
| | - Thomas Marcussen
- Centre for Ecological and Evolutionary Synthesis, University of Oslo, Oslo, Norway
| | - Hélène Morlon
- Institut de Biologie, Ecole Normale Supérieure de Paris, Paris, France
| | - Luay K. Nakhleh
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Bengt Oxelman
- Gothenburg Global Biodiversity Centre, Göteborg, Sweden
- Department of Biological and Environmental Sciences, University of Gothenburg, Göteborg, Sweden
| | - Bernard Pfeil
- Department of Biological and Environmental Sciences, University of Gothenburg, Göteborg, Sweden
| | - Alexander Schliep
- Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, Göteborg, Sweden
| | | | - Fernanda P. Werneck
- Coordenação de Biodiversidade, Programa de Coleções Científicas Biológicas, Instituto Nacional de Pesquisa da Amazônia, Manaus, AM, Brazil
| | - John Wiedenhoeft
- Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, Göteborg, Sweden
- Department of Computer Science, Rutgers University, Piscataway, NJ, USA
| | - Sandi Willows-Munro
- School of Life Sciences, University of Kwazulu-Natal, Pietermaritzburg, South Africa
| | - Scott V. Edwards
- Department of Organismic and Evolutionary Biology, Museum of Comparative Zoology, Harvard University, Cambridge, MA, USA
- Gothenburg Centre for Advanced Studies in Science and Technology, Chalmers University of Technology and University of Gothenburg, Göteborg, Sweden
| |
Collapse
|
24
|
Abstract
Populations evolve as mutations arise in individual organisms and, through hereditary transmission, may become "fixed" (shared by all individuals) in the population. Most mutations are lethal or have negative fitness consequences for the organism. Others have essentially no effect on organismal fitness and can become fixed through the neutral stochastic process known as random drift. However, mutations may also produce a selective advantage that boosts their chances of reaching fixation. Regions of genomes where new mutations are beneficial, rather than neutral or deleterious, tend to evolve more rapidly due to positive selection. Genes involved in immunity and defense are a well-known example; rapid evolution in these genes presumably occurs because new mutations help organisms to prevail in evolutionary "arms races" with pathogens. In recent years genome-wide scans for selection have enlarged our understanding of the genome evolution of various species. In this chapter, we will focus on methods to detect selection on the genome. In particular, we will discuss probabilistic models and how they have changed with the advent of new genome-wide data now available.
Collapse
Affiliation(s)
- Carolin Kosiol
- Centre of Biological Diversity, School of Biology, University of St Andrews, Fife, UK.
- Institut für Populationsgenetik, Vetmeduni Vienna, Wien, Austria.
| | - Maria Anisimova
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), Wädenswil, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
25
|
Gossmann TI, Bockwoldt M, Diringer L, Schwarz F, Schumann VF. Evidence for Strong Fixation Bias at 4-fold Degenerate Sites Across Genes in the Great Tit Genome. Front Ecol Evol 2018. [DOI: 10.3389/fevo.2018.00203] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
|
26
|
Corcoran P, Gossmann TI, Barton HJ, Slate J, Zeng K. Determinants of the Efficacy of Natural Selection on Coding and Noncoding Variability in Two Passerine Species. Genome Biol Evol 2018; 9:2987-3007. [PMID: 29045655 PMCID: PMC5714183 DOI: 10.1093/gbe/evx213] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/16/2017] [Indexed: 02/06/2023] Open
Abstract
Population genetic theory predicts that selection should be more effective when the effective population size (Ne) is larger, and that the efficacy of selection should correlate positively with recombination rate. Here, we analyzed the genomes of ten great tits and ten zebra finches. Nucleotide diversity at 4-fold degenerate sites indicates that zebra finches have a 2.83-fold larger Ne. We obtained clear evidence that purifying selection is more effective in zebra finches. The proportion of substitutions at 0-fold degenerate sites fixed by positive selection (α) is high in both species (great tit 48%; zebra finch 64%) and is significantly higher in zebra finches. When α was estimated on GC-conservative changes (i.e., between A and T and between G and C), the estimates reduced in both species (great tit 22%; zebra finch 53%). A theoretical model presented herein suggests that failing to control for the effects of GC-biased gene conversion (gBGC) is potentially a contributor to the overestimation of α, and that this effect cannot be alleviated by first fitting a demographic model to neutral variants. We present the first estimates in birds for α in the untranslated regions, and found evidence for substantial adaptive changes. Finally, although purifying selection is stronger in high-recombination regions, we obtained mixed evidence for α increasing with recombination rate, especially after accounting for gBGC. These results highlight that it is important to consider the potential confounding effects of gBGC when quantifying selection and that our understanding of what determines the efficacy of selection is incomplete.
Collapse
Affiliation(s)
- Pádraic Corcoran
- Department of Animal and Plant Sciences, University of Sheffield, South Yorkshire, United Kingdom
| | - Toni I Gossmann
- Department of Animal and Plant Sciences, University of Sheffield, South Yorkshire, United Kingdom
| | - Henry J Barton
- Department of Animal and Plant Sciences, University of Sheffield, South Yorkshire, United Kingdom
| | | | - Jon Slate
- Department of Animal and Plant Sciences, University of Sheffield, South Yorkshire, United Kingdom
| | - Kai Zeng
- Department of Animal and Plant Sciences, University of Sheffield, South Yorkshire, United Kingdom
| |
Collapse
|
27
|
Rabiee M, Sayyari E, Mirarab S. Multi-allele species reconstruction using ASTRAL. Mol Phylogenet Evol 2018; 130:286-296. [PMID: 30393186 DOI: 10.1016/j.ympev.2018.10.033] [Citation(s) in RCA: 97] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2017] [Revised: 10/23/2018] [Accepted: 10/24/2018] [Indexed: 11/29/2022]
Abstract
Genome-wide phylogeny reconstruction is becoming increasingly common, and one driving factor behind these phylogenomic studies is the promise that the potential discordance between gene trees and the species tree can be modeled. Incomplete lineage sorting is one cause of discordance that bridges population genetic and phylogenetic processes. ASTRAL is a species tree reconstruction method that seeks to find the tree with minimum quartet distance to an input set of inferred gene trees. However, the published ASTRAL algorithm only works with one sample per species. To account for polymorphisms in present-day species, one can sample multiple individuals per species to create multi-allele datasets. Here, we introduce how ASTRAL can handle multi-allele datasets. We show that the quartet-based optimization problem extends naturally, and we introduce heuristic methods for building the search space specifically for the case of multi-individual datasets. We study the accuracy and scalability of the multi-individual version of ASTRAL-III using extensive simulation studies and compare it to NJst, the only other scalable method that can handle these datasets. We do not find strong evidence that using multiple individuals dramatically improves accuracy. When we study the trade-off between sampling more genes versus more individuals, we find that sampling more genes is more effective than sampling more individuals, even under conditions that we study where trees are shallow (median length: ≈1Ne) and ILS is extremely high.
Collapse
Affiliation(s)
- Maryam Rabiee
- Department of Computer Science and Engineering, University of California, San Diego, 9500 Gilman Dr, La Jolla, CA 92093, United States
| | - Erfan Sayyari
- Department of Electrical and Computer Engineering, University of California, San Diego, 9500 Gilman Dr, La Jolla, CA 92093, United States
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California, San Diego, 9500 Gilman Dr, La Jolla, CA 92093, United States.
| |
Collapse
|
28
|
De Maio N, Worby CJ, Wilson DJ, Stoesser N. Bayesian reconstruction of transmission within outbreaks using genomic variants. PLoS Comput Biol 2018; 14:e1006117. [PMID: 29668677 PMCID: PMC5927459 DOI: 10.1371/journal.pcbi.1006117] [Citation(s) in RCA: 49] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2017] [Revised: 04/30/2018] [Accepted: 04/03/2018] [Indexed: 01/19/2023] Open
Abstract
Pathogen genome sequencing can reveal details of transmission histories and is a powerful tool in the fight against infectious disease. In particular, within-host pathogen genomic variants identified through heterozygous nucleotide base calls are a potential source of information to identify linked cases and infer direction and time of transmission. However, using such data effectively to model disease transmission presents a number of challenges, including differentiating genuine variants from those observed due to sequencing error, as well as the specification of a realistic model for within-host pathogen population dynamics. Here we propose a new Bayesian approach to transmission inference, BadTrIP (BAyesian epiDemiological TRansmission Inference from Polymorphisms), that explicitly models evolution of pathogen populations in an outbreak, transmission (including transmission bottlenecks), and sequencing error. BadTrIP enables the inference of host-to-host transmission from pathogen sequencing data and epidemiological data. By assuming that genomic variants are unlinked, our method does not require the computationally intensive and unreliable reconstruction of individual haplotypes. Using simulations we show that BadTrIP is robust in most scenarios and can accurately infer transmission events by efficiently combining information from genetic and epidemiological sources; thanks to its realistic model of pathogen evolution and the inclusion of epidemiological data, BadTrIP is also more accurate than existing approaches. BadTrIP is distributed as an open source package (https://bitbucket.org/nicofmay/badtrip) for the phylogenetic software BEAST2. We apply our method to reconstruct transmission history at the early stages of the 2014 Ebola outbreak, showcasing the power of within-host genomic variants to reconstruct transmission events. We present a new tool to reconstruct transmission events within outbreaks. Our approach makes use of pathogen genetic information, notably genetic variants at low frequency within host that are usually discarded, and combines it with epidemiological information of host exposure to infection. This leads to accurate reconstruction of transmission even in cases where abundant within-host pathogen genetic variation and weak transmission bottlenecks (multiple pathogen units colonising a new host at transmission) would otherwise make inference difficult due to the transmission history differing from the pathogen evolution history inferred from pathogen isolets. Also, the use of within-host pathogen genomic variants increases the resolution of the reconstruction of the transmission tree even in scenarios with limited within-outbreak pathogen genetic diversity: within-host pathogen populations that appear identical at the level of consensus sequences can be discriminated using within-host variants. Our Bayesian approach provides a measure of the confidence in different possible transmission histories, and is published as open source software. We show with simulations and with an analysis of the beginning of the 2014 Ebola outbreak that our approach is applicable in many scenarios, improves our understanding of transmission dynamics, and will contribute to finding and limiting sources and routes of transmission, and therefore preventing the spread of infectious disease.
Collapse
Affiliation(s)
- Nicola De Maio
- Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom
| | - Colin J Worby
- Department of Ecology and Evolutionary Biology, Princeton University, Princeton, New Jersey, United States of America
| | - Daniel J Wilson
- Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom.,Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom
| | - Nicole Stoesser
- Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
29
|
Tataru P, Simonsen M, Bataillon T, Hobolth A. Statistical Inference in the Wright-Fisher Model Using Allele Frequency Data. Syst Biol 2018; 66:e30-e46. [PMID: 28173553 PMCID: PMC5837693 DOI: 10.1093/sysbio/syw056] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Revised: 05/31/2016] [Accepted: 06/06/2016] [Indexed: 11/14/2022] Open
Abstract
The Wright–Fisher model provides an elegant mathematical framework for understanding allele frequency data. In particular, the model can be used to infer the demographic history of species and identify loci under selection. A crucial quantity for inference under the Wright–Fisher model is the distribution of allele frequencies (DAF). Despite the apparent simplicity of the model, the calculation of the DAF is challenging. We review and discuss strategies for approximating the DAF, and how these are used in methods that perform inference from allele frequency data. Various evolutionary forces can be incorporated in the Wright–Fisher model, and we consider these in turn. We begin our review with the basic bi-allelic Wright–Fisher model where random genetic drift is the only evolutionary force. We then consider mutation, migration, and selection. In particular, we compare diffusion-based and moment-based methods in terms of accuracy, computational efficiency, and analytical tractability. We conclude with a brief overview of the multi-allelic process with a general mutation model.
Collapse
Affiliation(s)
- Paula Tataru
- Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark
| | - Maria Simonsen
- Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark
| | - Thomas Bataillon
- Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark
| | - Asger Hobolth
- Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark
| |
Collapse
|
30
|
Platt A, Weber CC, Liberles DA. Protein evolution depends on multiple distinct population size parameters. BMC Evol Biol 2018; 18:17. [PMID: 29422024 PMCID: PMC5806465 DOI: 10.1186/s12862-017-1085-x] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2017] [Accepted: 11/20/2017] [Indexed: 01/08/2023] Open
Abstract
That population size affects the fate of new mutations arising in genomes, modulating both how frequently they arise and how efficiently natural selection is able to filter them, is well established. It is therefore clear that these distinct roles for population size that characterize different processes should affect the evolution of proteins and need to be carefully defined. Empirical evidence is consistent with a role for demography in influencing protein evolution, supporting the idea that functional constraints alone do not determine the composition of coding sequences. Given that the relationship between population size, mutant fitness and fixation probability has been well characterized, estimating fitness from observed substitutions is well within reach with well-formulated models. Molecular evolution research has, therefore, increasingly begun to leverage concepts from population genetics to quantify the selective effects associated with different classes of mutation. However, in order for this type of analysis to provide meaningful information about the intra- and inter-specific evolution of coding sequences, a clear definition of concepts of population size, what they influence, and how they are best parameterized is essential. Here, we present an overview of the many distinct concepts that “population size” and “effective population size” may refer to, what they represent for studying proteins, and how this knowledge can be harnessed to produce better specified models of protein evolution.
Collapse
Affiliation(s)
- Alexander Platt
- Department of Biology and Center for Computational Genetics and Genomics, Temple University, Philadelphia, 19121, USA
| | - Claudia C Weber
- Department of Biology and Center for Computational Genetics and Genomics, Temple University, Philadelphia, 19121, USA
| | - David A Liberles
- Department of Biology and Center for Computational Genetics and Genomics, Temple University, Philadelphia, 19121, USA.
| |
Collapse
|
31
|
Bertl J, Ewing G, Kosiol C, Futschik A. Approximate maximum likelihood estimation for population genetic inference. Stat Appl Genet Mol Biol 2017; 16:387-405. [PMID: 29095700 DOI: 10.1515/sagmb-2017-0016] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
In many population genetic problems, parameter estimation is obstructed by an intractable likelihood function. Therefore, approximate estimation methods have been developed, and with growing computational power, sampling-based methods became popular. However, these methods such as Approximate Bayesian Computation (ABC) can be inefficient in high-dimensional problems. This led to the development of more sophisticated iterative estimation methods like particle filters. Here, we propose an alternative approach that is based on stochastic approximation. By moving along a simulated gradient or ascent direction, the algorithm produces a sequence of estimates that eventually converges to the maximum likelihood estimate, given a set of observed summary statistics. This strategy does not sample much from low-likelihood regions of the parameter space, and is fast, even when many summary statistics are involved. We put considerable efforts into providing tuning guidelines that improve the robustness and lead to good performance on problems with high-dimensional summary statistics and a low signal-to-noise ratio. We then investigate the performance of our resulting approach and study its properties in simulations. Finally, we re-estimate parameters describing the demographic history of Bornean and Sumatran orang-utans.
Collapse
|
32
|
Leaché AD, Oaks JR. The Utility of Single Nucleotide Polymorphism (SNP) Data in Phylogenetics. ANNUAL REVIEW OF ECOLOGY EVOLUTION AND SYSTEMATICS 2017. [DOI: 10.1146/annurev-ecolsys-110316-022645] [Citation(s) in RCA: 109] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Adam D. Leaché
- Department of Biology and Burke Museum of Natural History and Culture, University of Washington, Seattle, Washington 98195
| | - Jamie R. Oaks
- Department of Biological Sciences, Auburn University, Auburn, Alabama 36849
| |
Collapse
|
33
|
Zinger L, Philippe H. Coalescing molecular evolution and DNA barcoding. Mol Ecol 2017; 25:1908-10. [PMID: 27169389 DOI: 10.1111/mec.13639] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2016] [Revised: 03/31/2016] [Accepted: 03/31/2016] [Indexed: 02/06/2023]
Abstract
The DNA barcoding concept (Woese et al. ; Hebert et al. ) has considerably boosted taxonomy research by facilitating the identification of specimens and discovery of new species. Used alone or in combination with DNA metabarcoding on environmental samples (Taberlet et al. ), the approach is becoming a standard for basic and applied research in ecology, evolution and conservation across taxa, communities and ecosystems (Scheffers et al. ; Kress et al. ). However, DNA barcoding suffers from several shortcomings that still remain overlooked, especially when it comes to species delineation (Collins & Cruickshank ). In this issue of Molecular Ecology, Barley & Thomson () demonstrate that the choice of models of sequence evolution has substantial impacts on inferred genetic distances, with a propensity of the widely used Kimura 2-parameter model to lead to underestimated species richness. While DNA barcoding has been and will continue to be a powerful tool for specimen identification and preliminary taxonomic sorting, this work calls for a systematic assessment of substitution models fit on barcoding data used for species delineation and reopens the debate on the limitation of this approach.
Collapse
Affiliation(s)
- Lucie Zinger
- CNRS, ENFA, UMR 5174 EDB, Université Toulouse 3 Paul Sabatier, F-31062, Toulouse, France
| | - Hervé Philippe
- Centre de Théorisation et de Modélisation de la Biodiversité, UMR CNRS 5321, Station d'Ecologie Théorique et Expérimentale, Moulis, 09200, France
| |
Collapse
|
34
|
Kamm JA, Terhorst J, Song YS. Efficient computation of the joint sample frequency spectra for multiple populations. J Comput Graph Stat 2017; 26:182-194. [PMID: 28239248 DOI: 10.1080/10618600.2016.1159212] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
A wide range of studies in population genetics have employed the sample frequency spectrum (SFS), a summary statistic which describes the distribution of mutant alleles at a polymorphic site in a sample of DNA sequences and provides a highly efficient dimensional reduction of large-scale population genomic variation data. Recently, there has been much interest in analyzing the joint SFS data from multiple populations to infer parameters of complex demographic histories, including variable population sizes, population split times, migration rates, admixture proportions, and so on. SFS-based inference methods require accurate computation of the expected SFS under a given demographic model. Although much methodological progress has been made, existing methods suffer from numerical instability and high computational complexity when multiple populations are involved and the sample size is large. In this paper, we present new analytic formulas and algorithms that enable accurate, efficient computation of the expected joint SFS for thousands of individuals sampled from hundreds of populations related by a complex demographic model with arbitrary population size histories (including piecewise-exponential growth). Our results are implemented in a new software package called momi (MOran Models for Inference). Through an empirical study we demonstrate our improvements to numerical stability and computational complexity.
Collapse
Affiliation(s)
- John A Kamm
- Department of Statistics, University of California, Berkeley
| | | | - Yun S Song
- Departments of EECS, Statistics, and Integrative Biology, University of California, Berkeley
| |
Collapse
|
35
|
Mendes FK, Hahn Y, Hahn MW. Gene Tree Discordance Can Generate Patterns of Diminishing Convergence over Time. Mol Biol Evol 2016; 33:3299-3307. [PMID: 27634870 DOI: 10.1093/molbev/msw197] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Phenotypic convergence is an exciting outcome of adaptive evolution, occurring when different species find similar solutions to the same problem. Unraveling the molecular basis of convergence provides a way to link genotype to adaptive phenotypes, but can also shed light on the extent to which molecular evolution is repeatable and predictable. Many recent genome-wide studies have uncovered a striking pattern of diminishing convergence over time, ascribing this pattern to the presence of intramolecular epistatic interactions. Here, we consider gene tree discordance as an alternative cause of changes in convergence levels over time in a primate dataset. We demonstrate that gene tree discordance can produce patterns of diminishing convergence by itself, and that controlling for discordance as a cause of apparent convergence makes the pattern disappear. We also show that synonymous substitutions, where neither selection nor epistasis should be prevalent, have the same diminishing pattern of molecular convergence in primates. Finally, we demonstrate that even in situations where biological discordance is not possible, discordance due to errors in species tree inference can drive similar patterns. Though intramolecular epistasis could in principle create a pattern of declining convergence over time, our results suggest a possible alternative explanation for this widespread pattern. These results contribute to a growing appreciation not just of the presence of gene tree discordance, but of the unpredictable effects this discordance can have on analyses of molecular evolution.
Collapse
Affiliation(s)
- Fábio K Mendes
- Department of Biology, Indiana University, Bloomington, IN
| | - Yoonsoo Hahn
- Department of Life Science, Research Center for Biomolecules and Biosystems, Chung-Ang University, Seoul, Republic of Korea
| | - Matthew W Hahn
- Department of Biology, Indiana University, Bloomington, IN.,School of Informatics and Computing, Indiana University, Bloomington, IN
| |
Collapse
|
36
|
Reversible polymorphism-aware phylogenetic models and their application to tree inference. J Theor Biol 2016; 407:362-370. [PMID: 27480613 DOI: 10.1016/j.jtbi.2016.07.042] [Citation(s) in RCA: 52] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2016] [Revised: 07/25/2016] [Accepted: 07/27/2016] [Indexed: 12/13/2022]
Abstract
We present a reversible Polymorphism-Aware Phylogenetic Model (revPoMo) for species tree estimation from genome-wide data. revPoMo enables the reconstruction of large scale species trees for many within-species samples. It expands the alphabet of DNA substitution models to include polymorphic states, thereby, naturally accounting for incomplete lineage sorting. We implemented revPoMo in the maximum likelihood software IQ-TREE. A simulation study and an application to great apes data show that the runtimes of our approach and standard substitution models are comparable but that revPoMo has much better accuracy in estimating trees, divergence times and mutation rates. The advantage of revPoMo is that an increase of sample size per species improves estimations but does not increase runtime. Therefore, revPoMo is a valuable tool with several applications, from speciation dating to species tree reconstruction.
Collapse
|
37
|
Kenigsberg E, Yehuda Y, Marjavaara L, Keszthelyi A, Chabes A, Tanay A, Simon I. The mutation spectrum in genomic late replication domains shapes mammalian GC content. Nucleic Acids Res 2016; 44:4222-32. [PMID: 27085808 PMCID: PMC4872117 DOI: 10.1093/nar/gkw268] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2015] [Revised: 03/10/2016] [Accepted: 03/30/2016] [Indexed: 11/14/2022] Open
Abstract
Genome sequence compositions and epigenetic organizations are correlated extensively across multiple length scales. Replication dynamics, in particular, is highly correlated with GC content. We combine genome-wide time of replication (ToR) data, topological domains maps and detailed functional epigenetic annotations to study the correlations between replication timing and GC content at multiple scales. We find that the decrease in genomic GC content at large scale late replicating regions can be explained by mutation bias favoring A/T nucleotide, without selection or biased gene conversion. Quantification of the free dNTP pool during the cell cycle is consistent with a mechanism involving replication-coupled mutation spectrum that favors AT nucleotides at late S-phase. We suggest that mammalian GC content composition is shaped by independent forces, globally modulating mutation bias and locally selecting on functional element. Deconvoluting these forces and analyzing them on their native scales is important for proper characterization of complex genomic correlations.
Collapse
Affiliation(s)
- Ephraim Kenigsberg
- Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel
| | - Yishai Yehuda
- Department of Microbiology and Molecular Genetics, IMRIC, Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel
| | - Lisette Marjavaara
- Department of Medical Biochemistry and Biophysics, Umeå University, Umeå, Sweden
| | - Andrea Keszthelyi
- Department of Medical Biochemistry and Biophysics, Umeå University, Umeå, Sweden
| | - Andrei Chabes
- Department of Medical Biochemistry and Biophysics, Umeå University, Umeå, Sweden
| | - Amos Tanay
- Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel
| | - Itamar Simon
- Department of Microbiology and Molecular Genetics, IMRIC, Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel
| |
Collapse
|
38
|
Vogl C, Bergman J. Inference of directional selection and mutation parameters assuming equilibrium. Theor Popul Biol 2015; 106:71-82. [PMID: 26597774 DOI: 10.1016/j.tpb.2015.10.003] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2015] [Revised: 09/30/2015] [Accepted: 10/07/2015] [Indexed: 01/15/2023]
Abstract
In a classical study, Wright (1931) proposed a model for the evolution of a biallelic locus under the influence of mutation, directional selection and drift. He derived the equilibrium distribution of the allelic proportion conditional on the scaled mutation rate, the mutation bias and the scaled strength of directional selection. The equilibrium distribution can be used for inference of these parameters with genome-wide datasets of "site frequency spectra" (SFS). Assuming that the scaled mutation rate is low, Wright's model can be approximated by a boundary-mutation model, where mutations are introduced into the population exclusively from sites fixed for the preferred or unpreferred allelic states. With the boundary-mutation model, inference can be partitioned: (i) the shape of the SFS distribution within the polymorphic region is determined by random drift and directional selection, but not by the mutation parameters, such that inference of the selection parameter relies exclusively on the polymorphic sites in the SFS; (ii) the mutation parameters can be inferred from the amount of polymorphic and monomorphic preferred and unpreferred alleles, conditional on the selection parameter. Herein, we derive maximum likelihood estimators for the mutation and selection parameters in equilibrium and apply the method to simulated SFS data as well as empirical data from a Madagascar population of Drosophila simulans.
Collapse
Affiliation(s)
- Claus Vogl
- Institute of Animal Breeding and Genetics, Veterinärmedizinische Universität Wien, Veterinärplatz 1, A-1210 Vienna, Austria.
| | - Juraj Bergman
- Institute of Population Genetics, Veterinärmedizinische Universität Wien, Veterinärplatz 1, A-1210 Vienna, Austria; Vienna Graduate School of Population Genetics, Veterinärmedizinische Universität Wien, Veterinärplatz 1, A-1210 Vienna, Austria.
| |
Collapse
|
39
|
Hobolth A, Siren J. The multivariate Wright-Fisher process with mutation: Moment-based analysis and inference using a hierarchical Beta model. Theor Popul Biol 2015; 108:36-50. [PMID: 26612605 DOI: 10.1016/j.tpb.2015.11.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2015] [Revised: 11/04/2015] [Accepted: 11/05/2015] [Indexed: 10/22/2022]
Abstract
We consider the diffusion approximation of the multivariate Wright-Fisher process with mutation. Analytically tractable formulas for the first-and second-order moments of the allele frequency distribution are derived, and the moments are subsequently used to better understand key population genetics parameters and modeling frameworks. In particular we investigate the behavior of the expected homozygosity (the probability that two randomly sampled genes are identical) in the transient and stationary phases, and how appropriate the Dirichlet distribution is for modeling the allele frequency distribution at different evolutionary time scales. We find that the Dirichlet distribution is adequate for the pure drift model (no mutations allowed), but the distribution is not sufficiently flexible for more general mutation models. We suggest a new hierarchical Beta distribution for the allele frequencies in the Wright-Fisher process with a mutation model on the nucleotide level that distinguishes between transitions and transversions.
Collapse
Affiliation(s)
- Asger Hobolth
- Bioinformatics Research Center, Aarhus University, Denmark.
| | - Jukka Siren
- Department of Biosciences, University of Helsinki, Finland.
| |
Collapse
|
40
|
De Maio N, Schrempf D, Kosiol C. PoMo: An Allele Frequency-Based Approach for Species Tree Estimation. Syst Biol 2015; 64:1018-31. [PMID: 26209413 PMCID: PMC4604832 DOI: 10.1093/sysbio/syv048] [Citation(s) in RCA: 56] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2014] [Accepted: 06/11/2015] [Indexed: 11/24/2022] Open
Abstract
Incomplete lineage sorting can cause incongruencies of the overall species-level phylogenetic tree with the phylogenetic trees for individual genes or genomic segments. If these incongruencies are not accounted for, it is possible to incur several biases in species tree estimation. Here, we present a simple maximum likelihood approach that accounts for ancestral variation and incomplete lineage sorting. We use a POlymorphisms-aware phylogenetic MOdel (PoMo) that we have recently shown to efficiently estimate mutation rates and fixation biases from within and between-species variation data. We extend this model to perform efficient estimation of species trees. We test the performance of PoMo in several different scenarios of incomplete lineage sorting using simulations and compare it with existing methods both in accuracy and computational speed. In contrast to other approaches, our model does not use coalescent theory but is allele frequency based. We show that PoMo is well suited for genome-wide species tree estimation and that on such data it is more accurate than previous approaches.
Collapse
Affiliation(s)
- Nicola De Maio
- Institut für Populationsgenetik, Vetmeduni Vienna, Wien 1210, Austria; Vienna Graduate School of Population Genetics, Wien, Austria; and Nuffield Department of Clinical Medicine, University of Oxford, Oxford OX3 7BN, UK
| | - Dominik Schrempf
- Institut für Populationsgenetik, Vetmeduni Vienna, Wien 1210, Austria; Vienna Graduate School of Population Genetics, Wien, Austria; and
| | - Carolin Kosiol
- Institut für Populationsgenetik, Vetmeduni Vienna, Wien 1210, Austria;
| |
Collapse
|
41
|
Contingency and entrenchment in protein evolution under purifying selection. Proc Natl Acad Sci U S A 2015; 112:E3226-35. [PMID: 26056312 DOI: 10.1073/pnas.1412933112] [Citation(s) in RCA: 120] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
The phenotypic effect of an allele at one genetic site may depend on alleles at other sites, a phenomenon known as epistasis. Epistasis can profoundly influence the process of evolution in populations and shape the patterns of protein divergence across species. Whereas epistasis between adaptive substitutions has been studied extensively, relatively little is known about epistasis under purifying selection. Here we use computational models of thermodynamic stability in a ligand-binding protein to explore the structure of epistasis in simulations of protein sequence evolution. Even though the predicted effects on stability of random mutations are almost completely additive, the mutations that fix under purifying selection are enriched for epistasis. In particular, the mutations that fix are contingent on previous substitutions: Although nearly neutral at their time of fixation, these mutations would be deleterious in the absence of preceding substitutions. Conversely, substitutions under purifying selection are subsequently entrenched by epistasis with later substitutions: They become increasingly deleterious to revert over time. Our results imply that, even under purifying selection, protein sequence evolution is often contingent on history and so it cannot be predicted by the phenotypic effects of mutations assayed in the ancestral background.
Collapse
|
42
|
Glémin S, Arndt PF, Messer PW, Petrov D, Galtier N, Duret L. Quantification of GC-biased gene conversion in the human genome. Genome Res 2015; 25:1215-28. [PMID: 25995268 PMCID: PMC4510005 DOI: 10.1101/gr.185488.114] [Citation(s) in RCA: 89] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2014] [Accepted: 05/18/2015] [Indexed: 11/25/2022]
Abstract
Much evidence indicates that GC-biased gene conversion (gBGC) has a major impact on the evolution of mammalian genomes. However, a detailed quantification of the process is still lacking. The strength of gBGC can be measured from the analysis of derived allele frequency spectra (DAF), but this approach is sensitive to a number of confounding factors. In particular, we show by simulations that the inference is pervasively affected by polymorphism polarization errors and by spatial heterogeneity in gBGC strength. We propose a new general method to quantify gBGC from DAF spectra, incorporating polarization errors, taking spatial heterogeneity into account, and jointly estimating mutation bias. Applying it to human polymorphism data from the 1000 Genomes Project, we show that the strength of gBGC does not differ between hypermutable CpG sites and non-CpG sites, suggesting that in humans gBGC is not caused by the base-excision repair machinery. Genome-wide, the intensity of gBGC is in the nearly neutral area. However, given that recombination occurs primarily within recombination hotspots, 1%–2% of the human genome is subject to strong gBGC. On average, gBGC is stronger in African than in non-African populations, reflecting differences in effective population sizes. However, due to more heterogeneous recombination landscapes, the fraction of the genome affected by strong gBGC is larger in non-African than in African populations. Given that the location of recombination hotspots evolves very rapidly, our analysis predicts that, in the long term, a large fraction of the genome is affected by short episodes of strong gBGC.
Collapse
Affiliation(s)
- Sylvain Glémin
- Institut des Sciences de l'Evolution (ISEM - UMR 5554 Université de Montpellier-CNRS-IRD-EPHE), 34095 Montpellier, France; Department of Ecology and Genetics, Evolutionary Biology Centre, Uppsala University, SE-752 36 Uppsala, Sweden
| | - Peter F Arndt
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany
| | - Philipp W Messer
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853, USA
| | - Dmitri Petrov
- Department of Biology, Stanford University, Stanford, California 94305-5020, USA
| | - Nicolas Galtier
- Institut des Sciences de l'Evolution (ISEM - UMR 5554 Université de Montpellier-CNRS-IRD-EPHE), 34095 Montpellier, France
| | - Laurent Duret
- Laboratoire de Biométrie et Biologie Evolutive, UMR CNRS 5558, Université Lyon 1, 69622 Villeurbanne, France
| |
Collapse
|
43
|
Vieira FG, Lassalle F, Korneliussen TS, Fumagalli M. Improving the estimation of genetic distances from Next-Generation Sequencing data. Biol J Linn Soc Lond 2015. [DOI: 10.1111/bij.12511] [Citation(s) in RCA: 79] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Filipe G. Vieira
- Centre for GeoGenetics and Evogenomics Section; Natural History Museum of Denmark; University of Copenhagen; DK-2100 Copenhagen Denmark
| | - Florent Lassalle
- Department of Genetics, Evolution and Environment; UCL Genetics Institute; University College London; Gower Street London WC1E 6BT UK
| | - Thorfinn S. Korneliussen
- Centre for GeoGenetics and Evogenomics Section; Natural History Museum of Denmark; University of Copenhagen; DK-2100 Copenhagen Denmark
| | - Matteo Fumagalli
- Department of Genetics, Evolution and Environment; UCL Genetics Institute; University College London; Gower Street London WC1E 6BT UK
| |
Collapse
|
44
|
Mirarab S, Bayzid MS, Boussau B, Warnow T. Statistical binning enables an accurate coalescent-based estimation of the avian tree. Science 2014; 346:1250463. [PMID: 25504728 DOI: 10.1126/science.1250463] [Citation(s) in RCA: 164] [Impact Index Per Article: 14.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Gene tree incongruence arising from incomplete lineage sorting (ILS) can reduce the accuracy of concatenation-based estimations of species trees. Although coalescent-based species tree estimation methods can have good accuracy in the presence of ILS, they are sensitive to gene tree estimation error. We propose a pipeline that uses bootstrapping to evaluate whether two genes are likely to have the same tree, then it groups genes into sets using a graph-theoretic optimization and estimates a tree on each subset using concatenation, and finally produces an estimated species tree from these trees using the preferred coalescent-based method. Statistical binning improves the accuracy of MP-EST, a popular coalescent-based method, and we use it to produce the first genome-scale coalescent-based avian tree of life.
Collapse
Affiliation(s)
- Siavash Mirarab
- Department of Computer Science, University of Texas at Austin, Austin, TX 78712, USA
| | - Md Shamsuzzoha Bayzid
- Department of Computer Science, University of Texas at Austin, Austin, TX 78712, USA
| | - Bastien Boussau
- Laboratoire de Biométrie et Biologie Evolutive, CNRS, UMR5558, Université Lyon 1, 69622, Villeurbanne, France
| | - Tandy Warnow
- Department of Computer Science, University of Texas at Austin, Austin, TX 78712, USA. Department of Bioengineering and Computer Science, University of Illinois Urbana-Champaign, Champaign, IL 61820, USA.
| |
Collapse
|
45
|
Scala G, Affinito O, Miele G, Monticelli A, Cocozza S. Evidence for evolutionary and nonevolutionary forces shaping the distribution of human genetic variants near transcription start sites. PLoS One 2014; 9:e114432. [PMID: 25474578 PMCID: PMC4256220 DOI: 10.1371/journal.pone.0114432] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2014] [Accepted: 11/09/2014] [Indexed: 11/19/2022] Open
Abstract
The regions surrounding transcription start sites (TSSs) of genes play a critical role in the regulation of gene expression. At the same time, current evidence indicates that these regions are particularly stressed by transcription-related mutagenic phenomena. In this work we performed a genome-wide analysis of the distribution of single nucleotide polymorphisms (SNPs) inside the 10 kb region flanking human TSSs by dividing SNPs into four classes according to their frequency (rare, two intermediate classes, and common). We found that, in this 10 kb region, the distribution of variants depends on their frequency and on their localization relative to the TSS. We found that the distribution of variants is generally different for TSSs located inside or outside of CpG islands. We found a significant relationship between the distribution of rare variants and nucleosome occupancy scores. Furthermore, our analysis suggests that evolutionary (purifying selection) and nonevolutionary (biased gene conversion) forces both play a role in determining the relative SNP frequency around TSSs. Finally, we analyzed the potential pathogenicity of each class of variant using the Combined Annotation Dependent Depletion score. In conclusion, this study provides a novel and detailed view of the distribution of genomic variants around TSSs, providing insight into the forces that instigate and maintain variability in such critical regions.
Collapse
Affiliation(s)
- Giovanni Scala
- Gruppo Interdipartimentale di Bioinformatica e Biologia Computazionale, Università degli Studi di Napoli “Federico II”, Naples, Italy
- Dipartimento di Fisica, Università degli Studi di Napoli “Federico II”, Naples, Italy
- Istituto Nazionale di Fisica Nucleare, Sezione di Napoli, Naples, Italy
- * E-mail:
| | - Ornella Affinito
- Gruppo Interdipartimentale di Bioinformatica e Biologia Computazionale, Università degli Studi di Napoli “Federico II”, Naples, Italy
- Dipartimento di Medicina Molecolare e Biotecnologie Mediche, Università degli Studi di Napoli “Federico II”, Naples, Italy
- Istituto di Endocrinologia ed Oncologia Sperimentale (IEOS), CNR, Naples, Italy
| | - Gennaro Miele
- Gruppo Interdipartimentale di Bioinformatica e Biologia Computazionale, Università degli Studi di Napoli “Federico II”, Naples, Italy
- Dipartimento di Fisica, Università degli Studi di Napoli “Federico II”, Naples, Italy
- Istituto Nazionale di Fisica Nucleare, Sezione di Napoli, Naples, Italy
| | - Antonella Monticelli
- Gruppo Interdipartimentale di Bioinformatica e Biologia Computazionale, Università degli Studi di Napoli “Federico II”, Naples, Italy
- Istituto di Endocrinologia ed Oncologia Sperimentale (IEOS), CNR, Naples, Italy
| | - Sergio Cocozza
- Gruppo Interdipartimentale di Bioinformatica e Biologia Computazionale, Università degli Studi di Napoli “Federico II”, Naples, Italy
- Dipartimento di Medicina Molecolare e Biotecnologie Mediche, Università degli Studi di Napoli “Federico II”, Naples, Italy
| |
Collapse
|
46
|
McCandlish DM, Stoltzfus A. Modeling evolution using the probability of fixation: history and implications. QUARTERLY REVIEW OF BIOLOGY 2014; 89:225-52. [PMID: 25195318 DOI: 10.1086/677571] [Citation(s) in RCA: 123] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
Many models of evolution calculate the rate of evolution by multiplying the rate at which new mutations originate within a population by a probability of fixation. Here we review the historical origins, contemporary applications, and evolutionary implications of these "origin-fixation" models, which are widely used in evolutionary genetics, molecular evolution, and phylogenetics. Origin-fixation models were first introduced in 1969, in association with an emerging view of "molecular" evolution. Early origin-fixation models were used to calculate an instantaneous rate of evolution across a large number of independently evolving loci; in the 1980s and 1990s, a second wave of origin-fixation models emerged to address a sequence of fixation events at a single locus. Although origin fixation models have been applied to a broad array of problems in contemporary evolutionary research, their rise in popularity has not been accompanied by an increased appreciation of their restrictive assumptions or their distinctive implications. We argue that origin-fixation models constitute a coherent theory of mutation-limited evolution that contrasts sharply with theories of evolution that rely on the presence of standing genetic variation. A major unsolved question in evolutionary biology is the degree to which these models provide an accurate approximation of evolution in natural populations.
Collapse
|
47
|
Lachance J, Tishkoff SA. Biased gene conversion skews allele frequencies in human populations, increasing the disease burden of recessive alleles. Am J Hum Genet 2014; 95:408-20. [PMID: 25279983 PMCID: PMC4185123 DOI: 10.1016/j.ajhg.2014.09.008] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2014] [Revised: 08/21/2014] [Accepted: 09/10/2014] [Indexed: 10/25/2022] Open
Abstract
Gene conversion results in the nonreciprocal transfer of genetic information between two recombining sequences, and there is evidence that this process is biased toward G and C alleles. However, the strength of GC-biased gene conversion (gBGC) in human populations and its effects on hereditary disease have yet to be assessed on a genomic scale. Using high-coverage whole-genome sequences of African hunter-gatherers, agricultural populations, and primate outgroups, we quantified the effects of GC-biased gene conversion on population genomic data sets. We find that genetic distances (FST and population branch statistics) are modified by gBGC. In addition, the site frequency spectrum is left-shifted when ancestral alleles are favored by gBGC and right-shifted when derived alleles are favored by gBGC. Allele frequency shifts due to gBGC mimic the effects of natural selection. As expected, these effects are strongest in high-recombination regions of the human genome. By comparing the relative rates of fixation of unbiased and biased sites, the strength of gene conversion was estimated to be on the order of Nb ≈ 0.05 to 0.09. We also find that derived alleles favored by gBGC are much more likely to be homozygous than derived alleles at unbiased SNPs (+42.2% to 62.8%). This results in a curse of the converted, whereby gBGC causes substantial increases in hereditary disease risks. Taken together, our findings reveal that GC-biased gene conversion has important population genetic and public health implications.
Collapse
MESH Headings
- Bias
- Evolution, Molecular
- Gene Conversion
- Gene Frequency
- Genes, Recessive/genetics
- Genetic Diseases, Inborn/genetics
- Genetics, Population
- Genome, Human/genetics
- Humans
- Models, Genetic
- Models, Theoretical
- Polymorphism, Single Nucleotide/genetics
- Recombination, Genetic
- Selection, Genetic/genetics
Collapse
Affiliation(s)
- Joseph Lachance
- Departments of Biology and Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA.
| | - Sarah A Tishkoff
- Departments of Biology and Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA.
| |
Collapse
|
48
|
Abstract
This article reviews the various models that have been used to describe the relationships between gene trees and species trees. Molecular phylogeny has focused mainly on improving models for the reconstruction of gene trees based on sequence alignments. Yet, most phylogeneticists seek to reveal the history of species. Although the histories of genes and species are tightly linked, they are seldom identical, because genes duplicate, are lost or horizontally transferred, and because alleles can coexist in populations for periods that may span several speciation events. Building models describing the relationship between gene and species trees can thus improve the reconstruction of gene trees when a species tree is known, and vice versa. Several approaches have been proposed to solve the problem in one direction or the other, but in general neither gene trees nor species trees are known. Only a few studies have attempted to jointly infer gene trees and species trees. These models account for gene duplication and loss, transfer or incomplete lineage sorting. Some of them consider several types of events together, but none exists currently that considers the full repertoire of processes that generate gene trees along the species tree. Simulations as well as empirical studies on genomic data show that combining gene tree-species tree models with models of sequence evolution improves gene tree reconstruction. In turn, these better gene trees provide a more reliable basis for studying genome evolution or reconstructing ancestral chromosomes and ancestral gene sequences. We predict that gene tree-species tree methods that can deal with genomic data sets will be instrumental to advancing our understanding of genomic evolution.
Collapse
Affiliation(s)
- Gergely J Szöllősi
- ELTE-MTA "Lendület" Biophysics Research Group, Pázmány P. stny. 1A., 1117 Budapest, Hungary; Laboratoire de Biométrie et Biologie Evolutive, Centre National de la Recherche Scientifique, Unité Mixte de Recherche 5558, Université Lyon 1, F-69622 Villeurbanne, France; Université de Lyon, F-69000 Lyon, France; and Institut National de Recherche en Informatique et en Automatique Rhône-Alpes, F-38334 Montbonnot, France
| | - Eric Tannier
- ELTE-MTA "Lendület" Biophysics Research Group, Pázmány P. stny. 1A., 1117 Budapest, Hungary; Laboratoire de Biométrie et Biologie Evolutive, Centre National de la Recherche Scientifique, Unité Mixte de Recherche 5558, Université Lyon 1, F-69622 Villeurbanne, France; Université de Lyon, F-69000 Lyon, France; and Institut National de Recherche en Informatique et en Automatique Rhône-Alpes, F-38334 Montbonnot, France; ELTE-MTA "Lendület" Biophysics Research Group, Pázmány P. stny. 1A., 1117 Budapest, Hungary; Laboratoire de Biométrie et Biologie Evolutive, Centre National de la Recherche Scientifique, Unité Mixte de Recherche 5558, Université Lyon 1, F-69622 Villeurbanne, France; Université de Lyon, F-69000 Lyon, France; and Institut National de Recherche en Informatique et en Automatique Rhône-Alpes, F-38334 Montbonnot, France; ELTE-MTA "Lendület" Biophysics Research Group, Pázmány P. stny. 1A., 1117 Budapest, Hungary; Laboratoire de Biométrie et Biologie Evolutive, Centre National de la Recherche Scientifique, Unité Mixte de Recherche 5558, Université Lyon 1, F-69622 Villeurbanne, France; Université de Lyon, F-69000 Lyon, France; and Institut National de Recherche en Informatique et en Automatique Rhône-Alpes, F-38334 Montbonnot, France
| | - Vincent Daubin
- ELTE-MTA "Lendület" Biophysics Research Group, Pázmány P. stny. 1A., 1117 Budapest, Hungary; Laboratoire de Biométrie et Biologie Evolutive, Centre National de la Recherche Scientifique, Unité Mixte de Recherche 5558, Université Lyon 1, F-69622 Villeurbanne, France; Université de Lyon, F-69000 Lyon, France; and Institut National de Recherche en Informatique et en Automatique Rhône-Alpes, F-38334 Montbonnot, France; ELTE-MTA "Lendület" Biophysics Research Group, Pázmány P. stny. 1A., 1117 Budapest, Hungary; Laboratoire de Biométrie et Biologie Evolutive, Centre National de la Recherche Scientifique, Unité Mixte de Recherche 5558, Université Lyon 1, F-69622 Villeurbanne, France; Université de Lyon, F-69000 Lyon, France; and Institut National de Recherche en Informatique et en Automatique Rhône-Alpes, F-38334 Montbonnot, France
| | - Bastien Boussau
- ELTE-MTA "Lendület" Biophysics Research Group, Pázmány P. stny. 1A., 1117 Budapest, Hungary; Laboratoire de Biométrie et Biologie Evolutive, Centre National de la Recherche Scientifique, Unité Mixte de Recherche 5558, Université Lyon 1, F-69622 Villeurbanne, France; Université de Lyon, F-69000 Lyon, France; and Institut National de Recherche en Informatique et en Automatique Rhône-Alpes, F-38334 Montbonnot, France; ELTE-MTA "Lendület" Biophysics Research Group, Pázmány P. stny. 1A., 1117 Budapest, Hungary; Laboratoire de Biométrie et Biologie Evolutive, Centre National de la Recherche Scientifique, Unité Mixte de Recherche 5558, Université Lyon 1, F-69622 Villeurbanne, France; Université de Lyon, F-69000 Lyon, France; and Institut National de Recherche en Informatique et en Automatique Rhône-Alpes, F-38334 Montbonnot, France;
| |
Collapse
|
49
|
Evans BJ, Zeng K, Esselstyn JA, Charlesworth B, Melnick DJ. Reduced representation genome sequencing suggests low diversity on the sex chromosomes of tonkean macaque monkeys. Mol Biol Evol 2014; 31:2425-40. [PMID: 24987106 DOI: 10.1093/molbev/msu197] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
In species with separate sexes, social systems can differ in the relative variances of male versus female reproductive success. Papionin monkeys (macaques, mangabeys, mandrills, drills, baboons, and geladas) exhibit hallmarks of a high variance in male reproductive success, including a female-biased adult sex ratio and prominent sexual dimorphism. To explore the potential genomic consequences of such sex differences, we used a reduced representation genome sequencing approach to quantifying polymorphism at sites on autosomes and sex chromosomes of the tonkean macaque (Macaca tonkeana), a species endemic to the Indonesian island of Sulawesi. The ratio of nucleotide diversity of the X chromosome to that of the autosomes was less than the value (0.75) expected with a 1:1 sex ratio and no sex differences in the variance in reproductive success. However, the significance of this difference was dependent on which outgroup was used to standardize diversity levels. Using a new model that includes the effects of varying population size, sex differences in mutation rate between the autosomes and X chromosome, and GC-biased gene conversion (gBGC) or selection on GC content, we found that the maximum-likelihood estimate of the ratio of effective population size of the X chromosome to that of the autosomes was 0.68, which did not differ significantly from 0.75. We also found evidence for 1) a higher level of purifying selection on genic than nongenic regions, 2) gBGC or natural selection favoring increased GC content, 3) a dynamic demography characterized by population growth and contraction, 4) a higher mutation rate in males than females, and 5) a very low polymorphism level on the Y chromosome. These findings shed light on the population genomic consequences of sex differences in the variance in reproductive success, which appear to be modest in the tonkean macaque; they also suggest the occurrence of hitchhiking on the Y chromosome.
Collapse
Affiliation(s)
- Ben J Evans
- Biology Department, McMaster University, Hamilton, ON, Canada
| | - Kai Zeng
- Department of Animal and Plant Sciences, Alfred Denny Building, University of Sheffield, Sheffield, United Kingdom
| | - Jacob A Esselstyn
- Department of Biological Sciences and Museum of Natural Science, Louisiana State University
| | - Brian Charlesworth
- Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh, United Kingdom
| | - Don J Melnick
- Department of Ecology, Evolution, and Environmental Biology, Columbia University
| |
Collapse
|
50
|
Lawrie DS, Petrov DA. Comparative population genomics: power and principles for the inference of functionality. Trends Genet 2014; 30:133-9. [PMID: 24656563 DOI: 10.1016/j.tig.2014.02.002] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2013] [Revised: 01/31/2014] [Accepted: 02/06/2014] [Indexed: 11/19/2022]
Abstract
The availability of sequenced genomes from multiple related organisms allows the detection and localization of functional genomic elements based on the idea that such elements evolve more slowly than neutral sequences. Although such comparative genomics methods have proven useful in discovering functional elements and ascertaining levels of functional constraint in the genome as a whole, here we outline limitations intrinsic to this approach that cannot be overcome by sequencing more species. We argue that it is essential to supplement comparative genomics with ultra-deep sampling of populations from closely related species to enable substantially more powerful genomic scans for functional elements. The convergence of sequencing technology and population genetics theory has made such projects feasible and has exciting implications for functional genomics.
Collapse
Affiliation(s)
- David S Lawrie
- Department of Genetics, Stanford University, Stanford, CA, USA; Department of Biology, Stanford University, Stanford, CA, USA.
| | - Dmitri A Petrov
- Department of Biology, Stanford University, Stanford, CA, USA
| |
Collapse
|