1
|
McArthur RN, Zehmakan AN, Charleston MA, Lin Y, Huttley G. Spectral cluster supertree: fast and statistically robust merging of rooted phylogenetic trees. Front Mol Biosci 2024; 11:1432495. [PMID: 39544404 PMCID: PMC11561713 DOI: 10.3389/fmolb.2024.1432495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Accepted: 09/24/2024] [Indexed: 11/17/2024] Open
Abstract
The algorithms for phylogenetic reconstruction are central to computational molecular evolution. The relentless pace of data acquisition has exposed their poor scalability and the conclusion that the conventional application of these methods is impractical and not justifiable from an energy usage perspective. Furthermore, the drive to improve the statistical performance of phylogenetic methods produces increasingly parameter-rich models of sequence evolution, which worsens the computational performance. Established theoretical and algorithmic results identify supertree methods as critical to divide-and-conquer strategies for improving scalability of phylogenetic reconstruction. Of particular importance is the ability to explicitly accommodate rooted topologies. These can arise from the more biologically plausible non-stationary models of sequence evolution. We make a contribution to addressing this challenge with Spectral Cluster Supertree, a novel supertree method for merging a set of overlapping rooted phylogenetic trees. It offers significant improvements over Min-Cut supertree and previous state-of-the-art methods in terms of both time complexity and overall topological accuracy, particularly for problems of large size. We perform comparisons against Min-Cut supertree and Bad Clade Deletion. Leveraging two tree topology distance metrics, we demonstrate that while Bad Clade Deletion generates more correct clades in its resulting supertree, Spectral Cluster Supertree's generated tree is generally more topologically close to the true model tree. Over large datasets containing 10,000 taxa and ∼ 500 source trees, where Bad Clade Deletion usually takes ∼ 2 h to run, our method generates a supertree in on average 20 s. Spectral Cluster Supertree is released under an open source license and is available on the python package index as sc-supertree.
Collapse
Affiliation(s)
- Robert N. McArthur
- Research School of Biology, The Australian National University, Canberra, ACT, Australia
| | - Ahad N. Zehmakan
- School of Computing, The Australian National University, Canberra, ACT, Australia
| | | | - Yu Lin
- School of Computing, The Australian National University, Canberra, ACT, Australia
| | - Gavin Huttley
- Research School of Biology, The Australian National University, Canberra, ACT, Australia
| |
Collapse
|
2
|
McShea H, Weibel C, Wehbi S, Goodman P, James JE, Wheeler AL, Masel J. The effectiveness of selection in a species affects the direction of amino acid frequency evolution. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.02.01.526552. [PMID: 38948853 PMCID: PMC11212923 DOI: 10.1101/2023.02.01.526552] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/02/2024]
Abstract
Nearly neutral theory predicts that species with higher effective population size (N e ) are better able to purge slightly deleterious mutations. We compare evolution in high-N e vs. low-N e vertebrates to reveal which amino acid frequencies are subject to subtle selective preferences. We take three complementary approaches, two measuring flux and one measuring outcomes. First, we fit non-stationary substitution models of amino acid flux using maximum likelihood, comparing the high-N e clade of rodents and lagomorphs to its low-N e sister clade of primates and colugos. Second, we compare evolutionary outcomes across a wider range of vertebrates, via correlations between amino acid frequencies and N e . Third, we dissect the details of flux in human, chimpanzee, mouse, and rat, as scored by parsimony - this also enables comparison to a historical paper. All three methods agree on which amino acids are preferred under more effective selection. Preferred amino acids tend to be smaller, less costly to synthesize, and to promote intrinsic structural disorder. Parsimony-induced bias in the historical study produces an apparent reduction in structural disorder, perhaps driven by slightly deleterious substitutions. Within highly exchangeable pairs of amino acids, arginine is strongly preferred over lysine, and valine over isoleucine, consistent with more effective selection preferring a marginally larger free energy of folding. These two preferences match differences between thermophiles and mesophilic relatives. These results reveal the biophysical consequences of mutation-selection-drift balance, and demonstrate the utility of nearly neutral theory for understanding protein evolution.
Collapse
Affiliation(s)
- Hanon McShea
- Department of Earth System Science, Stanford University
| | - Catherine Weibel
- Department of Ecology & Evolutionary Biology, University of Arizona
- Department of Applied Physics, Stanford University
| | - Sawsan Wehbi
- Graduate Interdisciplinary Program in Genetics, University of Arizona
| | | | - Jennifer E James
- Department of Ecology & Evolutionary Biology, University of Arizona
- Department of Ecology and Genetics, Uppsala University
| | - Andrew L Wheeler
- Graduate Interdisciplinary Program in Genetics, University of Arizona
| | - Joanna Masel
- Department of Ecology & Evolutionary Biology, University of Arizona
| |
Collapse
|
3
|
Casanellas M, Fernández-Sánchez J, Garrote-López M, Sabaté-Vidales M. Designing Weights for Quartet-Based Methods When Data are Heterogeneous Across Lineages. Bull Math Biol 2023; 85:68. [PMID: 37310552 DOI: 10.1007/s11538-023-01167-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2022] [Accepted: 05/15/2023] [Indexed: 06/14/2023]
Abstract
Homogeneity across lineages is a general assumption in phylogenetics according to which nucleotide substitution rates are common to all lineages. Many phylogenetic methods relax this hypothesis but keep a simple enough model to make the process of sequence evolution more tractable. On the other hand, dealing successfully with the general case (heterogeneity of rates across lineages) is one of the key features of phylogenetic reconstruction methods based on algebraic tools. The goal of this paper is twofold. First, we present a new weighting system for quartets (ASAQ) based on algebraic and semi-algebraic tools, thus especially indicated to deal with data evolving under heterogeneous rates. This method combines the weights of two previous methods by means of a test based on the positivity of the branch lengths estimated with the paralinear distance. ASAQ is statistically consistent when applied to data generated under the general Markov model, considers rate and base composition heterogeneity among lineages and does not assume stationarity nor time-reversibility. Second, we test and compare the performance of several quartet-based methods for phylogenetic tree reconstruction (namely QFM, wQFM, quartet puzzling, weight optimization and Willson's method) in combination with several systems of weights, including ASAQ weights and other weights based on algebraic and semi-algebraic methods or on the paralinear distance. These tests are applied to both simulated and real data and support weight optimization with ASAQ weights as a reliable and successful reconstruction method that improves upon the accuracy of global methods (such as neighbor-joining or maximum likelihood) in the presence of long branches or on mixtures of distributions on trees.
Collapse
Affiliation(s)
- Marta Casanellas
- Institut de Matematiques de la UPC-BarcelonaTech (IMTech), Universitat Politècnica de Catalunya and Centre de Recerca Matemàtica, Av. Diagonal 647, 08028, Barcelona, Spain.
| | - Jesús Fernández-Sánchez
- Institut de Matematiques de la UPC-BarcelonaTech (IMTech), Universitat Politècnica de Catalunya and Centre de Recerca Matemàtica, Av. Diagonal 647, 08028, Barcelona, Spain
| | | | | |
Collapse
|
4
|
Zioutis C, Seki D, Bauchinger F, Herbold C, Berger A, Wisgrill L, Berry D. Ecological Processes Shaping Microbiomes of Extremely Low Birthweight Infants. Front Microbiol 2022; 13:812136. [PMID: 35295290 PMCID: PMC8919028 DOI: 10.3389/fmicb.2022.812136] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2021] [Accepted: 02/07/2022] [Indexed: 11/23/2022] Open
Abstract
The human microbiome has been implicated in affecting health outcomes in premature infants, but the ecological processes governing early life microbiome assembly remain poorly understood. Here, we investigated microbial community assembly and dynamics in extremely low birth weight infants (ELBWI) over the first 2 weeks of life. We profiled the gut, oral cavity and skin microbiomes over time using 16S rRNA gene amplicon sequencing and evaluated the ecological forces shaping these microbiomes. Though microbiomes at all three body sites were characterized by compositional instability over time and had low body-site specificity (PERMANOVA, r 2 = 0.09, p = 0.001), they could nonetheless be clustered into four discrete community states. Despite the volatility of these communities, deterministic assembly processes were detectable in this period of initial microbial colonization. To further explore these deterministic dynamics, we developed a probabilistic approach in which we modeled microbiome state transitions in each ELBWI as a Markov process, or a "memoryless" shift, from one community state to another. This analysis revealed that microbiomes from different body sites had distinctive dynamics as well as characteristic equilibrium frequencies. Time-resolved microbiome sampling of premature infants may help to refine and inform clinical practices. Additionally, this work provides an analysis framework for microbial community dynamics based on Markov modeling that can facilitate new insights, not only into neonatal microbiomes but also other human-associated or environmental microbiomes.
Collapse
Affiliation(s)
- Christos Zioutis
- Division of Microbial Ecology, Department of Microbiology and Ecosystem Science, Centre for Microbiology and Environmental Systems Science, University of Vienna, Vienna, Austria
| | - David Seki
- Division of Microbial Ecology, Department of Microbiology and Ecosystem Science, Centre for Microbiology and Environmental Systems Science, University of Vienna, Vienna, Austria
- Division of Neonatology, Department of Pediatrics and Adolescent Medicine, Pediatric Intensive Care and Neuropediatrics, Comprehensive Center for Pediatrics, Medical University of Vienna, Vienna, Austria
| | - Franziska Bauchinger
- Division of Microbial Ecology, Department of Microbiology and Ecosystem Science, Centre for Microbiology and Environmental Systems Science, University of Vienna, Vienna, Austria
| | - Craig Herbold
- Division of Microbial Ecology, Department of Microbiology and Ecosystem Science, Centre for Microbiology and Environmental Systems Science, University of Vienna, Vienna, Austria
| | - Angelika Berger
- Division of Neonatology, Department of Pediatrics and Adolescent Medicine, Pediatric Intensive Care and Neuropediatrics, Comprehensive Center for Pediatrics, Medical University of Vienna, Vienna, Austria
| | - Lukas Wisgrill
- Division of Neonatology, Department of Pediatrics and Adolescent Medicine, Pediatric Intensive Care and Neuropediatrics, Comprehensive Center for Pediatrics, Medical University of Vienna, Vienna, Austria
| | - David Berry
- Division of Microbial Ecology, Department of Microbiology and Ecosystem Science, Centre for Microbiology and Environmental Systems Science, University of Vienna, Vienna, Austria
- Joint Microbiome Facility of the Medical University of Vienna, University of Vienna, Vienna, Austria
| |
Collapse
|
5
|
Borges R, Machado JP, Gomes C, Rocha AP, Antunes A. Measuring phylogenetic signal between categorical traits and phylogenies. Bioinformatics 2020; 35:1862-1869. [PMID: 30358816 DOI: 10.1093/bioinformatics/bty800] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2017] [Revised: 08/18/2018] [Accepted: 10/24/2018] [Indexed: 12/21/2022] Open
Abstract
MOTIVATION Determining whether a trait and phylogeny share some degree of phylogenetic signal is a flagship goal in evolutionary biology. Signatures of phylogenetic signal can assist the resolution of a broad range of evolutionary questions regarding the tempo and mode of phenotypic evolution. However, despite the considerable number of strategies to measure it, few and limited approaches exist for categorical traits. Here, we used the concept of Shannon entropy and propose the δ statistic for evaluating the degree of phylogenetic signal between a phylogeny and categorical traits. RESULTS We validated δ as a measure of phylogenetic signal: the higher the δ-value the higher the degree of phylogenetic signal between a given tree and a trait. Based on simulated data we proposed a threshold-based classification test to pinpoint cases of phylogenetic signal. The assessment of the test's specificity and sensitivity suggested that the δ approach should only be applied to 20 or more species. We have further tested the performance of δ in scenarios of branch length and topology uncertainty, unbiased and biased trait evolution and trait saturation. Our results showed that δ may be applied in a wide range of phylogenetic contexts. Finally, we investigated our method in 14 360 mammalian gene trees and found that olfactory receptor genes are significantly associated with the mammalian activity patterns, a result that is congruent with expectations and experiments from the literature. Our application shows that δ can successfully detect molecular signatures of phenotypic evolution. We conclude that δ represents a useful measure of phylogenetic signal since many phenotypes can only be measured in categories. AVAILABILITY AND IMPLEMENTATION https://github.com/mrborges23/delta_statistic. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Rui Borges
- CIIMAR/CIMAR, Interdisciplinary Centre of Marine and Environmental Research, Terminal de Cruzeiros do Porto de Leixões, Matosinhos, Portugal.,Department of Biology, Faculty of Sciences of the University of Porto, FCUP, Porto, Portugal.,CMUP, Centre of Mathematics of the University of Porto, Porto, Portugal
| | - João Paulo Machado
- CIIMAR/CIMAR, Interdisciplinary Centre of Marine and Environmental Research, Terminal de Cruzeiros do Porto de Leixões, Matosinhos, Portugal
| | - Cidália Gomes
- CIIMAR/CIMAR, Interdisciplinary Centre of Marine and Environmental Research, Terminal de Cruzeiros do Porto de Leixões, Matosinhos, Portugal
| | - Ana Paula Rocha
- Department of Biology, Faculty of Sciences of the University of Porto, FCUP, Porto, Portugal.,CMUP, Centre of Mathematics of the University of Porto, Porto, Portugal
| | - Agostinho Antunes
- CIIMAR/CIMAR, Interdisciplinary Centre of Marine and Environmental Research, Terminal de Cruzeiros do Porto de Leixões, Matosinhos, Portugal.,Department of Biology, Faculty of Sciences of the University of Porto, FCUP, Porto, Portugal
| |
Collapse
|
6
|
Goremykin V. A Novel Test for Absolute Fit of Evolutionary Models Provides a Means to Correctly Identify the Substitution Model and the Model Tree. Genome Biol Evol 2020; 11:2403-2419. [PMID: 31368483 PMCID: PMC6736042 DOI: 10.1093/gbe/evz167] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/29/2019] [Indexed: 02/07/2023] Open
Abstract
A novel test is described that visualizes the absolute model-data fit of the substitution and tree components of an evolutionary model. The test utilizes statistics based on counts of character state matches and mismatches in alignments of observed and simulated sequences. This comparison is used to assess model-data fit. In simulations conducted to evaluate the performance of the test, the test estimator was able to identify both the correct tree topology and substitution model under conditions where the Goldman-Cox test-which tests the fit of a substitution model to sequence data and is also based on comparing simulated replicates with observed data-showed high error rates. The novel test was found to identify the correct tree topology within a wide range of DNA substitution model misspecifications, indicating the high discriminatory power of the test. Use of this test provides a practical approach for assessing absolute model-data fit when testing phylogenetic hypotheses.
Collapse
Affiliation(s)
- Vadim Goremykin
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all'Adige, Trentino, Italy
| |
Collapse
|
7
|
Casanellas M, Fernández-Sánchez J, Roca-Lacostena J. Embeddability and rate identifiability of Kimura 2-parameter matrices. J Math Biol 2019; 80:995-1019. [PMID: 31705189 DOI: 10.1007/s00285-019-01446-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2019] [Revised: 10/27/2019] [Indexed: 10/25/2022]
Abstract
Deciding whether a substitution matrix is embeddable (i.e. the corresponding Markov process has a continuous-time realization) is an open problem even for [Formula: see text] matrices. We study the embedding problem and rate identifiability for the K80 model of nucleotide substitution. For these [Formula: see text] matrices, we fully characterize the set of embeddable K80 Markov matrices and the set of embeddable matrices for which rates are identifiable. In particular, we describe an open subset of embeddable matrices with non-identifiable rates. This set contains matrices with positive eigenvalues and also diagonal largest in column matrices, which might lead to consequences in parameter estimation in phylogenetics. Finally, we compute the relative volumes of embeddable K80 matrices and of embeddable matrices with identifiable rates. This study concludes the embedding problem for the more general model K81 and its submodels, which had been initiated by the last two authors in a separate work.
Collapse
Affiliation(s)
- Marta Casanellas
- Dpt. Matemàtiques, Universitat Politècnica de Catalunya and BGSMath, Diagonal 647, 08028, Barcelona, Spain.
| | - Jesús Fernández-Sánchez
- Dpt. Matemàtiques, Universitat Politècnica de Catalunya and BGSMath, Diagonal 647, 08028, Barcelona, Spain
| | - Jordi Roca-Lacostena
- Dpt. Matemàtiques, Universitat Politècnica de Catalunya and BGSMath, Diagonal 647, 08028, Barcelona, Spain
| |
Collapse
|
8
|
Ying H, Cooke I, Sprungala S, Wang W, Hayward DC, Tang Y, Huttley G, Ball EE, Forêt S, Miller DJ. Comparative genomics reveals the distinct evolutionary trajectories of the robust and complex coral lineages. Genome Biol 2018; 19:175. [PMID: 30384840 PMCID: PMC6214176 DOI: 10.1186/s13059-018-1552-8] [Citation(s) in RCA: 45] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2018] [Accepted: 09/28/2018] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Despite the biological and economic significance of scleractinian reef-building corals, the lack of large molecular datasets for a representative range of species limits understanding of many aspects of their biology. Within the Scleractinia, based on molecular evidence, it is generally recognised that there are two major clades, Complexa and Robusta, but the genomic bases of significant differences between them remain unclear. RESULTS Draft genome assemblies and annotations were generated for three coral species: Galaxea fascicularis (Complexa), Fungia sp., and Goniastrea aspera (Robusta). Whilst phylogenetic analyses strongly support a deep split between Complexa and Robusta, synteny analyses reveal a high level of gene order conservation between all corals, but not between corals and sea anemones or between sea anemones. HOX-related gene clusters are, however, well preserved across all of these combinations. Differences between species are apparent in the distribution and numbers of protein domains and an apparent correlation between number of HSP20 proteins and stress tolerance. Uniquely amongst animals, a complete histidine biosynthesis pathway is present in robust corals but not in complex corals or sea anemones. This pathway appears to be ancestral, and its retention in the robust coral lineage has important implications for coral nutrition and symbiosis. CONCLUSIONS The availability of three new coral genomes enabled recognition of a de novo histidine biosynthesis pathway in robust corals which is only the second identified biosynthetic difference between corals. These datasets provide a platform for understanding many aspects of coral biology, particularly the interactions of corals with their endosymbionts.
Collapse
Affiliation(s)
- Hua Ying
- Division of Ecology and Evolution, Research School of Biology, Australian National University, Acton, ACT 2601 Australia
| | - Ira Cooke
- Comparative Genomics Centre, Department of Molecular and Cell Biology, James Cook University, Townsville, QLD 4811 Australia
| | - Susanne Sprungala
- Comparative Genomics Centre, Department of Molecular and Cell Biology, James Cook University, Townsville, QLD 4811 Australia
| | - Weiwen Wang
- Division of Ecology and Evolution, Research School of Biology, Australian National University, Acton, ACT 2601 Australia
| | - David C. Hayward
- Division of Ecology and Evolution, Research School of Biology, Australian National University, Acton, ACT 2601 Australia
| | - Yurong Tang
- Division of Ecology and Evolution, Research School of Biology, Australian National University, Acton, ACT 2601 Australia
- Computational Biology and Bioinformatics Unit, Research School of Biology, Australian National University, Acton, ACT 2601 Australia
| | - Gavin Huttley
- Division of Ecology and Evolution, Research School of Biology, Australian National University, Acton, ACT 2601 Australia
- Computational Biology and Bioinformatics Unit, Research School of Biology, Australian National University, Acton, ACT 2601 Australia
| | - Eldon E. Ball
- Division of Ecology and Evolution, Research School of Biology, Australian National University, Acton, ACT 2601 Australia
- ARC Centre of Excellence for Coral Reef Studies, James Cook University, Townsville, QLD 4811 Australia
| | - Sylvain Forêt
- Division of Ecology and Evolution, Research School of Biology, Australian National University, Acton, ACT 2601 Australia
- ARC Centre of Excellence for Coral Reef Studies, James Cook University, Townsville, QLD 4811 Australia
| | - David J. Miller
- Comparative Genomics Centre, Department of Molecular and Cell Biology, James Cook University, Townsville, QLD 4811 Australia
- ARC Centre of Excellence for Coral Reef Studies, James Cook University, Townsville, QLD 4811 Australia
| |
Collapse
|
9
|
Using the Mutation-Selection Framework to Characterize Selection on Protein Sequences. Genes (Basel) 2018; 9:genes9080409. [PMID: 30104502 PMCID: PMC6115872 DOI: 10.3390/genes9080409] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2018] [Revised: 08/02/2018] [Accepted: 08/09/2018] [Indexed: 12/13/2022] Open
Abstract
When mutational pressure is weak, the generative process of protein evolution involves explicit probabilities of mutations of different types coupled to their conditional probabilities of fixation dependent on selection. Establishing this mechanistic modeling framework for the detection of selection has been a goal in the field of molecular evolution. Building on a mathematical framework proposed more than a decade ago, numerous methods have been introduced in an attempt to detect and measure selection on protein sequences. In this review, we discuss the structure of the original model, subsequent advances, and the series of assumptions that these models operate under.
Collapse
|
10
|
Kaehler BD, Yap VB, Huttley GA. Standard Codon Substitution Models Overestimate Purifying Selection for Nonstationary Data. Genome Biol Evol 2018; 9:134-149. [PMID: 28175284 PMCID: PMC5381540 DOI: 10.1093/gbe/evw308] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/02/2017] [Indexed: 01/28/2023] Open
Abstract
Estimation of natural selection on protein-coding sequences is a key comparative genomics approach for de novo prediction of lineage-specific adaptations. Selective pressure is measured on a per-gene basis by comparing the rate of nonsynonymous substitutions to the rate of synonymous substitutions. All published codon substitution models have been time-reversible and thus assume that sequence composition does not change over time. We previously demonstrated that if time-reversible DNA substitution models are applied in the presence of changing sequence composition, the number of substitutions is systematically biased towards overestimation. We extend these findings to the case of codon substitution models and further demonstrate that the ratio of nonsynonymous to synonymous rates of substitution tends to be underestimated over three data sets of mammals, vertebrates, and insects. Our basis for comparison is a nonstationary codon substitution model that allows sequence composition to change. Goodness-of-fit results demonstrate that our new model tends to fit the data better. Direct measurement of nonstationarity shows that bias in estimates of natural selection and genetic distance increases with the degree of violation of the stationarity assumption. Additionally, inferences drawn under time-reversible models are systematically affected by compositional divergence. As genomic sequences accumulate at an accelerating rate, the importance of accurate de novo estimation of natural selection increases. Our results establish that our new model provides a more robust perspective on this fundamental quantity.
Collapse
Affiliation(s)
- Benjamin D Kaehler
- Research School of Biology, College of Medicine, Biology, and Environment, Australian National University, Canberra, ACT, Australia
| | - Von Bing Yap
- Department of Statistics and Applied Probability, National University of Singapore, Singapore, Singapore
| | - Gavin A Huttley
- Research School of Biology, College of Medicine, Biology, and Environment, Australian National University, Canberra, ACT, Australia
| |
Collapse
|
11
|
Kaehler BD. Full reconstruction of non-stationary strand-symmetric models on rooted phylogenies. J Theor Biol 2017; 420:144-151. [PMID: 28286217 DOI: 10.1016/j.jtbi.2017.03.007] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2016] [Revised: 03/06/2017] [Accepted: 03/08/2017] [Indexed: 10/20/2022]
Abstract
Understanding the evolutionary relationship among species is of fundamental importance to the biological sciences. The location of the root in any phylogenetic tree is critical as it gives an order to evolutionary events. None of the popular models of nucleotide evolution currently used in likelihood or Bayesian methods are able to infer the location of the root without exogenous information. It is known that the most general Markov models of nucleotide substitution also cannot identify the location of the root or be fitted to multiple sequence alignments with fewer than three sequences. We prove that the location of the root and the full model can be identified and statistically consistently estimated for a non-stationary, strand-symmetric substitution model given a multiple sequence alignment with two or more sequences. We also generalise earlier work to provide a practical means of overcoming the computationally intractable problem of labelling hidden states in a phylogenetic model.
Collapse
Affiliation(s)
- Benjamin D Kaehler
- Research School of Biology, Australian National University, Canberra, Australian Capital Territory, Australia.
| |
Collapse
|
12
|
Barley AJ, Thomson RC. Assessing the performance of DNA barcoding using posterior predictive simulations. Mol Ecol 2016; 25:1944-57. [PMID: 26915049 DOI: 10.1111/mec.13590] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2015] [Revised: 01/05/2016] [Accepted: 01/18/2016] [Indexed: 02/05/2023]
Abstract
Accurate estimates of biodiversity are required for research in a broad array of biological subdisciplines including ecology, evolution, systematics, conservation and biodiversity science. The use of statistical models and genetic data, particularly DNA barcoding, has been suggested as an important tool for remedying the large gaps in our current understanding of biodiversity. However, the reliability of biodiversity estimates obtained using these approaches depends on how well the statistical models that are used describe the evolutionary process underlying the genetic data. In this study, we utilize data from the Barcode of Life Database and posterior predictive simulations to assess the performance of DNA barcoding under commonly used substitution models. We demonstrate that the success of DNA barcoding varies widely across DNA substitution models and that model choice has a substantial impact on the number of operational taxonomic units identified (changing results by ~4-31%). Additionally, we demonstrate that the widely followed practice of a priori assuming the Kimura 2-parameter model for DNA barcoding is statistically unjustified and should be avoided. Using both data-based and inference-based test statistics, we detect variation in model performance across taxonomic groups, clustering algorithms, genetic divergence thresholds and substitution models. Taken together, these results illustrate the importance of considering both model selection and model adequacy in studies quantifying biodiversity.
Collapse
Affiliation(s)
- Anthony J Barley
- Department of Biology, University of Hawai'i at Mānoa, Honolulu, HI, 96822, USA
| | - Robert C Thomson
- Department of Biology, University of Hawai'i at Mānoa, Honolulu, HI, 96822, USA
| |
Collapse
|
13
|
Arenas M. Trends in substitution models of molecular evolution. Front Genet 2015; 6:319. [PMID: 26579193 PMCID: PMC4620419 DOI: 10.3389/fgene.2015.00319] [Citation(s) in RCA: 80] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2015] [Accepted: 10/09/2015] [Indexed: 11/13/2022] Open
Abstract
Substitution models of evolution describe the process of genetic variation through fixed mutations and constitute the basis of the evolutionary analysis at the molecular level. Almost 40 years after the development of first substitution models, highly sophisticated, and data-specific substitution models continue emerging with the aim of better mimicking real evolutionary processes. Here I describe current trends in substitution models of DNA, codon and amino acid sequence evolution, including advantages and pitfalls of the most popular models. The perspective concludes that despite the large number of currently available substitution models, further research is required for more realistic modeling, especially for DNA coding and amino acid data. Additionally, the development of more accurate complex models should be coupled with new implementations and improvements of methods and frameworks for substitution model selection and downstream evolutionary analysis.
Collapse
Affiliation(s)
- Miguel Arenas
- Institute of Molecular Pathology and Immunology of the University of Porto Porto, Portugal
| |
Collapse
|