1
|
Mello B, Schrago CG. Modeling Substitution Rate Evolution across Lineages and Relaxing the Molecular Clock. Genome Biol Evol 2024; 16:evae199. [PMID: 39332907 PMCID: PMC11430275 DOI: 10.1093/gbe/evae199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/08/2024] [Indexed: 09/29/2024] Open
Abstract
Relaxing the molecular clock using models of how substitution rates change across lineages has become essential for addressing evolutionary problems. The diversity of rate evolution models and their implementations are substantial, and studies have demonstrated their impact on divergence time estimates can be as significant as that of calibration information. In this review, we trace the development of rate evolution models from the proposal of the molecular clock concept to the development of sophisticated Bayesian and non-Bayesian methods that handle rate variation in phylogenies. We discuss the various approaches to modeling rate evolution, provide a comprehensive list of available software, and examine the challenges and advancements of the prevalent Bayesian framework, contrasting them to faster non-Bayesian methods. Lastly, we offer insights into potential advancements in the field in the era of big data.
Collapse
Affiliation(s)
- Beatriz Mello
- Department of Genetics, Federal University of Rio de Janeiro, Rio de Janeiro, RJ 21941-617, Brazil
| | - Carlos G Schrago
- Department of Genetics, Federal University of Rio de Janeiro, Rio de Janeiro, RJ 21941-617, Brazil
| |
Collapse
|
2
|
Seidel S, Stadler T. TiDeTree: a Bayesian phylogenetic framework to estimate single-cell trees and population dynamic parameters from genetic lineage tracing data. Proc Biol Sci 2022; 289:20221844. [PMID: 36350216 PMCID: PMC9653226 DOI: 10.1098/rspb.2022.1844] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
The development of organisms and tissues is dictated by an elaborate balance between cell division, apoptosis and differentiation: the cell population dynamics. To quantify these dynamics, we propose a phylodynamic inference approach based on single-cell lineage recorder data. We developed a Bayesian phylogenetic framework-time-scaled developmental trees (TiDeTree)-that uses lineage recorder data to estimate time-scaled single-cell trees. By implementing TiDeTree within BEAST 2, we enable joint inference of the time-scaled trees and the cell population dynamics. We validated TiDeTree using simulations and showed that performance further improves when including multiple independent sources of information into the inference, such as frequencies of editing outcomes or experimental replicates. We benchmarked TiDeTree against state-of-the-art methods and show comparable performance in terms of tree topology, plus direct assessment of uncertainty and co-estimation of additional parameters. To demonstrate TiDeTree's use in practice, we analysed a public dataset containing lineage data from approximately 100 stem cell colonies. We estimated a time-scaled phylogeny for each colony; as well as the cell division and apoptosis rates underlying the growth dynamics of all colonies. We envision that TiDeTree will find broad application in the analysis of single-cell lineage tracing data, which will improve our understanding of cellular processes during development.
Collapse
Affiliation(s)
- Sophie Seidel
- Department of Biosystems Science and Engineering, ETH Zürich, Basel, Switzerland,Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Tanja Stadler
- Department of Biosystems Science and Engineering, ETH Zürich, Basel, Switzerland,Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| |
Collapse
|
3
|
Carstens BC, Smith ML, Duckett DJ, Fonseca EM, Thomé MTC. Assessing model adequacy leads to more robust phylogeographic inference. Trends Ecol Evol 2022; 37:402-410. [PMID: 35027224 DOI: 10.1016/j.tree.2021.12.007] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2021] [Revised: 12/06/2021] [Accepted: 12/14/2021] [Indexed: 11/29/2022]
Abstract
Phylogeographic studies base inferences on large data sets and complex demographic models, but these models are applied in ways that could mislead researchers and compromise their inference. Researchers face three challenges associated with the use of models: (i) 'model selection', or the identification of an appropriate model for analysis; (ii) 'evaluation of analytical results', or the interpretation of the biological significance of the resulting parameter estimates, delimitations, and topologies; and (iii) 'model evaluation', or the use of statistical approaches to assess the fit of the model to the data. The field collectively invests most of its energy in point (ii) without considering the other points; we argue that attention to points (i) and (iii) is essential to phylogeographic inference.
Collapse
Affiliation(s)
- Bryan C Carstens
- Department of Evolution, Ecology, and Organismal Biology at The Ohio State University, Columbus, OH, USA.
| | - Megan L Smith
- Department of Biology, Indiana University, Bloomington, IN, USA
| | - Drew J Duckett
- Department of Evolution, Ecology, and Organismal Biology at The Ohio State University, Columbus, OH, USA
| | - Emanuel M Fonseca
- Department of Evolution, Ecology, and Organismal Biology at The Ohio State University, Columbus, OH, USA
| | - M Tereza C Thomé
- Department of Evolution, Ecology, and Organismal Biology at The Ohio State University, Columbus, OH, USA
| |
Collapse
|
4
|
Spielman SJ. Relative Model Fit Does Not Predict Topological Accuracy in Single-Gene Protein Phylogenetics. Mol Biol Evol 2021; 37:2110-2123. [PMID: 32191313 PMCID: PMC7306691 DOI: 10.1093/molbev/msaa075] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
It is regarded as best practice in phylogenetic reconstruction to perform relative model selection to determine an appropriate evolutionary model for the data. This procedure ranks a set of candidate models according to their goodness of fit to the data, commonly using an information theoretic criterion. Users then specify the best-ranking model for inference. Although it is often assumed that better-fitting models translate to increase accuracy, recent studies have shown that the specific model employed may not substantially affect inferences. We examine whether there is a systematic relationship between relative model fit and topological inference accuracy in protein phylogenetics, using simulations and real sequences. Simulations employed site-heterogeneous mechanistic codon models that are distinct from protein-level phylogenetic inference models, allowing us to investigate how protein models performs when they are misspecified to the data, as will be the case for any real sequence analysis. We broadly find that phylogenies inferred across models with vastly different fits to the data produce highly consistent topologies. We additionally find that all models infer similar proportions of false-positive splits, raising the possibility that all available models of protein evolution are similarly misspecified. Moreover, we find that the parameter-rich GTR (general time reversible) model, whose amino acid exchangeabilities are free parameters, performs similarly to models with fixed exchangeabilities, although the inference precision associated with GTR models was not examined. We conclude that, although relative model selection may not hinder phylogenetic analysis on protein data, it may not offer specific predictable improvements and is not a reliable proxy for accuracy.
Collapse
|
5
|
Rice A, Mayrose I. Model adequacy tests for probabilistic models of chromosome-number evolution. THE NEW PHYTOLOGIST 2021; 229:3602-3613. [PMID: 33226654 DOI: 10.1111/nph.17106] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/02/2020] [Accepted: 11/18/2020] [Indexed: 05/29/2023]
Abstract
Chromosome number is a central feature of eukaryote genomes. Deciphering patterns of chromosome-number change along a phylogeny is central to the inference of whole genome duplications and ancestral chromosome numbers. ChromEvol is a probabilistic inference tool that allows the evaluation of several models of chromosome-number evolution and their fit to the data. However, fitting a model does not necessarily mean that the model describes the empirical data adequately. This vulnerability may lead to incorrect conclusions when model assumptions are not met by real data. Here, we present a model adequacy test for likelihood models of chromosome-number evolution. The procedure allows us to determine whether the model can generate data with similar characteristics as those found in the observed ones. We demonstrate that using inadequate models can lead to inflated errors in several inference tasks. Applying the developed method to 200 angiosperm genera, we find that in many of these, the best-fitting model provides poor fit to the data. The inadequacy rate increases in large clades or in those in which hybridizations are present. The developed model adequacy test can help researchers to identify phylogenies whose underlying evolutionary patterns deviate substantially from current modelling assumptions and should guide future methods development.
Collapse
Affiliation(s)
- Anna Rice
- School of Plant Sciences and Food Security, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, 69978, Israel
| | - Itay Mayrose
- School of Plant Sciences and Food Security, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, 69978, Israel
| |
Collapse
|
6
|
Bilderbeek RJC, Laudanno G, Etienne RS. Quantifying the impact of an inference model in Bayesian phylogenetics. Methods Ecol Evol 2020. [DOI: 10.1111/2041-210x.13514] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Affiliation(s)
- Richèl J. C. Bilderbeek
- Groningen Institute for Evolutionary Life Sciences University of Groningen Groningen The Netherlands
| | - Giovanni Laudanno
- Groningen Institute for Evolutionary Life Sciences University of Groningen Groningen The Netherlands
| | - Rampal S. Etienne
- Groningen Institute for Evolutionary Life Sciences University of Groningen Groningen The Netherlands
| |
Collapse
|
7
|
Mello B, Tao Q, Barba-Montoya J, Kumar S. Molecular dating for phylogenies containing a mix of populations and species by using Bayesian and RelTime approaches. Mol Ecol Resour 2020; 21:122-136. [PMID: 32881388 DOI: 10.1111/1755-0998.13249] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2019] [Revised: 08/14/2020] [Accepted: 08/19/2020] [Indexed: 12/11/2022]
Abstract
Simultaneous molecular dating of population and species divergences is essential in many biological investigations, including phylogeography, phylodynamics and species delimitation studies. In these investigations, multiple sequence alignments consist of both intra- and interspecies samples (mixed samples). As a result, the phylogenetic trees contain interspecies, interpopulation and within-population divergences. Bayesian relaxed clock methods are often employed in these analyses, but they assume the same tree prior for both inter- and intraspecies branching processes and require specification of a clock model for branch rates (independent vs. autocorrelated rates models). We evaluated the impact of a single tree prior on Bayesian divergence time estimates by analysing computer-simulated data sets. We also examined the effect of the assumption of independence of evolutionary rate variation among branches when the branch rates are autocorrelated. Bayesian approach with coalescent tree priors generally produced excellent molecular dates and highest posterior densities with high coverage probabilities. We also evaluated the performance of a non-Bayesian method, RelTime, which does not require the specification of a tree prior or a clock model. RelTime's performance was similar to that of the Bayesian approach, suggesting that it is also suitable to analyse data sets containing both populations and species variation when its computational efficiency is needed.
Collapse
Affiliation(s)
- Beatriz Mello
- Department of Genetics, Federal University of Rio de Janeiro, Brazil.,Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA
| | - Qiqing Tao
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA.,Center for Excellence in Genome Medicine and Research, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Jose Barba-Montoya
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA.,Center for Excellence in Genome Medicine and Research, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Sudhir Kumar
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA.,Center for Excellence in Genome Medicine and Research, King Abdulaziz University, Jeddah, Saudi Arabia
| |
Collapse
|
8
|
Chen W, Kenney T, Bielawski J, Gu H. Testing adequacy for DNA substitution models. BMC Bioinformatics 2019; 20:349. [PMID: 31221105 PMCID: PMC6585133 DOI: 10.1186/s12859-019-2905-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2018] [Accepted: 05/17/2019] [Indexed: 12/22/2022] Open
Abstract
Background Testing model adequacy is important before a DNA substitution model is chosen for phylogenetic inference. Using a mis-specified model can negatively impact phylogenetic inference, for example, the maximum likelihood method can be inconsistent when the DNA sequences are generated under a tree topology which is in the Felsentein Zone and analyzed with a mis-specified or inadequate model. However, model adequacy testing in phylogenetics is underdeveloped. Results Here we develop a simple, general, powerful and robust model test based on Pearson’s goodness-of-fit test and binning of site patterns. We demonstrate through simulation that this test is robust in its high power to reject the inadequate models for a large range of different ways of binning site patterns while the Type I error is controlled well. In the real data analysis we discovered many cases where models chosen by another method can be rejected by this new test, in particular, our proposed test rejects the most complex DNA model (GTR+I+ Γ) while the Goldman-Cox test fails to reject the commonly used simple models. Conclusions Model adequacy testing and bootstrap should be used together to assess reliability of conclusions after model selection and model fitting have already been applied to choose the model and fit it. The new goodness-of-fit test proposed in this paper is a simple and powerful model adequacy testing method serving such a regular model checking purpose. We caution against deriving strong conclusions from analyses based on inadequate models. At a minimum, those results derived from inadequate models can now be readly flagged using the new test, and reported as such.
Collapse
Affiliation(s)
- Wei Chen
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Canada
| | - Toby Kenney
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Canada
| | - Joseph Bielawski
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Canada.,Department of Biology, Dalhousie University, Halifax, Canada
| | - Hong Gu
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Canada.
| |
Collapse
|
9
|
Duchene S, Bouckaert R, Duchene DA, Stadler T, Drummond AJ. Phylodynamic Model Adequacy Using Posterior Predictive Simulations. Syst Biol 2019; 68:358-364. [PMID: 29945220 PMCID: PMC6368481 DOI: 10.1093/sysbio/syy048] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2018] [Accepted: 06/15/2018] [Indexed: 11/18/2022] Open
Abstract
Rapidly evolving pathogens, such as viruses and bacteria, accumulate genetic change at a similar timescale over which their epidemiological processes occur, such that, it is possible to make inferences about their infectious spread using phylogenetic time-trees. For this purpose it is necessary to choose a phylodynamic model. However, the resulting inferences are contingent on whether the model adequately describes key features of the data. Model adequacy methods allow formal rejection of a model if it cannot generate the main features of the data. We present TreeModelAdequacy, a package for the popular BEAST2 software that allows assessing the adequacy of phylodynamic models. We illustrate its utility by analyzing phylogenetic trees from two viral outbreaks of Ebola and H1N1 influenza. The main features of the Ebola data were adequately described by the coalescent exponential-growth model, whereas the H1N1 influenza data were best described by the birth–death susceptible-infected-recovered model.
Collapse
Affiliation(s)
- Sebastian Duchene
- Department of Biochemistry and Molecular Biology, Bio21 Molecular Science and Biotechnology Institute, University of Melbourne, Melbourne, Australia
| | - Remco Bouckaert
- Centre for Computational Evolution, University of Auckland, Auckland, New Zealand.,Max Planck Institute for the Science of Human History, Jena, Germany
| | - David A Duchene
- School of Life and Environmental Sciences, University of Sydney, Sydney, Australia
| | - Tanja Stadler
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland.,Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Alexei J Drummond
- Centre for Computational Evolution, University of Auckland, Auckland, New Zealand
| |
Collapse
|
10
|
Hilton SK, Bloom JD. Modeling site-specific amino-acid preferences deepens phylogenetic estimates of viral sequence divergence. Virus Evol 2018; 4:vey033. [PMID: 30425841 PMCID: PMC6220371 DOI: 10.1093/ve/vey033] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Molecular phylogenetics is often used to estimate the time since the divergence of modern gene sequences. For highly diverged sequences, such phylogenetic techniques sometimes estimate surprisingly recent divergence times. In the case of viruses, independent evidence indicates that the estimates of deep divergence times from molecular phylogenetics are sometimes too recent. This discrepancy is caused in part by inadequate models of purifying selection leading to branch-length underestimation. Here we examine the effect on branch-length estimation of using models that incorporate experimental measurements of purifying selection. We find that models informed by experimentally measured site-specific amino-acid preferences estimate longer deep branches on phylogenies of influenza virus hemagglutinin. This lengthening of branches is due to more realistic stationary states of the models, and is mostly independent of the branch-length extension from modeling site-to-site variation in amino-acid substitution rate. The branch-length extension from experimentally informed site-specific models is similar to that achieved by other approaches that allow the stationary state to vary across sites. However, the improvements from all of these site-specific but time homogeneous and site independent models are limited by the fact that a protein’s amino-acid preferences gradually shift as it evolves. Overall, our work underscores the importance of modeling site-specific amino-acid preferences when estimating deep divergence times—but also shows the inherent limitations of approaches that fail to account for how these preferences shift over time.
Collapse
Affiliation(s)
- Sarah K Hilton
- Basic Sciences and Computational Biology Program, Fred Hutchinson Cancer Research Center.,Department of Genome Sciences, University of Washington, USA
| | - Jesse D Bloom
- Basic Sciences and Computational Biology Program, Fred Hutchinson Cancer Research Center.,Department of Genome Sciences, University of Washington, USA.,Howard Hughes Medical Institute, Seattle, WA, USA
| |
Collapse
|
11
|
Brown JM, Thomson RC. Evaluating Model Performance in Evolutionary Biology. ANNUAL REVIEW OF ECOLOGY EVOLUTION AND SYSTEMATICS 2018. [DOI: 10.1146/annurev-ecolsys-110617-062249] [Citation(s) in RCA: 30] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Many fields of evolutionary biology now depend on stochastic mathematical models. These models are valuable for their ability to formalize predictions in the face of uncertainty and provide a quantitative framework for testing hypotheses. However, no mathematical model will fully capture biological complexity. Instead, these models attempt to capture the important features of biological systems using relatively simple mathematical principles. These simplifications can allow us to focus on differences that are meaningful, while ignoring those that are not. However, simplification also requires assumptions, and to the extent that these are wrong, so is our ability to predict or compare. Here, we discuss approaches for evaluating the performance of evolutionary models in light of their assumptions by comparing them against reality. We highlight general approaches, how they are applied, and remaining opportunities. Absolute tests of fit, even when not explicitly framed as such, are fundamental to progress in understanding evolution.
Collapse
Affiliation(s)
- Jeremy M. Brown
- Department of Biological Sciences and Museum of Natural Science, Louisiana State University, Baton Rouge, Louisiana 70803, USA
| | - Robert C. Thomson
- Department of Biology, University of Hawai'i, Honolulu, Hawai'i 96822, USA
| |
Collapse
|
12
|
Duchêne DA, Duchêne S, Ho SYW. Differences in Performance among Test Statistics for Assessing Phylogenomic Model Adequacy. Genome Biol Evol 2018; 10:1375-1388. [PMID: 29788113 PMCID: PMC6007652 DOI: 10.1093/gbe/evy094] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/11/2018] [Indexed: 11/12/2022] Open
Abstract
Statistical phylogenetic analyses of genomic data depend on models of nucleotide or amino acid substitution. The adequacy of these substitution models can be assessed using a number of test statistics, allowing the model to be rejected when it is found to provide a poor description of the evolutionary process. A potentially valuable use of model-adequacy test statistics is to identify when data sets are likely to produce unreliable phylogenetic estimates, but their differences in performance are rarely explored. We performed a comprehensive simulation study to identify test statistics that are sensitive to some of the most commonly cited sources of phylogenetic estimation error. Our results show that, for many test statistics, traditional thresholds for assessing model adequacy can fail to reject the model when the phylogenetic inferences are inaccurate and imprecise. This is particularly problematic when analysing loci that have few informative sites. We propose new thresholds for assessing substitution model adequacy and demonstrate their effectiveness in analyses of three phylogenomic data sets. These thresholds lead to frequent rejection of the model for loci that yield topological inferences that are imprecise and are likely to be inaccurate. We also propose the use of a summary statistic that provides a practical assessment of overall model adequacy. Our approach offers a promising means of enhancing model choice in genome-scale data sets, potentially leading to improvements in the reliability of phylogenomic inference.
Collapse
Affiliation(s)
- David A Duchêne
- School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, Australia
| | - Sebastian Duchêne
- Bio21 Molecular Science and Biotechnology Institute, University of Melbourne, Melbourne, VIC, Australia
| | - Simon Y W Ho
- School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, Australia
| |
Collapse
|
13
|
Richards EJ, Brown JM, Barley AJ, Chong RA, Thomson RC. Variation Across Mitochondrial Gene Trees Provides Evidence for Systematic Error: How Much Gene Tree Variation Is Biological? Syst Biol 2018; 67:847-860. [PMID: 29471536 DOI: 10.1093/sysbio/syy013] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2017] [Accepted: 02/15/2018] [Indexed: 12/28/2022] Open
Abstract
The use of large genomic data sets in phylogenetics has highlighted extensive topological variation across genes. Much of this discordance is assumed to result from biological processes. However, variation among gene trees can also be a consequence of systematic error driven by poor model fit, and the relative importance of biological vs. methodological factors in explaining gene tree variation is a major unresolved question. Using mitochondrial genomes to control for biological causes of gene tree variation, we estimate the extent of gene tree discordance driven by systematic error and employ posterior prediction to highlight the role of model fit in producing this discordance. We find that the amount of discordance among mitochondrial gene trees is similar to the amount of discordance found in other studies that assume only biological causes of variation. This similarity suggests that the role of systematic error in generating gene tree variation is underappreciated and critical evaluation of fit between assumed models and the data used for inference is important for the resolution of unresolved phylogenetic questions.
Collapse
Affiliation(s)
- Emilie J Richards
- Department of Biology, University of Hawai'i, 2538 McCarthy Mall, Edmondson Hall 2016, Honolulu, HI 96822, USA.,Department of Biology, University of North Carolina, 120 South Road, Coker Hall CB 3280 Chapel Hill, NC 27599, USA
| | - Jeremy M Brown
- Department of Biological Sciences and Museum of Natural Science, Louisiana State University, 202 Life Sciences Building, Baton Rouge, LA 70803, USA
| | - Anthony J Barley
- Department of Biology, University of Hawai'i, 2538 McCarthy Mall, Edmondson Hall 2016, Honolulu, HI 96822, USA
| | - Rebecca A Chong
- Department of Biology, University of Hawai'i, 2538 McCarthy Mall, Edmondson Hall 2016, Honolulu, HI 96822, USA
| | - Robert C Thomson
- Department of Biology, University of Hawai'i, 2538 McCarthy Mall, Edmondson Hall 2016, Honolulu, HI 96822, USA
| |
Collapse
|
14
|
The molecular clock and evolutionary timescales. Biochem Soc Trans 2018; 46:1183-1190. [PMID: 30154097 DOI: 10.1042/bst20180186] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Revised: 07/17/2018] [Accepted: 07/24/2018] [Indexed: 11/17/2022]
Abstract
The molecular clock provides a valuable means of estimating evolutionary timescales from genetic and biochemical data. Proposed in the early 1960s, it was first applied to amino acid sequences and immunological measures of genetic distances between species. The molecular clock has undergone considerable development over the years, and it retains profound relevance in the genomic era. In this mini-review, we describe the history of the molecular clock, its impact on evolutionary theory, the challenges brought by evidence of evolutionary rate variation among species, and the statistical models that have been developed to account for these heterogeneous rates of genetic change. We explain how the molecular clock can be used to infer rates and timescales of evolution, and we list some of the key findings that have been obtained when molecular clocks have been applied to genomic data. Despite the numerous challenges that it has faced over the decades, the molecular clock continues to offer the most effective method of resolving the details of the evolutionary timescale of the Tree of Life.
Collapse
|
15
|
Foster CSP, Ho SYW. Strategies for Partitioning Clock Models in Phylogenomic Dating: Application to the Angiosperm Evolutionary Timescale. Genome Biol Evol 2018; 9:2752-2763. [PMID: 29036288 PMCID: PMC5647803 DOI: 10.1093/gbe/evx198] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/25/2017] [Indexed: 12/14/2022] Open
Abstract
Evolutionary timescales can be inferred from molecular sequence data using a Bayesian phylogenetic approach. In these methods, the molecular clock is often calibrated using fossil data. The uncertainty in these fossil calibrations is important because it determines the limiting posterior distribution for divergence-time estimates as the sequence length tends to infinity. Here, we investigate how the accuracy and precision of Bayesian divergence-time estimates improve with the increased clock-partitioning of genome-scale data into clock-subsets. We focus on a data set comprising plastome-scale sequences of 52 angiosperm taxa. There was little difference among the Bayesian date estimates whether we chose clock-subsets based on patterns of among-lineage rate heterogeneity or relative rates across genes, or by random assignment. Increasing the degree of clock-partitioning usually led to an improvement in the precision of divergence-time estimates, but this increase was asymptotic to a limit presumably imposed by fossil calibrations. Our clock-partitioning approaches yielded highly precise age estimates for several key nodes in the angiosperm phylogeny. For example, when partitioning the data into 20 clock-subsets based on patterns of among-lineage rate heterogeneity, we inferred crown angiosperms to have arisen 198–178 Ma. This demonstrates that judicious clock-partitioning can improve the precision of molecular dating based on phylogenomic data, but the meaning of this increased precision should be considered critically.
Collapse
Affiliation(s)
- Charles S P Foster
- School of Life and Environmental Sciences, University of Sydney, Sydney, New South Wales 2006, Australia
| | - Simon Y W Ho
- School of Life and Environmental Sciences, University of Sydney, Sydney, New South Wales 2006, Australia
| |
Collapse
|
16
|
Abstract
Genetic sequencing data of pathogens allow one to quantify the evolutionary rate together with epidemiological dynamics using Bayesian phylodynamic methods. Such tools are particularly useful for obtaining a timely understanding of newly emerging epidemic outbreaks. During the West African Ebola virus disease epidemic, an unusually high evolutionary rate was initially estimated, promoting discussions regarding the potential danger of the strain quickly evolving into an even more dangerous virus. We show here that such high evolutionary rates are not necessarily real but can stem from methodological biases in the analyses. While most analyses of epidemic outbreak data are performed such that these biases may be present, we suggest a solution to overcome these biases in the future. Bayesian phylogenetics aims at estimating phylogenetic trees together with evolutionary and population dynamic parameters based on genetic sequences. It has been noted that the clock rate, one of the evolutionary parameters, decreases with an increase in the sampling period of sequences. In particular, clock rates of epidemic outbreaks are often estimated to be higher compared with the long-term clock rate. Purifying selection has been suggested as a biological factor that contributes to this phenomenon, since it purges slightly deleterious mutations from a population over time. However, other factors such as methodological biases may also play a role and make a biological interpretation of results difficult. In this paper, we identify methodological biases originating from the choice of tree prior, that is, the model specifying epidemiological dynamics. With a simulation study we demonstrate that a misspecification of the tree prior can upwardly bias the inferred clock rate and that the interplay of the different models involved in the inference can be complex and nonintuitive. We also show that the choice of tree prior can influence the inference of clock rate on real-world Ebola virus (EBOV) datasets. While commonly used tree priors result in very high clock-rate estimates for sequences from the initial phase of the epidemic in Sierra Leone, tree priors allowing for population structure lead to estimates agreeing with the long-term rate for EBOV.
Collapse
|
17
|
Brown JW, Smith SA. The Past Sure is Tense: On Interpreting Phylogenetic Divergence Time Estimates. Syst Biol 2018; 67:340-353. [PMID: 28945912 DOI: 10.1093/sysbio/syx074] [Citation(s) in RCA: 47] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2017] [Accepted: 09/04/2017] [Indexed: 11/12/2022] Open
Abstract
Divergence time estimation-the calibration of a phylogeny to geological time-is an integral first step in modeling the tempo of biological evolution (traits and lineages). However, despite increasingly sophisticated methods to infer divergence times from molecular genetic sequences, the estimated age of many nodes across the tree of life contrast significantly and consistently with timeframes conveyed by the fossil record. This is perhaps best exemplified by crown angiosperms, where molecular clock (Triassic) estimates predate the oldest (Early Cretaceous) undisputed angiosperm fossils by tens of millions of years or more. While the incompleteness of the fossil record is a common concern, issues of data limitation and model inadequacy are viable (if underexplored) alternative explanations. In this vein, Beaulieu et al. (2015) convincingly demonstrated how methods of divergence time inference can be misled by both (i) extreme state-dependent molecular substitution rate heterogeneity and (ii) biased sampling of representative major lineages. These results demonstrate the impact of (potentially common) model violations. Here, we suggest another potential challenge: that the configuration of the statistical inference problem (i.e., the parameters, their relationships, and associated priors) alone may preclude the reconstruction of the paleontological timeframe for the crown age of angiosperms. We demonstrate, through sampling from the joint prior (formed by combining the tree (diversification) prior with the calibration densities specified for fossil-calibrated nodes) that with no data present at all, that an Early Cretaceous crown angiosperms is rejected (i.e., has essentially zero probability). More worrisome, however, is that for the 24 nodes calibrated by fossils, almost all have indistinguishable marginal prior and posterior age distributions when employing routine lognormal fossil calibration priors. These results indicate that there is inadequate information in the data to over-rule the joint prior. Given that these calibrated nodes are strategically placed in disparate regions of the tree, they act to anchor the tree scaffold, and so the posterior inference for the tree as a whole is largely determined by the pseudodata present in the (often arbitrary) calibration densities. We recommend, as for any Bayesian analysis, that marginal prior and posterior distributions be carefully compared to determine whether signal is coming from the data or prior belief, especially for parameters of direct interest. This recommendation is not novel. However, given how rarely such checks are carried out in evolutionary biology, it bears repeating. Our results demonstrate the fundamental importance of prior/posterior comparisons in any Bayesian analysis, and we hope that they further encourage both researchers and journals to consistently adopt this crucial step as standard practice. Finally, we note that the results presented here do not refute the biological modeling concerns identified by Beaulieu et al. (2015). Both sets of issues remain apposite to the goals of accurate divergence time estimation, and only by considering them in tandem can we move forward more confidently.
Collapse
Affiliation(s)
- Joseph W Brown
- Department of Ecology & Evolutionary Biology, University of Michigan, 830 North University Avenue, Ann Arbor, MI 48109, USA
| | - Stephen A Smith
- Department of Ecology & Evolutionary Biology, University of Michigan, 830 North University Avenue, Ann Arbor, MI 48109, USA
| |
Collapse
|
18
|
Barley AJ, Brown JM, Thomson RC. Impact of Model Violations on the Inference of Species Boundaries Under the Multispecies Coalescent. Syst Biol 2018; 67:269-284. [PMID: 28945903 DOI: 10.1093/sysbio/syx073] [Citation(s) in RCA: 64] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2017] [Accepted: 08/31/2017] [Indexed: 11/14/2022] Open
Abstract
The use of genetic data for identifying species-level lineages across the tree of life has received increasing attention in the field of systematics over the past decade. The multispecies coalescent model provides a framework for understanding the process of lineage divergence and has become widely adopted for delimiting species. However, because these studies lack an explicit assessment of model fit, in many cases, the accuracy of the inferred species boundaries are unknown. This is concerning given the large amount of empirical data and theory that highlight the complexity of the speciation process. Here, we seek to fill this gap by using simulation to characterize the sensitivity of inference under the multispecies coalescent (MSC) to several violations of model assumptions thought to be common in empirical data. We also assess the fit of the MSC model to empirical data in the context of species delimitation. Our results show substantial variation in model fit across data sets. Posterior predictive tests find the poorest model performance in data sets that were hypothesized to be impacted by model violations. We also show that while the inferences assuming the MSC are robust to minor model violations, such inferences can be biased under some biologically plausible scenarios. Taken together, these results suggest that researchers can identify individual data sets in which species delimitation under the MSC is likely to be problematic, thereby highlighting the cases where additional lines of evidence to identify species boundaries are particularly important to collect. Our study supports a growing body of work highlighting the importance of model checking in phylogenetics, and the usefulness of tailoring tests of model fit to assess the reliability of particular inferences. [Populations structure, gene flow, demographic changes, posterior prediction, simulation, genetics.].
Collapse
Affiliation(s)
- Anthony J Barley
- Department of Biology, University of Hawai'i, 2538 McCarthy Mall, Edmondson Hall 216, Honolulu, HI 96822, USA
| | - Jeremy M Brown
- Department of Biological Sciences and Museum of Natural Science, Louisiana State University, 202 Life Sciences Building, Baton Rouge, LA 70803, USA
| | - Robert C Thomson
- Department of Biology, University of Hawai'i, 2538 McCarthy Mall, Edmondson Hall 216, Honolulu, HI 96822, USA
| |
Collapse
|
19
|
Duchêne DA, Duchêne S, Ho SYW. PhyloMAd: efficient assessment of phylogenomic model adequacy. Bioinformatics 2018; 34:2300-2301. [DOI: 10.1093/bioinformatics/bty103] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2017] [Accepted: 02/20/2018] [Indexed: 11/12/2022] Open
Affiliation(s)
- David A Duchêne
- School of Life and Environmental Sciences, University of Sydney, Sydney, Australia
| | - Sebastian Duchêne
- Bio21 Molecular Science and Biotechnology Institute, University of Melbourne, Melbourne, VIC, Australia
| | - Simon Y W Ho
- School of Life and Environmental Sciences, University of Sydney, Sydney, Australia
| |
Collapse
|
20
|
Tongo M, Harkins GW, Dorfman JR, Billings E, Tovanabutra S, de Oliveira T, Martin DP. Unravelling the complicated evolutionary and dissemination history of HIV-1M subtype A lineages. Virus Evol 2018; 4:vey003. [PMID: 29484203 PMCID: PMC5819727 DOI: 10.1093/ve/vey003] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Subtype A is one of the rare HIV-1 group M (HIV-1M) lineages that is both widely distributed throughout the world and persists at high frequencies in the Congo Basin (CB), the site where HIV-1M likely originated. This, together with its high degree of diversity suggests that subtype A is amongst the fittest HIV-1M lineages. Here we use a comprehensive set of published near full-length subtype A sequences and A-derived genome fragments from both circulating and unique recombinant forms (CRFs/URFs) to obtain some insights into how frequently these lineages have independently seeded HIV-1M sub-epidemics in different parts of the world. We do this by inferring when and where the major subtype A lineages and subtype A-derived CRFs originated. Following its origin in the CB during the 1940s, we track the diversification and recombination history of subtype A sequences before and during its dissemination throughout much of the world between the 1950s and 1970s. Collectively, the timings and numbers of detectable subtype A recombination and dissemination events, the present broad global distribution of the sub-epidemics that were seeded by these events, and the high prevalence of subtype A sequences within the regions where these sub-epidemics occurred, suggest that ancestral subtype A viruses (and particularly sub-subtype A1 ancestral viruses) may have been genetically predisposed to become major components of the present epidemic.
Collapse
Affiliation(s)
- Marcel Tongo
- KwaZulu-Natal Research Innovation and Sequencing Platform (Krisp), School of Laboratory Medicine and Medical Sciences, College of Health Sciences, Nelson R Mandela School of Medicine, University of KwaZulu-Natal, Durban 4041, South Africa
- Division of Computational Biology, Department of Integrative Biomedical Sciences and Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town 7925, South Africa
- Center of Research for Emerging and Re-Emerging Diseases (CREMER), Institute of Medical Research and Study of Medicinal Plants (IMPM), Yaoundé, Cameroon
| | - Gordon W Harkins
- South African MRC Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape, Bellville 7535, South Africa
| | - Jeffrey R Dorfman
- Division of Immunology, Department of Pathology, Faculty of Health Sciences, University of Cape Town, Cape Town 7925, South Africa
- Division of Immunology, School of Pathology, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg 2193, South Africa
| | - Erik Billings
- U.S. Military HIV Research Program, Walter Reed Army Institute of Research, Silver Spring, MD 20910–7500, USA
- Henry M. Jackson Foundation for the Advancement of Military Medicine Inc., Bethesda, MD 20910–7500, USA
| | - Sodsai Tovanabutra
- U.S. Military HIV Research Program, Walter Reed Army Institute of Research, Silver Spring, MD 20910–7500, USA
- Henry M. Jackson Foundation for the Advancement of Military Medicine Inc., Bethesda, MD 20910–7500, USA
| | - Tulio de Oliveira
- KwaZulu-Natal Research Innovation and Sequencing Platform (Krisp), School of Laboratory Medicine and Medical Sciences, College of Health Sciences, Nelson R Mandela School of Medicine, University of KwaZulu-Natal, Durban 4041, South Africa
| | - Darren P Martin
- Division of Computational Biology, Department of Integrative Biomedical Sciences and Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Cape Town 7925, South Africa
| |
Collapse
|
21
|
Bromham L, Duchêne S, Hua X, Ritchie AM, Duchêne DA, Ho SYW. Bayesian molecular dating: opening up the black box. Biol Rev Camb Philos Soc 2017; 93:1165-1191. [DOI: 10.1111/brv.12390] [Citation(s) in RCA: 104] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2017] [Revised: 11/13/2017] [Accepted: 11/17/2017] [Indexed: 12/27/2022]
Affiliation(s)
- Lindell Bromham
- Macroevolution & Macroecology, Division of Ecology & Evolution, Research School of Biology; Australian National University; Canberra ACT 2601 Australia
| | - Sebastián Duchêne
- Department of Biochemistry and Molecular Biology, Bio21 Molecular Science and Biotechnology Institute; The University of Melbourne; Melbourne VIC 3010 Australia
- School of Life and Environmental Sciences; University of Sydney; Sydney NSW 2006 Australia
| | - Xia Hua
- Macroevolution & Macroecology, Division of Ecology & Evolution, Research School of Biology; Australian National University; Canberra ACT 2601 Australia
| | - Andrew M. Ritchie
- School of Life and Environmental Sciences; University of Sydney; Sydney NSW 2006 Australia
| | - David A. Duchêne
- Macroevolution & Macroecology, Division of Ecology & Evolution, Research School of Biology; Australian National University; Canberra ACT 2601 Australia
- School of Life and Environmental Sciences; University of Sydney; Sydney NSW 2006 Australia
| | - Simon Y. W. Ho
- School of Life and Environmental Sciences; University of Sydney; Sydney NSW 2006 Australia
| |
Collapse
|
22
|
Duchêne DA, Duchêne S, Ho SYW. New Statistical Criteria Detect Phylogenetic Bias Caused by Compositional Heterogeneity. Mol Biol Evol 2017; 34:1529-1534. [PMID: 28333201 DOI: 10.1093/molbev/msx092] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
In statistical phylogenetic analyses of DNA sequences, models of evolutionary change commonly assume that base composition is stationary through time and across lineages. This assumption is violated by many data sets, but it is unclear whether the magnitude of these violations is sufficient to mislead phylogenetic inference. We investigated the impacts of compositional heterogeneity on phylogenetic estimates using a method for assessing model adequacy. Based on a detailed simulation study, we found that common frequentist criteria are highly conservative, such that the model is often rejected when the phylogenetic estimates do not show clear signs of bias. We propose new criteria and provide guidelines for their usage. We apply these criteria to genome-scale data from 40 birds and find that loci with severely non-homogeneous base composition are uncommon. Our results show the importance of using well-informed diagnostic statistics when testing model adequacy for phylogenomic analyses.
Collapse
Affiliation(s)
- David A Duchêne
- School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, Australia
| | - Sebastian Duchêne
- Centre for Systems Genomics, University of Melbourne, Melbourne, VIC, Australia
| | - Simon Y W Ho
- School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, Australia
| |
Collapse
|
23
|
Duchêne DA, Hua X, Bromham L. Phylogenetic estimates of diversification rate are affected by molecular rate variation. J Evol Biol 2017; 30:1884-1897. [PMID: 28758282 DOI: 10.1111/jeb.13148] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2017] [Revised: 07/16/2017] [Accepted: 07/18/2017] [Indexed: 01/14/2023]
Abstract
Molecular phylogenies are increasingly being used to investigate the patterns and mechanisms of macroevolution. In particular, node heights in a phylogeny can be used to detect changes in rates of diversification over time. Such analyses rest on the assumption that node heights in a phylogeny represent the timing of diversification events, which in turn rests on the assumption that evolutionary time can be accurately predicted from DNA sequence divergence. But there are many influences on the rate of molecular evolution, which might also influence node heights in molecular phylogenies, and thus affect estimates of diversification rate. In particular, a growing number of studies have revealed an association between the net diversification rate estimated from phylogenies and the rate of molecular evolution. Such an association might, by influencing the relative position of node heights, systematically bias estimates of diversification time. We simulated the evolution of DNA sequences under several scenarios where rates of diversification and molecular evolution vary through time, including models where diversification and molecular evolutionary rates are linked. We show that commonly used methods, including metric-based, likelihood and Bayesian approaches, can have a low power to identify changes in diversification rate when molecular substitution rates vary. Furthermore, the association between the rates of speciation and molecular evolution rate can cause the signature of a slowdown or speedup in speciation rates to be lost or misidentified. These results suggest that the multiple sources of variation in molecular evolutionary rates need to be considered when inferring macroevolutionary processes from phylogenies.
Collapse
Affiliation(s)
- D A Duchêne
- Macroevolution & Macroecology, Research School of Biology, Australian National University, Canberra, ACT, Australia.,School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, Australia
| | - X Hua
- Macroevolution & Macroecology, Research School of Biology, Australian National University, Canberra, ACT, Australia
| | - L Bromham
- Macroevolution & Macroecology, Research School of Biology, Australian National University, Canberra, ACT, Australia
| |
Collapse
|
24
|
Duchêne S, Duchêne DA, Di Giallonardo F, Eden JS, Geoghegan JL, Holt KE, Ho SYW, Holmes EC. Cross-validation to select Bayesian hierarchical models in phylogenetics. BMC Evol Biol 2016; 16:115. [PMID: 27230264 PMCID: PMC4880944 DOI: 10.1186/s12862-016-0688-y] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2016] [Accepted: 05/19/2016] [Indexed: 01/12/2023] Open
Abstract
Background Recent developments in Bayesian phylogenetic models have increased the range of inferences that can be drawn from molecular sequence data. Accordingly, model selection has become an important component of phylogenetic analysis. Methods of model selection generally consider the likelihood of the data under the model in question. In the context of Bayesian phylogenetics, the most common approach involves estimating the marginal likelihood, which is typically done by integrating the likelihood across model parameters, weighted by the prior. Although this method is accurate, it is sensitive to the presence of improper priors. We explored an alternative approach based on cross-validation that is widely used in evolutionary analysis. This involves comparing models according to their predictive performance. Results We analysed simulated data and a range of viral and bacterial data sets using a cross-validation approach to compare a variety of molecular clock and demographic models. Our results show that cross-validation can be effective in distinguishing between strict- and relaxed-clock models and in identifying demographic models that allow growth in population size over time. In most of our empirical data analyses, the model selected using cross-validation was able to match that selected using marginal-likelihood estimation. The accuracy of cross-validation appears to improve with longer sequence data, particularly when distinguishing between relaxed-clock models. Conclusions Cross-validation is a useful method for Bayesian phylogenetic model selection. This method can be readily implemented even when considering complex models where selecting an appropriate prior for all parameters may be difficult. Electronic supplementary material The online version of this article (doi:10.1186/s12862-016-0688-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sebastián Duchêne
- Marie Bashir Institute of Infectious Diseases and Biosecurity, Charles Perkins Centre, Sydney Medical School, University of Sydney, Sydney, NSW, 2006, Australia. .,School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, 2006, Australia.
| | - David A Duchêne
- School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, 2006, Australia
| | - Francesca Di Giallonardo
- Marie Bashir Institute of Infectious Diseases and Biosecurity, Charles Perkins Centre, Sydney Medical School, University of Sydney, Sydney, NSW, 2006, Australia.,School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, 2006, Australia
| | - John-Sebastian Eden
- Marie Bashir Institute of Infectious Diseases and Biosecurity, Charles Perkins Centre, Sydney Medical School, University of Sydney, Sydney, NSW, 2006, Australia.,School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, 2006, Australia
| | - Jemma L Geoghegan
- Marie Bashir Institute of Infectious Diseases and Biosecurity, Charles Perkins Centre, Sydney Medical School, University of Sydney, Sydney, NSW, 2006, Australia.,School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, 2006, Australia
| | - Kathryn E Holt
- Department of Biochemistry and Molecular Biology, Bio21 Molecular Science and Biotechnology Institute, The University of Melbourne, Melbourne, VIC, 3010, Australia.,Centre for Systems Genomics, The University of Melbourne, Melbourne, VIC, 3010, Australia
| | - Simon Y W Ho
- School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, 2006, Australia
| | - Edward C Holmes
- Marie Bashir Institute of Infectious Diseases and Biosecurity, Charles Perkins Centre, Sydney Medical School, University of Sydney, Sydney, NSW, 2006, Australia.,School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, 2006, Australia
| |
Collapse
|
25
|
Abstract
Molecular dating has become central to placing a temporal dimension on the tree of life. Methods for estimating divergence times have been developed for over 50 years, beginning with the proposal of molecular clock in 1962. We categorize the chronological development of these methods into four generations based on the timing of their origin. In the first generation approaches (1960s-1980s), a strict molecular clock was assumed to date divergences. In the second generation approaches (1990s), the equality of evolutionary rates between species was first tested and then a strict molecular clock applied to estimate divergence times. The third generation approaches (since ∼2000) account for differences in evolutionary rates across the tree by using a statistical model, obviating the need to assume a clock or to test the equality of evolutionary rates among species. Bayesian methods in the third generation require a specific or uniform prior on the speciation-process and enable the inclusion of uncertainty in clock calibrations. The fourth generation approaches (since 2012) allow rates to vary from branch to branch, but do not need prior selection of a statistical model to describe the rate variation or the specification of speciation model. With high accuracy, comparable to Bayesian approaches, and speeds that are orders of magnitude faster, fourth generation methods are able to produce reliable timetrees of thousands of species using genome scale data. We found that early time estimates from second generation studies are similar to those of third and fourth generation studies, indicating that methodological advances have not fundamentally altered the timetree of life, but rather have facilitated time estimation by enabling the inclusion of more species. Nonetheless, we feel an urgent need for testing the accuracy and precision of third and fourth generation methods, including their robustness to misspecification of priors in the analysis of large phylogenies and data sets.
Collapse
Affiliation(s)
- Sudhir Kumar
- Institute for Genomics and Evolutionary Medicine, Temple University Center for Biodiversity, Temple University Department of Biology, Temple University
| | - S Blair Hedges
- Institute for Genomics and Evolutionary Medicine, Temple University Center for Biodiversity, Temple University Department of Biology, Temple University
| |
Collapse
|
26
|
Duchêne S, Di Giallonardo F, Holmes EC. Substitution Model Adequacy and Assessing the Reliability of Estimates of Virus Evolutionary Rates and Time Scales. Mol Biol Evol 2015; 33:255-67. [PMID: 26416981 DOI: 10.1093/molbev/msv207] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Determining the time scale of virus evolution is central to understanding their origins and emergence. The phylogenetic methods commonly used for this purpose can be misleading if the substitution model makes incorrect assumptions about the data. Empirical studies consider a pool of models and select that with the highest statistical fit. However, this does not allow the rejection of all models, even if they poorly describe the data. An alternative is to use model adequacy methods that evaluate the ability of a model to predict hypothetical future observations. This can be done by comparing the empirical data with data generated under the model in question. We conducted simulations to evaluate the sensitivity of such methods with nucleotide, amino acid, and codon data. These effectively detected underparameterized models, but failed to detect mutational saturation and some instances of nonstationary base composition, which can lead to biases in estimates of tree topology and length. To test the applicability of these methods with real data, we analyzed nucleotide and amino acid data sets from the genus Flavivirus of RNA viruses. In most cases these models were inadequate, with the exception of a data set of relatively closely related sequences of Dengue virus, for which the GTR+Γ nucleotide and LG+Γ amino acid substitution models were adequate. Our results partly explain the lack of consensus over estimates of the long-term evolutionary time scale of these viruses, and indicate that assessing the adequacy of substitution models should be routinely used to determine whether estimates are reliable.
Collapse
Affiliation(s)
- Sebastián Duchêne
- Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Biological Sciences and Sydney Medical School, The University of Sydney, Sydney, NSW, Australia
| | - Francesca Di Giallonardo
- Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Biological Sciences and Sydney Medical School, The University of Sydney, Sydney, NSW, Australia
| | - Edward C Holmes
- Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Biological Sciences and Sydney Medical School, The University of Sydney, Sydney, NSW, Australia
| |
Collapse
|