1
|
Li X, Sunday Okoh O, Sequeira Trovão N. The impact of software and criteria on the selection of best-fit nucleotide substitution models for molecular evolutionary genetic analysis. PLoS One 2025; 20:e0319774. [PMID: 40138374 PMCID: PMC11940733 DOI: 10.1371/journal.pone.0319774] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2024] [Accepted: 02/08/2025] [Indexed: 03/29/2025] Open
Abstract
The statistical selection of best-fit models of nucleotide substitution for multiple sequence alignments (MSAs) is routine in phylogenetics. Our analysis of model selection across three widely used phylogenetic programs (jModelTest2, ModelTest-NG, and IQ-TREE) demonstrated that the choice of program did not significantly affect the ability to accurately identify the true nucleotide substitution model. This finding indicates that researchers can confidently rely on any of these programs for model selection, as they offer comparable accuracy without substantial differences. However, our results underscore the critical impact of the information criterion chosen for model selection. BIC consistently outperformed both AIC and AICc in accurately identifying the true model, regardless of the program used. This observation highlights the importance of carefully selecting the information criterion, with a preference for BIC, when determining the best-fit model for phylogenetic analyses. This study provides an assessment of popular model selection programs while contributing to the advancement of more robust statistical methods and tools for accurately identifying the most suitable nucleotide substitution models.
Collapse
Affiliation(s)
- Xingguang Li
- Ningbo No. 2 Hospital, Ningbo, China
- Guoke Ningbo Life Science and Health Industry Research Institute, Ningbo, China
| | | | - Nídia Sequeira Trovão
- Division of International Epidemiology and Population Studies, Fogarty International Center, National Institutes of Health, Bethesda, Maryland, United States of America
| |
Collapse
|
2
|
Pyke AT, Wilson DJ, Michie A, Mackenzie JS, Imrie A, Cameron J, Doggett SL, Haniotis J, Herrero LJ, Caly L, Lynch SE, Mee PT, Madzokere ET, Ramirez AL, Paramitha D, Hobson-Peters J, Smith DW, Weir R, Sullivan M, Druce J, Melville L, Robson J, Gibb R, van den Hurk AF, Duchene S. Independent repeated mutations within the alphaviruses Ross River virus and Barmah Forest virus indicates convergent evolution and past positive selection in ancestral populations despite ongoing purifying selection. Virus Evol 2024; 10:veae080. [PMID: 39411152 PMCID: PMC11477980 DOI: 10.1093/ve/veae080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2024] [Revised: 07/25/2024] [Accepted: 09/12/2024] [Indexed: 10/19/2024] Open
Abstract
Ross River virus (RRV) and Barmah Forest virus (BFV) are arthritogenic arthropod-borne viruses (arboviruses) that exhibit generalist host associations and share distributions in Australia and Papua New Guinea (PNG). Using stochastic mapping and discrete-trait phylogenetic analyses, we profiled the independent evolution of RRV and BFV signature mutations. Analysis of 186 RRV and 88 BFV genomes demonstrated their viral evolution trajectories have involved repeated selection of mutations, particularly in the nonstructural protein 1 (nsP1) and envelope 3 (E3) genes suggesting convergent evolution. Convergent mutations in the nsP1 genes of RRV (residues 248 and 441) and BFV (residues 297 and 447) may be involved with catalytic enzyme mechanisms and host membrane interactions during viral RNA replication and capping. Convergent E3 mutations (RRV site 59 and BFV site 57) may be associated with enzymatic furin activity and cleavage of E3 from protein precursors assisting viral maturation and infectivity. Given their requirement to replicate in disparate insect and vertebrate hosts, convergent evolution in RRV and BFV may represent a dynamic link between their requirement to selectively 'fine-tune' intracellular host interactions and viral replicative enzymatic processes. Despite evidence of evolutionary convergence, selection pressure analyses did not reveal any RRV or BFV amino acid sites under strong positive selection and only weak positive selection for nonstructural protein sites. These findings may indicate that their alphavirus ancestors were subject to positive selection events which predisposed ongoing pervasive convergent evolution, and this largely supports continued purifying selection in RRV and BFV populations during their replication in mosquito and vertebrate hosts.
Collapse
Affiliation(s)
- Alyssa T Pyke
- Public Health Virology Laboratory, Public and Environmental Health Reference Laboratories, Department of Health, Queensland Government, P.O. Box 594, Archerfield, Coopers Plains, Queensland, Australia
| | - Daniel J Wilson
- Big Data Institute, Oxford Population Health, University of Oxford, Li Ka Shing Centre for Health Information and Discovery, Old Road Campus, Oxford OX3 7LF, United Kingdom
- Department for Continuing Education, University of Oxford, 1 Wellington Square, Oxford OX1 2JA, United Kingdom
| | - Alice Michie
- School of Biomedical Sciences, University of Western Australia, 35 Stirling Highway, Perth, Western Australia 6009, Australia
| | - John S Mackenzie
- Faculty of Health Sciences, Curtin University, G.P.O. Box U1987, Bentley, Western Australia 6845, Australia
| | - Allison Imrie
- School of Biomedical Sciences, University of Western Australia, 35 Stirling Highway, Perth, Western Australia 6009, Australia
| | - Jane Cameron
- Public Health Virology Laboratory, Public and Environmental Health Reference Laboratories, Department of Health, Queensland Government, P.O. Box 594, Archerfield, Coopers Plains, Queensland, Australia
| | - Stephen L Doggett
- NSW Health Pathology, Westmead Hospital, 166-174 Hawkesbury Road Westmead, Sydney, New South Wales 2145, Australia
| | - John Haniotis
- NSW Health Pathology, Westmead Hospital, 166-174 Hawkesbury Road Westmead, Sydney, New South Wales 2145, Australia
| | - Lara J Herrero
- Gold Coast Campus, Institute for Glycomics, Griffith University, 1 Parklands Drive, Southport, Queensland 4215, Australia
| | - Leon Caly
- Victorian Infectious Diseases Reference Laboratory, Royal Melbourne Hospital at the Peter Doherty Institute for Infection and Immunity, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia
| | - Stacey E Lynch
- Agriculture Victoria Research, AgriBio, Centre for AgriBioscience, 5 Ring Road, Bundoora, Victoria 3083, Australia
| | - Peter T Mee
- Agriculture Victoria Research, AgriBio, Centre for AgriBioscience, 5 Ring Road, Bundoora, Victoria 3083, Australia
| | - Eugene T Madzokere
- Gold Coast Campus, Institute for Glycomics, Griffith University, 1 Parklands Drive, Southport, Queensland 4215, Australia
| | - Ana L Ramirez
- College of Public Health, Medical and Veterinary Sciences, James Cook University, P.O. Box 6811, Cairns, Queensland 4870, Australia
- Australian Institute of Tropical Health and Medicine, James Cook University, P.O. Box 6811, Cairns, Queensland 4870, Australia
- The Jackson Laboratory, 10 Discovery Drive Connecticut, Farmington, CT 06032, United States of America
| | - Devina Paramitha
- School of Chemistry and Molecular Biosciences, The University of Queensland, Bdg 68 Cooper Road, St. Lucia, Queensland 4072, Australia
| | - Jody Hobson-Peters
- School of Chemistry and Molecular Biosciences, The University of Queensland, Bdg 68 Cooper Road, St. Lucia, Queensland 4072, Australia
| | - David W Smith
- NSW Health Pathology, Westmead Hospital, 166-174 Hawkesbury Road Westmead, Sydney, New South Wales 2145, Australia
- School of Medicine, University of Western Australia, 35 Stirling Highway, Perth, Western Australia 6009, Australia
| | - Richard Weir
- Department of Primary Industries and Fisheries, Berrimah Veterinary Laboratory, P.O. Box 3000, Darwin, Northern Territory 0801, Australia
| | - Mitchell Sullivan
- Public and Environmental Health Reference Laboratories, Department of Health, Queensland Government, P.O Box 594 Archerfield, Coopers Plains, Queensland 4108, Australia
| | - Julian Druce
- Victorian Infectious Diseases Reference Laboratory, Royal Melbourne Hospital at the Peter Doherty Institute for Infection and Immunity, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia
| | - Lorna Melville
- Department of Primary Industries and Fisheries, Berrimah Veterinary Laboratory, P.O. Box 3000, Darwin, Northern Territory 0801, Australia
| | - Jennifer Robson
- Department of Microbiology and Molecular Pathology, Sullivan Nicolaides Pathology, P.O. Box 2014 Fortitude Valley, Brisbane, Queensland 4006, Australia
| | - Robert Gibb
- Serology, Pathology Queensland Central Laboratory, Royal Brisbane and Women’s Hospital, 40 Butterfield Street Herston, Brisbane, Queensland 4029, Australia
| | - Andrew F van den Hurk
- Public Health Virology Laboratory, Public and Environmental Health Reference Laboratories, Department of Health, Queensland Government, P.O. Box 594, Archerfield, Coopers Plains, Queensland, Australia
| | - Sebastian Duchene
- Department of Microbiology and Immunology, Peter Doherty Institute for Infection and Immunity, University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia
- Evolutionary Dynamics of Infectious Diseases, Department of Computational Biology, Institut Pasteur, 28 Rue du Dr Roux, Paris 75015, France
| |
Collapse
|
3
|
Fabreti LG, Coghill LM, Thomson RC, Höhna S, Brown JM. The Expected Behaviors of Posterior Predictive Tests and Their Unexpected Interpretation. Mol Biol Evol 2024; 41:msae051. [PMID: 38437512 PMCID: PMC10946647 DOI: 10.1093/molbev/msae051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Accepted: 01/09/2024] [Indexed: 03/06/2024] Open
Abstract
Poor fit between models of sequence or trait evolution and empirical data is known to cause biases and lead to spurious conclusions about evolutionary patterns and processes. Bayesian posterior prediction is a flexible and intuitive approach for detecting such cases of poor fit. However, the expected behavior of posterior predictive tests has never been characterized for evolutionary models, which is critical for their proper interpretation. Here, we show that the expected distribution of posterior predictive P-values is generally not uniform, in contrast to frequentist P-values used for hypothesis testing, and extreme posterior predictive P-values often provide more evidence of poor fit than typically appreciated. Posterior prediction assesses model adequacy under highly favorable circumstances, because the model is fitted to the data, which leads to expected distributions that are often concentrated around intermediate values. Nonuniform expected distributions of P-values do not pose a problem for the application of these tests, however, and posterior predictive P-values can be interpreted as the posterior probability that the fitted model would predict a dataset with a test statistic value as extreme as the value calculated from the observed data.
Collapse
Affiliation(s)
- Luiza Guimarães Fabreti
- GeoBio-Center, Ludwig-Maximilians-Universität München, Richard-Wagner-Str. 10, Munich 80333, Germany
- Department of Earth and Environmental Sciences, Paleontology & Geobiology, Ludwig-Maximilians-Universität München, Richard-Wagner-Str. 10, Munich 80333, Germany
| | - Lyndon M Coghill
- Center for Computation & Technology, Louisiana State University, Baton Rouge, LA 70803, USA
- Present address: Division of Research, Innovation, and Impact & Department of Veterinary Pathobiology, University of Missouri, Columbia, MO 65211, USA
| | - Robert C Thomson
- School of Life Sciences, University of Hawai‘i at Mānoa, Honolulu, HI 96822, USA
| | - Sebastian Höhna
- GeoBio-Center, Ludwig-Maximilians-Universität München, Richard-Wagner-Str. 10, Munich 80333, Germany
- Department of Earth and Environmental Sciences, Paleontology & Geobiology, Ludwig-Maximilians-Universität München, Richard-Wagner-Str. 10, Munich 80333, Germany
| | - Jeremy M Brown
- Department of Biological Sciences and Museum of Natural Science, Louisiana State University, Baton Rouge, LA 70803, USA
| |
Collapse
|
4
|
Ferreiro D, Khalil R, Sousa SF, Arenas M. Substitution Models of Protein Evolution with Selection on Enzymatic Activity. Mol Biol Evol 2024; 41:msae026. [PMID: 38314876 PMCID: PMC10873502 DOI: 10.1093/molbev/msae026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Revised: 01/25/2024] [Accepted: 01/31/2024] [Indexed: 02/07/2024] Open
Abstract
Substitution models of evolution are necessary for diverse evolutionary analyses including phylogenetic tree and ancestral sequence reconstructions. At the protein level, empirical substitution models are traditionally used due to their simplicity, but they ignore the variability of substitution patterns among protein sites. Next, in order to improve the realism of the modeling of protein evolution, a series of structurally constrained substitution models were presented, but still they usually ignore constraints on the protein activity. Here, we present a substitution model of protein evolution with selection on both protein structure and enzymatic activity, and that can be applied to phylogenetics. In particular, the model considers the binding affinity of the enzyme-substrate complex as well as structural constraints that include the flexibility of structural flaps, hydrogen bonds, amino acids backbone radius of gyration, and solvent-accessible surface area that are quantified through molecular dynamics simulations. We applied the model to the HIV-1 protease and evaluated it by phylogenetic likelihood in comparison with the best-fitting empirical substitution model and a structurally constrained substitution model that ignores the enzymatic activity. We found that accounting for selection on the protein activity improves the fitting of the modeled functional regions with the real observations, especially in data with high molecular identity, which recommends considering constraints on the protein activity in the development of substitution models of evolution.
Collapse
Affiliation(s)
- David Ferreiro
- CINBIO, Universidade de Vigo, 36310 Vigo, Spain
- Department of Biochemistry, Genetics and Immunology, Universidade de Vigo, 36310 Vigo, Spain
| | - Ruqaiya Khalil
- CINBIO, Universidade de Vigo, 36310 Vigo, Spain
- Department of Biochemistry, Genetics and Immunology, Universidade de Vigo, 36310 Vigo, Spain
| | - Sergio F Sousa
- UCIBIO/REQUIMTE, BioSIM, Departamento de Biomedicina, Faculdade de Medicina da Universidade do Porto, 4200-319 Porto, Portugal
| | - Miguel Arenas
- CINBIO, Universidade de Vigo, 36310 Vigo, Spain
- Department of Biochemistry, Genetics and Immunology, Universidade de Vigo, 36310 Vigo, Spain
| |
Collapse
|
5
|
Brazão JM, Foster PG, Cox CJ. Data-specific substitution models improve protein-based phylogenetics. PeerJ 2023; 11:e15716. [PMID: 37576497 PMCID: PMC10416777 DOI: 10.7717/peerj.15716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2023] [Accepted: 06/16/2023] [Indexed: 08/15/2023] Open
Abstract
Calculating amino-acid substitution models that are specific for individual protein data sets is often difficult due to the computational burden of estimating large numbers of rate parameters. In this study, we tested the computational efficiency and accuracy of five methods used to estimate substitution models, namely Codeml, FastMG, IQ-TREE, P4 (maximum likelihood), and P4 (Bayesian inference). Data-specific substitution models were estimated from simulated alignments (with different lengths) that were generated from a known simulation model and simulation tree. Each of the resulting data-specific substitution models was used to calculate the maximum likelihood score of the simulation tree and simulated data that was used to calculate the model, and compared with the maximum likelihood scores of the known simulation model and simulation tree on the same simulated data. Additionally, the commonly-used empirical models, cpREV and WAG, were assessed similarly. Data-specific models performed better than the empirical models, which under-fitted the simulated alignments, had the highest difference to the simulation model maximum-likelihood score, clustered further from the simulation model in principal component analysis ordination, and inferred less accurate trees. Data-specific models and the simulation model shared statistically indistinguishable maximum-likelihood scores, indicating that the five methods were reasonably accurate at estimating substitution models by this measure. Nevertheless, tree statistics showed differences between optimal maximum likelihood trees. Unlike other model estimating methods, trees inferred using data-specific models generated with IQ-TREE and P4 (maximum likelihood) were not significantly different from the trees derived from the simulation model in each analysis, indicating that these two methods alone were the most accurate at estimating data-specific models. To show the benefits of using data-specific protein models several published data sets were reanalysed using IQ-TREE-estimated models. These newly estimated models were a better fit to the data than the empirical models that were used by the original authors, often inferred longer trees, and resulted in different tree topologies in more than half of the re-analysed data sets. The results of this study show that software availability and high computation burden are not limitations to generating better-fitting data-specific amino-acid substitution models for phylogenetic analyses.
Collapse
Affiliation(s)
- João M. Brazão
- Centro de Ciências do Mar, Universidade do Algarve, Faro, Algarve, Portugal
| | - Peter G. Foster
- Department of Life Sciences, Natural History Museum, London, United Kingdom
| | - Cymon J. Cox
- Centro de Ciências do Mar, Universidade do Algarve, Faro, Algarve, Portugal
| |
Collapse
|
6
|
Xu J, Wahaab A, Khan S, Nawaz M, Anwar MN, Liu K, Wei J, Hameed M, Ma Z. Recent Population Dynamics of Japanese Encephalitis Virus. Viruses 2023; 15:1312. [PMID: 37376612 DOI: 10.3390/v15061312] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Revised: 05/31/2023] [Accepted: 05/31/2023] [Indexed: 06/29/2023] Open
Abstract
Japanese encephalitis virus (JEV) causes acute viral encephalitis in humans and reproductive disorders in pigs. JEV emerged during the 1870s in Japan, and since that time, JEV has been transmitted exclusively throughout Asia, according to known reporting and sequencing records. A recent JEV outbreak occurred in Australia, affecting commercial piggeries across different temperate southern Australian states, and causing confirmed infections in humans. A total of 47 human cases and 7 deaths were reported. The recent evolving situation of JEV needs to be reported due to its continuous circulation in endemic regions and spread to non-endemics areas. Here, we reconstructed the phylogeny and population dynamics of JEV using recent JEV isolates for the future perception of disease spread. Phylogenetic analysis shows the most recent common ancestor occurred about 2993 years ago (YA) (95% Highest posterior density (HPD), 2433 to 3569). Our results of the Bayesian skyline plot (BSP) demonstrates that JEV demography lacks fluctuations for the last two decades, but it shows that JEV genetic diversity has increased during the last ten years. This indicates the potential JEV replication in the reservoir host, which is helping it to maintain its genetic diversity and to continue its dispersal into non-endemic areas. The continuous spread in Asia and recent detection from Australia further support these findings. Therefore, an enhanced surveillance system is needed along with precautionary measures such as regular vaccination and mosquito control to avoid future JEV outbreaks.
Collapse
Affiliation(s)
- Jinpeng Xu
- School of Life Sciences and Food Engineering, Hebei University of Engineering, Handan 056038, China
- Shanghai Veterinary Research Institute, Chinese Academy of Agricultural Science, Shanghai 200241, China
| | - Abdul Wahaab
- Shanghai Veterinary Research Institute, Chinese Academy of Agricultural Science, Shanghai 200241, China
| | - Sawar Khan
- Shanghai Veterinary Research Institute, Chinese Academy of Agricultural Science, Shanghai 200241, China
- Institute of Molecular Biology and Biotechnology, The University of Lahore, Lahore 54000, Pakistan
| | - Mohsin Nawaz
- Shanghai Veterinary Research Institute, Chinese Academy of Agricultural Science, Shanghai 200241, China
- Faculty of Veterinary and Animal sciences, University of Poonch, Rawalakot 12350, Pakistan
| | | | - Ke Liu
- Shanghai Veterinary Research Institute, Chinese Academy of Agricultural Science, Shanghai 200241, China
| | - Jianchao Wei
- Shanghai Veterinary Research Institute, Chinese Academy of Agricultural Science, Shanghai 200241, China
| | - Muddassar Hameed
- Shanghai Veterinary Research Institute, Chinese Academy of Agricultural Science, Shanghai 200241, China
- Center for Zoonotic and Arthropod-borne Pathogens, Virginia Polytechnic Institute and State University, Blacksburg, VA 24060, USA
| | - Zhiyong Ma
- Shanghai Veterinary Research Institute, Chinese Academy of Agricultural Science, Shanghai 200241, China
| |
Collapse
|
7
|
Del Amparo R, Arenas M. Influence of substitution model selection on protein phylogenetic tree reconstruction. Gene 2023; 865:147336. [PMID: 36871672 DOI: 10.1016/j.gene.2023.147336] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Revised: 02/22/2023] [Accepted: 02/28/2023] [Indexed: 03/06/2023]
Abstract
Probabilistic phylogenetic tree reconstruction is traditionally performed under a best-fitting substitution model of molecular evolution previously selected according to diverse statistical criteria. Interestingly, some recent studies proposed that this procedure is unnecessary for phylogenetic tree reconstruction leading to a debate in the field. In contrast to DNA sequences, phylogenetic tree reconstruction from protein sequences is traditionally based on empirical exchangeability matrices that can differ among taxonomic groups and protein families. Considering this aspect, here we investigated the influence of selecting a substitution model of protein evolution on phylogenetic tree reconstruction by the analyses of real and simulated data. We found that phylogenetic tree reconstructions based on a selected best-fitting substitution model of protein evolution are the most accurate, in terms of topology and branch lengths, compared with those derived from substitution models with amino acid replacement matrices far from the selected best-fitting model, especially when the data has large genetic diversity. Indeed, we found that substitution models with similar amino acid replacement matrices produce similar reconstructed phylogenetic trees, suggesting the use of substitution models as similar as possible to a selected best-fitting model when the latter cannot be used. Therefore, we recommend the use of the traditional protocol of selection among substitution models of evolution for protein phylogenetic tree reconstruction.
Collapse
Affiliation(s)
- Roberto Del Amparo
- CINBIO, Universidade de Vigo, 36310 Vigo, Spain; Department of Biochemistry, Genetics and Immunology, Universidade de Vigo, 36310 Vigo, Spain.
| | - Miguel Arenas
- CINBIO, Universidade de Vigo, 36310 Vigo, Spain; Department of Biochemistry, Genetics and Immunology, Universidade de Vigo, 36310 Vigo, Spain; Galicia Sur Health Research Institute (IIS Galicia Sur), 36310 Vigo, Spain.
| |
Collapse
|
8
|
Del Amparo R, Arenas M. Consequences of Substitution Model Selection on Protein Ancestral Sequence Reconstruction. Mol Biol Evol 2022; 39:6628884. [PMID: 35789388 PMCID: PMC9254009 DOI: 10.1093/molbev/msac144] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
The selection of the best-fitting substitution model of molecular evolution is a traditional step for phylogenetic inferences, including ancestral sequence reconstruction (ASR). However, a few recent studies suggested that applying this procedure does not affect the accuracy of phylogenetic tree reconstruction. Here, we revisited this debate topic by analyzing the influence of selection among substitution models of protein evolution, with focus on exchangeability matrices, on the accuracy of ASR using simulated and real data. We found that the selected best-fitting substitution model produces the most accurate ancestral sequences, especially if the data present large genetic diversity. Indeed, ancestral sequences reconstructed under substitution models with similar exchangeability matrices were similar, suggesting that if the selected best-fitting model cannot be used for the reconstruction, applying a model similar to the selected one is preferred. We conclude that selecting among substitution models of protein evolution is recommended for reconstructing accurate ancestral sequences.
Collapse
Affiliation(s)
- Roberto Del Amparo
- CINBIO, Universidade de Vigo, Vigo, Spain.,Departamento de Bioquímica, Xenética e Immunoloxía, Universidade de Vigo, Vigo, Spain
| | - Miguel Arenas
- CINBIO, Universidade de Vigo, Vigo, Spain.,Departamento de Bioquímica, Xenética e Immunoloxía, Universidade de Vigo, Vigo, Spain.,Galicia Sur Health Research Institute (IIS Galicia Sur), Vigo, Spain
| |
Collapse
|
9
|
Abstract
The reconstruction of genetic material of ancestral organisms constitutes a powerful application of evolutionary biology. A fundamental step in this inference is the ancestral sequence reconstruction (ASR), which can be performed with diverse methodologies implemented in computer frameworks. However, most of these methodologies ignore evolutionary properties frequently observed in microbes, such as genetic recombination and complex selection processes, that can bias the traditional ASR. From a practical perspective, here I review methodologies for the reconstruction of ancestral DNA and protein sequences, with particular focus on microbes, and including biases, recommendations, and software implementations. I conclude that microbial ASR is a complex analysis that should be carefully performed and that there is a need for methods to infer more realistic ancestral microbial sequences.
Collapse
Affiliation(s)
- Miguel Arenas
- Biomedical Research Center (CINBIO), University of Vigo, Vigo, Spain.
- Department of Biochemistry, Genetics and Immunology, University of Vigo, Vigo, Spain.
- Galicia Sur Health Research Institute (IIS Galicia Sur), Vigo, Spain.
| |
Collapse
|
10
|
Spielman SJ. Relative Model Fit Does Not Predict Topological Accuracy in Single-Gene Protein Phylogenetics. Mol Biol Evol 2021; 37:2110-2123. [PMID: 32191313 PMCID: PMC7306691 DOI: 10.1093/molbev/msaa075] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
It is regarded as best practice in phylogenetic reconstruction to perform relative model selection to determine an appropriate evolutionary model for the data. This procedure ranks a set of candidate models according to their goodness of fit to the data, commonly using an information theoretic criterion. Users then specify the best-ranking model for inference. Although it is often assumed that better-fitting models translate to increase accuracy, recent studies have shown that the specific model employed may not substantially affect inferences. We examine whether there is a systematic relationship between relative model fit and topological inference accuracy in protein phylogenetics, using simulations and real sequences. Simulations employed site-heterogeneous mechanistic codon models that are distinct from protein-level phylogenetic inference models, allowing us to investigate how protein models performs when they are misspecified to the data, as will be the case for any real sequence analysis. We broadly find that phylogenies inferred across models with vastly different fits to the data produce highly consistent topologies. We additionally find that all models infer similar proportions of false-positive splits, raising the possibility that all available models of protein evolution are similarly misspecified. Moreover, we find that the parameter-rich GTR (general time reversible) model, whose amino acid exchangeabilities are free parameters, performs similarly to models with fixed exchangeabilities, although the inference precision associated with GTR models was not examined. We conclude that, although relative model selection may not hinder phylogenetic analysis on protein data, it may not offer specific predictable improvements and is not a reliable proxy for accuracy.
Collapse
|
11
|
Goremykin V. A Novel Test for Absolute Fit of Evolutionary Models Provides a Means to Correctly Identify the Substitution Model and the Model Tree. Genome Biol Evol 2020; 11:2403-2419. [PMID: 31368483 PMCID: PMC6736042 DOI: 10.1093/gbe/evz167] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/29/2019] [Indexed: 02/07/2023] Open
Abstract
A novel test is described that visualizes the absolute model-data fit of the substitution and tree components of an evolutionary model. The test utilizes statistics based on counts of character state matches and mismatches in alignments of observed and simulated sequences. This comparison is used to assess model-data fit. In simulations conducted to evaluate the performance of the test, the test estimator was able to identify both the correct tree topology and substitution model under conditions where the Goldman-Cox test-which tests the fit of a substitution model to sequence data and is also based on comparing simulated replicates with observed data-showed high error rates. The novel test was found to identify the correct tree topology within a wide range of DNA substitution model misspecifications, indicating the high discriminatory power of the test. Use of this test provides a practical approach for assessing absolute model-data fit when testing phylogenetic hypotheses.
Collapse
Affiliation(s)
- Vadim Goremykin
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all'Adige, Trentino, Italy
| |
Collapse
|
12
|
Du Y, Wu S, Edwards SV, Liu L. The effect of alignment uncertainty, substitution models and priors in building and dating the mammal tree of life. BMC Evol Biol 2019; 19:203. [PMID: 31694538 PMCID: PMC6833305 DOI: 10.1186/s12862-019-1534-9] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Accepted: 10/21/2019] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND The flood of genomic data to help build and date the tree of life requires automation at several critical junctures, most importantly during sequence assembly and alignment. It is widely appreciated that automated alignment protocols can yield inaccuracies, but the relative impact of various sources error on phylogenomic analysis is not yet known. This study employs an updated mammal data set of 5162 coding loci sampled from 90 species to evaluate the effects of alignment uncertainty, substitution models, and fossil priors on gene tree, species tree, and divergence time estimation. Additionally, a novel coalescent likelihood ratio test is introduced for comparing competing species trees against a given set of gene trees. RESULTS The aligned DNA sequences of 5162 loci from 90 species were trimmed and filtered using trimAL and two filtering protocols. The final dataset contains 4 sets of alignments - before trimming, after trimming, filtered by a recently proposed pipeline, and further filtered by comparing ML gene trees for each locus with the concatenation tree. Our analyses suggest that the average discordance among the coalescent trees is significantly smaller than that among the concatenation trees estimated from the 4 sets of alignments or with different substitution models. There is no significant difference among the divergence times estimated with different substitution models. However, the divergence dates estimated from the alignments after trimming are more recent than those estimated from the alignments before trimming. CONCLUSIONS Our results highlight that alignment uncertainty of the updated mammal data set and the choice of substitution models have little impact on tree topologies yielded by coalescent methods for species tree estimation, whereas they are more influential on the trees made by concatenation. Given the choice of calibration scheme and clock models, divergence time estimates are robust to the choice of substitution models, but removing alignments deemed problematic by trimming algorithms can lead to more recent dates. Although the fossil prior is important in divergence time estimation, Bayesian estimates of divergence times in this data set are driven primarily by the sequence data.
Collapse
Affiliation(s)
- Yan Du
- Department of Statistics, University of Georgia, 310 Herty Drive, Athens, GA 30606 USA
| | - Shaoyuan Wu
- Jiangsu Key Laboratory of Phylogenomics & Comparative Genomics, School of Life Sciences, Jiangsu Normal University, Xuzhou, Jiangsu 221116 People’s Republic of China
| | - Scott V. Edwards
- Department of Organismic & Evolutionary Biology, Museum of Comparative Zoology, Harvard University, Cambridge, MA 02138 USA
| | - Liang Liu
- Liang Liu, Department of Statistics and Institute of Bioinformatics, University of Georgia, 310 Herty Drive, Athens, GA 30606 USA
| |
Collapse
|
13
|
Parry R, Asgari S. Discovery of Novel Crustacean and Cephalopod Flaviviruses: Insights into the Evolution and Circulation of Flaviviruses between Marine Invertebrate and Vertebrate Hosts. J Virol 2019; 93:e00432-19. [PMID: 31068424 PMCID: PMC6600200 DOI: 10.1128/jvi.00432-19] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2019] [Accepted: 04/23/2019] [Indexed: 12/21/2022] Open
Abstract
Most described flaviviruses (family Flaviviridae) are disease-causing pathogens of vertebrates maintained in zoonotic cycles between mosquitoes or ticks and vertebrate hosts. Poor sampling of flaviviruses outside vector-borne flaviviruses such as Zika virus and dengue virus has presented a narrow understanding of flavivirus diversity and evolution. In this study, we discovered three crustacean flaviviruses (Gammarus chevreuxi flavivirus, Gammarus pulex flavivirus, and Crangon crangon flavivirus) and two cephalopod flaviviruses (Southern Pygmy squid flavivirus and Firefly squid flavivirus). Bayesian and maximum likelihood phylogenetic methods demonstrate that crustacean flaviviruses form a well-supported clade and share a more closely related ancestor with terrestrial vector-borne flaviviruses than with classical insect-specific flaviviruses. In addition, we identify variants of Wenzhou shark flavivirus in multiple gazami crab (Portunus trituberculatus) populations, with active replication supported by evidence of an active RNA interference response. This suggests that Wenzhou shark flavivirus moves horizontally between sharks and gazami crabs in ocean ecosystems. Analyses of the mono- and dinucleotide composition of marine flaviviruses compared to that of flaviviruses with known host status suggest that some marine flaviviruses share a nucleotide bias similar to that of vector-borne flaviviruses. Furthermore, we identify crustacean flavivirus endogenous viral elements that are closely related to elements of terrestrial vector-borne flaviviruses. Taken together, these data provide evidence of flaviviruses circulating between marine vertebrates and invertebrates, expand our understanding of flavivirus host range, and offer potential insights into the evolution and emergence of terrestrial vector-borne flaviviruses.IMPORTANCE Some flaviviruses are known to cause disease in vertebrates and are typically transmitted by blood-feeding arthropods such as ticks and mosquitoes. While an ever-increasing number of insect-specific flaviviruses have been described, we have a narrow understanding of flavivirus incidence and evolution. To expand this understanding, we discovered a number of novel flaviviruses that infect a range of crustaceans and cephalopod hosts. Phylogenetic analyses of these novel marine flaviviruses suggest that crustacean flaviviruses share a close ancestor to all terrestrial vector-borne flaviviruses, and squid flaviviruses are the most divergent of all known flaviviruses to date. Additionally, our results indicate horizontal transmission of a marine flavivirus between crabs and sharks. Taken together, these data suggest that flaviviruses move horizontally between invertebrates and vertebrates in ocean ecosystems. This study demonstrates that flavivirus invertebrate-vertebrate host associations have arisen in flaviviruses at least twice and may potentially provide insights into the emergence or origin of terrestrial vector-borne flaviviruses.
Collapse
Affiliation(s)
- Rhys Parry
- Australian Infectious Disease Research Centre, School of Biological Sciences, The University of Queensland, Brisbane, Queensland, Australia
| | - Sassan Asgari
- Australian Infectious Disease Research Centre, School of Biological Sciences, The University of Queensland, Brisbane, Queensland, Australia
| |
Collapse
|
14
|
Chen W, Kenney T, Bielawski J, Gu H. Testing adequacy for DNA substitution models. BMC Bioinformatics 2019; 20:349. [PMID: 31221105 PMCID: PMC6585133 DOI: 10.1186/s12859-019-2905-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2018] [Accepted: 05/17/2019] [Indexed: 12/22/2022] Open
Abstract
Background Testing model adequacy is important before a DNA substitution model is chosen for phylogenetic inference. Using a mis-specified model can negatively impact phylogenetic inference, for example, the maximum likelihood method can be inconsistent when the DNA sequences are generated under a tree topology which is in the Felsentein Zone and analyzed with a mis-specified or inadequate model. However, model adequacy testing in phylogenetics is underdeveloped. Results Here we develop a simple, general, powerful and robust model test based on Pearson’s goodness-of-fit test and binning of site patterns. We demonstrate through simulation that this test is robust in its high power to reject the inadequate models for a large range of different ways of binning site patterns while the Type I error is controlled well. In the real data analysis we discovered many cases where models chosen by another method can be rejected by this new test, in particular, our proposed test rejects the most complex DNA model (GTR+I+ Γ) while the Goldman-Cox test fails to reject the commonly used simple models. Conclusions Model adequacy testing and bootstrap should be used together to assess reliability of conclusions after model selection and model fitting have already been applied to choose the model and fit it. The new goodness-of-fit test proposed in this paper is a simple and powerful model adequacy testing method serving such a regular model checking purpose. We caution against deriving strong conclusions from analyses based on inadequate models. At a minimum, those results derived from inadequate models can now be readly flagged using the new test, and reported as such.
Collapse
Affiliation(s)
- Wei Chen
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Canada
| | - Toby Kenney
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Canada
| | - Joseph Bielawski
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Canada.,Department of Biology, Dalhousie University, Halifax, Canada
| | - Hong Gu
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Canada.
| |
Collapse
|
15
|
|
16
|
Hilton SK, Bloom JD. Modeling site-specific amino-acid preferences deepens phylogenetic estimates of viral sequence divergence. Virus Evol 2018; 4:vey033. [PMID: 30425841 PMCID: PMC6220371 DOI: 10.1093/ve/vey033] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Molecular phylogenetics is often used to estimate the time since the divergence of modern gene sequences. For highly diverged sequences, such phylogenetic techniques sometimes estimate surprisingly recent divergence times. In the case of viruses, independent evidence indicates that the estimates of deep divergence times from molecular phylogenetics are sometimes too recent. This discrepancy is caused in part by inadequate models of purifying selection leading to branch-length underestimation. Here we examine the effect on branch-length estimation of using models that incorporate experimental measurements of purifying selection. We find that models informed by experimentally measured site-specific amino-acid preferences estimate longer deep branches on phylogenies of influenza virus hemagglutinin. This lengthening of branches is due to more realistic stationary states of the models, and is mostly independent of the branch-length extension from modeling site-to-site variation in amino-acid substitution rate. The branch-length extension from experimentally informed site-specific models is similar to that achieved by other approaches that allow the stationary state to vary across sites. However, the improvements from all of these site-specific but time homogeneous and site independent models are limited by the fact that a protein’s amino-acid preferences gradually shift as it evolves. Overall, our work underscores the importance of modeling site-specific amino-acid preferences when estimating deep divergence times—but also shows the inherent limitations of approaches that fail to account for how these preferences shift over time.
Collapse
Affiliation(s)
- Sarah K Hilton
- Basic Sciences and Computational Biology Program, Fred Hutchinson Cancer Research Center.,Department of Genome Sciences, University of Washington, USA
| | - Jesse D Bloom
- Basic Sciences and Computational Biology Program, Fred Hutchinson Cancer Research Center.,Department of Genome Sciences, University of Washington, USA.,Howard Hughes Medical Institute, Seattle, WA, USA
| |
Collapse
|
17
|
Duchêne DA, Duchêne S, Ho SYW. Differences in Performance among Test Statistics for Assessing Phylogenomic Model Adequacy. Genome Biol Evol 2018; 10:1375-1388. [PMID: 29788113 PMCID: PMC6007652 DOI: 10.1093/gbe/evy094] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/11/2018] [Indexed: 11/12/2022] Open
Abstract
Statistical phylogenetic analyses of genomic data depend on models of nucleotide or amino acid substitution. The adequacy of these substitution models can be assessed using a number of test statistics, allowing the model to be rejected when it is found to provide a poor description of the evolutionary process. A potentially valuable use of model-adequacy test statistics is to identify when data sets are likely to produce unreliable phylogenetic estimates, but their differences in performance are rarely explored. We performed a comprehensive simulation study to identify test statistics that are sensitive to some of the most commonly cited sources of phylogenetic estimation error. Our results show that, for many test statistics, traditional thresholds for assessing model adequacy can fail to reject the model when the phylogenetic inferences are inaccurate and imprecise. This is particularly problematic when analysing loci that have few informative sites. We propose new thresholds for assessing substitution model adequacy and demonstrate their effectiveness in analyses of three phylogenomic data sets. These thresholds lead to frequent rejection of the model for loci that yield topological inferences that are imprecise and are likely to be inaccurate. We also propose the use of a summary statistic that provides a practical assessment of overall model adequacy. Our approach offers a promising means of enhancing model choice in genome-scale data sets, potentially leading to improvements in the reliability of phylogenomic inference.
Collapse
Affiliation(s)
- David A Duchêne
- School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, Australia
| | - Sebastian Duchêne
- Bio21 Molecular Science and Biotechnology Institute, University of Melbourne, Melbourne, VIC, Australia
| | - Simon Y W Ho
- School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, Australia
| |
Collapse
|
18
|
Zhao K, Henderson E, Bullard K, Oberste MS, Burns CC, Jorba J. PoSE: visualization of patterns of sequence evolution using PAML and MATLAB. BMC Bioinformatics 2018; 19:364. [PMID: 30343671 PMCID: PMC6196406 DOI: 10.1186/s12859-018-2335-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
Background Determining patterns of nucleotide and amino acid substitution is the first step during sequence evolution analysis. However, it is not easy to visualize the different phylogenetic signatures imprinted in aligned nucleotide and amino acid sequences. Results Here we present PoSE (Pattern of Sequence Evolution), a reliable resource for unveiling the evolutionary history of sequence alignments and for graphically displaying their contents. Substitutions are displayed by category (transitions and transversions), codon position, and phenotypic effect (synonymous and nonsynonymous). Visualization is accomplished using MATLAB scripts wrapped around PAML (Phylogenetic Analysis by Maximum Likelihood), implemented in an easy-to-use graphical user interface. The application displays inferred substitutions estimated by baseml or codeml, two programs included in the PAML software package. PoSE organizes patterns of substitution in eleven plots, including estimated non-synonymous/synonymous ratios (dN/dS) along the sequence alignment. In addition, PoSE provides visualization and annotation of patterns of amino acid substitutions along groups of related sequences that can be graphically inspected in a phylogenetic tree window. Conclusions PoSE is a useful tool to help determine major patterns during sequence evolution of protein-coding sequences, hypervariable regions, or changes in dN/dS ratios. PoSE is publicly available at https://github.com/CDCgov/PoSE
Collapse
|
19
|
Spielman SJ, Kosakovsky Pond SL. Relative Evolutionary Rates in Proteins Are Largely Insensitive to the Substitution Model. Mol Biol Evol 2018; 35:2307-2317. [PMID: 29924340 PMCID: PMC6107055 DOI: 10.1093/molbev/msy127] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
The relative evolutionary rates at individual sites in proteins are informative measures of conservation or adaptation. Often used as evolutionarily aware conservation scores, relative rates reveal key functional or strongly selected residues. Estimating rates in a phylogenetic context requires specifying a protein substitution model, which is typically a phenomenological model trained on a large empirical data set. A strong emphasis has traditionally been placed on selecting the "best-fit" model, with the implicit understanding that suboptimal or otherwise ill-fitting models might bias inferences. However, the pervasiveness and degree of such bias has not been systematically examined. We investigated how model choice impacts site-wise relative rates in a large set of empirical protein alignments. We compared models designed for use on any general protein, models designed for specific domains of life, and the simple equal-rates Jukes Cantor-style model (JC). As expected, information theoretic measures showed overwhelming evidence that some models fit the data decidedly better than others. By contrast, estimates of site-specific evolutionary rates were impressively insensitive to the substitution model used, revealing an unexpected degree of robustness to potential model misspecification. A deeper examination of the fewer than 5% of sites for which model inferences differed in a meaningful way showed that the JC model could uniquely identify rapidly evolving sites that models with empirically derived exchangeabilities failed to detect. We conclude that relative protein rates appear robust to the applied substitution model, and any sensible model of protein evolution, regardless of its fit to the data, should produce broadly consistent evolutionary rates.
Collapse
Affiliation(s)
- Stephanie J Spielman
- Department of Biology, Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA
| | - Sergei L Kosakovsky Pond
- Department of Biology, Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA
| |
Collapse
|
20
|
Gillung JP, Winterton SL, Bayless KM, Khouri Z, Borowiec ML, Yeates D, Kimsey LS, Misof B, Shin S, Zhou X, Mayer C, Petersen M, Wiegmann BM. Anchored phylogenomics unravels the evolution of spider flies (Diptera, Acroceridae) and reveals discordance between nucleotides and amino acids. Mol Phylogenet Evol 2018; 128:233-245. [PMID: 30110663 DOI: 10.1016/j.ympev.2018.08.007] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2018] [Revised: 08/03/2018] [Accepted: 08/07/2018] [Indexed: 11/17/2022]
Abstract
The onset of phylogenomics has contributed to the resolution of numerous challenging evolutionary questions while offering new perspectives regarding biodiversity. However, in some instances, analyses of large genomic datasets can also result in conflicting estimates of phylogeny. Here, we present the first phylogenomic scale study of a dipteran parasitoid family, built upon anchored hybrid enrichment and transcriptomic data of 240 loci of 43 ingroup acrocerid taxa. A new hypothesis for the timing of spider fly evolution is proposed, wielding recent advances in divergence time dating, including the fossilized birth-death process to show that the origin of Acroceridae is younger than previously proposed. To test the robustness of our phylogenetic inferences, we analyzed our datasets using different phylogenetic estimation criteria, including supermatrix and coalescent-based approaches, maximum-likelihood and Bayesian methods, combined with other approaches such as permutations of the data, homogeneous versus heterogeneous models, and alternative data and taxon sets. Resulting topologies based on amino acids and nucleotides are both strongly supported but critically discordant, primarily in terms of the monophyly of Panopinae. Conflict was not resolved by controlling for compositional heterogeneity and saturation in third codon positions, which highlights the need for a better understanding of how different biases affect different data sources. In our study, results based on nucleotides were both more robust to alterations of the data and different analytical methods and more compatible with our current understanding of acrocerid morphology and patterns of host usage.
Collapse
Affiliation(s)
- Jessica P Gillung
- Bohart Museum of Entomology, University of California, One Shields Ave, Davis, CA 95616, USA; California State Collection of Arthropods, 3294 Meadowview Rd, Sacramento, CA 95832, USA.
| | - Shaun L Winterton
- California State Collection of Arthropods, 3294 Meadowview Rd, Sacramento, CA 95832, USA
| | - Keith M Bayless
- California Academy of Sciences, 55 Music Concourse Drive, San Francisco, CA 94118, USA
| | - Ziad Khouri
- Bohart Museum of Entomology, University of California, One Shields Ave, Davis, CA 95616, USA
| | - Marek L Borowiec
- School of Life Sciences, Social Insect Research Group, Arizona State University, Tempe, AZ, 85287, USA
| | - David Yeates
- National Research Collections Australia, Clunies Ross Street, Acton, ACT 2601, GPO Box 1700, Canberra, ACT 2601, Australia
| | - Lynn S Kimsey
- Bohart Museum of Entomology, University of California, One Shields Ave, Davis, CA 95616, USA
| | - Bernhard Misof
- Center for Molecular Biodiversity Research, Zoological Research Museum Alexander Koenig, 53113 Bonn, Germany
| | - Seunggwan Shin
- Department of Biological Sciences, University of Memphis, 3700 Walker Avenue, Memphis, TN 38152, USA
| | - Xin Zhou
- Department of Entomology, China Agricultural University, Beijing 100193, China
| | - Christoph Mayer
- Center for Molecular Biodiversity Research, Zoological Research Museum Alexander Koenig, 53113 Bonn, Germany
| | - Malte Petersen
- Center for Molecular Biodiversity Research, Zoological Research Museum Alexander Koenig, 53113 Bonn, Germany
| | - Brian M Wiegmann
- Department of Entomology & Plant Pathology, North Carolina State University, 3114 Gardner Hall, Raleigh, NC 27695-7613, USA
| |
Collapse
|
21
|
Abstract
Genetic sequencing data of pathogens allow one to quantify the evolutionary rate together with epidemiological dynamics using Bayesian phylodynamic methods. Such tools are particularly useful for obtaining a timely understanding of newly emerging epidemic outbreaks. During the West African Ebola virus disease epidemic, an unusually high evolutionary rate was initially estimated, promoting discussions regarding the potential danger of the strain quickly evolving into an even more dangerous virus. We show here that such high evolutionary rates are not necessarily real but can stem from methodological biases in the analyses. While most analyses of epidemic outbreak data are performed such that these biases may be present, we suggest a solution to overcome these biases in the future. Bayesian phylogenetics aims at estimating phylogenetic trees together with evolutionary and population dynamic parameters based on genetic sequences. It has been noted that the clock rate, one of the evolutionary parameters, decreases with an increase in the sampling period of sequences. In particular, clock rates of epidemic outbreaks are often estimated to be higher compared with the long-term clock rate. Purifying selection has been suggested as a biological factor that contributes to this phenomenon, since it purges slightly deleterious mutations from a population over time. However, other factors such as methodological biases may also play a role and make a biological interpretation of results difficult. In this paper, we identify methodological biases originating from the choice of tree prior, that is, the model specifying epidemiological dynamics. With a simulation study we demonstrate that a misspecification of the tree prior can upwardly bias the inferred clock rate and that the interplay of the different models involved in the inference can be complex and nonintuitive. We also show that the choice of tree prior can influence the inference of clock rate on real-world Ebola virus (EBOV) datasets. While commonly used tree priors result in very high clock-rate estimates for sequences from the initial phase of the epidemic in Sierra Leone, tree priors allowing for population structure lead to estimates agreeing with the long-term rate for EBOV.
Collapse
|
22
|
Duchêne DA, Duchêne S, Ho SYW. New Statistical Criteria Detect Phylogenetic Bias Caused by Compositional Heterogeneity. Mol Biol Evol 2017; 34:1529-1534. [PMID: 28333201 DOI: 10.1093/molbev/msx092] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
In statistical phylogenetic analyses of DNA sequences, models of evolutionary change commonly assume that base composition is stationary through time and across lineages. This assumption is violated by many data sets, but it is unclear whether the magnitude of these violations is sufficient to mislead phylogenetic inference. We investigated the impacts of compositional heterogeneity on phylogenetic estimates using a method for assessing model adequacy. Based on a detailed simulation study, we found that common frequentist criteria are highly conservative, such that the model is often rejected when the phylogenetic estimates do not show clear signs of bias. We propose new criteria and provide guidelines for their usage. We apply these criteria to genome-scale data from 40 birds and find that loci with severely non-homogeneous base composition are uncommon. Our results show the importance of using well-informed diagnostic statistics when testing model adequacy for phylogenomic analyses.
Collapse
Affiliation(s)
- David A Duchêne
- School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, Australia
| | - Sebastian Duchêne
- Centre for Systems Genomics, University of Melbourne, Melbourne, VIC, Australia
| | - Simon Y W Ho
- School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, Australia
| |
Collapse
|
23
|
Arenas M, Araujo NM, Branco C, Castelhano N, Castro-Nallar E, Pérez-Losada M. Mutation and recombination in pathogen evolution: Relevance, methods and controversies. INFECTION GENETICS AND EVOLUTION 2017; 63:295-306. [PMID: 28951202 DOI: 10.1016/j.meegid.2017.09.029] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/26/2017] [Revised: 09/20/2017] [Accepted: 09/21/2017] [Indexed: 02/06/2023]
Abstract
Mutation and recombination drive the evolution of most pathogens by generating the genetic variants upon which selection operates. Those variants can, for example, confer resistance to host immune systems and drug therapies or lead to epidemic outbreaks. Given their importance, diverse evolutionary studies have investigated the abundance and consequences of mutation and recombination in pathogen populations. However, some controversies persist regarding the contribution of each evolutionary force to the development of particular phenotypic observations (e.g., drug resistance). In this study, we revise the importance of mutation and recombination in the evolution of pathogens at both intra-host and inter-host levels. We also describe state-of-the-art analytical methodologies to detect and quantify these two evolutionary forces, including biases that are often ignored in evolutionary studies. Finally, we present some of our former studies involving pathogenic taxa where mutation and recombination played crucial roles in the recovery of pathogenic fitness, the generation of interspecific genetic diversity, or the design of centralized vaccines. This review also illustrates several common controversies and pitfalls in the analysis and in the evaluation and interpretation of mutation and recombination outcomes.
Collapse
Affiliation(s)
- Miguel Arenas
- Department of Biochemistry, Genetics and Immunology, University of Vigo, Vigo, Spain; Instituto de Investigação e Inovação em Saúde (i3S), University of Porto, Porto, Portugal; Institute of Molecular Pathology and Immunology of the University of Porto (IPATIMUP), Porto, Portugal.
| | - Natalia M Araujo
- Laboratory of Molecular Virology, Oswaldo Cruz Institute, FIOCRUZ, Rio de Janeiro, Brazil.
| | - Catarina Branco
- Instituto de Investigação e Inovação em Saúde (i3S), University of Porto, Porto, Portugal; Institute of Molecular Pathology and Immunology of the University of Porto (IPATIMUP), Porto, Portugal.
| | - Nadine Castelhano
- Instituto de Investigação e Inovação em Saúde (i3S), University of Porto, Porto, Portugal; Institute of Molecular Pathology and Immunology of the University of Porto (IPATIMUP), Porto, Portugal.
| | - Eduardo Castro-Nallar
- Universidad Andrés Bello, Center for Bioinformatics and Integrative Biology, Facultad de Ciencias Biológicas, Santiago, Chile.
| | - Marcos Pérez-Losada
- Computational Biology Institute, Milken Institute School of Public Health, George Washington University, Ashburn, VA 20147, Washington, DC, United States; CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, Vairão 4485-661, Portugal.
| |
Collapse
|
24
|
Geoghegan JL, Duchêne S, Holmes EC. Comparative analysis estimates the relative frequencies of co-divergence and cross-species transmission within viral families. PLoS Pathog 2017; 13:e1006215. [PMID: 28178344 PMCID: PMC5319820 DOI: 10.1371/journal.ppat.1006215] [Citation(s) in RCA: 171] [Impact Index Per Article: 21.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2016] [Revised: 02/21/2017] [Accepted: 02/02/2017] [Indexed: 01/20/2023] Open
Abstract
The cross-species transmission of viruses from one host species to another is responsible for the majority of emerging infections. However, it is unclear whether some virus families have a greater propensity to jump host species than others. If related viruses have an evolutionary history of co-divergence with their hosts there should be evidence of topological similarities between the virus and host phylogenetic trees, whereas host jumping generates incongruent tree topologies. By analyzing co-phylogenetic processes in 19 virus families and their eukaryotic hosts we provide a quantitative and comparative estimate of the relative frequency of virus-host co-divergence versus cross-species transmission among virus families. Notably, our analysis reveals that cross-species transmission is a near universal feature of the viruses analyzed here, with virus-host co-divergence occurring less frequently and always on a subset of viruses. Despite the overall high topological incongruence among virus and host phylogenies, the Hepadnaviridae, Polyomaviridae, Poxviridae, Papillomaviridae and Adenoviridae, all of which possess double-stranded DNA genomes, exhibited more frequent co-divergence than the other virus families studied here. At the other extreme, the virus and host trees for all the RNA viruses studied here, particularly the Rhabdoviridae and the Picornaviridae, displayed high levels of topological incongruence, indicative of frequent host switching. Overall, we show that cross-species transmission plays a major role in virus evolution, with all the virus families studied here having the potential to jump host species, and that increased sampling will likely reveal more instances of host jumping.
Collapse
Affiliation(s)
- Jemma L. Geoghegan
- Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Life and Environmental Sciences and Sydney Medical School, The University of Sydney, Sydney, New South Wales, Australia
| | - Sebastián Duchêne
- Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Life and Environmental Sciences and Sydney Medical School, The University of Sydney, Sydney, New South Wales, Australia
- Centre for Systems Genomics, The University of Melbourne, Melbourne, Victoria, Australia
| | - Edward C. Holmes
- Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Life and Environmental Sciences and Sydney Medical School, The University of Sydney, Sydney, New South Wales, Australia
- * E-mail:
| |
Collapse
|
25
|
Maddison DR. The rapidly changing landscape of insect phylogenetics. CURRENT OPINION IN INSECT SCIENCE 2016; 18:77-82. [PMID: 27939714 DOI: 10.1016/j.cois.2016.09.007] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/01/2016] [Accepted: 09/23/2016] [Indexed: 06/06/2023]
Abstract
Insect phylogenetics is being profoundly changed by many innovations. Although rapid developments in genomics have center stage, key progress has been made in phenomics, field and museum science, digital databases and pipelines, analytical tools, and the culture of science. The importance of these methodological and cultural changes to the pace of inference of the hexapod Tree of Life is discussed. The innovations have the potential, when synthesized and mobilized in ways as yet unforeseen, to shine light on the million or more clades in insects, and infer their composition with confidence. There are many challenges to overcome before insects can enter the 'phylocognisant age', but because of the promise of genomics, phenomics, and informatics, that is now an imaginable future.
Collapse
Affiliation(s)
- David R Maddison
- Department of Integrative Biology, 3029 Cordley Hall, Oregon State University, Corvallis, OR 97331, USA.
| |
Collapse
|
26
|
Duchêne S, Duchêne DA, Di Giallonardo F, Eden JS, Geoghegan JL, Holt KE, Ho SYW, Holmes EC. Cross-validation to select Bayesian hierarchical models in phylogenetics. BMC Evol Biol 2016; 16:115. [PMID: 27230264 PMCID: PMC4880944 DOI: 10.1186/s12862-016-0688-y] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2016] [Accepted: 05/19/2016] [Indexed: 01/12/2023] Open
Abstract
Background Recent developments in Bayesian phylogenetic models have increased the range of inferences that can be drawn from molecular sequence data. Accordingly, model selection has become an important component of phylogenetic analysis. Methods of model selection generally consider the likelihood of the data under the model in question. In the context of Bayesian phylogenetics, the most common approach involves estimating the marginal likelihood, which is typically done by integrating the likelihood across model parameters, weighted by the prior. Although this method is accurate, it is sensitive to the presence of improper priors. We explored an alternative approach based on cross-validation that is widely used in evolutionary analysis. This involves comparing models according to their predictive performance. Results We analysed simulated data and a range of viral and bacterial data sets using a cross-validation approach to compare a variety of molecular clock and demographic models. Our results show that cross-validation can be effective in distinguishing between strict- and relaxed-clock models and in identifying demographic models that allow growth in population size over time. In most of our empirical data analyses, the model selected using cross-validation was able to match that selected using marginal-likelihood estimation. The accuracy of cross-validation appears to improve with longer sequence data, particularly when distinguishing between relaxed-clock models. Conclusions Cross-validation is a useful method for Bayesian phylogenetic model selection. This method can be readily implemented even when considering complex models where selecting an appropriate prior for all parameters may be difficult. Electronic supplementary material The online version of this article (doi:10.1186/s12862-016-0688-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sebastián Duchêne
- Marie Bashir Institute of Infectious Diseases and Biosecurity, Charles Perkins Centre, Sydney Medical School, University of Sydney, Sydney, NSW, 2006, Australia. .,School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, 2006, Australia.
| | - David A Duchêne
- School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, 2006, Australia
| | - Francesca Di Giallonardo
- Marie Bashir Institute of Infectious Diseases and Biosecurity, Charles Perkins Centre, Sydney Medical School, University of Sydney, Sydney, NSW, 2006, Australia.,School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, 2006, Australia
| | - John-Sebastian Eden
- Marie Bashir Institute of Infectious Diseases and Biosecurity, Charles Perkins Centre, Sydney Medical School, University of Sydney, Sydney, NSW, 2006, Australia.,School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, 2006, Australia
| | - Jemma L Geoghegan
- Marie Bashir Institute of Infectious Diseases and Biosecurity, Charles Perkins Centre, Sydney Medical School, University of Sydney, Sydney, NSW, 2006, Australia.,School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, 2006, Australia
| | - Kathryn E Holt
- Department of Biochemistry and Molecular Biology, Bio21 Molecular Science and Biotechnology Institute, The University of Melbourne, Melbourne, VIC, 3010, Australia.,Centre for Systems Genomics, The University of Melbourne, Melbourne, VIC, 3010, Australia
| | - Simon Y W Ho
- School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, 2006, Australia
| | - Edward C Holmes
- Marie Bashir Institute of Infectious Diseases and Biosecurity, Charles Perkins Centre, Sydney Medical School, University of Sydney, Sydney, NSW, 2006, Australia.,School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, 2006, Australia
| |
Collapse
|