1
|
Magee AF, Holbrook AJ, Pekar JE, Caviedes-Solis IW, Matsen Iv FA, Baele G, Wertheim JO, Ji X, Lemey P, Suchard MA. Random-Effects Substitution Models for Phylogenetics via Scalable Gradient Approximations. Syst Biol 2024; 73:562-578. [PMID: 38712512 DOI: 10.1093/sysbio/syae019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Revised: 02/26/2024] [Accepted: 05/02/2024] [Indexed: 05/08/2024] Open
Abstract
Phylogenetic and discrete-trait evolutionary inference depend heavily on an appropriate characterization of the underlying character substitution process. In this paper, we present random-effects substitution models that extend common continuous-time Markov chain models into a richer class of processes capable of capturing a wider variety of substitution dynamics. As these random-effects substitution models often require many more parameters than their usual counterparts, inference can be both statistically and computationally challenging. Thus, we also propose an efficient approach to compute an approximation to the gradient of the data likelihood with respect to all unknown substitution model parameters. We demonstrate that this approximate gradient enables scaling of sampling-based inference, namely Bayesian inference via Hamiltonian Monte Carlo, under random-effects substitution models across large trees and state-spaces. Applied to a dataset of 583 SARS-CoV-2 sequences, an HKY model with random-effects shows strong signals of nonreversibility in the substitution process, and posterior predictive model checks clearly show that it is a more adequate model than a reversible model. When analyzing the pattern of phylogeographic spread of 1441 influenza A virus (H3N2) sequences between 14 regions, a random-effects phylogeographic substitution model infers that air travel volume adequately predicts almost all dispersal rates. A random-effects state-dependent substitution model reveals no evidence for an effect of arboreality on the swimming mode in the tree frog subfamily Hylinae. Simulations reveal that random-effects substitution models can accommodate both negligible and radical departures from the underlying base substitution model. We show that our gradient-based inference approach is over an order of magnitude more time efficient than conventional approaches.
Collapse
Affiliation(s)
- Andrew F Magee
- Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California - Los Angeles, Los Angeles, CA, USA
| | - Andrew J Holbrook
- Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California - Los Angeles, Los Angeles, CA, USA
| | - Jonathan E Pekar
- Bioinformatics and Systems Biology Graduate Program, University of California - San Diego, La Jolla, CA, USA
- Department of Biomedical Informatics, University of California - San Diega, La Jolla, CA, USA
| | | | - Fredrick A Matsen Iv
- Howard Hughes Medical Institute, Seattle, Washington, USA
- Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA
- Department of Statistics, University of Washington, Seattle, Washington, USA
| | - Guy Baele
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| | - Joel O Wertheim
- Department of Medicine, University of California - San Diego, La Jolla, CA, USA
| | - Xiang Ji
- Department of Mathematics, Tulane University, New Orleans, LA, USA
| | - Philippe Lemey
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| | - Marc A Suchard
- Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California - Los Angeles, Los Angeles, CA, USA
- Department of Biomathematics, David Geffen School of Medicine at UCLA, University of California - Los Angeles, Los Angeles, CA, USA
- Department of Human Genetics, David Geffen School of Medicine at UCLA, University of California - Los Angeles, Los Angeles, CA, USA
| |
Collapse
|
2
|
Yu DL, van Lieshout LP, Stevens BAY, Near KJ(J, Stodola JK, Stinson KJ, Slavic D, Wootton SK. AAV Vectors Pseudotyped with Capsids from Porcine and Bovine Species Mediate In Vitro and In Vivo Gene Delivery. Viruses 2023; 16:57. [PMID: 38257756 PMCID: PMC10820940 DOI: 10.3390/v16010057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2023] [Revised: 12/19/2023] [Accepted: 12/27/2023] [Indexed: 01/24/2024] Open
Abstract
Adeno-associated virus (AAV) vectors are among the most widely used delivery vehicles for in vivo gene therapy as they mediate robust and sustained transgene expression with limited toxicity. However, a significant impediment to the broad clinical success of AAV-based therapies is the widespread presence of pre-existing humoral immunity to AAVs in the human population. This immunity arises from the circulation of non-pathogenic endemic human AAV serotypes. One possible solution is to use non-human AAV capsids to pseudotype transgene-containing AAV vector genomes of interest. Due to the low probability of human exposure to animal AAVs, pre-existing immunity to animal-derived AAV capsids should be low. Here, we characterize two novel AAV capsid sequences: one derived from porcine colon tissue and the other from a caprine adenovirus stock. Both AAV capsids proved to be effective transducers of HeLa and HEK293T cells in vitro. In vivo, both capsids were able to transduce the murine nose, lung, and liver after either intranasal or intraperitoneal administration. In addition, we demonstrate that the porcine AAV capsid likely arose from multiple recombination events involving human- and animal-derived AAV sequences. We hypothesize that recurrent recombination events with similar and distantly related AAV sequences represent an effective mechanism for enhancing the fitness of wildtype AAV populations.
Collapse
Affiliation(s)
- Darrick L. Yu
- Department of Pathobiology, University of Guelph, Guelph, ON N1G 2W1, Canada
| | | | | | | | - Jenny K. Stodola
- Department of Pathobiology, University of Guelph, Guelph, ON N1G 2W1, Canada
| | - Kevin J. Stinson
- Department of Pathobiology, University of Guelph, Guelph, ON N1G 2W1, Canada
| | - Durda Slavic
- Animal Health Laboratory, Laboratory Services Division, University of Guelph, Guelph, ON N1G 2W1, Canada
| | - Sarah K. Wootton
- Department of Pathobiology, University of Guelph, Guelph, ON N1G 2W1, Canada
| |
Collapse
|
3
|
Jun SH, Nasif H, Jennings-Shaffer C, Rich DH, Kooperberg A, Fourment M, Zhang C, Suchard MA, Matsen FA. A topology-marginal composite likelihood via a generalized phylogenetic pruning algorithm. Algorithms Mol Biol 2023; 18:10. [PMID: 37525243 PMCID: PMC10391877 DOI: 10.1186/s13015-023-00235-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Accepted: 07/03/2023] [Indexed: 08/02/2023] Open
Abstract
Bayesian phylogenetics is a computationally challenging inferential problem. Classical methods are based on random-walk Markov chain Monte Carlo (MCMC), where random proposals are made on the tree parameter and the continuous parameters simultaneously. Variational phylogenetics is a promising alternative to MCMC, in which one fits an approximating distribution to the unnormalized phylogenetic posterior. Previous work fit this variational approximation using stochastic gradient descent, which is the canonical way of fitting general variational approximations. However, phylogenetic trees are special structures, giving opportunities for efficient computation. In this paper we describe a new algorithm that directly generalizes the Felsenstein pruning algorithm (a.k.a. sum-product algorithm) to compute a composite-like likelihood by marginalizing out ancestral states and subtrees simultaneously. We show the utility of this algorithm by rapidly making point estimates for branch lengths of a multi-tree phylogenetic model. These estimates accord with a long MCMC run and with estimates obtained using a variational method, but are much faster to obtain. Thus, although generalized pruning does not lead to a variational algorithm as such, we believe that it will form a useful starting point for variational inference.
Collapse
Affiliation(s)
- Seong-Hwan Jun
- Department of Biostatistics and Computational Biology, University of Rochester, Rochester, USA
| | - Hassan Nasif
- Department of Statistics, University of Washington, Seattle, USA
| | - Chris Jennings-Shaffer
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA USA
| | - David H Rich
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA USA
| | - Anna Kooperberg
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA USA
| | - Mathieu Fourment
- Australian Institute for Microbiology and Infection, University of Technology Sydney, Ultimo, NSW Australia
| | - Cheng Zhang
- School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, China
| | - Marc A Suchard
- Department of Human Genetics, University of California, Los Angeles, USA
- Department of Computational Medicine, University of California, Los Angeles, USA
- Department of Biostatistics, University of California, Los Angeles, USA
| | - Frederick A Matsen
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA USA
- Department of Genome Sciences, University of Washington, Seattle, USA
- Howard Hughes Medical Institute, Fred Hutchinson Cancer Research Center, Seattle, Washington USA
- Computational Biology Program, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., Mail stop: S2-140, Seattle, WA 98109-1024 USA
| |
Collapse
|
4
|
Jacob Machado D, Scott R, Guirales S, Janies DA. Fundamental evolution of all Orthocoronavirinae including three deadly lineages descendent from Chiroptera-hosted coronaviruses: SARS-CoV, MERS-CoV and SARS-CoV-2. Cladistics 2021; 37:461-488. [PMID: 34570933 PMCID: PMC8239696 DOI: 10.1111/cla.12454] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/24/2021] [Indexed: 12/14/2022] Open
Abstract
The severe acute respiratory syndrome coronavirus (SARS-CoV) emerged in humans in 2002. Despite reports showing Chiroptera as the original animal reservoir of SARS-CoV, many argue that Carnivora-hosted viruses are the most likely origin. The emergence of the Middle East respiratory syndrome coronavirus (MERS-CoV) in 2012 also involves Chiroptera-hosted lineages. However, factors such as the lack of comprehensive phylogenies hamper our understanding of host shifts once MERS-CoV emerged in humans and Artiodactyla. Since 2019, the origin of SARS-CoV-2, causative agent of coronavirus disease 2019 (COVID-19), added to this episodic history of zoonotic transmission events. Here we introduce a phylogenetic analysis of 2006 unique and complete genomes of different lineages of Orthocoronavirinae. We used gene annotations to align orthologous sequences for total evidence analysis under the parsimony optimality criterion. Deltacoronavirus and Gammacoronavirus were set as outgroups to understand spillovers of Alphacoronavirus and Betacoronavirus among ten orders of animals. We corroborated that Chiroptera-hosted viruses are the sister group of SARS-CoV, SARS-CoV-2 and MERS-related viruses. Other zoonotic events were qualified and quantified to provide a comprehensive picture of the risk of coronavirus emergence among humans. Finally, we used a 250 SARS-CoV-2 genomes dataset to elucidate the phylogenetic relationship between SARS-CoV-2 and Chiroptera-hosted coronaviruses.
Collapse
Affiliation(s)
- Denis Jacob Machado
- Department of Bioinformatics and GenomicsUniversity of North Carolina at Charlotte9331 Robert D. Snyder RdCharlotteNC28223USA
| | - Rachel Scott
- Department of Bioinformatics and GenomicsUniversity of North Carolina at Charlotte9331 Robert D. Snyder RdCharlotteNC28223USA
| | - Sayal Guirales
- Department of Bioinformatics and GenomicsUniversity of North Carolina at Charlotte9331 Robert D. Snyder RdCharlotteNC28223USA
| | - Daniel A. Janies
- Department of Bioinformatics and GenomicsUniversity of North Carolina at Charlotte9331 Robert D. Snyder RdCharlotteNC28223USA
| |
Collapse
|
5
|
R Oaks J, A Cobb K, N Minin V, D Leaché A. Marginal Likelihoods in Phylogenetics: A Review of Methods and Applications. Syst Biol 2019; 68:681-697. [PMID: 30668834 PMCID: PMC6701458 DOI: 10.1093/sysbio/syz003] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2018] [Revised: 01/14/2019] [Accepted: 01/15/2019] [Indexed: 11/29/2022] Open
Abstract
By providing a framework of accounting for the shared ancestry inherent to all life, phylogenetics is becoming the statistical foundation of biology. The importance of model choice continues to grow as phylogenetic models continue to increase in complexity to better capture micro- and macroevolutionary processes. In a Bayesian framework, the marginal likelihood is how data update our prior beliefs about models, which gives us an intuitive measure of comparing model fit that is grounded in probability theory. Given the rapid increase in the number and complexity of phylogenetic models, methods for approximating marginal likelihoods are increasingly important. Here, we try to provide an intuitive description of marginal likelihoods and why they are important in Bayesian model testing. We also categorize and review methods for estimating marginal likelihoods of phylogenetic models, highlighting several recent methods that provide well-behaved estimates. Furthermore, we review some empirical studies that demonstrate how marginal likelihoods can be used to learn about models of evolution from biological data. We discuss promising alternatives that can complement marginal likelihoods for Bayesian model choice, including posterior-predictive methods. Using simulations, we find one alternative method based on approximate-Bayesian computation to be biased. We conclude by discussing the challenges of Bayesian model choice and future directions that promise to improve the approximation of marginal likelihoods and Bayesian phylogenetics as a whole.
Collapse
Affiliation(s)
- Jamie R Oaks
- Department of Biological Sciences and Museum of Natural History, Auburn University, Auburn, AL 36849, USA
- Correspondence to be sent to: Department of Biological Sciences and Museum of Natural History, Auburn University, Auburn, AL 36849, USA; E-mail:
| | - Kerry A Cobb
- Department of Biological Sciences and Museum of Natural History, Auburn University, Auburn, AL 36849, USA
| | - Vladimir N Minin
- Department of Statistics, University of California, Irvine, CA 92697, USA
| | - Adam D Leaché
- Department of Biology and Burke Museum of Natural History and Culture, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|
6
|
Francis A, Moulton V. Identifiability of tree-child phylogenetic networks under a probabilistic recombination-mutation model of evolution. J Theor Biol 2018; 446:160-167. [PMID: 29548737 DOI: 10.1016/j.jtbi.2018.03.011] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2017] [Accepted: 03/09/2018] [Indexed: 12/27/2022]
Abstract
Phylogenetic networks are an extension of phylogenetic trees which are used to represent evolutionary histories in which reticulation events (such as recombination and hybridization) have occurred. A central question for such networks is that of identifiability, which essentially asks under what circumstances can we reliably identify the phylogenetic network that gave rise to the observed data? Recently, identifiability results have appeared for networks relative to a model of sequence evolution that generalizes the standard Markov models used for phylogenetic trees. However, these results are quite limited in terms of the complexity of the networks that are considered. In this paper, by introducing an alternative probabilistic model for evolution along a network that is based on some ground-breaking work by Thatte for pedigrees, we are able to obtain an identifiability result for a much larger class of phylogenetic networks (essentially the class of so-called tree-child networks). To prove our main theorem, we derive some new results for identifying tree-child networks combinatorially, and then adapt some techniques developed by Thatte for pedigrees to show that our combinatorial results imply identifiability in the probabilistic setting. We hope that the introduction of our new model for networks could lead to new approaches to reliably construct phylogenetic networks.
Collapse
Affiliation(s)
- Andrew Francis
- Centre for Research in Mathematics, Western Sydney University, Sydney, Australia.
| | - Vincent Moulton
- School of Computing Sciences, University of East Anglia, Norwich, UK.
| |
Collapse
|
7
|
Gurtler V, Grando D, Kumar BK, Maiti B, Karunasagar I, Karunasagar I. The Use of Recombined Ribosomal RNA Operon (rrn) Type-Specific Flanking Genes to Investigate rrn Differences Between Vibrio parahaemolyticus Environmental and Clinical Strains. GENE REPORTS 2016. [DOI: 10.1016/j.genrep.2016.02.006] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
8
|
Chi PB, Chattopadhyay S, Lemey P, Sokurenko EV, Minin VN. Synonymous and nonsynonymous distances help untangle convergent evolution and recombination. Stat Appl Genet Mol Biol 2016; 14:375-89. [PMID: 26061623 DOI: 10.1515/sagmb-2014-0078] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
When estimating a phylogeny from a multiple sequence alignment, researchers often assume the absence of recombination. However, if recombination is present, then tree estimation and all downstream analyses will be impacted, because different segments of the sequence alignment support different phylogenies. Similarly, convergent selective pressures at the molecular level can also lead to phylogenetic tree incongruence across the sequence alignment. Current methods for detection of phylogenetic incongruence are not equipped to distinguish between these two different mechanisms and assume that the incongruence is a result of recombination or other horizontal transfer of genetic information. We propose a new recombination detection method that can make this distinction, based on synonymous codon substitution distances. Although some power is lost by discarding the information contained in the nonsynonymous substitutions, our new method has lower false positive probabilities than the comparable recombination detection method when the phylogenetic incongruence signal is due to convergent evolution. We apply our method to three empirical examples, where we analyze: (1) sequences from a transmission network of the human immunodeficiency virus, (2) tlpB gene sequences from a geographically diverse set of 38 Helicobacter pylori strains, and (3) hepatitis C virus sequences sampled longitudinally from one patient.
Collapse
|
9
|
Franco J, Ferreira RC, Ienne S, Zingales B. ABCG-like transporter of Trypanosoma cruzi involved in benznidazole resistance: gene polymorphisms disclose inter-strain intragenic recombination in hybrid isolates. INFECTION GENETICS AND EVOLUTION 2015; 31:198-208. [PMID: 25660041 DOI: 10.1016/j.meegid.2015.01.030] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/01/2014] [Revised: 01/21/2015] [Accepted: 01/29/2015] [Indexed: 02/09/2023]
Abstract
Benznidazole (BZ) is one of the two drugs for Chagas disease treatment. In a previous study we showed that the Trypanosoma cruzi ABCG-like transporter gene, named TcABCG1, is over-expressed in parasite strains naturally resistant to BZ and that the gene of TcI BZ-resistant strains exhibited several single nucleotide polymorphisms (SNPs) as compared to the gene of CL Brener BZ-susceptible strain. Here we report the sequence of TcABCG1 gene of fourteen T. cruzi strains, with diverse degrees of BZ sensitivity and belonging to different discrete typing units (DTUs) and Tcbat group. Although DTU-specific SNPs and amino acid changes were identified, no direct correlation with BZ-resistance phenotype was found. Thus, it is plausible that the transporter abundance is a determinant factor for drug resistance, as pointed out above. Sequence data were used for Bayesian phylogenies and network genealogy analysis. The network showed a high degree of reticulation suggesting genetic exchange between the parasites. TcI and TcII clades were clearly separated. Tcbat sequences were close to TcI. A fourth clade clustered TcABCG1 haplotypes of TcV, TcVI and TcIII strains, with closer proximity to TcI. Analysis of the recombination patterns indicated that hybrid strains contain haplotypes that are mosaics most likely derived by intragenic recombination of parental sequences. The data confirm that TcII and TcIII as the parentals of TcV and TcVI DTUs. Since genetic fingerprint of TcI was found in TcIII, we sustain the previously proposed "Two Hybridization model" for the origin of hybrid strains. Among the twenty best BLASTP hits in databases, orthologues of TcABCG1 transporter were found in Leishmania spp. and African trypanosomes, though their function remains undescribed.
Collapse
Affiliation(s)
- Jaques Franco
- Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo, Avenida Professor Lineu Prestes 748, 05508-000 São Paulo, SP, Brazil
| | - Renata C Ferreira
- Laboratório de Genômica Evolutiva e Biocomplexidade, Universidade Federal de São Paulo, Rua Pedro de Toledo 669, 04039-032 São Paulo, SP, Brazil
| | - Susan Ienne
- Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo, Avenida Professor Lineu Prestes 748, 05508-000 São Paulo, SP, Brazil
| | - Bianca Zingales
- Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo, Avenida Professor Lineu Prestes 748, 05508-000 São Paulo, SP, Brazil.
| |
Collapse
|
10
|
Poczai P, Varga I, Hyvönen J. Internal transcribed spacer (ITS) evolution in populations of the hyperparasitic European mistletoe pathogen fungus, Sphaeropsis visci (Botryosphaeriaceae): The utility of ITS2 secondary structures. Gene 2014; 558:54-64. [PMID: 25536165 DOI: 10.1016/j.gene.2014.12.042] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2014] [Revised: 11/29/2014] [Accepted: 12/19/2014] [Indexed: 01/18/2023]
Abstract
We investigated patterns of nucleotide polymorphism in the internal transcribed spacer (ITS) region for Sphaeropsis visci, a hyperparasitic fungus that causes the leaf spot disease of the hemiparasite European mistletoe (Viscum album). Samples of S. visci were obtained from Hungary covering all major infected forest areas. For obtaining PCR products we used a fast and efficient direct PCR approach based on a high fidelity DNA polymerase. A total of 140 ITS sequences were subjected to an array of complementary sequence analyses, which included analyses of secondary structure stability, nucleotide polymorphism patterns, GC content, and presence of conserved motifs. Analysed sequences exhibited features of functional rRNAs. Overall, polymorphism was observed within less conserved motifs, such as loops and bulges, or, alternatively, as non-canonical G-U pairs within conserved regions of double stranded helices. The secondary structure of ITS2 provides new opportunities for obtaining further valuable information, which could be used in phylogenetic analyses, or at population level as demonstrated in our study. This is due to additional information provided by secondary structures and their models. The combined score matrix was used with the methods implemented in the programme 4SALE. Besides the pseudoprotein coding method of 4SALE, the molecular morphometric character coding also has potential for gaining further information for phylogenetic analyses based on the geometric features of the sub-structural elements of the ITS2 RNA transcript.
Collapse
Affiliation(s)
- Péter Poczai
- Botany Unit, Finnish Museum of Natural History, University of Helsinki, PO Box 7, Helsinki FI-00014, Finland.
| | - Ildikó Varga
- Plant Biology, Department of Biosciences, PO Box 65, FI-00014, University of Helsinki, Finland.
| | - Jaakko Hyvönen
- Plant Biology, Department of Biosciences, PO Box 65, FI-00014, University of Helsinki, Finland.
| |
Collapse
|
11
|
Persing A, Jasra A, Beskos A, Balding D, De Iorio M. A simulation approach for change-points on phylogenetic trees. J Comput Biol 2014; 22:10-24. [PMID: 25506749 DOI: 10.1089/cmb.2014.0218] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We observe n sequences at each of m sites and assume that they have evolved from an ancestral sequence that forms the root of a binary tree of known topology and branch lengths, but the sequence states at internal nodes are unknown. The topology of the tree and branch lengths are the same for all sites, but the parameters of the evolutionary model can vary over sites. We assume a piecewise constant model for these parameters, with an unknown number of change-points and hence a transdimensional parameter space over which we seek to perform Bayesian inference. We propose two novel ideas to deal with the computational challenges of such inference. Firstly, we approximate the model based on the time machine principle: the top nodes of the binary tree (near the root) are replaced by an approximation of the true distribution; as more nodes are removed from the top of the tree, the cost of computing the likelihood is reduced linearly in n. The approach introduces a bias, which we investigate empirically. Secondly, we develop a particle marginal Metropolis-Hastings (PMMH) algorithm, that employs a sequential Monte Carlo (SMC) sampler and can use the first idea. Our time-machine PMMH algorithm copes well with one of the bottle-necks of standard computational algorithms: the transdimensional nature of the posterior distribution. The algorithm is implemented on simulated and real data examples, and we empirically demonstrate its potential to outperform competing methods based on approximate Bayesian computation (ABC) techniques.
Collapse
Affiliation(s)
- Adam Persing
- 1 Department of Statistical Science, University College London , London, United Kingdom
| | | | | | | | | |
Collapse
|
12
|
Bielejec F, Lemey P, Baele G, Rambaut A, Suchard MA. Inferring heterogeneous evolutionary processes through time: from sequence substitution to phylogeography. Syst Biol 2014; 63:493-504. [PMID: 24627184 PMCID: PMC4055869 DOI: 10.1093/sysbio/syu015] [Citation(s) in RCA: 56] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Molecular phylogenetic and phylogeographic reconstructions generally assume time-homogeneous substitution processes. Motivated by computational convenience, this assumption sacrifices biological realism and offers little opportunity to uncover the temporal dynamics in evolutionary histories. Here, we propose an evolutionary approach that explicitly relaxes the time-homogeneity assumption by allowing the specification of different infinitesimal substitution rate matrices across different time intervals, called epochs, along the evolutionary history. We focus on an epoch model implementation in a Bayesian inference framework that offers great modeling flexibility in drawing inference about any discrete data type characterized as a continuous-time Markov chain, including phylogeographic traits. To alleviate the computational burden that the additional temporal heterogeneity imposes, we adopt a massively parallel approach that achieves both fine- and coarse-grain parallelization of the computations across branches that accommodate epoch transitions, making extensive use of graphics processing units. Through synthetic examples, we assess model performance in recovering evolutionary parameters from data generated according to different evolutionary scenarios that comprise different numbers of epochs for both nucleotide and codon substitution processes. We illustrate the usefulness of our inference framework in two different applications to empirical data sets: the selection dynamics on within-host HIV populations throughout infection and the seasonality of global influenza circulation. In both cases, our epoch model captures key features of temporal heterogeneity that remained difficult to test using ad hoc procedures. [Bayesian inference; BEAGLE; BEAST; Epoch Model; phylogeography; Phylogenetics.]
Collapse
Affiliation(s)
- Filip Bielejec
- Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven, Belgium;
| | - Philippe Lemey
- Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven, Belgium
| | - Guy Baele
- Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven, Belgium
| | - Andrew Rambaut
- Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, United Kingdom;Fogarty International Center, National Institutes of Health, Bethesda, MD, USA
| | - Marc A Suchard
- Departments of Biomathematics and Human Genetics, David Geffen School of Medicine at UCLA, University of California, Los Angeles, CA, 90095, USA;Department of Biostatistics, UCLA Fielding School of Public Health, University of California, Los Angeles, CA, 90095, USA
| |
Collapse
|
13
|
Sarker S, Patterson EI, Peters A, Baker GB, Forwood JK, Ghorashi SA, Holdsworth M, Baker R, Murray N, Raidal SR. Mutability dynamics of an emergent single stranded DNA virus in a naïve host. PLoS One 2014; 9:e85370. [PMID: 24416396 PMCID: PMC3885698 DOI: 10.1371/journal.pone.0085370] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2013] [Accepted: 11/26/2013] [Indexed: 01/21/2023] Open
Abstract
Quasispecies variants and recombination were studied longitudinally in an emergent outbreak of beak and feather disease virus (BFDV) infection in the orange-bellied parrot (Neophema chrysogaster). Detailed health monitoring and the small population size (<300 individuals) of this critically endangered bird provided an opportunity to longitudinally track viral replication and mutation events occurring in a circular, single-stranded DNA virus over a period of four years within a novel bottleneck population. Optimized PCR was used with different combinations of primers, primer walking, direct amplicon sequencing and sequencing of cloned amplicons to analyze BFDV genome variants. Analysis of complete viral genomes (n = 16) and Rep gene sequences (n = 35) revealed that the outbreak was associated with mutations in functionally important regions of the normally conserved Rep gene and immunogenic capsid (Cap) gene with a high evolutionary rate (3.41×10−3 subs/site/year) approaching that for RNA viruses; simultaneously we observed significant evidence of recombination hotspots between two distinct progenitor genotypes within orange-bellied parrots indicating early cross-transmission of BFDV in the population. Multiple quasispecies variants were also demonstrated with at least 13 genotypic variants identified in four different individual birds, with one containing up to seven genetic variants. Preferential PCR amplification of variants was also detected. Our findings suggest that the high degree of genetic variation within the BFDV species as a whole is reflected in evolutionary dynamics within individually infected birds as quasispecies variation, particularly when BFDV jumps from one host species to another.
Collapse
Affiliation(s)
- Subir Sarker
- School of Animal and Veterinary Sciences, Charles Sturt University, Wagga Wagga, New South Wales, Australia
- Graham Centre for Agricultural Innovation (NSW Department of Primary Industries and Charles Sturt University), Wagga Wagga, New South Wales, Australia
| | - Edward I. Patterson
- School of Animal and Veterinary Sciences, Charles Sturt University, Wagga Wagga, New South Wales, Australia
- Graham Centre for Agricultural Innovation (NSW Department of Primary Industries and Charles Sturt University), Wagga Wagga, New South Wales, Australia
| | - Andrew Peters
- School of Animal and Veterinary Sciences, Charles Sturt University, Wagga Wagga, New South Wales, Australia
- Graham Centre for Agricultural Innovation (NSW Department of Primary Industries and Charles Sturt University), Wagga Wagga, New South Wales, Australia
| | - G. Barry Baker
- Institute of Marine and Antarctic Studies, University of Tasmania, Hobart, Tasmania, Australia
| | - Jade K. Forwood
- School of Biomedical Sciences, Charles Sturt University, Wagga Wagga, New South Wales, Australia
- Graham Centre for Agricultural Innovation (NSW Department of Primary Industries and Charles Sturt University), Wagga Wagga, New South Wales, Australia
| | - Seyed A. Ghorashi
- School of Animal and Veterinary Sciences, Charles Sturt University, Wagga Wagga, New South Wales, Australia
- Graham Centre for Agricultural Innovation (NSW Department of Primary Industries and Charles Sturt University), Wagga Wagga, New South Wales, Australia
| | - Mark Holdsworth
- Biodiversity Conservation Branch, Department of Primary Industries, Parks, Water and Environment, Hobart, Tasmania, Australia
| | - Rupert Baker
- Healesville Sanctuary, Zoos Victoria, Healesville, Victoria, Australia
| | - Neil Murray
- Department of Genetics, La Trobe University, Bundoora, Victoria, Australia
| | - Shane R. Raidal
- School of Animal and Veterinary Sciences, Charles Sturt University, Wagga Wagga, New South Wales, Australia
- Graham Centre for Agricultural Innovation (NSW Department of Primary Industries and Charles Sturt University), Wagga Wagga, New South Wales, Australia
- * E-mail:
| |
Collapse
|
14
|
Catching speciation in the act: Metschnikowia bowlesiae sp. nov., a yeast species found in nitidulid beetles of Hawaii and Belize. Antonie van Leeuwenhoek 2013; 105:541-50. [DOI: 10.1007/s10482-013-0106-z] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/06/2013] [Accepted: 12/20/2013] [Indexed: 11/27/2022]
|
15
|
Highly divergent type 2 and 3 vaccine-derived polioviruses isolated from sewage in Tallinn, Estonia. J Virol 2013; 87:13076-80. [PMID: 24049178 DOI: 10.1128/jvi.01174-13] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
Highly divergent vaccine-derived polioviruses (VDPVs) have been isolated from sewage in Tallinn, Estonia, since 2002. Sequence analysis of VDPVs of serotypes 2 and 3 showed that they shared common noncapsid region recombination sites, indicating origination from a single trivalent oral polio vaccine dose, estimated to have been given between 1986 and 1998. The sewage isolates closely resemble VDPVs chronically excreted by persons with common variable immunodeficiency, but no chronic excretors have yet been identified in Estonia.
Collapse
|
16
|
Plastid trnF pseudogenes are present in Jaltomata, the sister genus of Solanum (Solanaceae): molecular evolution of tandemly repeated structural mutations. Gene 2013; 530:143-50. [PMID: 23962687 DOI: 10.1016/j.gene.2013.08.013] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2013] [Revised: 08/05/2013] [Accepted: 08/06/2013] [Indexed: 11/24/2022]
Abstract
Extensive gene duplication arranged in a tandem array is rare in the plastome of embryophytes. Interestingly, we found pseudogene copies of the trnF gene in the genus Jaltomata, the sister genus of Solanum where such gene duplication has been previously reported. In each Jaltomata sequence available we found two pseudogene copies in close 5'-proximity to the original functional gene. The size of each pseudogene copy ranged between 17 and 48 bp and the anticodon domain was identified as the most conserved element. A common ATT(G)n motif is particularly interesting and its modifications were found to border the 3' of the duplicated regions. Other motifs were partial residues, or entire parts of the T- and D-domains, and both domains proved to be variable in length among the pseudogenes identified. The residues of the 3' and 5' acceptor stem were not found among the copies. We further compared the newly discovered copies of Jaltomata with those ones previously described from Solanum and inferred phylogenetic relationships of the copies aligned. The evolution of Solanum copies, in contrast to Jaltomata, is hard to explain as resulting only in parsimonious changes since reticulate evolutionary patterns were detected among the copies. The dynamic evolutionary patterns of Solanum might be explained by possible inter- or intrachromosomal recombination.
Collapse
|
17
|
Irvahn J, Chattopadhyay S, Sokurenko EV, Minin VN. rbrothers: R Package for Bayesian Multiple Change-Point Recombination Detection. Evol Bioinform Online 2013; 9:235-8. [PMID: 23818749 PMCID: PMC3694826 DOI: 10.4137/ebo.s11945] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
Abstract
Phylogenetic recombination detection is a fundamental task in bioinformatics and evolutionary biology. Most of the computational tools developed to attack this important problem are not integrated into the growing suite of R packages for statistical analysis of molecular sequences. Here, we present an R package, rbrothers, that makes a Bayesian multiple change-point model, one of the most sophisticated model-based phylogenetic recombination tools, available to R users. Moreover, we equip the Bayesian change-point model with a set of pre- and post- processing routines that will broaden the application domain of this recombination detection framework. Specifically, we implement an algorithm that forms the set of input trees required by multiple change-point models. We also provide functionality for checking Markov chain Monte Carlo convergence and creating estimation result summaries and graphics. Using rbrothers, we perform a comparative analysis of two Salmonella enterica genes, fimA and fimH, that encode major and adhesive subunits of the type 1 fimbriae, respectively. We believe that rbrothers, available at R-Forge: http://evolmod.r-forge.r-project.org/, will allow researchers to incorporate recombination detection into phylogenetic workflows already implemented in R.
Collapse
Affiliation(s)
- Jan Irvahn
- Department of Statistics, University of Washington, Seattle, WA, 98195, USA
| | | | | | | |
Collapse
|
18
|
Quinlivan M, Cook F, Kenna R, Callinan JJ, Cullinane A. Genetic characterization by composite sequence analysis of a new pathogenic field strain of equine infectious anemia virus from the 2006 outbreak in Ireland. J Gen Virol 2013; 94:612-622. [DOI: 10.1099/vir.0.047191-0] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Equine infectious anemia virus (EIAV), the causative agent of equine infectious anaemia (EIA), possesses the least-complex genomic organization of any known extant lentivirus. Despite this relative genetic simplicity, all of the complete genomic sequences published to date are derived from just two viruses, namely the North American EIAVWYOMING (EIAVWY) and Chinese EIAVLIAONING (EIAVLIA) strains. In 2006, an outbreak of EIA occurred in Ireland, apparently as a result of the importation of contaminated horse plasma from Italy and subsequent iatrogenic transmission to foals. This EIA outbreak was characterized by cases of severe, sometimes fatal, disease. To begin to understand the molecular mechanisms underlying this pathogenic phenotype, complete proviral genomic sequences in the form of 12 overlapping PCR-generated fragments were obtained from four of the EIAV-infected animals, including two of the index cases. Sequence analysis of multiple molecular clones produced from each fragment demonstrated the extent of diversity within individual viral genes and permitted construction of consensus whole-genome sequences for each of the four viral isolates. In addition, complete env gene sequences were obtained from 11 animals with differing clinical profiles, despite exposure to a common EIAV source. Although the overall genomic organization of the Irish EIAV isolates was typical of that seen in all other strains, the European viruses possessed ≤80 % nucleotide sequence identity with either EIAVWY or EIAVLIA. Furthermore, phylogenetic analysis suggested that the Irish EIAV isolates developed independently of the North American and Chinese viruses and that they constitute a separate monophyletic group.
Collapse
Affiliation(s)
- Michelle Quinlivan
- Virology Unit, Irish Equine Centre, Johnstown, Naas, Co. Kildare, Ireland
| | - Frank Cook
- Gluck Equine Research Centre, Department of Veterinary Science, University of Kentucky, Lexington, KY 40545, USA
| | - Rachel Kenna
- Virology Unit, Irish Equine Centre, Johnstown, Naas, Co. Kildare, Ireland
| | - John J. Callinan
- Veterinary Science Centre, University College Dublin, Belfield, Dublin 4, Ireland
| | - Ann Cullinane
- Virology Unit, Irish Equine Centre, Johnstown, Naas, Co. Kildare, Ireland
| |
Collapse
|
19
|
Kemal KS, Ramirez CM, Burger H, Foley B, Mayers D, Klimkait T, Hamy F, Anastos K, Petrovic K, Minin VN, Suchard MA, Weiser B. Recombination between variants from genital tract and plasma: evolution of multidrug-resistant HIV type 1. AIDS Res Hum Retroviruses 2012; 28:1766-74. [PMID: 22364185 DOI: 10.1089/aid.2011.0383] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
Multidrug-resistant (MDR) HIV-1 presents a challenge to the efficacy of antiretroviral therapy (ART). To examine mechanisms leading to MDR variants in infected individuals, we studied recombination between single viral genomes from the genital tract and plasma of a woman initiating ART. We determined HIV-1 RNA sequences and drug resistance profiles of 159 unique viral variants obtained before ART and semiannually for 4 years thereafter. Soon after initiating zidovudine, lamivudine, and nevirapine, resistant variants and intrapatient HIV-1 recombinants were detected in both compartments; the recombinants had inherited genetic material from both genital and plasma-derived viruses. Twenty-three unique recombinants were documented during 4 years of therapy, comprising ~22% of variants. Most recombinant genomes displayed similar breakpoints and clustered phylogenetically, suggesting evolution from common ancestors. Longitudinal analysis demonstrated that MDR recombinants were common and persistent, demonstrating that recombination, in addition to point mutation, can contribute to the evolution of MDR HIV-1 in viremic individuals.
Collapse
Affiliation(s)
- Kimdar S. Kemal
- Wadsworth Center, New York State Department of Health, Albany, New York
| | | | - Harold Burger
- Wadsworth Center, New York State Department of Health, Albany, New York
- Department of Medicine, Albany Medical College, Albany, New York
| | - Brian Foley
- Los Alamos National Laboratory, Los Alamos, New Mexico
| | | | - Thomas Klimkait
- Institute of Medical Microbiology, Basel, Switzerland
- InPheno AG, Basel, Switzerland
| | - François Hamy
- Institute of Medical Microbiology, Basel, Switzerland
- InPheno AG, Basel, Switzerland
| | | | | | - Vladimir N. Minin
- Department of Statistics, University of Washington, Seattle, Washington
| | - Marc A. Suchard
- Department of Biostatistics, University of California, Los Angeles, California
- Department of Biomathematics, University of California, Los Angeles, California
- Department of Human Genetics, University of California, Los Angeles, California
| | - Barbara Weiser
- Wadsworth Center, New York State Department of Health, Albany, New York
- Department of Medicine, Albany Medical College, Albany, New York
| |
Collapse
|
20
|
The diversity of the pathogenic Oomycete (Aphanomyces astaci) chitinase genes within the genotypes indicate adaptation to its hosts. Fungal Genet Biol 2012; 49:635-42. [DOI: 10.1016/j.fgb.2012.05.014] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2012] [Revised: 05/25/2012] [Accepted: 05/27/2012] [Indexed: 11/18/2022]
|
21
|
Guindon S. From trajectories to averages: an improved description of the heterogeneity of substitution rates along lineages. Syst Biol 2012; 62:22-34. [PMID: 22798331 DOI: 10.1093/sysbio/sys063] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The accuracy and precision of species divergence date estimation from molecular data strongly depend on the models describing the variation of substitution rates along a phylogeny. These models generally assume that rates randomly fluctuate along branches from one node to the next. However, for mathematical convenience, the stochasticity of such a process is ignored when translating these rate trajectories into branch lengths. This study addresses this shortcoming. A new approach is described that explicitly considers the average substitution rates along branches as random quantities, resulting in a more realistic description of the variations of evolutionary rates along lineages. The proposed method provides more precise estimates of the rate autocorrelation parameter as well as divergence times. Also, simulation results indicate that ignoring the stochastic variation of rates along edges can lead to significant overestimation of specific node ages. Altogether, the new approach introduced in this study is a step forward to designing biologically relevant models of rate evolution that are well suited to data sets with dense taxon sampling which are likely to present rate autocorrelation. The computer programme PhyTime, part of the PhyML package and implementing the new approach, is available from http://code.google.com/p/phyml (last accessed 1 August 2012).
Collapse
Affiliation(s)
- Stéphane Guindon
- Department of Statistics, University of Auckland, Auckland, 1010, New Zealand.
| |
Collapse
|
22
|
Phylogenetic evidence based on Trypanosoma cruzi nuclear gene sequences and information entropy suggest that inter-strain intragenic recombination is a basic mechanism underlying the allele diversity of hybrid strains. INFECTION GENETICS AND EVOLUTION 2012; 12:1064-71. [DOI: 10.1016/j.meegid.2012.03.010] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/11/2011] [Revised: 03/12/2012] [Accepted: 03/13/2012] [Indexed: 11/27/2022]
|
23
|
Marttinen P, Hanage WP, Croucher NJ, Connor TR, Harris SR, Bentley SD, Corander J. Detection of recombination events in bacterial genomes from large population samples. Nucleic Acids Res 2011; 40:e6. [PMID: 22064866 PMCID: PMC3245952 DOI: 10.1093/nar/gkr928] [Citation(s) in RCA: 171] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Analysis of important human pathogen populations is currently under transition toward whole-genome sequencing of growing numbers of samples collected on a global scale. Since recombination in bacteria is often an important factor shaping their evolution by enabling resistance elements and virulence traits to rapidly transfer from one evolutionary lineage to another, it is highly beneficial to have access to tools that can detect recombination events. Multiple advanced statistical methods exist for such purposes; however, they are typically limited either to only a few samples or to data from relatively short regions of a total genome. By harnessing the power of recent advances in Bayesian modeling techniques, we introduce here a method for detecting homologous recombination events from whole-genome sequence data for bacterial population samples on a large scale. Our statistical approach can efficiently handle hundreds of whole genome sequenced population samples and identify separate origins of the recombinant sequence, offering an enhanced insight into the diversification of bacterial clones at the level of the whole genome. A data set of 241 whole genome sequences from an important pandemic lineage of Streptococcus pneumoniae is used together with multiple simulated data sets to demonstrate the potential of our approach.
Collapse
Affiliation(s)
- Pekka Marttinen
- Department of Biomedical Engineering and Computational Science, Aalto University, PO Box 12200, FI-00076 AALTO, Finland.
| | | | | | | | | | | | | |
Collapse
|
24
|
Huelsenbeck JP, Alfaro ME, Suchard MA. Biologically inspired phylogenetic models strongly outperform the no common mechanism model. Syst Biol 2011; 60:225-32. [PMID: 21252385 PMCID: PMC3038349 DOI: 10.1093/sysbio/syq089] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2009] [Revised: 06/29/2009] [Accepted: 09/22/2010] [Indexed: 11/13/2022] Open
Abstract
But Tuffley and Steel (1997) introduced a model called No Common Mechanism (NCM), in which characters may-but are not required to-vary their relative rates independently, both within and between branches. Because the independent variation is taken only as a possibility, not as a requirement, NCM would apply to almost any situation, and so may be accepted as realistic. This is useful because Tuffley and Steel also showed that maximum likelihood under NCM selects the same trees as does parsimony. With the realistic NCM in the background, then, most parsimonious trees have greatest power to explain available observations. -Farris (2008).
Collapse
Affiliation(s)
- John P Huelsenbeck
- Department of Integrative Biology, University of California, Berkeley, CA 94720-3140, USA.
| | | | | |
Collapse
|
25
|
Ané C. Detecting phylogenetic breakpoints and discordance from genome-wide alignments for species tree reconstruction. Genome Biol Evol 2011; 3:246-58. [PMID: 21362638 PMCID: PMC3070431 DOI: 10.1093/gbe/evr013] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
With the easy acquisition of sequence data, it is now possible to obtain and align whole genomes across multiple related species or populations. In this work, I assess the performance of a statistical method to reconstruct the whole distribution of phylogenetic trees along the genome, estimate the proportion of the genome for which a given clade is true, and infer a concordance tree that summarizes the dominant vertical inheritance pattern. There are two main issues when dealing with whole-genome alignments, as opposed to multiple genes: the size of the data and the detection of recombination breakpoints. These breakpoints partition the genomic alignment into phylogenetically homogeneous loci, where sites within a given locus all share the same phylogenetic tree topology. To delimitate these loci, I describe here a method based on the minimum description length (MDL) principle, implemented with dynamic programming for computational efficiency. Simulations show that combining MDL partitioning with Bayesian concordance analysis provides an efficient and robust way to estimate both the vertical inheritance signal and the horizontal phylogenetic signal. The method performed well both in the presence of incomplete lineage sorting and in the presence of horizontal gene transfer. A high level of systematic bias was found here, highlighting the need for good individual tree building methods, which form the basis for more elaborate gene tree/species tree reconciliation methods.
Collapse
Affiliation(s)
- Cécile Ané
- Departments of Statistics and Botany, University of Wisconsin-Madison, USA.
| |
Collapse
|
26
|
Song G, Hsu CH, Riemer C, Miller W. Evaluation of methods for detecting conversion events in gene clusters. BMC Bioinformatics 2011; 12 Suppl 1:S45. [PMID: 21342577 PMCID: PMC3044302 DOI: 10.1186/1471-2105-12-s1-s45] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND Gene clusters are genetically important, but their analysis poses significant computational challenges. One of the major reasons for these difficulties is gene conversion among the duplicated regions of the cluster, which can obscure their true relationships. Many computational methods for detecting gene conversion events have been released, but their performance has not been assessed for wide deployment in evolutionary history studies due to a lack of accurate evaluation methods. RESULTS We designed a new method that simulates gene cluster evolution, including large-scale events of duplication, deletion, and conversion as well as small mutations. We used this simulation data to evaluate several different programs for detecting gene conversion events. CONCLUSIONS Our evaluation identifies strengths and weaknesses of several methods for detecting gene conversion, which can contribute to more accurate analysis of gene cluster evolution.
Collapse
Affiliation(s)
- Giltae Song
- Center for Comparative Genomics and Bioinformatics, 506 Wartik Lab, Pennsylvania State University, University Park, PA 16802, USA
| | - Chih-Hao Hsu
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health (NIH), Bethesda, MD, USA
| | - Cathy Riemer
- Center for Comparative Genomics and Bioinformatics, 506 Wartik Lab, Pennsylvania State University, University Park, PA 16802, USA
| | - Webb Miller
- Center for Comparative Genomics and Bioinformatics, 506 Wartik Lab, Pennsylvania State University, University Park, PA 16802, USA
| |
Collapse
|
27
|
Lèbre S, Becq J, Devaux F, Stumpf MPH, Lelandais G. Statistical inference of the time-varying structure of gene-regulation networks. BMC SYSTEMS BIOLOGY 2010; 4:130. [PMID: 20860793 PMCID: PMC2955603 DOI: 10.1186/1752-0509-4-130] [Citation(s) in RCA: 76] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/22/2010] [Accepted: 09/22/2010] [Indexed: 01/08/2023]
Abstract
Background Biological networks are highly dynamic in response to environmental and physiological cues. This variability is in contrast to conventional analyses of biological networks, which have overwhelmingly employed static graph models which stay constant over time to describe biological systems and their underlying molecular interactions. Methods To overcome these limitations, we propose here a new statistical modelling framework, the ARTIVA formalism (Auto Regressive TIme VArying models), and an associated inferential procedure that allows us to learn temporally varying gene-regulation networks from biological time-course expression data. ARTIVA simultaneously infers the topology of a regulatory network and how it changes over time. It allows us to recover the chronology of regulatory associations for individual genes involved in a specific biological process (development, stress response, etc.). Results We demonstrate that the ARTIVA approach generates detailed insights into the function and dynamics of complex biological systems and exploits efficiently time-course data in systems biology. In particular, two biological scenarios are analyzed: the developmental stages of Drosophila melanogaster and the response of Saccharomyces cerevisiae to benomyl poisoning. Conclusions ARTIVA does recover essential temporal dependencies in biological systems from transcriptional data, and provide a natural starting point to learn and investigate their dynamics in greater detail.
Collapse
Affiliation(s)
- Sophie Lèbre
- Center for Bioinformatics, Imperial College London, London, UK
| | | | | | | | | |
Collapse
|
28
|
Modelling nonstationary gene regulatory processes. Adv Bioinformatics 2010. [PMID: 20721277 PMCID: PMC2913537 DOI: 10.1155/2010/749848] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2009] [Accepted: 04/29/2010] [Indexed: 02/05/2023] Open
Abstract
An important objective in systems biology is to infer gene regulatory networks from postgenomic data, and dynamic Bayesian networks have been widely applied as a popular tool to this end. The standard approach for nondiscretised data is restricted to a linear model and a homogeneous Markov chain. Recently, various generalisations based on changepoint processes and free allocation mixture models have been proposed. The former aim to relax the homogeneity assumption, whereas the latter are more flexible and, in principle, more adequate for modelling nonlinear processes. In our paper, we compare both paradigms and discuss theoretical shortcomings of the latter approach. We show that a model based on the changepoint process yields systematically better results than the free allocation model when inferring nonstationary gene regulatory processes from simulated gene expression time series. We further cross-compare the performance of both models on three biological systems: macrophages challenged with viral infection, circadian regulation in Arabidopsis thaliana, and morphogenesis in Drosophila melanogaster.
Collapse
|
29
|
REBERNIG CAROLINA, SCHNEEWEISS GERALDM, BARDY KATHARINAE, SCHÖNSWETTER PETER, VILLASEÑOR JOSEL, OBERMAYER RENATE, STUESSY TODF, WEISS-SCHNEEWEISS HANNA. Multiple Pleistocene refugia and Holocene range expansion of an abundant southwestern American desert plant species (Melampodium leucanthum, Asteraceae). Mol Ecol 2010; 19:3421-43. [DOI: 10.1111/j.1365-294x.2010.04754.x] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
|
30
|
Matsen FA. constNJ: An Algorithm to Reconstruct Sets of Phylogenetic Trees Satisfying Pairwise Topological Constraints. J Comput Biol 2010; 17:799-818. [DOI: 10.1089/cmb.2009.0201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Affiliation(s)
- Frederick A. Matsen
- Program in Computational Biology, Fred Hutchinson Cancer Research Center 1100, Seattle, Washington, USA
| |
Collapse
|
31
|
Rebernig CA, Weiss-Schneeweiss H, Schneeweiss GM, Schönswetter P, Obermayer R, Villaseñor JL, Stuessy TF. Quaternary range dynamics and polyploid evolution in an arid brushland plant species (Melampodium cinereum, Asteraceae). Mol Phylogenet Evol 2010; 54:594-606. [DOI: 10.1016/j.ympev.2009.10.010] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2009] [Revised: 10/02/2009] [Accepted: 10/06/2009] [Indexed: 12/19/2022]
|
32
|
Bloomquist EW, Suchard MA. Unifying vertical and nonvertical evolution: a stochastic ARG-based framework. Syst Biol 2009; 59:27-41. [PMID: 20525618 DOI: 10.1093/sysbio/syp076] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Evolutionary biologists have introduced numerous statistical approaches to explore nonvertical evolution, such as horizontal gene transfer, recombination, and genomic reassortment, through collections of Markov-dependent gene trees. These tree collections allow for inference of nonvertical evolution, but only indirectly, making findings difficult to interpret and models difficult to generalize. An alternative approach to explore nonvertical evolution relies on phylogenetic networks. These networks provide a framework to model nonvertical evolution but leave unanswered questions such as the statistical significance of specific nonvertical events. In this paper, we begin to correct the shortcomings of both approaches by introducing the "stochastic model for reassortment and transfer events" (SMARTIE) drawing upon ancestral recombination graphs (ARGs). ARGs are directed graphs that allow for formal probabilistic inference on vertical speciation events and nonvertical evolutionary events. We apply SMARTIE to phylogenetic data. Because of this, we can typically infer a single most probable ARG, avoiding coarse population dynamic summary statistics. In addition, a focus on phylogenetic data suggests novel probability distributions on ARGs. To make inference with our model, we develop a reversible jump Markov chain Monte Carlo sampler to approximate the posterior distribution of SMARTIE. Using the BEAST phylogenetic software as a foundation, the sampler employs a parallel computing approach that allows for inference on large-scale data sets. To demonstrate SMARTIE, we explore 2 separate phylogenetic applications, one involving pathogenic Leptospirochete and the other Saccharomyces.
Collapse
Affiliation(s)
- Erik W Bloomquist
- Department of Biostatistics, UCLA School of Public Health, Los Angeles, CA 90095, USA
| | | |
Collapse
|
33
|
Chan CX, Beiko RG, Darling AE, Ragan MA. Lateral transfer of genes and gene fragments in prokaryotes. Genome Biol Evol 2009; 1:429-38. [PMID: 20333212 PMCID: PMC2817436 DOI: 10.1093/gbe/evp044] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/31/2009] [Indexed: 01/24/2023] Open
Abstract
Lateral genetic transfer (LGT) involves the movement of genetic material from one lineage into another and its subsequent incorporation into the new host genome via genetic recombination. Studies in individual taxa have indicated lateral origins for stretches of DNA of greatly varying length, from a few nucleotides to chromosome size. Here we analyze 1,462 sets of single-copy, putatively orthologous genes from 144 fully sequenced prokaryote genomes, asking to what extent complete genes and fragments of genes have been transferred and recombined in LGT. Using a rigorous phylogenetic approach, we find evidence for LGT in at least 476 (32.6%) of these 1,462 gene sets: 286 (19.6%) clearly show one or more "observable recombination breakpoints" within the boundaries of the open reading frame, while a further 190 (13.0%) yield trees that are topologically incongruent with the reference tree but do not contain a recombination breakpoint within the open reading frame. We refer to these gene sets as observable recombination breakpoint positive (ORB(+)) and negative (ORB(-)) respectively. The latter are prima facie instances of lateral transfer of an entire gene or beyond. We observe little functional bias between ORB(+) and ORB(-) gene sets, but find that incorporation of entire genes is potentially more frequent in pathogens than in nonpathogens. As ORB(+) gene sets are about 50% more common than ORB(-) sets in our data, the transfer of gene fragments has been relatively frequent, and the frequency of LGT may have been systematically underestimated in phylogenetic studies.
Collapse
Affiliation(s)
- Cheong Xin Chan
- Institute for Molecular Bioscience and ARC Centre of Excellence in Bioinformatics, The University of Queensland, Brisbane, Queensland, Australia
| | | | | | | |
Collapse
|
34
|
Distribution of distances between topologies and its effect on detection of phylogenetic recombination. ANN I STAT MATH 2009. [DOI: 10.1007/s10463-009-0259-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
35
|
Tai YC, Kvale MN, Witte JS. Segmentation and estimation for SNP microarrays: a Bayesian multiple change-point approach. Biometrics 2009; 66:675-83. [PMID: 19764955 DOI: 10.1111/j.1541-0420.2009.01328.x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
High-density single-nucleotide polymorphism (SNP) microarrays provide a useful tool for the detection of copy number variants (CNVs). The analysis of such large amounts of data is complicated, especially with regard to determining where copy numbers change and their corresponding values. In this article, we propose a Bayesian multiple change-point model (BMCP) for segmentation and estimation of SNP microarray data. Segmentation concerns separating a chromosome into regions of equal copy number differences between the sample of interest and some reference, and involves the detection of locations of copy number difference changes. Estimation concerns determining true copy number for each segment. Our approach not only gives posterior estimates for the parameters of interest, namely locations for copy number difference changes and true copy number estimates, but also useful confidence measures. In addition, our algorithm can segment multiple samples simultaneously, and infer both common and rare CNVs across individuals. Finally, for studies of CNVs in tumors, we incorporate an adjustment factor for signal attenuation due to tumor heterogeneity or normal contamination that can improve copy number estimates.
Collapse
Affiliation(s)
- Yu Chuan Tai
- Institute for Human Genetics, Department of Epidemiology and Biostatistics, University of California, San Francisco, California 94143-0794, USA.
| | | | | |
Collapse
|
36
|
Tang J, Hanage WP, Fraser C, Corander J. Identifying currents in the gene pool for bacterial populations using an integrative approach. PLoS Comput Biol 2009; 5:e1000455. [PMID: 19662158 PMCID: PMC2713424 DOI: 10.1371/journal.pcbi.1000455] [Citation(s) in RCA: 75] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2008] [Accepted: 07/01/2009] [Indexed: 11/18/2022] Open
Abstract
The evolution of bacterial populations has recently become considerably better understood due to large-scale sequencing of population samples. It has become clear that DNA sequences from a multitude of genes, as well as a broad sample coverage of a target population, are needed to obtain a relatively unbiased view of its genetic structure and the patterns of ancestry connected to the strains. However, the traditional statistical methods for evolutionary inference, such as phylogenetic analysis, are associated with several difficulties under such an extensive sampling scenario, in particular when a considerable amount of recombination is anticipated to have taken place. To meet the needs of large-scale analyses of population structure for bacteria, we introduce here several statistical tools for the detection and representation of recombination between populations. Also, we introduce a model-based description of the shape of a population in sequence space, in terms of its molecular variability and affinity towards other populations. Extensive real data from the genus Neisseria are utilized to demonstrate the potential of an approach where these population genetic tools are combined with an phylogenetic analysis. The statistical tools introduced here are freely available in BAPS 5.2 software, which can be downloaded from http://web.abo.fi/fak/mnf/mate/jc/software/baps.html. The study of bacterial population biology is complicated by the fact that, although bacteria are largely asexual, they can also exchange genetic materials through homologous recombination. Unlike eukaryotes, recombination in bacteria is not an obligatory process. Furthermore, the recombination mechanisms are subject to many biological and ecological factors that can vary even within different populations of the same species. Although increasing evidence for homologous recombination has been found in many bacterial species, determining the frequency of recombination and understanding the influence that it exerts upon the evolution of bacterial populations remains a challenging work. In this article, we provide a dynamic picture of recombination within and between closely related bacteria species. Through an integration of several Bayesian statistical models, our method highlights the importance of a quantitative estimation of recombination. Our analyses of a challenging multi-locus sequence typing (MLST) database demonstrate that combined analyses using both traditional phylogenetic methods, explorative MLST tools and Bayesian population genetic models can together yield interesting biological insights that cannot easily be reached by any of the approaches alone.
Collapse
Affiliation(s)
- Jing Tang
- Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland.
| | | | | | | |
Collapse
|
37
|
Lehrach WP, Husmeier D. Segmenting bacterial and viral DNA sequence alignments with a trans-dimensional phylogenetic factorial hidden Markov model. J R Stat Soc Ser C Appl Stat 2009. [DOI: 10.1111/j.1467-9876.2008.00648.x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
|
38
|
Webb A, Hancock JM, Holmes CC. Phylogenetic inference under recombination using Bayesian stochastic topology selection. ACTA ACUST UNITED AC 2008; 25:197-203. [PMID: 19028720 PMCID: PMC2639012 DOI: 10.1093/bioinformatics/btn607] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Conventional phylogenetic analysis for characterizing the relatedness between taxa typically assumes that a single relationship exists between species at every site along the genome. This assumption fails to take into account recombination which is a fundamental process for generating diversity and can lead to spurious results. Recombination induces a localized phylogenetic structure which may vary along the genome. Here, we generalize a hidden Markov model (HMM) to infer changes in phylogeny along multiple sequence alignments while accounting for rate heterogeneity; the hidden states refer to the unobserved phylogenic topology underlying the relatedness at a genomic location. The dimensionality of the number of hidden states (topologies) and their structure are random (not known a priori) and are sampled using Markov chain Monte Carlo algorithms. The HMM structure allows us to analytically integrate out over all possible changepoints in topologies as well as all the unknown branch lengths. RESULTS We demonstrate our approach on simulated data and also to the genome of a suspected HIV recombinant strain as well as to an investigation of recombination in the sequences of 15 laboratory mouse strains sequenced by Perlegen Sciences. Our findings indicate that our method allows us to distinguish between rate heterogeneity and variation in phylogeny caused by recombination without being restricted to 4-taxa data.
Collapse
Affiliation(s)
- Alex Webb
- Department of Statistics, Oxford, UK
| | | | | |
Collapse
|
39
|
McBride AJA, Cerqueira GM, Suchard MA, Moreira AN, Zuerner RL, Reis MG, Haake DA, Ko AI, Dellagostin OA. Genetic diversity of the Leptospiral immunoglobulin-like (Lig) genes in pathogenic Leptospira spp. INFECTION GENETICS AND EVOLUTION 2008; 9:196-205. [PMID: 19028604 DOI: 10.1016/j.meegid.2008.10.012] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/27/2008] [Revised: 10/23/2008] [Accepted: 10/28/2008] [Indexed: 11/17/2022]
Abstract
Recent serologic, immunoprotection, and pathogenesis studies identified the Lig proteins as key virulence determinants in interactions of leptospiral pathogens with the mammalian host. We examined the sequence variation and recombination patterns of ligA, ligB, and ligC among 10 pathogenic strains from five Leptospira species. All strains were found to have intact ligB genes and genetic drift accounting for most of the ligB genetic diversity observed. The ligA gene was found exclusively in L. interrogans and L. kirschneri strains, and was created from ligB by a two-step partial gene duplication process. The aminoterminal domain of LigB and the LigA paralog were essentially identical (98.5+/-0.8% mean identity) in strains with both genes. Like ligB, ligC gene variation also followed phylogenetic patterns, suggesting an early gene duplication event. However, ligC is a pseudogene in several strains, suggesting that LigC is not essential for virulence. Two ligB genes and one ligC gene had mosaic compositions and evidence for recombination events between related Leptospira species was also found for some ligA genes. In conclusion, the results presented here indicate that Lig diversity has important ramifications for the selection of Lig polypeptides for use in diagnosis and as vaccine candidates. This sequence information will aid the identification of highly conserved regions within the Lig proteins and improve upon the performance characteristics of the Lig proteins in diagnostic assays and in subunit vaccine formulations with the potential to confer heterologous protection.
Collapse
Affiliation(s)
- Alan J A McBride
- Centro de Pesquisa Gonçalo Moniz, Fundação Oswaldo Cruz, Salvador, BA, Brazil
| | | | | | | | | | | | | | | | | |
Collapse
|
40
|
Marttinen P, Baldwin A, Hanage WP, Dowson C, Mahenthiralingam E, Corander J. Bayesian modeling of recombination events in bacterial populations. BMC Bioinformatics 2008; 9:421. [PMID: 18840286 PMCID: PMC2579306 DOI: 10.1186/1471-2105-9-421] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2008] [Accepted: 10/07/2008] [Indexed: 11/10/2022] Open
Abstract
Background We consider the discovery of recombinant segments jointly with their origins within multilocus DNA sequences from bacteria representing heterogeneous populations of fairly closely related species. The currently available methods for recombination detection capable of probabilistic characterization of uncertainty have a limited applicability in practice as the number of strains in a data set increases. Results We introduce a Bayesian spatial structural model representing the continuum of origins over sites within the observed sequences, including a probabilistic characterization of uncertainty related to the origin of any particular site. To enable a statistically accurate and practically feasible approach to the analysis of large-scale data sets representing a single genus, we have developed a novel software tool (BRAT, Bayesian Recombination Tracker) implementing the model and the corresponding learning algorithm, which is capable of identifying the posterior optimal structure and to estimate the marginal posterior probabilities of putative origins over the sites. Conclusion A multitude of challenging simulation scenarios and an analysis of real data from seven housekeeping genes of 120 strains of genus Burkholderia are used to illustrate the possibilities offered by our approach. The software is freely available for download at URL .
Collapse
Affiliation(s)
- Pekka Marttinen
- Department of Mathematics and statistics, University of Helsinki, FIN-00014, Finland.
| | | | | | | | | | | |
Collapse
|
41
|
Huelsenbeck JP, Ané C, Larget B, Ronquist F. A Bayesian perspective on a non-parsimonious parsimony model. Syst Biol 2008; 57:406-19. [PMID: 18570035 DOI: 10.1080/10635150802166046] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022] Open
Abstract
Several stochastic models of character change, when implemented in a maximum likelihood framework, are known to give a correspondence between the maximum parsimony method and the method of maximum likelihood. One such model has an independently estimated branch-length parameter for each site and each branch of the phylogenetic tree. This model--the no-common-mechanism model--has many parameters, and, in fact, the number of parameters increases as fast as the alignment is extended. We take a Bayesian approach to the no-common-mechanism model and place independent gamma prior probability distributions on the branch-length parameters. We are able to analytically integrate over the branch lengths, and this allowed us to implement an efficient Markov chain Monte Carlo method for exploring the space of phylogenetic trees. We were able to reliably estimate the posterior probabilities of clades for phylogenetic trees of up to 500 sequences. However, the Bayesian approach to the problem, at least as implemented here with an independent prior on the length of each branch, does not tame the behavior of the branch-length parameters. The integrated likelihood appears to be a simple rescaling of the parsimony score for a tree, and the marginal posterior probability distribution of the length of a branch is dependent upon how the maximum parsimony method reconstructs the characters at the interior nodes of the tree. The method we describe, however, is of potential importance in the analysis of morphological character data and also for improving the behavior of Markov chain Monte Carlo methods implemented for models in which sites share a common branch-length parameter.
Collapse
Affiliation(s)
- John P Huelsenbeck
- Department of Integrative Biology, University of California, Berkeley, CA 94720-3140, USA.
| | | | | | | |
Collapse
|
42
|
Martins LDO, Leal E, Kishino H. Phylogenetic detection of recombination with a Bayesian prior on the distance between trees. PLoS One 2008; 3:e2651. [PMID: 18612422 PMCID: PMC2440540 DOI: 10.1371/journal.pone.0002651] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2008] [Accepted: 06/07/2008] [Indexed: 11/18/2022] Open
Abstract
Genomic regions participating in recombination events may support distinct topologies, and phylogenetic analyses should incorporate this heterogeneity. Existing phylogenetic methods for recombination detection are challenged by the enormous number of possible topologies, even for a moderate number of taxa. If, however, the detection analysis is conducted independently between each putative recombinant sequence and a set of reference parentals, potential recombinations between the recombinants are neglected. In this context, a recombination hotspot can be inferred in phylogenetic analyses if we observe several consecutive breakpoints. We developed a distance measure between unrooted topologies that closely resembles the number of recombinations. By introducing a prior distribution on these recombination distances, a Bayesian hierarchical model was devised to detect phylogenetic inconsistencies occurring due to recombinations. This model relaxes the assumption of known parental sequences, still common in HIV analysis, allowing the entire dataset to be analyzed at once. On simulated datasets with up to 16 taxa, our method correctly detected recombination breakpoints and the number of recombination events for each breakpoint. The procedure is robust to rate and transition∶transversion heterogeneities for simulations with and without recombination. This recombination distance is related to recombination hotspots. Applying this procedure to a genomic HIV-1 dataset, we found evidence for hotspots and de novo recombination.
Collapse
|
43
|
Bloomquist EW, Dorman KS, Suchard MA. StepBrothers: inferring partially shared ancestries among recombinant viral sequences. Biostatistics 2008; 10:106-20. [PMID: 18562348 DOI: 10.1093/biostatistics/kxn019] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
Phylogeneticists have developed several statistical methods to infer recombination among molecular sequences that are evolutionarily related. Of these methods, Markov change-point models currently provide the most coherent framework. Yet, the Markov assumption is faulty in that the inferred relatedness of homologous sequences across regions divided by recombinant events is not independent, particularly for nonrecombinant sequences as they share the same history. To correct this limitation, we introduce a novel random tips (RT) model. The model springs from the idea that a recombinant sequence inherits its characters from an unknown number of ancestral full-length sequences, of which one only observes the incomplete portions. The RT model decomposes recombinant sequences into their ancestral portions and then augments each portion onto the data set as unique partially observed sequences. This data augmentation generates a random number of sequences related to each other through a single inferable tree with the same random number of tips. While intuitively pleasing, this single tree corrects the independence assumptions plaguing previous methods while permitting the detection of recombination. The single tree also allows for inference of the relative times of recombination events and generalizes to incorporate multiple recombinant sequences. This generalization answers important questions with which previous models struggle. For example, we demonstrate that a group of human immunodeficiency type 1 recombinant viruses from Argentina, previously thought to have the same recombinant history, actually consist of 2 groups: one, a clonal expansion of a reference sequence and another that predates the formation of the reference sequence. In another example, we demonstrate that 2 hepatitis B virus recombinant strains share similar splicing locations, suggesting a common descent of the 2 viruses. We implement and run both examples in a software package called StepBrothers, freely available to interested parties.
Collapse
Affiliation(s)
- Erik W Bloomquist
- Department of Biostatistics, UCLA School of Public Health, Los Angeles, CA 90095, USA
| | | | | |
Collapse
|
44
|
Minin VN, Suchard MA. Counting labeled transitions in continuous-time Markov models of evolution. J Math Biol 2007; 56:391-412. [PMID: 17874105 DOI: 10.1007/s00285-007-0120-8] [Citation(s) in RCA: 205] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2007] [Indexed: 10/22/2022]
Abstract
Counting processes that keep track of labeled changes to discrete evolutionary traits play critical roles in evolutionary hypothesis testing. If we assume that trait evolution can be described by a continuous-time Markov chain, then it suffices to study the process that counts labeled transitions of the chain. For a binary trait, we demonstrate that it is possible to obtain closed-form analytic solutions for the probability mass and probability generating functions of this evolutionary counting process. In the general, multi-state case we show how to compute moments of the counting process using an eigen decomposition of the infinitesimal generator, provided the latter is a diagonalizable matrix. We conclude with two examples that demonstrate the utility of our results.
Collapse
Affiliation(s)
- Vladimir N Minin
- Department of Biomathematics, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095, USA.
| | | |
Collapse
|
45
|
Didelot X, Achtman M, Parkhill J, Thomson NR, Falush D. A bimodal pattern of relatedness between the Salmonella Paratyphi A and Typhi genomes: convergence or divergence by homologous recombination? Genome Res 2006; 17:61-8. [PMID: 17090663 PMCID: PMC1716267 DOI: 10.1101/gr.5512906] [Citation(s) in RCA: 94] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
All Salmonella can cause disease but severe systemic infections are primarily caused by a few lineages. Paratyphi A and Typhi are the deadliest human restricted serovars, responsible for approximately 600,000 deaths per annum. We developed a Bayesian changepoint model that uses variation in the degree of nucleotide divergence along two genomes to detect homologous recombination between these strains, and with other lineages of Salmonella enterica. Paratyphi A and Typhi showed an atypical and surprising pattern. For three quarters of their genomes, they appear to be distantly related members of the species S. enterica, both in their gene content and nucleotide divergence. However, the remaining quarter is much more similar in both aspects, with average nucleotide divergence of 0.18% instead of 1.2%. We describe two different scenarios that could have led to this pattern, convergence and divergence, and conclude that the former is more likely based on a variety of criteria. The convergence scenario implies that, although Paratyphi A and Typhi were not especially close relatives within S. enterica, they have gone through a burst of recombination involving more than 100 recombination events. Several of the recombination events transferred novel genes in addition to homologous sequences, resulting in similar gene content in the two lineages. We propose that recombination between Typhi and Paratyphi A has allowed the exchange of gene variants that are important for their adaptation to their common ecological niche, the human host.
Collapse
Affiliation(s)
- Xavier Didelot
- Department of Statistics, University of Oxford, Oxford OX1 3SY, United Kingdom
| | - Mark Achtman
- Department of Molecular Biology, Max Planck Institute for Infection Biology, Berlin, Germany 10117
| | - Julian Parkhill
- The Wellcome Trust Sanger Institute, Cambridge CB10 1SA, United Kingdom
| | | | - Daniel Falush
- Department of Statistics, University of Oxford, Oxford OX1 3SY, United Kingdom
- Corresponding author.E-mail ; fax +44-1865-272595
| |
Collapse
|
46
|
Chan CX, Beiko RG, Ragan MA. Detecting recombination in evolving nucleotide sequences. BMC Bioinformatics 2006; 7:412. [PMID: 16978423 PMCID: PMC1592127 DOI: 10.1186/1471-2105-7-412] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2006] [Accepted: 09/18/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Genetic recombination can produce heterogeneous phylogenetic histories within a set of homologous genes. These recombination events can be obscured by subsequent residue substitutions, which consequently complicate their detection. While there are many algorithms for the identification of recombination events, little is known about the effects of subsequent substitutions on the accuracy of available recombination-detection approaches. RESULTS We assessed the effect of subsequent substitutions on the detection of simulated recombination events within sets of four nucleotide sequences under a homogeneous evolutionary model. The amount of subsequent substitutions per site, prior evolutionary history of the sequences, and reciprocality or non-reciprocality of the recombination event all affected the accuracy of the recombination-detecting programs examined. Bayesian phylogenetic-based approaches showed high accuracy in detecting evidence of recombination event and in identifying recombination breakpoints. These approaches were less sensitive to parameter settings than other methods we tested, making them easier to apply to various data sets in a consistent manner. CONCLUSION Post-recombination substitutions tend to diminish the predictive accuracy of recombination-detecting programs. The best method for detecting recombined regions is not necessarily the most accurate in identifying recombination breakpoints. For difficult detection problems involving highly divergent sequences or large data sets, different types of approach can be run in succession to increase efficiency, and can potentially yield better predictive accuracy than any single method used in isolation.
Collapse
Affiliation(s)
- Cheong Xin Chan
- ARC Centre in Bioinformatics and Institute for Molecular Bioscience, the University of Queensland, Brisbane, QLD 4072, Australia
| | - Robert G Beiko
- ARC Centre in Bioinformatics and Institute for Molecular Bioscience, the University of Queensland, Brisbane, QLD 4072, Australia
| | - Mark A Ragan
- ARC Centre in Bioinformatics and Institute for Molecular Bioscience, the University of Queensland, Brisbane, QLD 4072, Australia
| |
Collapse
|
47
|
Minin VN, Dorman KS, Fang F, Suchard MA. Dual multiple change-point model leads to more accurate recombination detection. Bioinformatics 2005; 21:3034-42. [PMID: 15914546 DOI: 10.1093/bioinformatics/bti459] [Citation(s) in RCA: 102] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION We introduce a dual multiple change-point (MCP) model for recombination detection among aligned nucleotide sequences. The dual MCP model is an extension of the model introduced previously by Suchard and co-workers. In the original single MCP model, one change-point process is used to model spatial phylogenetic variation. Here, we show that using two change-point processes, one for spatial variation of tree topologies and the other for spatial variation of substitution process parameters, increases recombination detection accuracy. Statistical analysis is done in a Bayesian framework using reversible jump Markov chain Monte Carlo sampling to approximate the joint posterior distribution of all model parameters. RESULTS We use primate mitochondrial DNA data with simulated recombination break-points at specific locations to compare the two models. We also analyze two real HIV sequences to identify recombination break-points using the dual MCP model.
Collapse
Affiliation(s)
- Vladimir N Minin
- Department of Biomathematics, David Geffen School of Medicine, University of California Los Angeles, CA 90095-1766, USA
| | | | | | | |
Collapse
|
48
|
Kitchen CMR, Philpott S, Burger H, Weiser B, Anastos K, Suchard MA. Evolution of human immunodeficiency virus type 1 coreceptor usage during antiretroviral Therapy: a Bayesian approach. J Virol 2004; 78:11296-302. [PMID: 15452249 PMCID: PMC521818 DOI: 10.1128/jvi.78.20.11296-11302.2004] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
There is substantial evidence for ongoing replication and evolution of human immunodeficiency virus type 1 (HIV-1), even in individuals receiving highly active antiretroviral therapy. Viral evolution in the presence of antiviral therapy needs to be considered when developing new therapeutic strategies. Phylogenetic analyses of HIV-1 sequences can be used for this purpose but may give rise to misleading results if rates of intrapatient evolution differ significantly. To improve analyses of HIV-1 evolution relevant to studies of pathogenesis and treatment, we developed a Bayesian hierarchical model that incorporates all available sequence data while simultaneously allowing the phylogenetic parameters of each patient to vary. We used this method to examine evolutionary changes in HIV-1 coreceptor usage in response to treatment. We examined patients whose viral populations exhibited a shift in coreceptor utilization in response to therapy. CXCR4 (X4) strains emerged in each patient but were suppressed following initiation of new antiretroviral regimens, so that CCR5-utilizing (R5) strains predominated. By phylogenetically reconstructing the evolutionary relationship of HIV-1 obtained longitudinally from each patient, it was possible to examine the origin of the reemergent R5 virus. Using our Bayesian hierarchical approach, we found that the reemergent R5 virus detectable after therapy was more closely related to the predecessor R5 virus than to the X4 strains. The Bayesian hierarchical approach, unlike more traditional methods, makes it possible to evaluate competing hypotheses across patients. This model is not limited to analyses of HIV-1 but can be used to elucidate evolutionary processes for other organisms as well.
Collapse
Affiliation(s)
- Christina M R Kitchen
- Department of Biostatistics, UCLA School of Public Health, Los Angeles, CA 90095-1772.
| | | | | | | | | | | |
Collapse
|
49
|
Haake DA, Suchard MA, Kelley MM, Dundoo M, Alt DP, Zuerner RL. Molecular evolution and mosaicism of leptospiral outer membrane proteins involves horizontal DNA transfer. J Bacteriol 2004; 186:2818-28. [PMID: 15090524 PMCID: PMC387810 DOI: 10.1128/jb.186.9.2818-2828.2004] [Citation(s) in RCA: 98] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Leptospires belong to a genus of parasitic bacterial spirochetes that have adapted to a broad range of mammalian hosts. Mechanisms of leptospiral molecular evolution were explored by sequence analysis of four genes shared by 38 strains belonging to the core group of pathogenic Leptospira species: L. interrogans, L. kirschneri, L. noguchii, L. borgpetersenii, L. santarosai, and L. weilii. The 16S rRNA and lipL32 genes were highly conserved, and the lipL41 and ompL1 genes were significantly more variable. Synonymous substitutions are distributed throughout the ompL1 gene, whereas nonsynonymous substitutions are clustered in four variable regions encoding surface loops. While phylogenetic trees for the 16S, lipL32, and lipL41 genes were relatively stable, 8 of 38 (20%) ompL1 sequences had mosaic compositions consistent with horizontal transfer of DNA between related bacterial species. A novel Bayesian multiple change point model was used to identify the most likely sites of recombination and to determine the phylogenetic relatedness of the segments of the mosaic ompL1 genes. Segments of the mosaic ompL1 genes encoding two of the surface-exposed loops were likely acquired by horizontal transfer from a peregrine allele of unknown ancestry. Identification of the most likely sites of recombination with the Bayesian multiple change point model, an approach which has not previously been applied to prokaryotic gene sequence analysis, serves as a model for future studies of recombination in molecular evolution of genes.
Collapse
Affiliation(s)
- David A Haake
- Division of Infectious Diseases, Veterans Affairs Greater Los Angeles Healthcare System, Los Angeles, CA 90073, USA.
| | | | | | | | | | | |
Collapse
|