1
|
Schmidt H, Raphael BJ. The tree labeling polytope: a unified approach to ancestral reconstruction problems. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.14.638328. [PMID: 40027631 PMCID: PMC11870558 DOI: 10.1101/2025.02.14.638328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Motivation Reconstructing unobserved ancestral states of a phylogenetic tree provides insight into the history of evolving systems and is one of the fundamental problems in phylogenetics. For a fixed phylogenetic tree, the most parsimonious ancestral reconstruction - a solution to the small parsimony problem - can be efficiently found using the dynamic programming algorithms of Fitch-Hartigan and Sankoff. Ancestral reconstruction is important in many applications including inferring the routes of metastases in cancer, deriving the transmission history of viruses, determining the direction of cellular differentiation in organismal development, and detecting recombination and horizontal gene transfer in phylogenetic networks. However, most of these applications impose additional global constraints on the reconstructed ancestral states, which break the local structure required in the recurrences of Fitch-Hartigan and Sankoff. Results We introduce an alternative, polyhedral approach to ancestral reconstruction problems using the tree labeling polytope , a geometric object whose vertices represent the feasible ancestral labelings of a tree. This framework yields a polynomial-time linear programming algorithm for the small parsimony problem . More importantly, the tree labeling polytope facilitates the incorporation of additional constraints that arise in modern ancestral reconstruction problems. We demonstrate the utility of our approach by deriving mixed-integer programming algorithms with a small number of integer variables and strong linear relaxations for three such problems: the parsimonious migration history problem, the softwired small parsimony problem on phylogenetic networks, and the convex recoloring problem on trees. Our algorithms outperform existing state-of-the-art methods on both simulated and real datasets. For instance, our algorithm scales to trace routes of cancer metastases in trees with thousands of leaves, enabling the analysis of large trees generated by recent single-cell sequencing technologies. On a mouse model of metastatic lung adenocarcinoma, the tree labeling polytope allows us to infer simpler migration histories compared to previous results. Availability Python implementations of the algorithms provided in this work are available at: github.com/raphael-group/tree-labeling-polytope .
Collapse
|
2
|
Zhang Y, Wang J, Yu J. PSA: an effective method for predicting horizontal gene transfers through parsimonious phylogenetic networks. Cladistics 2024; 40:443-455. [PMID: 38717786 DOI: 10.1111/cla.12578] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Revised: 03/08/2024] [Accepted: 03/20/2024] [Indexed: 07/15/2024] Open
Abstract
Horizontal gene transfer (HGT) from one organism to another, according to some researchers, can be abundant in the evolution of species. A phylogenetic network is a network structure that describes the HGTs among species. Several studies have proposed methods to construct phylogenetic networks to predict HGTs based on parsimony values. Existing definitions of parsimony values for a phylogenetic network are based on the assumption that each gene site or segment evolves independently along different trees in the network. However, in the current study, we define a novel parsimony value, denoted the p definition, for phylogenetic networks, considering that a gene as a whole typically evolves along a tree. Using Simulated Annealing, a new method called the Phylogeny with Simulated Annealing (PSA) algorithm is proposed to search for an optimal network based on the p definition. The PSA method is tested on the simulated data. The results reveal that the parsimonious networks constructed using PSA can better represent the evolutionary relationships of species involving HGTs. Additionally, the HGTs predicted using PSA are more accurate than those predicted using other methods. The PSA algorithm is publicly accessible at http://github.com/imustu/sap.
Collapse
Affiliation(s)
- Yuan Zhang
- School of Computer Science, Inner Mongolia University, Hohhot, 010021, China
| | - Juan Wang
- School of Computer Science, Inner Mongolia University, Hohhot, 010021, China
| | - Jing Yu
- College of Education, Inner Mongolia Normal University, Hohhot, 010022, China
| |
Collapse
|
3
|
Gross E, van Iersel L, Janssen R, Jones M, Long C, Murakami Y. Distinguishing level-1 phylogenetic networks on the basis of data generated by Markov processes. J Math Biol 2021; 83:32. [PMID: 34482446 PMCID: PMC8418599 DOI: 10.1007/s00285-021-01653-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2020] [Revised: 07/07/2021] [Accepted: 08/16/2021] [Indexed: 11/24/2022]
Abstract
Phylogenetic networks can represent evolutionary events that cannot be described by phylogenetic trees. These networks are able to incorporate reticulate evolutionary events such as hybridization, introgression, and lateral gene transfer. Recently, network-based Markov models of DNA sequence evolution have been introduced along with model-based methods for reconstructing phylogenetic networks. For these methods to be consistent, the network parameter needs to be identifiable from data generated under the model. Here, we show that the semi-directed network parameter of a triangle-free, level-1 network model with any fixed number of reticulation vertices is generically identifiable under the Jukes–Cantor, Kimura 2-parameter, or Kimura 3-parameter constraints.
Collapse
Affiliation(s)
- Elizabeth Gross
- Department of Mathematics, University of Hawai'i at Mānoa, 2565 McCarthy Mall, Honolulu, HI, 96822, USA
| | - Leo van Iersel
- Delft Institute of Applied Mathematics, Delft University of Technology, Mekelweg 4, 2628 CD, Delft, The Netherlands
| | - Remie Janssen
- Delft Institute of Applied Mathematics, Delft University of Technology, Mekelweg 4, 2628 CD, Delft, The Netherlands
| | - Mark Jones
- Delft Institute of Applied Mathematics, Delft University of Technology, Mekelweg 4, 2628 CD, Delft, The Netherlands
| | - Colby Long
- The College of Wooster, 1189 Beall Avenue, Wooster, OH, 44691, USA
| | - Yukihiro Murakami
- Delft Institute of Applied Mathematics, Delft University of Technology, Mekelweg 4, 2628 CD, Delft, The Netherlands.
| |
Collapse
|
4
|
Advances in Computational Methods for Phylogenetic Networks in the Presence of Hybridization. BIOINFORMATICS AND PHYLOGENETICS 2019. [DOI: 10.1007/978-3-030-10837-3_13] [Citation(s) in RCA: 37] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
5
|
Jamil HM. Optimizing Phylogenetic Queries for Performance. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1692-1705. [PMID: 28858810 DOI: 10.1109/tcbb.2017.2743706] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
The vast majority of phylogenetic databases do not support declarative querying using which their contents can be flexibly and conveniently accessed and the template based query interfaces they support do not allow arbitrary speculative queries. They therefore also do not support query optimization leveraging unique phylogeny properties. While a small number of graph query languages such as XQuery, Cypher, and GraphQL exist for computer savvy users, most are too general and complex to be useful for biologists, and too inefficient for large phylogeny querying. In this paper, we discuss a recently introduced visual query language, called PhyQL, that leverages phylogeny specific properties to support essential and powerful constructs for a large class of phylogentic queries. We develop a range of pruning aids, and propose a substantial set of query optimization strategies using these aids suitable for large phylogeny querying. A hybrid optimization technique that exploits a set of indices and "graphlet" partitioning is discussed. A "fail soonest" strategy is used to avoid hopeless processing and is shown to produce dividends. Possible novel optimization techniques yet to be explored are also discussed.
Collapse
|
6
|
Finding a most parsimonious or likely tree in a network with respect to an alignment. J Math Biol 2018; 78:527-547. [PMID: 30121824 PMCID: PMC6437133 DOI: 10.1007/s00285-018-1282-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Revised: 05/16/2018] [Indexed: 11/22/2022]
Abstract
Phylogenetic networks are often constructed by merging multiple conflicting phylogenetic signals into a directed acyclic graph. It is interesting to explore whether a network constructed in this way induces biologically-relevant phylogenetic signals that were not present in the input. Here we show that, given a multiple alignment A for a set of taxa X and a rooted phylogenetic network N whose leaves are labelled by X, it is NP-hard to locate a most parsimonious phylogenetic tree displayed by N (with respect to A) even when the level of N—the maximum number of reticulation nodes within a biconnected component—is 1 and A contains only 2 distinct states. (If, additionally, gaps are allowed the problem becomes APX-hard.) We also show that under the same conditions, and assuming a simple binary symmetric model of character evolution, finding a most likely tree displayed by the network is NP-hard. These negative results contrast with earlier work on parsimony in which it is shown that if A consists of a single column the problem is fixed parameter tractable in the level. We conclude with a discussion of why, despite the NP-hardness, both the parsimony and likelihood problem can likely be well-solved in practice.
Collapse
|
7
|
Van Iersel L, Jones M, Scornavacca C. Improved Maximum Parsimony Models for Phylogenetic Networks. Syst Biol 2018; 67:518-542. [PMID: 29272537 DOI: 10.1093/sysbio/syx094] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2017] [Accepted: 12/11/2017] [Indexed: 11/13/2022] Open
Abstract
Phylogenetic networks are well suited to represent evolutionary histories comprising reticulate evolution. Several methods aiming at reconstructing explicit phylogenetic networks have been developed in the last two decades. In this article, we propose a new definition of maximum parsimony for phylogenetic networks that permits to model biological scenarios that cannot be modeled by the definitions currently present in the literature (namely, the "hardwired" and "softwired" parsimony). Building on this new definition, we provide several algorithmic results that lay the foundations for new parsimony-based methods for phylogenetic network reconstruction.
Collapse
Affiliation(s)
- Leo Van Iersel
- Delft Institute of Applied Mathematics, Delft University of Technology, P.O. Box 5, 2600 AA Delft, the Netherlands
| | - Mark Jones
- Delft Institute of Applied Mathematics, Delft University of Technology, P.O. Box 5, 2600 AA Delft, the Netherlands
| | - Celine Scornavacca
- Institut des Sciences de l'Évolution Université de Montpellier, CNRS, IRD, EPHE CC 064, Place Eugène Bataillon 34095 Montpellier Cedex 05, France.,Institut de Biologie Computationnelle (IBC), Montpellier, France
| |
Collapse
|
8
|
Bryant C, Fischer M, Linz S, Semple C. On the quirks of maximum parsimony and likelihood on phylogenetic networks. J Theor Biol 2017; 417:100-108. [PMID: 28087420 DOI: 10.1016/j.jtbi.2017.01.013] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2016] [Revised: 10/28/2016] [Accepted: 01/06/2017] [Indexed: 11/16/2022]
Abstract
Maximum parsimony is one of the most frequently-discussed tree reconstruction methods in phylogenetic estimation. However, in recent years it has become more and more apparent that phylogenetic trees are often not sufficient to describe evolution accurately. For instance, processes like hybridization or lateral gene transfer that are commonplace in many groups of organisms and result in mosaic patterns of relationships cannot be represented by a single phylogenetic tree. This is why phylogenetic networks, which can display such events, are becoming of more and more interest in phylogenetic research. It is therefore necessary to extend concepts like maximum parsimony from phylogenetic trees to networks. Several suggestions for possible extensions can be found in recent literature, for instance the softwired and the hardwired parsimony concepts. In this paper, we analyze the so-called big parsimony problem under these two concepts, i.e. we investigate maximum parsimonious networks and analyze their properties. In particular, we show that finding a softwired maximum parsimony network is possible in polynomial time. We also show that the set of maximum parsimony networks for the hardwired definition always contains at least one phylogenetic tree. Lastly, we investigate some parallels of parsimony to different likelihood concepts on phylogenetic networks.
Collapse
Affiliation(s)
| | - Mareike Fischer
- Department for Mathematics and Computer Science, Ernst Moritz Arndt University, Greifswald, Germany.
| | - Simone Linz
- Department of Computer Science, University of Auckland, New Zealand.
| | - Charles Semple
- School of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand.
| |
Collapse
|
9
|
Abstract
BACKGROUND Phylogenetic networks model reticulate evolutionary histories. The last two decades have seen an increased interest in establishing mathematical results and developing computational methods for inferring and analyzing these networks. A salient concept underlying a great majority of these developments has been the notion that a network displays a set of trees and those trees can be used to infer, analyze, and study the network. RESULTS In this paper, we show that in the presence of coalescence effects, the set of displayed trees is not sufficient to capture the network. We formally define the set of parental trees of a network and make three contributions based on this definition. First, we extend the notion of anomaly zone to phylogenetic networks and report on anomaly results for different networks. Second, we demonstrate how coalescence events could negatively affect the ability to infer a species tree that could be augmented into the correct network. Third, we demonstrate how a phylogenetic network can be viewed as a mixture model that lends itself to a novel inference approach via gene tree clustering. CONCLUSIONS Our results demonstrate the limitations of focusing on the set of trees displayed by a network when analyzing and inferring the network. Our findings can form the basis for achieving higher accuracy when inferring phylogenetic networks and open up new venues for research in this area, including new problem formulations based on the notion of a network's parental trees.
Collapse
Affiliation(s)
- Jiafan Zhu
- Department of Computer Science, Rice University, Houston, 77005 Texas USA
| | - Yun Yu
- Department of Computer Science, Rice University, Houston, 77005 Texas USA
| | - Luay Nakhleh
- Department of Computer Science, Rice University, Houston, 77005 Texas USA
- Department of BioSciences, Rice University, Houston, 77005 Texas USA
| |
Collapse
|
10
|
Adato O, Ninyo N, Gophna U, Snir S. Detecting Horizontal Gene Transfer between Closely Related Taxa. PLoS Comput Biol 2015; 11:e1004408. [PMID: 26439115 PMCID: PMC4595140 DOI: 10.1371/journal.pcbi.1004408] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2014] [Accepted: 06/20/2015] [Indexed: 01/12/2023] Open
Abstract
Horizontal gene transfer (HGT), the transfer of genetic material between organisms, is crucial for genetic innovation and the evolution of genome architecture. Existing HGT detection algorithms rely on a strong phylogenetic signal distinguishing the transferred sequence from ancestral (vertically derived) genes in its recipient genome. Detecting HGT between closely related species or strains is challenging, as the phylogenetic signal is usually weak and the nucleotide composition is normally nearly identical. Nevertheless, there is a great importance in detecting HGT between congeneric species or strains, especially in clinical microbiology, where understanding the emergence of new virulent and drug-resistant strains is crucial, and often time-sensitive. We developed a novel, self-contained technique named Near HGT, based on the synteny index, to measure the divergence of a gene from its native genomic environment and used it to identify candidate HGT events between closely related strains. The method confirms candidate transferred genes based on the constant relative mutability (CRM). Using CRM, the algorithm assigns a confidence score based on “unusual” sequence divergence. A gene exhibiting exceptional deviations according to both synteny and mutability criteria, is considered a validated HGT product. We first employed the technique to a set of three E. coli strains and detected several highly probable horizontally acquired genes. We then compared the method to existing HGT detection tools using a larger strain data set. When combined with additional approaches our new algorithm provides richer picture and brings us closer to the goal of detecting all newly acquired genes in a particular strain. The transfer of genetic material between organisms, usually denoted as horizontal (or lateral) gene transfer (HGT or LGT), is a prime mechanism in microbial evolution and responsible for genetic innovation and the evolution of genome architecture. Detecting HGT between closely related species or strains is imperative as drug-resistant pathogenic strains most often acquire their virulence from closely related bacteria. The proposed method combines two evolutionary signals that were not employed in the past for this task. One is the synteny index (SI), measuring the loss of synteny in an organism, and the other is a novel concept—constant relative mutability (CRM), maintaining that genes preserve their relative evolution rate along linages (although the latter ones may each change). We show both in simulation and real biological data that the method is sound and, in the cases examined, provides stronger sensitivity than existing methods. We therefore believe this novel approach represents a significant advance, for the first time enabling the detection of previously ignored HGT events that will bring us closer to the goal of detecting all newly acquired genes in a particular strain. Availability: The method is publicly available at http://research.haifa.ac.il/~ssagi/software/nearHGT.zip
Collapse
Affiliation(s)
- Orit Adato
- Department of Evolutionary Biology, University of Haifa, Haifa, Israel
| | - Noga Ninyo
- Department of Evolutionary Biology, University of Haifa, Haifa, Israel
| | - Uri Gophna
- Department of Molecular Microbiology and Biotechnology Tel Aviv University, Tel-Aviv, Israel
| | - Sagi Snir
- Department of Evolutionary Biology, University of Haifa, Haifa, Israel
- * E-mail:
| |
Collapse
|
11
|
The comparison of tree-sibling time consistent phylogenetic networks is graph isomorphism-complete. ScientificWorldJournal 2014; 2014:254279. [PMID: 24982934 PMCID: PMC3996867 DOI: 10.1155/2014/254279] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2013] [Accepted: 01/21/2014] [Indexed: 11/17/2022] Open
Abstract
Several polynomial time computable metrics on the class of semibinary tree-sibling time consistent phylogenetic networks are available in the literature; in particular, the problem of deciding if two networks of this kind are isomorphic is in P. In this paper, we show that if we remove the semibinarity condition, then the problem becomes much harder. More precisely, we prove that the isomorphism problem for generic tree-sibling time consistent phylogenetic networks is polynomially equivalent to the graph isomorphism problem. Since the latter is believed not to belong to P, the chances are that it is impossible to define a metric on the class of all tree-sibling time consistent phylogenetic networks that can be computed in polynomial time.
Collapse
|
12
|
Yang J, Grünewald S, Xu Y, Wan XF. Quartet-based methods to reconstruct phylogenetic networks. BMC SYSTEMS BIOLOGY 2014; 8:21. [PMID: 24555518 PMCID: PMC3941989 DOI: 10.1186/1752-0509-8-21] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/16/2013] [Accepted: 02/12/2014] [Indexed: 11/11/2022]
Abstract
BACKGROUND Phylogenetic networks are employed to visualize evolutionary relationships among a group of nucleotide sequences, genes or species when reticulate events like hybridization, recombination, reassortant and horizontal gene transfer are believed to be involved. In comparison to traditional distance-based methods, quartet-based methods consider more information in the reconstruction process and thus have the potential to be more accurate. RESULTS We introduce QuartetSuite, which includes a set of new quartet-based methods, namely QuartetS, QuartetA, and QuartetM, to reconstruct phylogenetic networks from nucleotide sequences. We tested their performances and compared them with other popular methods on two simulated nucleotide sequence data sets: one generated from a tree topology and the other from a complicated evolutionary history containing three reticulate events. We further validated these methods to two real data sets: a bacterial data set consisting of seven concatenated genes of 36 bacterial species and an influenza data set related to recently emerging H7N9 low pathogenic avian influenza viruses in China. CONCLUSION QuartetS, QuartetA, and QuartetM have the potential to accurately reconstruct evolutionary scenarios from simple branching trees to complicated networks containing many reticulate events. These methods could provide insights into the understanding of complicated biological evolutionary processes such as bacterial taxonomy and reassortant of influenza viruses.
Collapse
Affiliation(s)
- Jialiang Yang
- Department of Basic Sciences, College of Veterinary Medicine, Mississippi State
University, Mississippi State, MS 39762, USA
- Current Address: Department of Genetics and Genomic Sciences, Icahn School of
Medicine at Mount Sinai, New York, NY 10029, USA
| | - Stefan Grünewald
- CAS-MPG Partner Institute for Computational Biology, Key Laboratory of
Computational Biology, Shanghai Institutes for Biological Sciences, Chinese
Academy of Sciences, Shanghai 200031, China
| | - Yifei Xu
- Department of Basic Sciences, College of Veterinary Medicine, Mississippi State
University, Mississippi State, MS 39762, USA
| | - Xiu-Feng Wan
- Department of Basic Sciences, College of Veterinary Medicine, Mississippi State
University, Mississippi State, MS 39762, USA
| |
Collapse
|
13
|
Yang J, Grünewald S, Wan XF. Quartet-net: a quartet-based method to reconstruct phylogenetic networks. Mol Biol Evol 2013; 30:1206-17. [PMID: 23493256 DOI: 10.1093/molbev/mst040] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Phylogenetic networks can model reticulate evolutionary events such as hybridization, recombination, and horizontal gene transfer. However, reconstructing such networks is not trivial. Popular character-based methods are computationally inefficient, whereas distance-based methods cannot guarantee reconstruction accuracy because pairwise genetic distances only reflect partial information about a reticulate phylogeny. To balance accuracy and computational efficiency, here we introduce a quartet-based method to construct a phylogenetic network from a multiple sequence alignment. Unlike distances that only reflect the relationship between a pair of taxa, quartets contain information on the relationships among four taxa; these quartets provide adequate capacity to infer a more accurate phylogenetic network. In applications to simulated and biological data sets, we demonstrate that this novel method is robust and effective in reconstructing reticulate evolutionary events and it has the potential to infer more accurate phylogenetic distances than other conventional phylogenetic network construction methods such as Neighbor-Joining, Neighbor-Net, and Split Decomposition. This method can be used in constructing phylogenetic networks from simple evolutionary events involving a few reticulate events to complex evolutionary histories involving a large number of reticulate events. A software called "Quartet-Net" is implemented and available at http://sysbio.cvm.msstate.edu/QuartetNet/.
Collapse
Affiliation(s)
- Jialiang Yang
- Department of Basic Sciences, College of Veterinary Medicine, Mississippi State University, USA
| | | | | |
Collapse
|
14
|
Asano T, Jansson J, Sadakane K, Uehara R, Valiente G. Faster computation of the Robinson–Foulds distance between phylogenetic networks. Inf Sci (N Y) 2012. [DOI: 10.1016/j.ins.2012.01.038] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
15
|
Kannan L, Wheeler WC. Maximum Parsimony on Phylogenetic networks. Algorithms Mol Biol 2012; 7:9. [PMID: 22551229 PMCID: PMC3377548 DOI: 10.1186/1748-7188-7-9] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2011] [Accepted: 05/02/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Phylogenetic networks are generalizations of phylogenetic trees, that are used to model evolutionary events in various contexts. Several different methods and criteria have been introduced for reconstructing phylogenetic trees. Maximum Parsimony is a character-based approach that infers a phylogenetic tree by minimizing the total number of evolutionary steps required to explain a given set of data assigned on the leaves. Exact solutions for optimizing parsimony scores on phylogenetic trees have been introduced in the past. RESULTS In this paper, we define the parsimony score on networks as the sum of the substitution costs along all the edges of the network; and show that certain well-known algorithms that calculate the optimum parsimony score on trees, such as Sankoff and Fitch algorithms extend naturally for networks, barring conflicting assignments at the reticulate vertices. We provide heuristics for finding the optimum parsimony scores on networks. Our algorithms can be applied for any cost matrix that may contain unequal substitution costs of transforming between different characters along different edges of the network. We analyzed this for experimental data on 10 leaves or fewer with at most 2 reticulations and found that for almost all networks, the bounds returned by the heuristics matched with the exhaustively determined optimum parsimony scores. CONCLUSION The parsimony score we define here does not directly reflect the cost of the best tree in the network that displays the evolution of the character. However, when searching for the most parsimonious network that describes a collection of characters, it becomes necessary to add additional cost considerations to prefer simpler structures, such as trees over networks. The parsimony score on a network that we describe here takes into account the substitution costs along the additional edges incident on each reticulate vertex, in addition to the substitution costs along the other edges which are common to all the branching patterns introduced by the reticulate vertices. Thus the score contains an in-built cost for the number of reticulate vertices in the network, and would provide a criterion that is comparable among all networks. Although the problem of finding the parsimony score on the network is believed to be computationally hard to solve, heuristics such as the ones described here would be beneficial in our efforts to find a most parsimonious network.
Collapse
|
16
|
Boc A, Makarenkov V. Towards an accurate identification of mosaic genes and partial horizontal gene transfers. Nucleic Acids Res 2011; 39:e144. [PMID: 21917854 PMCID: PMC3241670 DOI: 10.1093/nar/gkr735] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Many bacteria and viruses adapt to varying environmental conditions through the acquisition of mosaic genes. A mosaic gene is composed of alternating sequence polymorphisms either belonging to the host original allele or derived from the integrated donor DNA. Often, the integrated sequence contains a selectable genetic marker (e.g. marker allowing for antibiotic resistance). An effective identification of mosaic genes and detection of corresponding partial horizontal gene transfers (HGTs) are among the most important challenges posed by evolutionary biology. We developed a method for detecting partial HGT events and related intragenic recombination giving rise to the formation of mosaic genes. A bootstrap procedure incorporated in our method is used to assess the support of each predicted partial gene transfer. The proposed method can be also applied to confirm or discard complete (i.e. traditional) horizontal gene transfers detected by any HGT inferring method. While working on a full-genome scale, the new method can be used to assess the level of mosaicism in the considered genomes as well as the rates of complete and partial HGT underlying their evolution.
Collapse
Affiliation(s)
- Alix Boc
- Département d'Informatique, Université du Québec à Montréal, CP 8888, Succursale Centre Ville, Montreal, QC, Canada H3C 3P8
| | | |
Collapse
|
17
|
Abstract
BACKGROUND Genome sequencing has revolutionized our view of the relationships among genomes, particularly in revealing the confounding effects of lateral genetic transfer (LGT). Phylogenomic techniques have been used to construct purported trees of microbial life. Although such trees are easily interpreted and allow the use of a subset of genomes as "proxies" for the full set, LGT and other phenomena impact the positioning of different groups in genome trees, confounding and potentially invalidating attempts to construct a phylogeny-based taxonomy of microorganisms. Network and graph approaches can reveal complex sets of relationships, but applying these techniques to large data sets is a significant challenge. Notwithstanding the question of what exactly it might represent, generating and interpreting a Tree or Network of All Genomes will only be feasible if current algorithms can be improved upon. RESULTS Complex relationships among even the most-similar genomes demonstrate that proxy-based approaches to simplifying large sets of genomes are not alone sufficient to solve the analysis problem. A phylogenomic analysis of 1173 sequenced bacterial and archaeal genomes generated phylogenetic trees for 159,905 distinct homologous gene sets. The relationships inferred from this set can be heavily dependent on the inclusion of other taxa: for example, phyla such as Spirochaetes, Proteobacteria and Firmicutes are recovered as cohesive groups or split depending on the presence of other specific lineages. Furthermore, named groups such as Acidithiobacillus, Coprothermobacter and Brachyspira show a multitude of affiliations that are more consistent with their ecology than with small subunit ribosomal DNA-based taxonomy. Network and graph representations can illustrate the multitude of conflicting affinities, but all methods impose constraints on the input data and create challenges of construction and interpretation. CONCLUSIONS These complex relationships highlight the need for an inclusive approach to genomic data, and current methods with minor alterations will likely scale to allow the analysis of data sets with 10,000 or more genomes. The main challenges lie in the visualization and interpretation of genomic relationships, and the redefinition of microbial taxonomy when subsets of genomic data are so evidently in conflict with one another, and with the "canonical" molecular taxonomy.
Collapse
Affiliation(s)
- Robert G Beiko
- Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 1W5 Canada.
| |
Collapse
|
18
|
Park HJ, Jin G, Nakhleh L. Bootstrap-based support of HGT inferred by maximum parsimony. BMC Evol Biol 2010; 10:131. [PMID: 20444286 PMCID: PMC2874802 DOI: 10.1186/1471-2148-10-131] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2009] [Accepted: 05/05/2010] [Indexed: 11/10/2022] Open
Abstract
Background Maximum parsimony is one of the most commonly used criteria for reconstructing phylogenetic trees. Recently, Nakhleh and co-workers extended this criterion to enable reconstruction of phylogenetic networks, and demonstrated its application to detecting reticulate evolutionary relationships. However, one of the major problems with this extension has been that it favors more complex evolutionary relationships over simpler ones, thus having the potential for overestimating the amount of reticulation in the data. An ad hoc solution to this problem that has been used entails inspecting the improvement in the parsimony length as more reticulation events are added to the model, and stopping when the improvement is below a certain threshold. Results In this paper, we address this problem in a more systematic way, by proposing a nonparametric bootstrap-based measure of support of inferred reticulation events, and using it to determine the number of those events, as well as their placements. A number of samples is generated from the given sequence alignment, and reticulation events are inferred based on each sample. Finally, the support of each reticulation event is quantified based on the inferences made over all samples. Conclusions We have implemented our method in the NEPAL software tool (available publicly at http://bioinfo.cs.rice.edu/), and studied its performance on both biological and simulated data sets. While our studies show very promising results, they also highlight issues that are inherently challenging when applying the maximum parsimony criterion to detect reticulate evolution.
Collapse
Affiliation(s)
- Hyun Jung Park
- Department of Computer Science, Rice University, 6100 Main Street, MS 132, Houston, Texas 77005, USA
| | | | | |
Collapse
|
19
|
Boc A, Philippe H, Makarenkov V. Inferring and validating horizontal gene transfer events using bipartition dissimilarity. Syst Biol 2010; 59:195-211. [PMID: 20525630 DOI: 10.1093/sysbio/syp103] [Citation(s) in RCA: 62] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Horizontal gene transfer (HGT) is one of the main mechanisms driving the evolution of microorganisms. Its accurate identification is one of the major challenges posed by reticulate evolution. In this article, we describe a new polynomial-time algorithm for inferring HGT events and compare 3 existing and 1 new tree comparison indices in the context of HGT identification. The proposed algorithm can rely on different optimization criteria, including least squares (LS), Robinson and Foulds (RF) distance, quartet distance (QD), and bipartition dissimilarity (BD), when searching for an optimal scenario of subtree prune and regraft (SPR) moves needed to transform the given species tree into the given gene tree. As the simulation results suggest, the algorithmic strategy based on BD, introduced in this article, generally provides better results than those based on LS, RF, and QD. The BD-based algorithm also proved to be more accurate and faster than a well-known polynomial time heuristic RIATA-HGT. Moreover, the HGT recovery results yielded by BD were generally equivalent to those provided by the exponential-time algorithm LatTrans, but a clear gain in running time was obtained using the new algorithm. Finally, a statistical framework for assessing the reliability of obtained HGTs by bootstrap analysis is also presented.
Collapse
Affiliation(s)
- Alix Boc
- Département d'informatique, Université du Québec à Montréal, C.P. 8888, Succ. Centre-ville, Montréal, Québec, Canada.
| | | | | |
Collapse
|
20
|
Kirsch JAW, Gauthier O, Campeau-Péloquin A, Eldridge MDB, Lapointe FJ. Phylogeny of the rock wallabies, Petrogale (Marsupialia: Macropodidae). Part II: Detection of hybridisation among macropodines. AUSTRALIAN MAMMALOGY 2010. [DOI: 10.1071/am09017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Phylogenetic relationships among rock-wallabies, Petrogale (Marsupialia: Macropodidae), have proven difficult to resolve. Given the documented interspecific hybridisation in the wild and the ease with which hybrids can be bred in captivity, introgression and hybrid speciation are likely explanations for these difficulties. In this paper, an attempt is made at using a phylogenetic approach to identify Petrogale hybrids of known origin. The Hybrid Detection Criterion (HDC) test is applied to DNA–DNA hybridisation data for 15 full species, two natural yard-bred hybrids, and two artificial hybrids from the same pairs of parental species. While the yard-bred hybrids elude detection with this technique, the artificial hybrids, consisting of equimolar mixture of parental extracts, are easily identified. Moreover, splitsgraphs constructed from five pairs of natural and artificial hybrids, including those evaluated with HDC, and their parents show that, in all cases but one, these two kinds of hybrids do not group together. Because the HDC assumes an intermediate phylogenetic position of the hybrid between its postulated parents, it is likely that unequal crossing-over, or another recombination event, affects the results of the test. These conclusions cast some doubt on the possibility of accurately detecting Petrogale hybrids with a phylogenetic approach.
Collapse
|
21
|
Linz S, Semple C, Stadler T. Analyzing and reconstructing reticulation networks under timing constraints. J Math Biol 2009; 61:715-37. [DOI: 10.1007/s00285-009-0319-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2009] [Revised: 11/19/2009] [Indexed: 10/20/2022]
|
22
|
Cardona G, Rosselló F, Valiente G. Comparison of tree-child phylogenetic networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2009; 6:552-569. [PMID: 19875855 DOI: 10.1109/tcbb.2007.70270] [Citation(s) in RCA: 81] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Phylogenetic networks are a generalization of phylogenetic trees that allow for the representation of nontreelike evolutionary events, like recombination, hybridization, or lateral gene transfer. While much progress has been made to find practical algorithms for reconstructing a phylogenetic network from a set of sequences, all attempts to endorse a class of phylogenetic networks (strictly extending the class of phylogenetic trees) with a well-founded distance measure have, to the best of our knowledge and with the only exception of the bipartition distance on regular networks, failed so far. In this paper, we present and study a new meaningful class of phylogenetic networks, called tree-child phylogenetic networks, and we provide an injective representation of these networks as multisets of vectors of natural numbers, their path multiplicity vectors. We then use this representation to define a distance on this class that extends the well-known Robinson-Foulds distance for phylogenetic trees and to give an alignment method for pairs of networks in this class. Simple polynomial algorithms for reconstructing a tree-child phylogenetic network from its path multiplicity vectors, for computing the distance between two tree-child phylogenetic networks and for aligning a pair of tree-child phylogenetic networks, are provided. They have been implemented as a Perl package and a Java applet, which can be found at http://bioinfo.uib.es/~recerca/phylonetworks/mudistance/.
Collapse
Affiliation(s)
- Gabriel Cardona
- Department of Mathematics and Computer Science, University of the Balearic Islands, E-07122 Palma de Mallorca, Spain.
| | | | | |
Collapse
|
23
|
Willson SJ. Properties of Normal Phylogenetic Networks. Bull Math Biol 2009; 72:340-58. [DOI: 10.1007/s11538-009-9449-z] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2008] [Accepted: 08/14/2009] [Indexed: 11/30/2022]
|
24
|
Abstract
Galled trees, evolutionary networks with isolated reticulation cycles, have appeared under several slightly different definitions in the literature. In this paper, we establish the actual relationships between the main four such alternative definitions: namely, the original galled trees, level-1 networks, nested networks with nesting depth 1, and evolutionary networks with arc-disjoint reticulation cycles.
Collapse
|
25
|
Jin G, Nakhleh L, Snir S, Tuller T. Parsimony score of phylogenetic networks: hardness results and a linear-time heuristic. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2009; 6:495-505. [PMID: 19644176 DOI: 10.1109/tcbb.2008.119] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Phylogenies-the evolutionary histories of groups of organisms-play a major role in representing the interrelationships among biological entities. Many methods for reconstructing and studying such phylogenies have been proposed, almost all of which assume that the underlying history of a given set of species can be represented by a binary tree. Although many biological processes can be effectively modeled and summarized in this fashion, others cannot: recombination, hybrid speciation, and horizontal gene transfer result in networks of relationships rather than trees of relationships. In previous works, we formulated a maximum parsimony (MP) criterion for reconstructing and evaluating phylogenetic networks, and demonstrated its quality on biological as well as synthetic data sets. In this paper, we provide further theoretical results as well as a very fast heuristic algorithm for the MP criterion of phylogenetic networks. In particular, we provide a novel combinatorial definition of phylogenetic networks in terms of "forbidden cycles," and provide detailed hardness and hardness of approximation proofs for the "small" MP problem. We demonstrate the performance of our heuristic in terms of time and accuracy on both biological and synthetic data sets. Finally, we explain the difference between our model and a similar one formulated by Nguyen et al., and describe the implications of this difference on the hardness and approximation results.
Collapse
Affiliation(s)
- Guohua Jin
- Department of Computer Science, Rice University, Houston, TX 77005, USA.
| | | | | | | |
Collapse
|
26
|
A human genome-wide library of local phylogeny predictions for whole-genome inference problems. BMC Genomics 2008; 9:389. [PMID: 18710563 PMCID: PMC2556685 DOI: 10.1186/1471-2164-9-389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2008] [Accepted: 08/18/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Many common inference problems in computational genetics depend on inferring aspects of the evolutionary history of a data set given a set of observed modern sequences. Detailed predictions of the full phylogenies are therefore of value in improving our ability to make further inferences about population history and sources of genetic variation. Making phylogenetic predictions on the scale needed for whole-genome analysis is, however, extremely computationally demanding. RESULTS In order to facilitate phylogeny-based predictions on a genomic scale, we develop a library of maximum parsimony phylogenies within local regions spanning all autosomal human chromosomes based on Haplotype Map variation data. We demonstrate the utility of this library for population genetic inferences by examining a tree statistic we call 'imperfection,' which measures the reuse of variant sites within a phylogeny. This statistic is significantly predictive of recombination rate, shows additional regional and population-specific conservation, and allows us to identify outlier genes likely to have experienced unusual amounts of variation in recent human history. CONCLUSION Recent theoretical advances in algorithms for phylogenetic tree reconstruction have made it possible to perform large-scale inferences of local maximum parsimony phylogenies from single nucleotide polymorphism (SNP) data. As results from the imperfection statistic demonstrate, phylogeny predictions encode substantial information useful for detecting genomic features and population history. This data set should serve as a platform for many kinds of inferences one may wish to make about human population history and genetic variation.
Collapse
|
27
|
Cardona G, Llabrés M, Rosselló F, Valiente G. A distance metric for a class of tree-sibling phylogenetic networks. ACTA ACUST UNITED AC 2008; 24:1481-8. [PMID: 18477576 PMCID: PMC2718672 DOI: 10.1093/bioinformatics/btn231] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Abstract
Motivation: The presence of reticulate evolutionary events in phylogenies turn phylogenetic trees into phylogenetic networks. These events imply in particular that there may exist multiple evolutionary paths from a non-extant species to an extant one, and this multiplicity makes the comparison of phylogenetic networks much more difficult than the comparison of phylogenetic trees. In fact, all attempts to define a sound distance measure on the class of all phylogenetic networks have failed so far. Thus, the only practical solutions have been either the use of rough estimates of similarity (based on comparison of the trees embedded in the networks), or narrowing the class of phylogenetic networks to a certain class where such a distance is known and can be efficiently computed. The first approach has the problem that one may identify two networks as equivalent, when they are not; the second one has the drawback that there may not exist algorithms to reconstruct such networks from biological sequences. Results: We present in this article a distance measure on the class of semi-binary tree-sibling time consistent phylogenetic networks, which generalize tree-child time consistent phylogenetic networks, and thus also galled-trees. The practical interest of this distance measure is 2-fold: it can be computed in polynomial time by means of simple algorithms, and there also exist polynomial-time algorithms for reconstructing networks of this class from DNA sequence data. Availability: The Perl package Bio::PhyloNetwork, included in the BioPerl bundle, implements many algorithms on phylogenetic networks, including the computation of the distance presented in this article. Contact:gabriel.cardona@uib.es Supplementary information: Some counterexamples, proofs of the results not included in this article, and some computational experiments are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gabriel Cardona
- Department of Mathematics and Computer Science, University of the Balearic Islands, E-07122 Palma de Mallorca, Spain.
| | | | | | | |
Collapse
|
28
|
Yang K, Zhang L. Performance comparison between k-tuple distance and four model-based distances in phylogenetic tree reconstruction. Nucleic Acids Res 2008; 36:e33. [PMID: 18296485 PMCID: PMC2275138 DOI: 10.1093/nar/gkn075] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Phylogenetic tree reconstruction requires construction of a multiple sequence alignment (MSA) from sequences. Computationally, it is difficult to achieve an optimal MSA for many sequences. Moreover, even if an optimal MSA is obtained, it may not be the true MSA that reflects the evolutionary history of the underlying sequences. Therefore, errors can be introduced during MSA construction which in turn affects the subsequent phylogenetic tree construction. In order to circumvent this issue, we extend the application of the k-tuple distance to phylogenetic tree reconstruction. The k-tuple distance between two sequences is the sum of the differences in frequency, over all possible tuples of length k, between the sequences and can be estimated without MSAs. It has been traditionally used to build a fast ‘guide tree’ to assist the construction of MSAs. Using the 1470 simulated sets of sequences generated under different evolutionary scenarios, the neighbor-joining trees and BioNJ trees, we compared the performance of the k-tuple distance with four commonly used distance estimators including Jukes–Cantor, Kimura, F84 and Tamura–Nei. These four distance estimators fall into the category of model-based distance estimators, as each of them takes account of a specific substitution model in order to compute the distance between a pair of already aligned sequences. Results show that trees constructed from the k-tuple distance are more accurate than those from other distances most time; when the divergence between underlying sequences is high, the tree accuracy could be twice or higher using the k-tuple distance than other estimators. Furthermore, as the k-tuple distance voids the need for constructing an MSA, it can save tremendous amount of time for phylogenetic tree reconstructions when the data include a large number of sequences.
Collapse
Affiliation(s)
- Kuan Yang
- Virginia Bioinformatics Institute, Virginia, USA
| | | |
Collapse
|
29
|
Tripartitions do not always discriminate phylogenetic networks. Math Biosci 2008; 211:356-70. [DOI: 10.1016/j.mbs.2007.11.003] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2007] [Revised: 11/09/2007] [Accepted: 11/16/2007] [Indexed: 11/23/2022]
|
30
|
Integrating Sequence and Topology for Efficient and Accurate Detection of Horizontal Gene Transfer. COMPARATIVE GENOMICS 2008. [DOI: 10.1007/978-3-540-87989-3_9] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/09/2023]
|
31
|
Abstract
Exponentially accumulating genetic molecular data were supposed to bring us closer to resolving one of the most fundamental issues in biology—the reconstruction of the tree of life. This tree should encompass the evolutionary history of all living creatures on earth and trace back a few billions of years to
the most ancient microbial ancestor.
Ironically, this abundance of data only blurs our traditional beliefs and seems to make this goal harder to achieve than initially thought. This is largelydue to lateral gene transfer, the passage of genetic material between organisms not through lineal descent. Evolution in light of lateral transfer tangles the traditional universal tree of life, turning it into a network of relationships. Lateral
transfer is a significant factor in microbial evolution and is the mechanism of antibiotic resistance spread in bacteria species.
In this paper we survey current methods designed to cope with lateral transfer in conjunction with vertical inheritance. We distinguish between phylogenetic-based methods and sequence-based methods and illuminate the advantages and disadvantages of each. Finally, we sketch a new statistically rigorous approach aimed at identifying lateral transfer between two genomes.
Collapse
Affiliation(s)
- Sagi Snir
- Institute of Evolution, University of Haifa, 31905 Haifa, Israel and Department of Computer Science, Netanya Academic College
| |
Collapse
|