1
|
Fleischauer M, Böcker S. BCD Beam Search: considering suboptimal partial solutions in Bad Clade Deletion supertrees. PeerJ 2018; 6:e4987. [PMID: 29900080 PMCID: PMC5995099 DOI: 10.7717/peerj.4987] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2018] [Accepted: 05/26/2018] [Indexed: 11/20/2022] Open
Abstract
Supertree methods enable the reconstruction of large phylogenies. The supertree problem can be formalized in different ways in order to cope with contradictory information in the input. Some supertree methods are based on encoding the input trees in a matrix; other methods try to find minimum cuts in some graph. Recently, we introduced Bad Clade Deletion (BCD) supertrees which combines the graph-based computation of minimum cuts with optimizing a global objective function on the matrix representation of the input trees. The BCD supertree method has guaranteed polynomial running time and is very swift in practice. The quality of reconstructed supertrees was superior to matrix representation with parsimony (MRP) and usually on par with SuperFine for simulated data; but particularly for biological data, quality of BCD supertrees could not keep up with SuperFine supertrees. Here, we present a beam search extension for the BCD algorithm that keeps alive a constant number of partial solutions in each top-down iteration phase. The guaranteed worst-case running time of the new algorithm is still polynomial in the size of the input. We present an exact and a randomized subroutine to generate suboptimal partial solutions. Both beam search approaches consistently improve supertree quality on all evaluated datasets when keeping 25 suboptimal solutions alive. Supertree quality of the BCD Beam Search algorithm is on par with MRP and SuperFine even for biological data. This is the best performance of a polynomial-time supertree algorithm reported so far.
Collapse
Affiliation(s)
| | - Sebastian Böcker
- Chair for Bioinformatics, Friedrich-Schiller-University, Jena, Germany
| |
Collapse
|
2
|
Chen D, Eulenstein O, Fernández-Baca D, Burleigh JG. Improved Heuristics for Minimum-Flip Supertree Construction. Evol Bioinform Online 2017. [DOI: 10.1177/117693430600200003] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
The utility of the matrix representation with flipping (MRF) supertree method has been limited by the speed of its heuristic algorithms. We describe a new heuristic algorithm for MRF supertree construction that improves upon the speed of the previous heuristic by a factor of n (the number of taxa in the supertree). This new heuristic makes MRF tractable for large-scale supertree analyses and allows the first comparisons of MRF with other supertree methods using large empirical data sets. Analyses of three published supertree data sets with between 267 to 571 taxa indicate that MRF supertrees are equally or more similar to the input trees on average than matrix representation with parsimony (MRP) and modified mincut supertrees. The results also show that large differences may exist between MRF and MRP supertrees and demonstrate that the MRF supertree method is a practical and potentially more accurate alternative to the nearly ubiquitous MRP super-tree method.
Collapse
Affiliation(s)
- Duhong Chen
- Department of Computer Science, Iowa State University, Ames, IA 50011, U.S.A
| | - Oliver Eulenstein
- Department of Computer Science, Iowa State University, Ames, IA 50011, U.S.A
| | | | - J. Gordon Burleigh
- Section of Evolution and Ecology, University of California, Davis, CA 95616, U.S.A.; NESCent, Durham, NC 27705, U.S.A
| |
Collapse
|
3
|
Sigwart JD, Lindberg DR. Consensus and confusion in molluscan trees: evaluating morphological and molecular phylogenies. Syst Biol 2015; 64:384-95. [PMID: 25472575 PMCID: PMC4395843 DOI: 10.1093/sysbio/syu105] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2013] [Accepted: 11/21/2014] [Indexed: 11/18/2022] Open
Abstract
Mollusks are the most morphologically disparate living animal phylum, they have diversified into all habitats, and have a deep fossil record. Monophyly and identity of their eight living classes is undisputed, but relationships between these groups and patterns of their early radiation have remained elusive. Arguments about traditional morphological phylogeny focus on a small number of topological concepts but often without regard to proximity of the individual classes. In contrast, molecular studies have proposed a number of radically different, inherently contradictory, and controversial sister relationships. Here, we assembled a data set of 42 unique published trees describing molluscan interrelationships. We used these data to ask several questions about the state of resolution of molluscan phylogeny compared with a null model of the variation possible in random trees constructed from a monophyletic assemblage of eight terminals. Although 27 different unique trees have been proposed from morphological inference, the majority of these are not statistically different from each other. Within the available molecular topologies, only four studies to date have included the deep sea class Monoplacophora; but 36.4% of all trees are not significantly different. We also present supertrees derived from two data partitions and three methods, including all available molecular molluscan phylogenies, which will form the basis for future hypothesis testing. The supertrees presented here were not constructed to provide yet another hypothesis of molluscan relationships, but rather to algorithmically evaluate the relationships present in the disparate published topologies. Based on the totality of available evidence, certain patterns of relatedness among constituent taxa become clear. The internodal distance is consistently short between a few taxon pairs, particularly supporting the relatedness of Monoplacophora and the chitons, Polyplacophora. Other taxon pairs are rarely or never found in close proximity, such as the vermiform Caudofoveata and Bivalvia. Our results have specific utility for guiding constructive research planning to better test relationships in Mollusca as well as other problematic groups. Taxa with consistently proximate relationships should be the focus of a combined approach in a concerted assessment of potential genetic and anatomical homology, whereas unequivocally distant taxa will make the most constructive choices for exemplar selection in higher level phylogenomic analyses.
Collapse
Affiliation(s)
- Julia D Sigwart
- Marine Laboratory, Queen's University Belfast, BT22 1PF, Northern Ireland, UK; and Department of Integrative Biology, Museum of Paleontology and Center for Computational Biology, University of California, Berkeley, CA, 94720, USA
| | - David R Lindberg
- Marine Laboratory, Queen's University Belfast, BT22 1PF, Northern Ireland, UK; and Department of Integrative Biology, Museum of Paleontology and Center for Computational Biology, University of California, Berkeley, CA, 94720, USA
| |
Collapse
|
4
|
GRISWOLD CHARLESE, WOOD HANNAHMARIE, CARMICHAEL ANTHEAD. The lace web spiders (Araneae, Phyxelididae) of Madagascar: phylogeny, biogeography and taxonomy. Zool J Linn Soc 2012. [DOI: 10.1111/j.1096-3642.2011.00779.x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
5
|
Nguyen N, Mirarab S, Warnow T. MRL and SuperFine+MRL: new supertree methods. Algorithms Mol Biol 2012; 7:3. [PMID: 22280525 PMCID: PMC3308190 DOI: 10.1186/1748-7188-7-3] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2011] [Accepted: 01/26/2012] [Indexed: 11/29/2022] Open
Abstract
Background Supertree methods combine trees on subsets of the full taxon set together to produce a tree on the entire set of taxa. Of the many supertree methods, the most popular is MRP (Matrix Representation with Parsimony), a method that operates by first encoding the input set of source trees by a large matrix (the "MRP matrix") over {0,1, ?}, and then running maximum parsimony heuristics on the MRP matrix. Experimental studies evaluating MRP in comparison to other supertree methods have established that for large datasets, MRP generally produces trees of equal or greater accuracy than other methods, and can run on larger datasets. A recent development in supertree methods is SuperFine+MRP, a method that combines MRP with a divide-and-conquer approach, and produces more accurate trees in less time than MRP. In this paper we consider a new approach for supertree estimation, called MRL (Matrix Representation with Likelihood). MRL begins with the same MRP matrix, but then analyzes the MRP matrix using heuristics (such as RAxML) for 2-state Maximum Likelihood. Results We compared MRP and SuperFine+MRP with MRL and SuperFine+MRL on simulated and biological datasets. We examined the MRP and MRL scores of each method on a wide range of datasets, as well as the resulting topological accuracy of the trees. Our experimental results show that MRL, coupled with a very good ML heuristic such as RAxML, produced more accurate trees than MRP, and MRL scores were more strongly correlated with topological accuracy than MRP scores. Conclusions SuperFine+MRP, when based upon a good MP heuristic, such as TNT, produces among the best scores for both MRP and MRL, and is generally faster and more topologically accurate than other supertree methods we tested.
Collapse
|
6
|
Swenson MS, Suri R, Linder CR, Warnow T. SuperFine: Fast and Accurate Supertree Estimation. Syst Biol 2011; 61:214-27. [DOI: 10.1093/sysbio/syr092] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- M. Shel Swenson
- Department of Computer Science, The University of Texas at Austin, Austin, TX, USA
| | - Rahul Suri
- Department of Computer Science, The University of Texas at Austin, Austin, TX, USA
| | - C. Randal Linder
- Section of Integrative Biology, School of Biological Sciences, The University of Texas at Austin, Austin, TX, USA
| | - Tandy Warnow
- Department of Computer Science, The University of Texas at Austin, Austin, TX, USA
| |
Collapse
|
7
|
Swenson MS, Suri R, Linder CR, Warnow T. An experimental study of Quartets MaxCut and other supertree methods. Algorithms Mol Biol 2011; 6:7. [PMID: 21504600 PMCID: PMC3101644 DOI: 10.1186/1748-7188-6-7] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2010] [Accepted: 04/19/2011] [Indexed: 12/01/2022] Open
Abstract
BACKGROUND Supertree methods represent one of the major ways by which the Tree of Life can be estimated, but despite many recent algorithmic innovations, matrix representation with parsimony (MRP) remains the main algorithmic supertree method. RESULTS We evaluated the performance of several supertree methods based upon the Quartets MaxCut (QMC) method of Snir and Rao and showed that two of these methods usually outperform MRP and five other supertree methods that we studied, under many realistic model conditions. However, the QMC-based methods have scalability issues that may limit their utility on large datasets. We also observed that taxon sampling impacted supertree accuracy, with poor results obtained when all of the source trees were only sparsely sampled. Finally, we showed that the popular optimality criterion of minimizing the total topological distance of the supertree to the source trees is only weakly correlated with supertree topological accuracy. Therefore evaluating supertree methods on biological datasets is problematic. CONCLUSIONS Our results show that supertree methods that improve upon MRP are possible, and that an effort should be made to produce scalable and robust implementations of the most accurate supertree methods. Also, because topological accuracy depends upon taxon sampling strategies, attempts to construct very large phylogenetic trees using supertree methods should consider the selection of source tree datasets, as well as supertree methods. Finally, since supertree topological error is only weakly correlated with the supertree's topological distance to its source trees, development and testing of supertree methods presents methodological challenges.
Collapse
Affiliation(s)
- M Shel Swenson
- Department of Computer Science, The University of Texas at Austin, Austin TX, USA
| | - Rahul Suri
- Department of Computer Science, The University of Texas at Austin, Austin TX, USA
| | - C Randal Linder
- Section of Integrative Biology, The University of Texas at Austin, Austin TX, USA
| | - Tandy Warnow
- Department of Computer Science, The University of Texas at Austin, Austin TX, USA
| |
Collapse
|
8
|
Kupczok A, Schmidt HA, von Haeseler A. Accuracy of phylogeny reconstruction methods combining overlapping gene data sets. Algorithms Mol Biol 2010; 5:37. [PMID: 21134245 PMCID: PMC3022592 DOI: 10.1186/1748-7188-5-37] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2010] [Accepted: 12/06/2010] [Indexed: 11/17/2022] Open
Abstract
Background The availability of many gene alignments with overlapping taxon sets raises the question of which strategy is the best to infer species phylogenies from multiple gene information. Methods and programs abound that use the gene alignment in different ways to reconstruct the species tree. In particular, different methods combine the original data at different points along the way from the underlying sequences to the final tree. Accordingly, they are classified into superalignment, supertree and medium-level approaches. Here, we present a simulation study to compare different methods from each of these three approaches. Results We observe that superalignment methods usually outperform the other approaches over a wide range of parameters including sparse data and gene-specific evolutionary parameters. In the presence of high incongruency among gene trees, however, other combination methods show better performance than the superalignment approach. Surprisingly, some supertree and medium-level methods exhibit, on average, worse results than a single gene phylogeny with complete taxon information. Conclusions For some methods, using the reconstructed gene tree as an estimation of the species tree is superior to the combination of incomplete information. Superalignment usually performs best since it is less susceptible to stochastic error. Supertree methods can outperform superalignment in the presence of gene-tree conflict.
Collapse
|
9
|
Swenson MS, Barbançon F, Warnow T, Linder CR. A simulation study comparing supertree and combined analysis methods using SMIDGen. Algorithms Mol Biol 2010; 5:8. [PMID: 20047664 PMCID: PMC2837663 DOI: 10.1186/1748-7188-5-8] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2009] [Accepted: 01/04/2010] [Indexed: 11/16/2022] Open
Abstract
Background Supertree methods comprise one approach to reconstructing large molecular phylogenies given multi-marker datasets: trees are estimated on each marker and then combined into a tree (the "supertree") on the entire set of taxa. Supertrees can be constructed using various algorithmic techniques, with the most common being matrix representation with parsimony (MRP). When the data allow, the competing approach is a combined analysis (also known as a "supermatrix" or "total evidence" approach) whereby the different sequence data matrices for each of the different subsets of taxa are concatenated into a single supermatrix, and a tree is estimated on that supermatrix. Results In this paper, we describe an extensive simulation study we performed comparing two supertree methods, MRP and weighted MRP, to combined analysis methods on large model trees. A key contribution of this study is our novel simulation methodology (Super-Method Input Data Generator, or SMIDGen) that better reflects biological processes and the practices of systematists than earlier simulations. We show that combined analysis based upon maximum likelihood outperforms MRP and weighted MRP, giving especially big improvements when the largest subtree does not contain most of the taxa. Conclusions This study demonstrates that MRP and weighted MRP produce distinctly less accurate trees than combined analyses for a given base method (maximum parsimony or maximum likelihood). Since there are situations in which combined analyses are not feasible, there is a clear need for better supertree methods. The source tree and combined datasets used in this study can be used to test other supertree and combined analysis methods.
Collapse
|
10
|
Swenson MS, Suri R, Linder CR, Warnow T. An Experimental Study of Quartets MaxCut and Other Supertree Methods. LECTURE NOTES IN COMPUTER SCIENCE 2010. [DOI: 10.1007/978-3-642-15294-8_24] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
11
|
Abstract
BACKGROUND There is much interest in developing fast and accurate supertree methods to infer the tree of life. Supertree methods combine smaller input trees with overlapping sets of taxa to make a comprehensive phylogenetic tree that contains all of the taxa in the input trees. The intrinsically hard triplet supertree problem takes a collection of input species trees and seeks a species tree (supertree) that maximizes the number of triplet subtrees that it shares with the input trees. However, the utility of this supertree problem has been limited by a lack of efficient and effective heuristics. RESULTS We introduce fast hill-climbing heuristics for the triplet supertree problem that perform a step-wise search of the tree space, where each step is guided by an exact solution to an instance of a local search problem. To realize time efficient heuristics we designed the first nontrivial algorithms for two standard search problems, which greatly improve on the time complexity to the best known (naïve) solutions by a factor of n and n2 (the number of taxa in the supertree). These algorithms enable large-scale supertree analyses based on the triplet supertree problem that were previously not possible. We implemented hill-climbing heuristics that are based on our new algorithms, and in analyses of two published supertree data sets, we demonstrate that our new heuristics outperform other standard supertree methods in maximizing the number of triplets shared with the input trees. CONCLUSION With our new heuristics, the triplet supertree problem is now computationally more tractable for large-scale supertree analyses, and it provides a potentially more accurate alternative to existing supertree methods.
Collapse
Affiliation(s)
- Harris T Lin
- Department of Computer Science, Iowa State University, Ames, IA, USA.
| | | | | |
Collapse
|
12
|
Abstract
The subject of this chapter is to describe the methodology for assessing the power of phylogenetic HGT detection methods. Detection power is defined in the framework of hypothesis testing. Rates of false positives and false negatives can be estimated by testing HGT detection methods on HGT-free orthologous sets, and on the same sets with in silico simulated HGT events. The whole process can be divided into three steps: obtaining HGT-free orthologous sets, in silico simulation of HGT events in the same set, and submitting both sets for evaluation by any of the tested methods.Phylogenetic methods of HGT detection can be roughly divided into three types: likelihood-based tests of topologies (Kishino-Hasegawa (KH), Shimodaira-Hasegawa (SH), and Approximately Unbiased (AU) tests), tree distance methods (symmetrical difference of Robinson and Foulds (RF), and Subtree Pruning and Regrafting (SPR) distances), and genome spectral approaches (bipartition and quartet decomposition analysis). Restrictions that are inherent to phylogenetic methods of HGT detection in general and the power and precision of each method are discussed and comparative analyses of different approaches are provided, as well as some examples of assessing the power of phylogenetic HGT detection methods from a case study of orthologous sets from gamma-proteobacteria (Poptsova and Gogarten, BMC Evol Biol 7, 45, 2007) and cyanobacteria (Zhaxybayeva et al., Genome Res 16, 1099-108, 2006).
Collapse
Affiliation(s)
- Maria Poptsova
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| |
Collapse
|
13
|
Moore BR, Smith SA, Donoghue MJ. Increasing data transparency and estimating phylogenetic uncertainty in supertrees: Approaches using nonparametric bootstrapping. Syst Biol 2006; 55:662-76. [PMID: 16969942 DOI: 10.1080/10635150600920693] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022] Open
Abstract
The estimation of ever larger phylogenies requires consideration of alternative inference strategies, including divide-and-conquer approaches that decompose the global inference problem to a set of smaller, more manageable component problems. A prominent locus of research in this area is the development of supertree methods, which estimate a composite tree by combining a set of partially overlapping component topologies. Although promising, the use of component tree topologies as the primary data dissociates supertrees from complexities within the underling character data and complicates the evaluation of phylogenetic uncertainty. We address these issues by exploring three approaches that variously incorporate nonparametric bootstrapping into a common supertree estimation algorithm (matrix representation with parsimony, although any algorithm might be used), including bootstrap-weighting, source-tree bootstrapping, and hierarchical bootstrapping. We illustrate these procedures by means of hypothetical and empirical examples. Our preliminary experiments suggest that these methods have the potential to improve the correspondence of supertree estimates to those derived from simultaneous analysis of the combined data and to allow uncertainty in supertree topologies to be quantified. The ability to increase the transparency of supertrees to the underlying character data has several practical implications and sheds new light on an old debate. These methods have been implemented in the freely available program, tREeBOOT.
Collapse
Affiliation(s)
- Brian R Moore
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, Connecticut 06520, USA.
| | | | | |
Collapse
|
14
|
Abstract
Supertrees result from combining many smaller, overlapping phylogenetic trees into a single, more comprehensive tree. As such, supertree construction is probably as old as the field of systematics itself, and remains our only way of visualizing the Tree of Life as a whole. Over the past decade, supertree construction has gained a more formal, objective footing, and has become an area of active theoretical and practical research. Here, I review the history of the supertree approach, focusing mainly on its current implementation. The supertrees of today represent some of the largest, complete phylogenies available for many groups, but are not without their critics. I conclude by arguing that the ever-growing molecular revolution will result in supertree construction taking on a new role and implementation in the future for analyzing large DNA sequence matrices as part of a divide-and-conquer phylogenetic approach.
Collapse
Affiliation(s)
- Olaf R P Bininda-Emonds
- Lehrstuhl für Tierzucht, Technical University of Munich, D-85354 Freising-Weihenstephan, Germany.
| |
Collapse
|
15
|
|
16
|
|
17
|
Ross HA, Rodrigo AG. An Assessment of Matrix Representation with Compatibility in Supertree Construction. COMPUTATIONAL BIOLOGY 2004. [DOI: 10.1007/978-1-4020-2330-9_3] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
|
18
|
Gatesy J, Springer MS. A Critique of Matrix Representation with Parsimony Supertrees. COMPUTATIONAL BIOLOGY 2004. [DOI: 10.1007/978-1-4020-2330-9_18] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
|
19
|
|
20
|
|
21
|
|