Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Roch S, Nute M, Warnow T. Long-Branch Attraction in Species Tree Estimation: Inconsistency of Partitioned Likelihood and Topology-Based Summary Methods. Syst Biol 2019;68:281-297. [PMID: 30247732 DOI: 10.1093/sysbio/syy061] [Citation(s) in RCA: 54] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2018] [Accepted: 09/12/2018] [Indexed: 11/13/2022] Open

For:	Roch S, Nute M, Warnow T. Long-Branch Attraction in Species Tree Estimation: Inconsistency of Partitioned Likelihood and Topology-Based Summary Methods. Syst Biol 2019;68:281-297. [PMID: 30247732 DOI: 10.1093/sysbio/syy061] [Citation(s) in RCA: 54] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2018] [Accepted: 09/12/2018] [Indexed: 11/13/2022] Open

Number

Cited by Other Article(s)

Lin YE, Wu CS, Wu YW, Chaw SM. Phylogenomic Inference Suggests Differential Deep Time Phylogenetic Signals from Nuclear and Organellar Genomes in Gymnosperms. PLANTS (BASEL, SWITZERLAND) 2025;14:1335. [PMID: 40364364 PMCID: PMC12073265 DOI: 10.3390/plants14091335] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/16/2025] [Revised: 04/17/2025] [Accepted: 04/17/2025] [Indexed: 05/15/2025]

Kong S, Swofford DL, Kubatko LS. Inference of Phylogenetic Networks From Sequence Data Using Composite Likelihood. Syst Biol 2025;74:53-69. [PMID: 39387633 DOI: 10.1093/sysbio/syae054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2022] [Revised: 09/13/2024] [Accepted: 10/08/2024] [Indexed: 10/12/2024] Open

Abstract

While phylogenies have been essential in understanding how species evolve, they do not adequately describe some evolutionary processes. For instance, hybridization, a common phenomenon where interbreeding between 2 species leads to formation of a new species, must be depicted by a phylogenetic network, a structure that modifies a phylogenetic tree by allowing 2 branches to merge into 1, resulting in reticulation. However, existing methods for estimating networks become computationally expensive as the dataset size and/or topological complexity increase. The lack of methods for scalable inference hampers phylogenetic networks from being widely used in practice, despite accumulating evidence that hybridization occurs frequently in nature. Here, we propose a novel method, PhyNEST (Phylogenetic Network Estimation using SiTe patterns), that estimates binary, level-1 phylogenetic networks with a fixed, user-specified number of reticulations directly from sequence data. By using the composite likelihood as the basis for inference, PhyNEST is able to use the full genomic data in a computationally tractable manner, eliminating the need to summarize the data as a set of gene trees prior to network estimation. To search network space, PhyNEST implements both hill climbing and simulated annealing algorithms. PhyNEST assumes that the data are composed of coalescent independent sites that evolve according to the Jukes-Cantor substitution model and that the network has a constant effective population size. Simulation studies demonstrate that PhyNEST is often more accurate than 2 existing composite likelihood summary methods (SNaQand PhyloNet) and that it is robust to at least one form of model misspecification (assuming a less complex nucleotide substitution model than the true generating model). We applied PhyNEST to reconstruct the evolutionary relationships among Heliconius butterflies and Papionini primates, characterized by hybrid speciation and widespread introgression, respectively. PhyNEST is implemented in an open-source Julia package and is publicly available at https://github.com/sungsik-kong/PhyNEST.jl.

Collapse

Kovacs TGL, Walker J, Hellemans S, Bourguignon T, Tatarnic NJ, McRae JM, Ho SYW, Lo N. Dating in the Dark: Elevated Substitution Rates in Cave Cockroaches (Blattodea: Nocticolidae) Have Negative Impacts on Molecular Date Estimates. Syst Biol 2024;73:532-545. [PMID: 38320290 PMCID: PMC11377191 DOI: 10.1093/sysbio/syae002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Revised: 01/14/2024] [Accepted: 01/18/2024] [Indexed: 02/08/2024] Open

Baños H, Susko E, Roger AJ. Is Over-parameterization a Problem for Profile Mixture Models? Syst Biol 2024;73:53-75. [PMID: 37843172 PMCID: PMC11129589 DOI: 10.1093/sysbio/syad063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2022] [Revised: 09/12/2023] [Accepted: 10/13/2023] [Indexed: 10/17/2023] Open

Abstract

Biochemical constraints on the admissible amino acids at specific sites in proteins lead to heterogeneity of the amino acid substitution process over sites in alignments. It is well known that phylogenetic models of protein sequence evolution that do not account for site heterogeneity are prone to long-branch attraction (LBA) artifacts. Profile mixture models were developed to model heterogeneity of preferred amino acids at sites via a finite distribution of site classes each with a distinct set of equilibrium amino acid frequencies. However, it is unknown whether the large number of parameters in such models associated with the many amino acid frequency vectors can adversely affect tree topology estimates because of over-parameterization. Here, we demonstrate theoretically that for long sequences, over-parameterization does not create problems for estimation with profile mixture models. Under mild conditions, tree, amino acid frequencies, and other model parameters converge to true values as sequence length increases, even when there are large numbers of components in the frequency profile distributions. Because large sample theory does not necessarily imply good behavior for shorter alignments we explore the performance of these models with short alignments simulated with tree topologies that are prone to LBA artifacts. We find that over-parameterization is not a problem for complex profile mixture models even when there are many amino acid frequency vectors. In fact, simple models with few site classes behave poorly. Interestingly, we also found that misspecification of the amino acid frequency vectors does not lead to increased LBA artifacts as long as the estimated cumulative distribution function of the amino acid frequencies at sites adequately approximates the true one. In contrast, misspecification of the amino acid exchangeability rates can severely negatively affect parameter estimation. Finally, we explore the effects of including in the profile mixture model an additional "F-class" representing the overall frequencies of amino acids in the data set. Surprisingly, the F-class does not help parameter estimation significantly and can decrease the probability of correct tree estimation, depending on the scenario, even though it tends to improve likelihood scores.

Collapse

Tabatabaee Y, Roch S, Warnow T. QR-STAR: A Polynomial-Time Statistically Consistent Method for Rooting Species Trees Under the Coalescent. J Comput Biol 2023;30:1146-1181. [PMID: 37902986 DOI: 10.1089/cmb.2023.0185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2023] Open

Liu B, Warnow T. Weighted ASTRID: fast and accurate species trees from weighted internode distances. Algorithms Mol Biol 2023;18:6. [PMID: 37468904 PMCID: PMC10355063 DOI: 10.1186/s13015-023-00230-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Accepted: 06/10/2023] [Indexed: 07/21/2023] Open

Abstract

BACKGROUND

Species tree estimation is a basic step in many biological research projects, but is complicated by the fact that gene trees can differ from the species tree due to processes such as incomplete lineage sorting (ILS), gene duplication and loss (GDL), and horizontal gene transfer (HGT), which can cause different regions within the genome to have different evolutionary histories (i.e., "gene tree heterogeneity"). One approach to estimating species trees in the presence of gene tree heterogeneity resulting from ILS operates by computing trees on each genomic region (i.e., computing "gene trees") and then using these gene trees to define a matrix of average internode distances, where the internode distance in a tree T between two species x and y is the number of nodes in T between the leaves corresponding to x and y. Given such a matrix, a tree can then be computed using methods such as neighbor joining. Methods such as ASTRID and NJst (which use this basic approach) are provably statistically consistent, very fast (low degree polynomial time) and have had high accuracy under many conditions that makes them competitive with other popular species tree estimation methods. In this study, inspired by the very recent work of weighted ASTRAL, we present weighted ASTRID, a variant of ASTRID that takes the branch uncertainty on the gene trees into account in the internode distance.

RESULTS

Our experimental study evaluating weighted ASTRID typically shows improvements in accuracy compared to the original (unweighted) ASTRID, and shows competitive accuracy against weighted ASTRAL, the state of the art. Our re-implementation of ASTRID also improves the runtime, with marked improvements on large datasets.

CONCLUSIONS

Weighted ASTRID is a new and very fast method for species tree estimation that typically improves upon ASTRID and has comparable accuracy to weighted ASTRAL, while remaining much faster. Weighted ASTRID is available at https://github.com/RuneBlaze/internode .

Collapse

Liu L, Yu L, Wu S, Arnold J, Whalen C, Davis C, Edwards S. Short branch attraction in phylogenomic inference under the multispecies coalescent. Front Ecol Evol 2023;11:1134764. [PMID: 39233780 PMCID: PMC11372852 DOI: 10.3389/fevo.2023.1134764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/06/2024] Open

Abstract

Accurate reconstruction of species trees often relies on the quality of input gene trees estimated from molecular sequences. Previous studies suggested that if the sequence length is fixed, the maximum likelihood may produce biased gene trees which subsequently mislead inference of species trees. Two key questions need to be answered in this context: what are the scenarios that may result in consistently biased gene trees? and for those scenarios, are there any remedies that may remove or at least reduce the misleading effects of consistently biased gene trees? In this article, we establish a theoretical framework to address these questions. Considering a scenario where the true gene tree is a 4-taxon star treeT * = S 1 , S 2 , S 3 , S 4 with two short branches leading to the speciesS 1 andS 2 , we demonstrate that maximum likelihood significantly favors the wrong bifurcating treeS 1 , S 2 , S 3 , S 4 grouping the two speciesS 1 andS 2 with short branches. We name this inconsistent behavior short branch attraction, which may occur in real-world data involving a 4-taxon bifurcating gene tree with a short internal branch. If no mutation occurs along the internal branch, which is likely if the internal branch is short, the 4-taxon bifurcating tree is equivalent to the 4-taxon star tree and thus will suffer the same misleading effect of short branch attraction. Theoretical and simulation results further demonstrate that short branch attraction may occur in gene trees and species trees of arbitrary size. Moreover, short branch attraction is primarily caused by a lack of phylogenetic information in sequence data, suggesting that converting short internal branches to polytomies in the estimated gene trees can significantly reduce artifacts induced by short branch attraction.

Collapse

Wilson D, Rogers JD. Evaluating Compression-Based Phylogeny Estimation in the Presence of Incomplete Lineage Sorting. J Comput Biol 2023;30:250-260. [PMID: 36848254 DOI: 10.1089/cmb.2022.0197] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/01/2023] Open

Brožová V, Proćków J, Záveská Drábková L. Toward finally unraveling the phylogenetic relationships of Juncaceae with respect to another cyperid family, Cyperaceae. Mol Phylogenet Evol 2022;177:107588. [PMID: 35907594 DOI: 10.1016/j.ympev.2022.107588] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2021] [Revised: 08/06/2021] [Accepted: 07/11/2022] [Indexed: 11/29/2022]

Hill M, Legried B, Roch S. Species tree estimation under joint modeling of coalescence and duplication: Sample complexity of quartet methods. ANN APPL PROBAB 2022. [DOI: 10.1214/22-aap1799] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]

Peng J, Swofford DL, Kubatko L. Estimation of speciation times under the multispecies coalescent. Bioinformatics 2022;38:5182-5190. [PMID: 36227122 DOI: 10.1093/bioinformatics/btac679] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2020] [Revised: 06/02/2022] [Accepted: 10/10/2022] [Indexed: 12/24/2022] Open

Susko E. Complex statistical modelling for phylogenetic inference. CAN J STAT 2022. [DOI: 10.1002/cjs.11741] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]

Zhang C, Mirarab S. Weighting by Gene Tree Uncertainty Improves Accuracy of Quartet-based Species Trees. Mol Biol Evol 2022;39:6750035. [PMID: 36201617 PMCID: PMC9750496 DOI: 10.1093/molbev/msac215] [Citation(s) in RCA: 51] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 09/20/2022] [Accepted: 10/03/2022] [Indexed: 01/07/2023] Open

Chan YB, Li Q, Scornavacca C. The large-sample asymptotic behaviour of quartet-based summary methods for species tree inference. J Math Biol 2022;85:22. [PMID: 35976512 PMCID: PMC9385842 DOI: 10.1007/s00285-022-01786-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Revised: 06/08/2022] [Accepted: 07/14/2022] [Indexed: 12/03/2022]

Flouri T, Huang J, Jiao X, Kapli P, Rannala B, Yang Z. Bayesian phylogenetic inference using relaxed-clocks and the multispecies coalescent. Mol Biol Evol 2022;39:6652437. [PMID: 35907248 PMCID: PMC9366188 DOI: 10.1093/molbev/msac161] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open

Gatesy J, Springer MS. Phylogenomic Coalescent Analyses of Avian Retroelements Infer Zero-Length Branches at the Base of Neoaves, Emergent Support for Controversial Clades, and Ancient Introgressive Hybridization in Afroaves. Genes (Basel) 2022;13:1167. [PMID: 35885951 PMCID: PMC9324441 DOI: 10.3390/genes13071167] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2022] [Revised: 06/20/2022] [Accepted: 06/21/2022] [Indexed: 01/25/2023] Open

Xiong H, Wang D, Shao C, Yang X, Yang J, Ma T, Davis CC, Liu L, Xi Z. Species Tree Estimation and the Impact of Gene Loss Following Whole-Genome Duplication. Syst Biol 2022;71:1348-1361. [PMID: 35689633 PMCID: PMC9558847 DOI: 10.1093/sysbio/syac040] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2021] [Revised: 06/03/2022] [Accepted: 06/07/2022] [Indexed: 12/02/2022] Open

Abstract

Whole-genome duplication (WGD) occurs broadly and repeatedly across the history of eukaryotes and is recognized as a prominent evolutionary force, especially in plants. Immediately following WGD, most genes are present in two copies as paralogs. Due to this redundancy, one copy of a paralog pair commonly undergoes pseudogenization and is eventually lost. When speciation occurs shortly after WGD; however, differential loss of paralogs may lead to spurious phylogenetic inference resulting from the inclusion of pseudoorthologs–paralogous genes mistakenly identified as orthologs because they are present in single copies within each sampled species. The influence and impact of including pseudoorthologs versus true orthologs as a result of gene extinction (or incomplete laboratory sampling) are only recently gaining empirical attention in the phylogenomics community. Moreover, few studies have yet to investigate this phenomenon in an explicit coalescent framework. Here, using mathematical models, numerous simulated data sets, and two newly assembled empirical data sets, we assess the effect of pseudoorthologs on species tree estimation under varying degrees of incomplete lineage sorting (ILS) and differential gene loss scenarios following WGD. When gene loss occurs along the terminal branches of the species tree, alignment-based (BPP) and gene-tree-based (ASTRAL, MP-EST, and STAR) coalescent methods are adversely affected as the degree of ILS increases. This can be greatly improved by sampling a sufficiently large number of genes. Under the same circumstances, however, concatenation methods consistently estimate incorrect species trees as the number of genes increases. Additionally, pseudoorthologs can greatly mislead species tree inference when gene loss occurs along the internal branches of the species tree. Here, both coalescent and concatenation methods yield inconsistent results. These results underscore the importance of understanding the influence of pseudoorthologs in the phylogenomics era. [Coalescent method; concatenation method; incomplete lineage sorting; pseudoorthologs; single-copy gene; whole-genome duplication.]

Collapse

Zhelezov G, Degnan JH. Trying Out a Million Genes to Find the Perfect Pair with RTIST. Bioinformatics 2022;38:3565-3573. [PMID: 35641003 DOI: 10.1093/bioinformatics/btac349] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Revised: 05/07/2022] [Accepted: 05/17/2022] [Indexed: 11/13/2022] Open

Abstract

MOTIVATION

Consensus methods can be used for reconstructing a species tree from several gene trees which exhibit incompatible topologies due to incomplete lineage sorting. Motivated by the fact that there are no anomalous rooted gene trees with three taxa and no anomalous unrooted gene trees with four taxa in the multispecies coalescent model, several contemporary methods form the gene tree consensus by finding the median tree with respect to the triplet or quartet distance-i.e., estimate the species tree as the tree which minimizes the sum of triplet or quartet distances to the input gene trees. These methods reformulate the solution to the consensus problem as the solution to a recursively-solved dynamic programming problem. We present an iterative, easily-parallelizable approach to finding the exact median triplet tree, and implement it as an open source software package which can also find suboptimal consensus trees within a specified triplet distance to the gene trees. The most time-consuming step for methods of this type is the creation of a weights array for all possible subtree bipartitions. By grouping the relevant calculations and array update operations of different bipartitions of the same subtree together, this implementation finds the exact median tree of many gene trees faster than comparable methods, has better scaling properties with respect to the number of gene trees, and has a smaller memory footprint.

RESULTS

RTIST (Rooted Triple Inference of Species Trees) finds the exact median triplet tree of a set of gene trees. Its runtime and memory footprints scale better than existing algorithms. RTIST can resolve all the non-unique median trees, as well as sub-optimal consensus trees within a user-specified triplet distance to the median. Although it is limited in the number of taxa (≤ 20), its runtime changes little when the number of gene trees is changed by several orders of magnitude.

AVAILABILITY

RTIST is written in C and Python. It is freely available at https://github.com/glebzhelezov/rtist.

Collapse

Bloesch Z, Nauheimer L, Elias Almeida T, Crayn D, Raymond Field A. HybPhaser identifies hybrid evolution in Australian Thelypteridaceae. Mol Phylogenet Evol 2022;173:107526. [DOI: 10.1016/j.ympev.2022.107526] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2021] [Revised: 03/23/2022] [Accepted: 04/05/2022] [Indexed: 10/18/2022]

Dasarathy G, Mossel E, Nowak R, Roch S. A stochastic Farris transform for genetic data under the multispecies coalescent with applications to data requirements. J Math Biol 2022;84:36. [PMID: 35394192 PMCID: PMC9258723 DOI: 10.1007/s00285-022-01731-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Revised: 02/15/2022] [Accepted: 02/17/2022] [Indexed: 10/18/2022]

Abstract

Species tree estimation faces many significant hurdles. Chief among them is that the trees describing the ancestral lineages of each individual gene-the gene trees-often differ from the species tree. The multispecies coalescent is commonly used to model this gene tree discordance, at least when it is believed to arise from incomplete lineage sorting, a population-genetic effect. Another significant challenge in this area is that molecular sequences associated to each gene typically provide limited information about the gene trees themselves. While the modeling of sequence evolution by single-site substitutions is well-studied, few species tree reconstruction methods with theoretical guarantees actually address this latter issue. Instead, a standard-but unsatisfactory-assumption is that gene trees are perfectly reconstructed before being fed into a so-called summary method. Hence much remains to be done in the development of inference methodologies that rigorously account for gene tree estimation error-or completely avoid gene tree estimation in the first place. In previous work, a data requirement trade-off was derived between the number of loci m needed for an accurate reconstruction and the length of the locus sequences k. It was shown that to reconstruct an internal branch of length f, one needs m to be of the order of [Formula: see text]. That previous result was obtained under the restrictive assumption that mutation rates as well as population sizes are constant across the species phylogeny. Here we further generalize this result beyond this assumption. Our main contribution is a novel reduction to the molecular clock case under the multispecies coalescent, which we refer to as a stochastic Farris transform. As a corollary, we also obtain a new identifiability result of independent interest: for any species tree with [Formula: see text] species, the rooted topology of the species tree can be identified from the distribution of its unrooted weighted gene trees even in the absence of a molecular clock.

Collapse

Liu B, Warnow T. Scalable Species Tree Inference with External Constraints. J Comput Biol 2022;29:664-678. [PMID: 35196115 DOI: 10.1089/cmb.2021.0543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Rabiee M, Mirarab S. OUP accepted manuscript. Bioinformatics 2022;38:i413-i421. [PMID: 35758818 PMCID: PMC9235488 DOI: 10.1093/bioinformatics/btac265] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open

Mirarab S, Nakhleh L, Warnow T. Multispecies Coalescent: Theory and Applications in Phylogenetics. ANNUAL REVIEW OF ECOLOGY, EVOLUTION, AND SYSTEMATICS 2021. [DOI: 10.1146/annurev-ecolsys-012121-095340] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]

Molloy EK, Gatesy J, Springer MS. Theoretical and practical considerations when using retroelement insertions to estimate species trees in the anomaly zone. Syst Biol 2021;71:721-740. [PMID: 34677617 DOI: 10.1093/sysbio/syab086] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Accepted: 10/11/2021] [Indexed: 11/13/2022] Open

Abstract

A potential shortcoming of concatenation methods for species tree estimation is their failure to account for incomplete lineage sorting. Coalescent methods address this problem but make various assumptions that, if violated, can result in worse performance than concatenation. Given the challenges of analyzing DNA sequences with both concatenation and coalescent methods, retroelement insertions (RIs) have emerged as powerful phylogenomic markers for species tree estimation. Here, we show that two recently proposed quartet-based methods, SDPquartets and ASTRAL_BP, are statistically consistent estimators of the unrooted species tree topology under the coalescent when RIs follow a neutral infinite-sites model of mutation and the expected number of new RIs per generation is constant across the species tree. The accuracy of these (and other) methods for inferring species trees from RIs has yet to be assessed on simulated data sets, where the true species tree topology is known. Therefore, we evaluated eight methods given RIs simulated from four model species trees, all of which have short branches and at least three of which are in the anomaly zone. In our simulation study, ASTRAL_BP and SDPquartets always recovered the correct species tree topology when given a sufficiently large number of RIs, as predicted. A distance-based method (ASTRID_BP) and Dollo parsimony also performed well in recovering the species tree topology. In contrast, unordered, polymorphism, and Camin-Sokal parsimony typically fail to recover the correct species tree topology in anomaly zone situations with more than four ingroup taxa. Of the methods studied, only ASTRAL_BP automatically estimates internal branch lengths (in coalescent units) and support values (i.e. local posterior probabilities). We examined the accuracy of branch length estimation, finding that estimated lengths were accurate for short branches but upwardly biased otherwise. This led us to derive the maximum likelihood (branch length) estimate for when RIs are given as input instead of binary gene trees; this corrected formula produced accurate estimates of branch lengths in our simulation study, provided that a sufficiently large number of RIs were given as input. Lastly, we evaluated the impact of data quantity on species tree estimation by repeating the above experiments with input sizes varying from 100 to 100 000 parsimony-informative RIs. We found that, when given just 1 000 parsimony-informative RIs as input, ASTRAL_BP successfully reconstructed major clades (i.e clades separated by branches > 0.3 CUs) with high support and identified rapid radiations (i.e. shorter connected branches), although not their precise branching order. The local posterior probability was effective for controlling false positive branches in these scenarios.

Collapse

Culshaw V, Villaverde T, Mairal M, Olsson S, Sanmartín I. Rare and widespread: integrating Bayesian MCMC approaches, Sanger sequencing and Hyb-Seq phylogenomics to reconstruct the origin of the enigmatic Rand Flora genus Camptoloma. AMERICAN JOURNAL OF BOTANY 2021;108:1673-1691. [PMID: 34550605 DOI: 10.1002/ajb2.1727] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/24/2020] [Revised: 04/14/2021] [Accepted: 04/16/2021] [Indexed: 06/13/2023]

Abstract

PREMISE

Genera that are widespread, with geographically discontinuous distributions and represented by few species, are intriguing. Is their achieved disjunct distribution recent or ancient in origin? Why are they species-poor? The Rand Flora is a continental-scale pattern in which closely related species appear codistributed in isolated regions over the continental margins of Africa. Genus Camptoloma (Scrophulariaceae) is the most notable example, comprising three species isolated from each other on the northwest, eastern, and southwest Africa.

METHODS

We employed Sanger sequencing of nuclear and plastid markers, together with genomic target sequencing of 2190 low-copy nuclear genes, to infer interspecies relationships and the position of Camptoloma within Scrophulariaceae by using supermatrix and multispecies-coalescent approaches. Lineage divergence times and ancestral ranges were inferred with Bayesian Markov chain Monte Carlo (MCMC) approaches. The population history was estimated with phylogeographic coalescent methods.

RESULTS

Camptoloma rotundifolium, restricted to Southern Africa, was shown to be a sister species to the disjunct clade formed by C. canariense, endemic to the Canary Islands, and C. lyperiiflorum, distributed in the Horn of Africa-Southern Arabia. Camptoloma was inferred to be sister to the mostly South African tribes Teedieae and Buddlejeae. Stem divergence was dated in the Late Miocene, while the origin of the extant disjunction was inferred as Early Pliocene.

CONCLUSIONS

The current disjunct distribution of Camptoloma across Africa was likely the result of fragmentation and extinction and/or population bottlenecking events associated with historical aridification cycles during the Neogene; the pattern of species divergence, from south to north, is consistent with the "climatic refugia" Rand Flora hypothesis.

Collapse

Adams RH, Castoe TA, DeGiorgio M. PhyloWGA: chromosome-aware phylogenetic interrogation of whole genome alignments. Bioinformatics 2021;37:1923-1925. [PMID: 33051672 DOI: 10.1093/bioinformatics/btaa884] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 09/16/2020] [Accepted: 09/29/2020] [Indexed: 11/13/2022] Open

Evans CL, Greenhill SJ, Watts J, List JM, Botero CA, Gray RD, Kirby KR. The uses and abuses of tree thinking in cultural evolution. Philos Trans R Soc Lond B Biol Sci 2021;376:20200056. [PMID: 33993767 PMCID: PMC8126464 DOI: 10.1098/rstb.2020.0056] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/11/2021] [Indexed: 11/13/2022] Open

Yu X, Le T, Christensen SA, Molloy EK, Warnow T. Using Robinson-Foulds supertrees in divide-and-conquer phylogeny estimation. Algorithms Mol Biol 2021;16:12. [PMID: 34183037 PMCID: PMC8240396 DOI: 10.1186/s13015-021-00189-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Accepted: 06/05/2021] [Indexed: 12/01/2022] Open

Susko E, Roger AJ. Long Branch Attraction Biases in Phylogenetics. Syst Biol 2021;70:838-843. [PMID: 33528562 DOI: 10.1093/sysbio/syab001] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2020] [Revised: 12/22/2020] [Accepted: 12/30/2020] [Indexed: 12/22/2022] Open

Le T, Sy A, Molloy EK, Zhang Q, Rao S, Warnow T. Using Constrained-INC for Large-Scale Gene Tree and Species Tree Estimation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021;18:2-15. [PMID: 32750844 DOI: 10.1109/tcbb.2020.2990867] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]

Legried B, Molloy EK, Warnow T, Roch S. Polynomial-Time Statistical Estimation of Species Trees Under Gene Duplication and Loss. J Comput Biol 2020;28:452-468. [DOI: 10.1089/cmb.2020.0424] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open

Blair C, Ané C. Phylogenetic Trees and Networks Can Serve as Powerful and Complementary Approaches for Analysis of Genomic Data. Syst Biol 2020;69:593-601. [PMID: 31432090 DOI: 10.1093/sysbio/syz056] [Citation(s) in RCA: 56] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2019] [Accepted: 08/15/2019] [Indexed: 11/14/2022] Open

Susko E, Roger AJ. On the Use of Information Criteria for Model Selection in Phylogenetics. Mol Biol Evol 2020;37:549-562. [PMID: 31688943 DOI: 10.1093/molbev/msz228] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open

Gardner EM, Johnson MG, Pereira JT, Puad ASA, Arifiani D, Sahromi , Wickett NJ, Zerega NJC. Paralogs and off-target sequences improve phylogenetic resolution in a densely-sampled study of the breadfruit genus (Artocarpus, Moraceae). Syst Biol 2020;70:syaa073. [PMID: 32970819 PMCID: PMC8048387 DOI: 10.1093/sysbio/syaa073] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Revised: 08/31/2020] [Accepted: 09/08/2020] [Indexed: 12/21/2022] Open

Chan KO, Hutter CR, Wood PL, Grismer LL, Das I, Brown RM. Gene flow creates a mirage of cryptic species in a Southeast Asian spotted stream frog complex. Mol Ecol 2020;29:3970-3987. [PMID: 32808335 DOI: 10.1111/mec.15603] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2019] [Revised: 07/29/2020] [Accepted: 08/13/2020] [Indexed: 02/06/2023]

Abstract

Most new cryptic species are described using conventional tree- and distance-based species delimitation methods (SDMs), which rely on phylogenetic arrangements and measures of genetic divergence. However, although numerous factors such as population structure and gene flow are known to confound phylogenetic inference and species delimitation, the influence of these processes is not frequently evaluated. Using large numbers of exons, introns, and ultraconserved elements obtained using the FrogCap sequence-capture protocol, we compared conventional SDMs with more robust genomic analyses that assess population structure and gene flow to characterize species boundaries in a Southeast Asian frog complex (Pulchrana picturata). Our results showed that gene flow and introgression can produce phylogenetic patterns and levels of divergence that resemble distinct species (up to 10% divergence in mitochondrial DNA). Hybrid populations were inferred as independent (singleton) clades that were highly divergent from adjacent populations (7%-10%) and unusually similar (<3%) to allopatric populations. Such anomalous patterns are not uncommon in Southeast Asian amphibians, which brings into question whether the high levels of cryptic diversity observed in other amphibian groups reflect distinct cryptic species-or, instead, highly admixed and structured metapopulation lineages. Our results also provide an alternative explanation to the conundrum of divergent (sometimes nonsister) sympatric lineages-a pattern that has been celebrated as indicative of true cryptic speciation. Based on these findings, we recommend that species delimitation of continuously distributed "cryptic" groups should not rely solely on conventional SDMs, but should necessarily examine population structure and gene flow to avoid taxonomic inflation.

Collapse

Portik DM, Wiens JJ. Do Alignment and Trimming Methods Matter for Phylogenomic (UCE) Analyses? Syst Biol 2020;70:440-462. [PMID: 32797207 DOI: 10.1093/sysbio/syaa064] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2019] [Revised: 08/02/2020] [Accepted: 08/03/2020] [Indexed: 11/14/2022] Open

Abstract

Alignment is a crucial issue in molecular phylogenetics because different alignment methods can potentially yield very different topologies for individual genes. But it is unclear if the choice of alignment methods remains important in phylogenomic analyses, which incorporate data from hundreds or thousands of genes. For example, problematic biases in alignment might be multiplied across many loci, whereas alignment errors in individual genes might become irrelevant. The issue of alignment trimming (i.e., removing poorly aligned regions or missing data from individual genes) is also poorly explored. Here, we test the impact of 12 different combinations of alignment and trimming methods on phylogenomic analyses. We compare these methods using published phylogenomic data from ultraconserved elements (UCEs) from squamate reptiles (lizards and snakes), birds, and tetrapods. We compare the properties of alignments generated by different alignment and trimming methods (e.g., length, informative sites, missing data). We also test whether these data sets can recover well-established clades when analyzed with concatenated (RAxML) and species-tree methods (ASTRAL-III), using the full data ($\sim $5000 loci) and subsampled data sets (10% and 1% of loci). We show that different alignment and trimming methods can significantly impact various aspects of phylogenomic data sets (e.g., length, informative sites). However, these different methods generally had little impact on the recovery and support values for well-established clades, even across very different numbers of loci. Nevertheless, our results suggest several "best practices" for alignment and trimming. Intriguingly, the choice of phylogenetic methods impacted the phylogenetic results most strongly, with concatenated analyses recovering significantly more well-established clades (with stronger support) than the species-tree analyses. [Alignment; concatenated analysis; phylogenomics; sequence length heterogeneity; species-tree analysis; trimming].

Collapse

Phylogenetic tree building in the genomic age. Nat Rev Genet 2020;21:428-444. [PMID: 32424311 DOI: 10.1038/s41576-020-0233-0] [Citation(s) in RCA: 217] [Impact Index Per Article: 43.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/20/2020] [Indexed: 01/22/2023]

Wascher M, Kubatko L. Consistency of SVDQuartets and Maximum Likelihood for Coalescent-Based Species Tree Estimation. Syst Biol 2020;70:33-48. [PMID: 32415974 DOI: 10.1093/sysbio/syaa039] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2018] [Revised: 05/06/2020] [Accepted: 05/07/2020] [Indexed: 11/14/2022] Open

Smirnov V, Warnow T. Unblended disjoint tree merging using GTM improves species tree estimation. BMC Genomics 2020;21:235. [PMID: 32299343 PMCID: PMC7161100 DOI: 10.1186/s12864-020-6605-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open

Adams RH, Castoe TA. Probabilistic Species Tree Distances: Implementing the Multispecies Coalescent to Compare Species Trees Within the Same Model-Based Framework Used to Estimate Them. Syst Biol 2020;69:194-207. [PMID: 31086978 DOI: 10.1093/sysbio/syz031] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2018] [Accepted: 05/02/2019] [Indexed: 11/14/2022] Open

Abstract

Despite the ubiquitous use of statistical models for phylogenomic and population genomic inferences, this model-based rigor is rarely applied to post hoc comparison of trees. In a recent study, Garba et al. derived new methods for measuring the distance between two gene trees computed as the difference in their site pattern probability distributions. Unlike traditional metrics that compare trees solely in terms of geometry, these measures consider gene trees and associated parameters as probabilistic models that can be compared using standard information theoretic approaches. Consequently, probabilistic measures of phylogenetic tree distance can be far more informative than simply comparisons of topology and/or branch lengths alone. However, in their current form, these distance measures are not suitable for the comparison of species tree models in the presence of gene tree heterogeneity. Here, we demonstrate an approach for how the theory of Garba et al. (2018), which is based on gene tree distances, can be extended naturally to the comparison of species tree models. Multispecies coalescent (MSC) models parameterize the discrete probability distribution of gene trees conditioned upon a species tree with a particular topology and set of divergence times (in coalescent units), and thus provide a framework for measuring distances between species tree models in terms of their corresponding gene tree topology probabilities. We describe the computation of probabilistic species tree distances in the context of standard MSC models, which assume complete genetic isolation postspeciation, as well as recent theoretical extensions to the MSC in the form of network-based MSC models that relax this assumption and permit hybridization among taxa. We demonstrate these metrics using simulations and empirical species tree estimates and discuss both the benefits and limitations of these approaches. We make our species tree distance approach available as an R package called pSTDistanceR, for open use by the community.

Collapse

Wang HC, Susko E, Roger AJ. The Relative Importance of Modeling Site Pattern Heterogeneity Versus Partition-Wise Heterotachy in Phylogenomic Inference. Syst Biol 2020;68:1003-1019. [PMID: 31140564 DOI: 10.1093/sysbio/syz021] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2018] [Revised: 02/04/2019] [Accepted: 04/09/2019] [Indexed: 12/18/2022] Open

Abstract

Large taxa-rich genome-scale data sets are often necessary for resolving ancient phylogenetic relationships. But accurate phylogenetic inference requires that they are analyzed with realistic models that account for the heterogeneity in substitution patterns amongst the sites, genes and lineages. Two kinds of adjustments are frequently used: models that account for heterogeneity in amino acid frequencies at sites in proteins, and partitioned models that accommodate the heterogeneity in rates (branch lengths) among different proteins in different lineages (protein-wise heterotachy). Although partitioned and site-heterogeneous models are both widely used in isolation, their relative importance to the inference of correct phylogenies has not been carefully evaluated. We conducted several empirical analyses and a large set of simulations to compare the relative performances of partitioned models, site-heterogeneous models, and combined partitioned site heterogeneous models. In general, site-homogeneous models (partitioned or not) performed worse than site heterogeneous, except in simulations with extreme protein-wise heterotachy. Furthermore, simulations using empirically-derived realistic parameter settings showed a marked long-branch attraction (LBA) problem for analyses employing protein-wise partitioning even when the generating model included partitioning. This LBA problem results from a small sample bias compounded over many single protein alignments. In some cases, this problem was ameliorated by clustering similarly-evolving proteins together into larger partitions using the PartitionFinder method. Similar results were obtained under simulations with larger numbers of taxa or heterogeneity in simulating topologies over genes. For an empirical Microsporidia test data set, all but one tested site-heterogeneous models (with or without partitioning) obtain the correct Microsporidia+Fungi grouping, whereas site-homogenous models (with or without partitioning) did not. The single exception was the fully partitioned site-heterogeneous analysis that succumbed to the compounded small sample LBA bias. In general unless protein-wise heterotachy effects are extreme, it is more important to model site-heterogeneity than protein-wise heterotachy in phylogenomic analyses. Complete protein-wise partitioning should be avoided as it can lead to a serious LBA bias. In cases of extreme protein-wise heterotachy, approaches that cluster similarly-evolving proteins together and coupled with site-heterogeneous models work well for phylogenetic estimation.

Collapse

Polynomial-Time Statistical Estimation of Species Trees Under Gene Duplication and Loss. LECTURE NOTES IN COMPUTER SCIENCE 2020. [DOI: 10.1007/978-3-030-45257-5_8] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]

Williams TA, Cox CJ, Foster PG, Szöllősi GJ, Embley TM. Phylogenomics provides robust support for a two-domains tree of life. Nat Ecol Evol 2020;4:138-147. [PMID: 31819234 PMCID: PMC6942926 DOI: 10.1038/s41559-019-1040-x] [Citation(s) in RCA: 143] [Impact Index Per Article: 28.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2019] [Accepted: 10/15/2019] [Indexed: 11/09/2022]

Molloy EK, Warnow T. Statistically consistent divide-and-conquer pipelines for phylogeny estimation using NJMerge. Algorithms Mol Biol 2019;14:14. [PMID: 31360216 PMCID: PMC6642500 DOI: 10.1186/s13015-019-0151-x] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2019] [Accepted: 06/13/2019] [Indexed: 12/26/2022] Open

Abstract

BACKGROUND

Divide-and-conquer methods, which divide the species set into overlapping subsets, construct a tree on each subset, and then combine the subset trees using a supertree method, provide a key algorithmic framework for boosting the scalability of phylogeny estimation methods to large datasets. Yet the use of supertree methods, which typically attempt to solve NP-hard optimization problems, limits the scalability of such approaches.

RESULTS

In this paper, we introduce a divide-and-conquer approach that does not require supertree estimation: we divide the species set into pairwise disjoint subsets, construct a tree on each subset using a base method, and then combine the subset trees using a distance matrix. For this merger step, we present a new method, called NJMerge, which is a polynomial-time extension of Neighbor Joining (NJ); thus, NJMerge can be viewed either as a method for improving traditional NJ or as a method for scaling the base method to larger datasets. We prove that NJMerge can be used to create divide-and-conquer pipelines that are statistically consistent under some models of evolution. We also report the results of an extensive simulation study evaluating NJMerge on multi-locus datasets with up to 1000 species. We found that NJMerge sometimes improved the accuracy of traditional NJ and substantially reduced the running time of three popular species tree methods (ASTRAL-III, SVDquartets, and "concatenation" using RAxML) without sacrificing accuracy. Finally, although NJMerge can fail to return a tree, in our experiments, NJMerge failed on only 11 out of 2560 test cases.

CONCLUSIONS

Theoretical and empirical results suggest that NJMerge is a valuable technique for large-scale phylogeny estimation, especially when computational resources are limited. NJMerge is freely available on Github (http://github.com/ekmolloy/njmerge).

Collapse

Comparative Phylogenomics, a Stepping Stone for Bird Biodiversity Studies. DIVERSITY-BASEL 2019. [DOI: 10.3390/d11070115] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]

Molloy EK, Warnow T. TreeMerge: a new method for improving the scalability of species tree estimation methods. Bioinformatics 2019;35:i417-i426. [PMID: 31510668 PMCID: PMC6612878 DOI: 10.1093/bioinformatics/btz344] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open

Spriggs EL, Schlutius C, Eaton DA, Park B, Sweeney PW, Edwards EJ, Donoghue MJ. Differences in flowering time maintain species boundaries in a continental radiation of Viburnum. AMERICAN JOURNAL OF BOTANY 2019;106:833-849. [PMID: 31124135 DOI: 10.1002/ajb2.1292] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/26/2018] [Accepted: 03/19/2019] [Indexed: 05/21/2023]

Statistical binning leads to profound model violation due to gene tree error incurred by trying to avoid gene tree error. Mol Phylogenet Evol 2019;134:164-171. [DOI: 10.1016/j.ympev.2019.02.012] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2018] [Revised: 11/30/2018] [Accepted: 02/14/2019] [Indexed: 11/19/2022]

Zhang Q, Rao S, Warnow T. Constrained incremental tree building: new absolute fast converging phylogeny estimation methods with improved scalability and accuracy. Algorithms Mol Biol 2019;14:2. [PMID: 30839943 PMCID: PMC6364484 DOI: 10.1186/s13015-019-0136-9] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2018] [Accepted: 01/18/2019] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Absolute fast converging (AFC) phylogeny estimation methods are ones that have been proven to recover the true tree with high probability given sequences whose lengths are polynomial in the number of number of leaves in the tree (once the shortest and longest branch weights are fixed). While there has been a large literature on AFC methods, the best in terms of empirical performance was D C M NJ , published in SODA 2001. The main empirical advantage of DCM NJ over other AFC methods is its use of neighbor joining (NJ) to construct trees on smaller taxon subsets, which are then combined into a tree on the full set of species using a supertree method; in contrast, the other AFC methods in essence depend on quartet trees that are computed independently of each other, which reduces accuracy compared to neighbor joining. However, DCM NJ is unlikely to scale to large datasets due to its reliance on supertree methods, as no current supertree methods are able to scale to large datasets with high accuracy.

RESULTS

In this study we present a new approach to large-scale phylogeny estimation that shares some of the features of DCM NJ but bypasses the use of supertree methods. We prove that this new approach is AFC and uses polynomial time and space. Furthermore, we describe variations on this basic approach that can be used with leaf-disjoint constraint trees (computed using methods such as maximum likelihood) to produce other methods that are likely to provide even better accuracy. Thus, we present a new generalizable technique for large-scale tree estimation that is designed to improve scalability for phylogeny estimation methods to ultra-large datasets, and that can be used in a variety of settings (including tree estimation from unaligned sequences, and species tree estimation from gene trees).

Collapse

Siniscalchi CM, Loeuille B, Funk VA, Mandel JR, Pirani JR. Phylogenomics Yields New Insight Into Relationships Within Vernonieae (Asteraceae). FRONTIERS IN PLANT SCIENCE 2019;10:1224. [PMID: 31749813 PMCID: PMC6843069 DOI: 10.3389/fpls.2019.01224] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/04/2019] [Accepted: 09/04/2019] [Indexed: 05/10/2023]

Abstract

Asteraceae, or the sunflower family, is the largest family of flowering plants and is usually considered difficult to work with, not only due to its size, but also because of the abundant cases of polyploidy and ancient whole-genome duplications. Traditional molecular systematics studies were often impaired by the low levels of variation found in chloroplast markers and the high paralogy of traditional nuclear markers like ITS. Next-generation sequencing and novel phylogenomics methods, such as target capture and Hyb-Seq, have provided new ways of studying the phylogeny of the family with great success. While the resolution of the backbone of the family is in progress with some results already published, smaller studies focusing on internal clades of the phylogeny are important to increase sampling and allow morphological, biogeography, and diversification analyses, as well as serving as basis to test the current infrafamilial classification. Vernonieae is one of the largest tribes in the family, accounting for approximately 1,500 species. From the 1970s to the 1990s, the tribe went through several reappraisals, mainly due to the splitting of the mega genus Vernonia into several smaller segregates. Only three phylogenetic studies focusing on the Vernonieae have been published to date, both using a few molecular markers, overall presenting low resolution and support in deepest nodes, and presenting conflicting topologies when compared. In this study, we present the first attempt at studying the phylogeny of Vernonieae using phylogenomics. Even though our sampling includes only around 4% of the diversity of the tribe, we achieved complete resolution of the phylogeny with high support recovering approximately 700 nuclear markers obtained through target capture. We also analyzed the effect of missing data using two different matrices with different number of markers and the difference between concatenated and gene tree analysis.

Collapse