Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Mallo D, De Oliveira Martins L, Posada D. SimPhy: Phylogenomic Simulation of Gene, Locus, and Species Trees. Syst Biol 2015;65:334-44. [PMID: 26526427 PMCID: PMC4748750 DOI: 10.1093/sysbio/syv082] [Citation(s) in RCA: 79] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2015] [Accepted: 10/20/2015] [Indexed: 11/14/2022] Open

For:	Mallo D, De Oliveira Martins L, Posada D. SimPhy: Phylogenomic Simulation of Gene, Locus, and Species Trees. Syst Biol 2015;65:334-44. [PMID: 26526427 PMCID: PMC4748750 DOI: 10.1093/sysbio/syv082] [Citation(s) in RCA: 79] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2015] [Accepted: 10/20/2015] [Indexed: 11/14/2022] Open

Number

Cited by Other Article(s)

Cortada Garcia M, Diéguez Moscardó A, Casanellas M. Generating Heterogeneous Data on Gene Trees. J Comput Biol 2025. [PMID: 40340442 DOI: 10.1089/cmb.2024.0843] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/10/2025] Open

Arasti S, Tabaghi P, Tabatabaee Y, Mirarab S. Branch Length Transforms using Optimal Tree Metric Matching. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2023.11.13.566962. [PMID: 38746464 PMCID: PMC11092445 DOI: 10.1101/2023.11.13.566962] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]

Abstract

The abundant discordance between evolutionary relationships across the genome has rekindled interest in methods for comparing and averaging trees on a shared leaf set. However, most attempts at comparing and matching trees have focused on tree topology. Comparing branch lengths has been more elusive due to several challenges. Species tree branch lengths can be measured in various units, often different from gene trees. Moreover, rates of evolution change across the genome, the species tree, and specific branches of gene trees. These factors compound the stochasticity of coalescence times and estimation noise, making branch lengths highly heterogeneous across the genome. For many downstream applications in phylogenomic analyses, branch lengths are as important as the topology, and yet, existing tools to compare and combine weighted trees are limited. In this paper, we address the question of matching one tree to another, accounting for their branch lengths. We define a series of computational problems called Topology-Constrained Metric Matching (TCMM) that seek to transform the branch lengths of a query tree based on a reference tree. We show that TCMM problems can be solved in quadratic time and memory using a linear algebraic formulation coupled with dynamic programming preprocessing. While many applications can be imagined for this framework, we explore two applications in this paper: embedding leaves of gene trees in Euclidean space to find outliers potentially indicative of errors and summarizing gene tree branch lengths onto the species tree. In these applications, our method, when paired with existing methods, increases their accuracy at limited computational expense.

Code and Data availability

The software is available at https://github.com/shayesteh99/TCMM.git . Data is available on Github https://github.com/shayesteh99/TCMM-Data.git .

Collapse

Zhang C, Nielsen R, Mirarab S. CASTER: Direct species tree inference from whole-genome alignments. Science 2025;387:eadk9688. [PMID: 39847611 PMCID: PMC12038793 DOI: 10.1126/science.adk9688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2023] [Revised: 08/05/2024] [Accepted: 12/04/2024] [Indexed: 01/25/2025]

Tabatabaee Y, Zhang C, Arasti S, Mirarab S. Species tree branch length estimation despite incomplete lineage sorting, duplication, and loss. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.20.639320. [PMID: 40027742 PMCID: PMC11870528 DOI: 10.1101/2025.02.20.639320] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]

Abstract

Phylogenetic branch lengths are essential for many analyses, such as estimating divergence times, analyzing rate changes, and studying adaptation. However, true gene tree heterogeneity due to incomplete lineage sorting (ILS), gene duplication and loss (GDL), and horizontal gene transfer (HGT) can complicate the estimation of species tree branch lengths. While several tools exist for estimating the topology of a species tree addressing various causes of gene tree discordance, much less attention has been paid to branch length estimation on multi-locus datasets. For single-copy gene trees, some methods are available that summarize gene tree branch lengths onto a species tree, including coalescent-based methods that account for heterogeneity due to ILS. However, no such branch length estimation method exists for multi-copy gene family trees that have evolved with gene duplication and loss. To address this gap, we introduce the CASTLES-Pro algorithm for estimating species tree branch lengths while accounting for both GDL and ILS. CASTLES-Pro improves on the existing coalescent-based branch length estimation method CASTLES by increasing its accuracy for single-copy gene trees and extends it to handle multi-copy ones. Our simulation studies show that CASTLES-Pro is generally more accurate than alternatives, eliminating the systematic bias toward overestimating terminal branch lengths often observed when using concatenation. Moreover, while not theoretically designed for HGT, we show that CASTLES-Pro maintains relatively high accuracy under high rates of random HGT.

Code availability

CASTLES-Pro is implemented inside the software package ASTER, available at https://github.com/chaoszhang/ASTER .

Data availability

The datasets and scripts used in this study are available at https://github.com/ytabatabaee/CASTLES-Pro-paper .

Collapse

Habib M, Roy K, Hasan S, Rahman AH, Bayzid MS. Terraces in species tree inference from gene trees. BMC Ecol Evol 2024;24:135. [PMID: 39497030 PMCID: PMC11533290 DOI: 10.1186/s12862-024-02309-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2023] [Accepted: 09/16/2024] [Indexed: 11/06/2024] Open

Leal JL, Milesi P, Hodková E, Zhou Q, James J, Eklund DM, Pyhäjärvi T, Salojärvi J, Lascoux M. Complex Polyploids: Origins, Genomic Composition, and Role of Introgressed Alleles. Syst Biol 2024;73:392-418. [PMID: 38613229 PMCID: PMC11282369 DOI: 10.1093/sysbio/syae012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Revised: 12/18/2023] [Accepted: 03/28/2024] [Indexed: 04/14/2024] Open

Abstract

Introgression allows polyploid species to acquire new genomic content from diploid progenitors or from other unrelated diploid or polyploid lineages, contributing to genetic diversity and facilitating adaptive allele discovery. In some cases, high levels of introgression elicit the replacement of large numbers of alleles inherited from the polyploid's ancestral species, profoundly reshaping the polyploid's genomic composition. In such complex polyploids, it is often difficult to determine which taxa were the progenitor species and which taxa provided additional introgressive blocks through subsequent hybridization. Here, we use population-level genomic data to reconstruct the phylogenetic history of Betula pubescens (downy birch), a tetraploid species often assumed to be of allopolyploid origin and which is known to hybridize with at least four other birch species. This was achieved by modeling polyploidization and introgression events under the multispecies coalescent and then using an approximate Bayesian computation rejection algorithm to evaluate and compare competing polyploidization models. We provide evidence that B. pubescens is the outcome of an autoploid genome doubling event in the common ancestor of B. pendula and its extant sister species, B. platyphylla, that took place approximately 178,000-188,000 generations ago. Extensive hybridization with B. pendula, B. nana, and B. humilis followed in the aftermath of autopolyploidization, with the relative contribution of each of these species to the B. pubescens genome varying markedly across the species' range. Functional analysis of B. pubescens loci containing alleles introgressed from B. nana identified multiple genes involved in climate adaptation, while loci containing alleles derived from B. humilis revealed several genes involved in the regulation of meiotic stability and pollen viability in plant species.

Collapse

Affiliation(s)

J Luis Leal Plant Ecology and Evolution, Department of Ecology and Genetics, Uppsala University, Norbyvägen 18D, 75236 Uppsala, Sweden
Pascal Milesi Plant Ecology and Evolution, Department of Ecology and Genetics, Uppsala University, Norbyvägen 18D, 75236 Uppsala, Sweden Science for Life Laboratory (SciLifeLab), Uppsala University, 75237 Uppsala, Sweden
Eva Hodková Plant Ecology and Evolution, Department of Ecology and Genetics, Uppsala University, Norbyvägen 18D, 75236 Uppsala, Sweden Faculty of Environmental Sciences, Czech University of Life Sciences Prague, Kamýcká 129, 16521 Prague, Czech Republic
Qiujie Zhou Plant Ecology and Evolution, Department of Ecology and Genetics, Uppsala University, Norbyvägen 18D, 75236 Uppsala, Sweden
Jennifer James Plant Ecology and Evolution, Department of Ecology and Genetics, Uppsala University, Norbyvägen 18D, 75236 Uppsala, Sweden
D Magnus Eklund Physiology and Environmental Toxicology, Department of Organismal Biology, Uppsala University, Norbyvägen 18A, 75236 Uppsala, Sweden
Tanja Pyhäjärvi Organismal and Evolutionary Biology Research Program, Faculty of Biological and Environmental Sciences, and Viikki Plant Science Centre, University of Helsinki, P.O. Box 65 (Viikinkaari 1), 00014 Helsinki, Finland Department of Forest Sciences, University of Helsinki, 00014 Helsinki, Finland
Jarkko Salojärvi School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore Organismal and Evolutionary Biology Research Program, Faculty of Biological and Environmental Sciences, and Viikki Plant Science Centre, University of Helsinki, P.O. Box 65 (Viikinkaari 1), 00014 Helsinki, Finland
Martin Lascoux Plant Ecology and Evolution, Department of Ecology and Genetics, Uppsala University, Norbyvägen 18D, 75236 Uppsala, Sweden Science for Life Laboratory (SciLifeLab), Uppsala University, 75237 Uppsala, Sweden

Collapse

Li Q, Chan YB, Galtier N, Scornavacca C. The Effect of Copy Number Hemiplasy on Gene Family Evolution. Syst Biol 2024;73:355-374. [PMID: 38330161 DOI: 10.1093/sysbio/syae007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Revised: 01/24/2024] [Accepted: 02/03/2024] [Indexed: 02/10/2024] Open

Jiang Y, McDonald D, Perry D, Knight R, Mirarab S. Scaling DEPP phylogenetic placement to ultra-large reference trees: a tree-aware ensemble approach. Bioinformatics 2024;40:btae361. [PMID: 38870525 PMCID: PMC11193062 DOI: 10.1093/bioinformatics/btae361] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Revised: 04/09/2024] [Accepted: 06/12/2024] [Indexed: 06/15/2024] Open

Rick JA, Brock CD, Lewanski AL, Golcher-Benavides J, Wagner CE. Reference Genome Choice and Filtering Thresholds Jointly Influence Phylogenomic Analyses. Syst Biol 2024;73:76-101. [PMID: 37881861 DOI: 10.1093/sysbio/syad065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Revised: 09/20/2023] [Accepted: 10/20/2023] [Indexed: 10/27/2023] Open

Abstract

Molecular phylogenies are a cornerstone of modern comparative biology and are commonly employed to investigate a range of biological phenomena, such as diversification rates, patterns in trait evolution, biogeography, and community assembly. Recent work has demonstrated that significant biases may be introduced into downstream phylogenetic analyses from processing genomic data; however, it remains unclear whether there are interactions among bioinformatic parameters or biases introduced through the choice of reference genome for sequence alignment and variant calling. We address these knowledge gaps by employing a combination of simulated and empirical data sets to investigate the extent to which the choice of reference genome in upstream bioinformatic processing of genomic data influences phylogenetic inference, as well as the way that reference genome choice interacts with bioinformatic filtering choices and phylogenetic inference method. We demonstrate that more stringent minor allele filters bias inferred trees away from the true species tree topology, and that these biased trees tend to be more imbalanced and have a higher center of gravity than the true trees. We find the greatest topological accuracy when filtering sites for minor allele count (MAC) >3-4 in our 51-taxa data sets, while tree center of gravity was closest to the true value when filtering for sites with MAC >1-2. In contrast, filtering for missing data increased accuracy in the inferred topologies; however, this effect was small in comparison to the effect of minor allele filters and may be undesirable due to a subsequent mutation spectrum distortion. The bias introduced by these filters differs based on the reference genome used in short read alignment, providing further support that choosing a reference genome for alignment is an important bioinformatic decision with implications for downstream analyses. These results demonstrate that attributes of the study system and dataset (and their interaction) add important nuance for how best to assemble and filter short-read genomic data for phylogenetic inference.

Collapse

Balaban M, Jiang Y, Zhu Q, McDonald D, Knight R, Mirarab S. Generation of accurate, expandable phylogenomic trees with uDance. Nat Biotechnol 2024;42:768-777. [PMID: 37500914 PMCID: PMC10818028 DOI: 10.1038/s41587-023-01868-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2022] [Accepted: 06/20/2023] [Indexed: 07/29/2023]

Arasti S, Mirarab S. Median quartet tree search algorithms using optimal subtree prune and regraft. Algorithms Mol Biol 2024;19:12. [PMID: 38481327 PMCID: PMC10938725 DOI: 10.1186/s13015-024-00257-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Accepted: 02/13/2024] [Indexed: 03/17/2024] Open

Górecki P, Rutecka N, Mykowiecka A, Paszek J. Unifying duplication episode clustering and gene-species mapping inference. Algorithms Mol Biol 2024;19:7. [PMID: 38355611 PMCID: PMC10865717 DOI: 10.1186/s13015-024-00252-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 01/04/2024] [Indexed: 02/16/2024] Open

Dai J, Rubel T, Han Y, Molloy EK. Dollo-CDP: a polynomial-time algorithm for the clade-constrained large Dollo parsimony problem. Algorithms Mol Biol 2024;19:2. [PMID: 38191515 PMCID: PMC10775561 DOI: 10.1186/s13015-023-00249-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Accepted: 12/10/2023] [Indexed: 01/10/2024] Open

Abstract

The last decade of phylogenetics has seen the development of many methods that leverage constraints plus dynamic programming. The goal of this algorithmic technique is to produce a phylogeny that is optimal with respect to some objective function and that lies within a constrained version of tree space. The popular species tree estimation method ASTRAL, for example, returns a tree that (1) maximizes the quartet score computed with respect to the input gene trees and that (2) draws its branches (bipartitions) from the input constraint set. This technique has yet to be used for parsimony problems where the input are binary characters, sometimes with missing values. Here, we introduce the clade-constrained character parsimony problem and present an algorithm that solves this problem for the Dollo criterion score in [Formula: see text] time, where n is the number of leaves, k is the number of characters, and [Formula: see text] is the set of clades used as constraints. Dollo parsimony, which requires traits/mutations to be gained at most once but allows them to be lost any number of times, is widely used for tumor phylogenetics as well as species phylogenetics, for example analyses of low-homoplasy retroelement insertions across the vertebrate tree of life. This motivated us to implement our algorithm in a software package, called Dollo-CDP, and evaluate its utility for analyzing retroelement insertion presence / absence patterns for bats, birds, toothed whales as well as simulated data. Our results show that Dollo-CDP can improve upon heuristic search from a single starting tree, often recovering a better scoring tree. Moreover, Dollo-CDP scales to data sets with much larger numbers of taxa than branch-and-bound while still having an optimality guarantee, albeit a more restricted one. Lastly, we show that our algorithm for Dollo parsimony can easily be adapted to Camin-Sokal parsimony but not Fitch parsimony.

Collapse

Comte A, Tricou T, Tannier E, Joseph J, Siberchicot A, Penel S, Allio R, Delsuc F, Dray S, de Vienne DM. PhylteR: Efficient Identification of Outlier Sequences in Phylogenomic Datasets. Mol Biol Evol 2023;40:msad234. [PMID: 37879113 PMCID: PMC10655845 DOI: 10.1093/molbev/msad234] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Revised: 09/29/2023] [Accepted: 10/18/2023] [Indexed: 10/27/2023] Open

Tabatabaee Y, Roch S, Warnow T. QR-STAR: A Polynomial-Time Statistically Consistent Method for Rooting Species Trees Under the Coalescent. J Comput Biol 2023;30:1146-1181. [PMID: 37902986 DOI: 10.1089/cmb.2023.0185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2023] Open

Fogg J, Allman ES, Ané C. PhyloCoalSimulations: A Simulator for Network Multispecies Coalescent Models, Including a New Extension for the Inheritance of Gene Flow. Syst Biol 2023;72:1171-1179. [PMID: 37254872 DOI: 10.1093/sysbio/syad030] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2023] [Revised: 05/03/2023] [Accepted: 05/15/2023] [Indexed: 06/01/2023] Open

Tabatabaee Y, Zhang C, Warnow T, Mirarab S. Phylogenomic branch length estimation using quartets. Bioinformatics 2023;39:i185-i193. [PMID: 37387151 PMCID: PMC10311336 DOI: 10.1093/bioinformatics/btad221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open

Leal JL, Milesi P, Salojärvi J, Lascoux M. Phylogenetic Analysis of Allotetraploid Species Using Polarized Genomic Sequences. Syst Biol 2023;72:372-390. [PMID: 36932679 PMCID: PMC10275558 DOI: 10.1093/sysbio/syad009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2022] [Revised: 10/14/2022] [Accepted: 03/10/2023] [Indexed: 03/19/2023] Open

Abstract

Phylogenetic analysis of polyploid hybrid species has long posed a formidable challenge as it requires the ability to distinguish between alleles of different ancestral origins in order to disentangle their individual evolutionary history. This problem has been previously addressed by conceiving phylogenies as reticulate networks, using a two-step phasing strategy that first identifies and segregates homoeologous loci and then, during a second phasing step, assigns each gene copy to one of the subgenomes of an allopolyploid species. Here, we propose an alternative approach, one that preserves the core idea behind phasing-to produce separate nucleotide sequences that capture the reticulate evolutionary history of a polyploid-while vastly simplifying its implementation by reducing a complex multistage procedure to a single phasing step. While most current methods used for phylogenetic reconstruction of polyploid species require sequencing reads to be pre-phased using experimental or computational methods-usually an expensive, complex, and/or time-consuming endeavor-phasing executed using our algorithm is performed directly on the multiple-sequence alignment (MSA), a key change that allows for the simultaneous segregation and sorting of gene copies. We introduce the concept of genomic polarization that, when applied to an allopolyploid species, produces nucleotide sequences that capture the fraction of a polyploid genome that deviates from that of a reference sequence, usually one of the other species present in the MSA. We show that if the reference sequence is one of the parental species, the polarized polyploid sequence has a close resemblance (high pairwise sequence identity) to the second parental species. This knowledge is harnessed to build a new heuristic algorithm where, by replacing the allopolyploid genomic sequence in the MSA by its polarized version, it is possible to identify the phylogenetic position of the polyploid's ancestral parents in an iterative process. The proposed methodology can be used with long-read and short-read high-throughput sequencing data and requires only one representative individual for each species to be included in the phylogenetic analysis. In its current form, it can be used in the analysis of phylogenies containing tetraploid and diploid species. We test the newly developed method extensively using simulated data in order to evaluate its accuracy. We show empirically that the use of polarized genomic sequences allows for the correct identification of both parental species of an allotetraploid with up to 97% certainty in phylogenies with moderate levels of incomplete lineage sorting (ILS) and 87% in phylogenies containing high levels of ILS. We then apply the polarization protocol to reconstruct the reticulate histories of Arabidopsis kamchatica and Arabidopsis suecica, two allopolyploids whose ancestry has been well documented. [Allopolyploidy; Arabidopsis; genomic polarization; homoeologs; incomplete lineage sorting; phasing; polyploid phylogenetics; reticulate evolution.].

Collapse

Allman ES, Banos H, Rhodes JA. Testing Multispecies Coalescent Simulators Using Summary Statistics. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023;20:1613-1618. [PMID: 35617176 PMCID: PMC10183998 DOI: 10.1109/tcbb.2022.3177956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]

Willson J, Tabatabaee Y, Liu B, Warnow T. DISCO+QR: rooting species trees in the presence of GDL and ILS. BIOINFORMATICS ADVANCES 2023;3:vbad015. [PMID: 36789293 PMCID: PMC9923442 DOI: 10.1093/bioadv/vbad015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Revised: 01/21/2023] [Accepted: 02/06/2023] [Indexed: 02/10/2023]

Morel B, Williams TA, Stamatakis A. Asteroid: a new algorithm to infer species trees from gene trees under high proportions of missing data. Bioinformatics 2022;39:6964379. [PMID: 36576010 PMCID: PMC9838317 DOI: 10.1093/bioinformatics/btac832] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Revised: 12/12/2022] [Accepted: 12/26/2022] [Indexed: 12/29/2022] Open

Super-Mitobarcoding in Plant Species Identification? It Can Work! The Case of Leafy Liverworts Belonging to the Genus Calypogeia. Int J Mol Sci 2022;23:ijms232415570. [PMID: 36555212 PMCID: PMC9779425 DOI: 10.3390/ijms232415570] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Revised: 12/04/2022] [Accepted: 12/05/2022] [Indexed: 12/14/2022] Open

Zhang C, Mirarab S. Weighting by Gene Tree Uncertainty Improves Accuracy of Quartet-based Species Trees. Mol Biol Evol 2022;39:6750035. [PMID: 36201617 PMCID: PMC9750496 DOI: 10.1093/molbev/msac215] [Citation(s) in RCA: 51] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 09/20/2022] [Accepted: 10/03/2022] [Indexed: 01/07/2023] Open

Balaban M, Bristy NA, Faisal A, Bayzid MS, Mirarab S. Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model. BIOINFORMATICS ADVANCES 2022;2:vbac055. [PMID: 35992043 PMCID: PMC9383262 DOI: 10.1093/bioadv/vbac055] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Accepted: 08/09/2022] [Indexed: 01/27/2023]

Key Words Collapse

MESH Headings Collapse

Grants Collapse

Affiliation(s)
Metin Balaban
Nishat Anjum Bristy
Ahnaf Faisal
Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
Md Shamsuzzoha Bayzid
Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
Siavash Mirarab
To whom correspondence should be addressed.
Collapse

Collaborators Collapse

Cui X, Xue Y, McCormack C, Garces A, Rachman TW, Yi Y, Stolzer M, Durand D. Simulating domain architecture evolution. Bioinformatics 2022;38:i134-i142. [PMID: 35758772 PMCID: PMC9236583 DOI: 10.1093/bioinformatics/btac242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open

Abstract

Motivation

Simulation is an essential technique for generating biomolecular data with a ‘known’ history for use in validating phylogenetic inference and other evolutionary methods. On longer time scales, simulation supports investigations of equilibrium behavior and provides a formal framework for testing competing evolutionary hypotheses. Twenty years of molecular evolution research have produced a rich repertoire of simulation methods. However, current models do not capture the stringent constraints acting on the domain insertions, duplications, and deletions by which multidomain architectures evolve. Although these processes have the potential to generate any combination of domains, only a tiny fraction of possible domain combinations are observed in nature. Modeling these stringent constraints on domain order and co-occurrence is a fundamental challenge in domain architecture simulation that does not arise with sequence and gene family simulation.

Results

Here, we introduce a stochastic model of domain architecture evolution to simulate evolutionary trajectories that reflect the constraints on domain order and co-occurrence observed in nature. This framework is implemented in a novel domain architecture simulator, DomArchov, using the Metropolis–Hastings algorithm with data-driven transition probabilities. The use of a data-driven event module enables quick and easy redeployment of the simulator for use in different taxonomic and protein function contexts. Using empirical evaluation with metazoan datasets, we demonstrate that domain architectures simulated by DomArchov recapitulate properties of genuine domain architectures that reflect the constraints on domain order and adjacency seen in nature. This work expands the realm of evolutionary processes that are amenable to simulation.

Availability and implementation

DomArchov is written in Python 3 and is available at http://www.cs.cmu.edu/~durand/DomArchov. The data underlying this article are available via the same link.

Supplementary information

Supplementary data are available at Bioinformatics online.

Collapse

Carson J, Ledda A, Ferretti L, Keeling M, Didelot X. The bounded coalescent model: Conditioning a genealogy on a minimum root date. J Theor Biol 2022;548:111186. [PMID: 35697144 DOI: 10.1016/j.jtbi.2022.111186] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Revised: 05/05/2022] [Accepted: 06/02/2022] [Indexed: 01/27/2023]

Wawerka M, Dąbkowski D, Rutecka N, Mykowiecka A, Górecki P. Embedding gene trees into phylogenetic networks by conflict resolution algorithms. Algorithms Mol Biol 2022;17:11. [PMID: 35590416 PMCID: PMC9119282 DOI: 10.1186/s13015-022-00218-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 03/22/2022] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Phylogenetic networks are mathematical models of evolutionary processes involving reticulate events such as hybridization, recombination, or horizontal gene transfer. One of the crucial notions in phylogenetic network modelling is displayed tree, which is obtained from a network by removing a set of reticulation edges. Displayed trees may represent an evolutionary history of a gene family if the evolution is shaped by reticulation events.

RESULTS

We address the problem of inferring an optimal tree displayed by a network, given a gene tree G and a tree-child network N, under the deep coalescence and duplication costs. We propose an O(mn)-time dynamic programming algorithm (DP) to compute a lower bound of the optimal displayed tree cost, where m and n are the sizes of G and N, respectively. In addition, our algorithm can verify whether the solution is exact. Moreover, it provides a set of reticulation edges corresponding to the obtained cost. If the cost is exact, the set induces an optimal displayed tree. Otherwise, the set contains pairs of conflicting edges, i.e., edges sharing a reticulation node. Next, we show a conflict resolution algorithm that requires [Formula: see text] invocations of DP in the worst case, where r is the number of reticulations. We propose a similar [Formula: see text]-time algorithm for level-k tree-child networks and a branch and bound solution to compute lower and upper bounds of optimal costs. We also extend the algorithms to a broader class of phylogenetic networks. Based on simulated data, the average runtime is [Formula: see text] under the deep-coalescence cost and [Formula: see text] under the duplication cost.

CONCLUSIONS

Despite exponential complexity in the worst case, our algorithms perform significantly well on empirical and simulated datasets, due to the strategy of resolving internal dissimilarities between gene trees and networks. Therefore, the algorithms are efficient alternatives to enumeration strategies commonly proposed in the literature and enable analyses of complex networks with dozens of reticulations.

Collapse

Jiang Y, Balaban M, Zhu Q, Mirarab S. DEPP: Deep Learning Enables Extending Species Trees using Single Genes. Syst Biol 2022;72:17-34. [PMID: 35485976 DOI: 10.1093/sysbio/syac031] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2021] [Revised: 04/13/2022] [Accepted: 04/22/2022] [Indexed: 11/13/2022] Open

Willson J, Roddur MS, Liu B, Zaharias P, Warnow T. DISCO: Species Tree Inference using Multicopy Gene Family Tree Decomposition. Syst Biol 2022;71:610-629. [PMID: 34450658 PMCID: PMC9016570 DOI: 10.1093/sysbio/syab070] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Revised: 08/18/2021] [Accepted: 08/23/2021] [Indexed: 11/21/2022] Open

Yan Z, Smith ML, Du P, Hahn MW, Nakhleh L. Species Tree Inference Methods Intended to Deal with Incomplete Lineage Sorting Are Robust to the Presence of Paralogs. Syst Biol 2022;71:367-381. [PMID: 34245291 PMCID: PMC8978208 DOI: 10.1093/sysbio/syab056] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2020] [Revised: 06/23/2021] [Accepted: 06/30/2021] [Indexed: 11/24/2022] Open

Morel B, Schade P, Lutteropp S, Williams TA, Szöllősi GJ, Stamatakis A. SpeciesRax: A Tool for Maximum Likelihood Species Tree Inference from Gene Family Trees under Duplication, Transfer, and Loss. Mol Biol Evol 2022;39:msab365. [PMID: 35021210 PMCID: PMC8826479 DOI: 10.1093/molbev/msab365] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open

Rabiee M, Mirarab S. OUP accepted manuscript. Bioinformatics 2022;38:i413-i421. [PMID: 35758818 PMCID: PMC9235488 DOI: 10.1093/bioinformatics/btac265] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open

Höhler D, Pfeiffer W, Ioannidis V, Stockinger H, Stamatakis A. RAxML Grove: an empirical phylogenetic tree database. Bioinformatics 2021;38:1741-1742. [PMID: 34962976 PMCID: PMC8896636 DOI: 10.1093/bioinformatics/btab863] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2021] [Revised: 12/16/2021] [Accepted: 12/22/2021] [Indexed: 02/03/2023] Open

Smith ML, Hahn MW. The Frequency and Topology of Pseudoorthologs. Syst Biol 2021;71:649-659. [PMID: 34951639 DOI: 10.1093/sysbio/syab097] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2021] [Revised: 12/15/2021] [Accepted: 12/17/2021] [Indexed: 11/12/2022] Open

Walker JF, Smith SA, Hodel RGJ, Moyroud E. Concordance-based approaches for the inference of relationships and molecular rates with phylogenomic datasets. Syst Biol 2021;71:943-958. [PMID: 34240209 DOI: 10.1093/sysbio/syab052] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Revised: 06/23/2021] [Accepted: 07/01/2021] [Indexed: 11/12/2022] Open

Abstract

Gene tree conflict is common and finding methods to analyze and alleviate the negative effects that conflict has on species tree analysis is a crucial part of phylogenomics. This study aims to expand the discussion of inferring species trees and molecular branch lengths when conflict is present. Conflict is typically examined in two ways: inferring its prevalence, and inferring the influence of the individual genes (how strongly one gene supports any given topology compared to an alternative topology). Here, we examine a procedure for incorporating both conflict and the influence of genes in order to infer evolutionary relationships. All supported relationships in the gene trees are analyzed and the likelihood of the genes constrained to these relationships is summed to provide a likelihood for the relationship. Consensus tree assembly is conducted based on the sum of likelihoods for a given relationship and choosing relationships based on the most likely relationship assuming it does not conflict with a relationship that has a higher likelihood score. If it is not possible for all most likely relationships to be combined into a single bifurcating tree then multiple trees are produced and a consensus tree with a polytomy is created. This procedure allows for more influential genes to have greater influence on an inferred relationship, does not assume conflict has arisen from any one source, and does not force the dataset to produce a single bifurcating tree. Using this approach on three empirical datasets, we examine and discuss the relationship between influence and prevalence of gene tree conflict. We find that in one of the datasets, assembling a bifurcating consensus tree solely composed of the most likely relationships is impossible. To account for conflict in molecular rate analysis we also introduce a concordance-based approach to the summary and estimation of branch lengths suitable for downstream comparative analyses. We demonstrate through simulation that even under high levels of stochastic conflict, the mean and median of the concordant rates recapitulate the true molecular rate better than using a supermatrix approach. Using a large phylogenomic dataset, we examine rate heterogeneity across concordant genes with a focus on the branch subtending crown angiosperms. Notably, we find highly variable rates of evolution along the branch subtending crown angiosperms. The approaches outlined here have several limitations, but they also represent some alternative methods for harnessing the complexity of phylogenomic datasets and enrich our inferences of both species' relationships and evolutionary processes.

Collapse

Paszek J, Markin A, Górecki P, Eulenstein O. Taming the Duplication-Loss-Coalescence Model with Integer Linear Programming. J Comput Biol 2021;28:758-773. [PMID: 34125600 DOI: 10.1089/cmb.2021.0011] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Dismukes W, Heath TA. treeducken: An R package for simulating cophylogenetic systems. Methods Ecol Evol 2021. [DOI: 10.1111/2041-210x.13625] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]

DNA Barcoding of Marine Mollusks Associated with Corallina officinalis Turfs in Southern Istria (Adriatic Sea). DIVERSITY 2021. [DOI: 10.3390/d13050196] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]

Morel B, Kozlov AM, Stamatakis A, Szöllősi GJ. GeneRax: A Tool for Species-Tree-Aware Maximum Likelihood-Based Gene Family Tree Inference under Gene Duplication, Transfer, and Loss. Mol Biol Evol 2021;37:2763-2774. [PMID: 32502238 PMCID: PMC8312565 DOI: 10.1093/molbev/msaa141] [Citation(s) in RCA: 80] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open

Abstract

Inferring phylogenetic trees for individual homologous gene families is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges, species-tree-aware methods also leverage information from a putative species tree. However, only few methods are available that implement a full likelihood framework or account for horizontal gene transfers. Furthermore, these methods often require expensive data preprocessing (e.g., computing bootstrap trees) and rely on approximations and heuristics that limit the degree of tree space exploration. Here, we present GeneRax, the first maximum likelihood species-tree-aware phylogenetic inference software. It simultaneously accounts for substitutions at the sequence level as well as gene level events, such as duplication, transfer, and loss relying on established maximum likelihood optimization algorithms. GeneRax can infer rooted phylogenetic trees for multiple gene families, directly from the per-gene sequence alignments and a rooted, yet undated, species tree. We show that compared with competing tools, on simulated data GeneRax infers trees that are the closest to the true tree in 90% of the simulations in terms of relative Robinson–Foulds distance. On empirical data sets, GeneRax is the fastest among all tested methods when starting from aligned sequences, and it infers trees with the highest likelihood score, based on our model. GeneRax completed tree inferences and reconciliations for 1,099 Cyanobacteria families in 8 min on 512 CPU cores. Thus, its parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax (last accessed June 17, 2020).

Collapse

Shen XX, Steenwyk JL, Rokas A. Dissecting incongruence between concatenation- and quartet-based approaches in phylogenomic data. Syst Biol 2021;70:997-1014. [PMID: 33616672 DOI: 10.1093/sysbio/syab011] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2020] [Revised: 02/10/2021] [Accepted: 02/17/2021] [Indexed: 12/12/2022] Open

Abstract

Topological conflict or incongruence is widespread in phylogenomic data. Concatenation- and coalescent-based approaches often result in incongruent topologies, but the causes of this conflict can be difficult to characterize. We examined incongruence stemming from conflict between likelihood-based signal (quantified by the difference in gene-wise log likelihood score or ΔGLS) and quartet-based topological signal (quantified by the difference in gene-wise quartet score or ΔGQS) for every gene in three phylogenomic studies in animals, fungi, and plants, which were chosen because their concatenation-based IQ-TREE (T1) and quartet-based ASTRAL (T2) phylogenies are known to produce eight conflicting internal branches (bipartitions). By comparing the types of phylogenetic signal for all genes in these three data matrices, we found that 30% - 36% of genes in each data matrix are inconsistent, that is, each of these genes has higher log likelihood score for T1 versus T2 (i.e., ΔGLS >0) whereas its T1 topology has lower quartet score than its T2 topology (i.e., ΔGQS <0) or vice versa. Comparison of inconsistent and consistent genes using a variety of metrics (e.g., evolutionary rate, gene tree topology, distribution of branch lengths, hidden paralogy, and gene tree discordance) showed that inconsistent genes are more likely to recover neither T1 nor T2 and have higher levels of gene tree discordance than consistent genes. Simulation analyses demonstrate that removal of inconsistent genes from datasets with low levels of incomplete lineage sorting (ILS) and low and medium levels of gene tree estimation error (GTEE) reduced incongruence and increased accuracy. In contrast, removal of inconsistent genes from datasets with medium and high ILS levels and high GTEE levels eliminated or extensively reduced incongruence, but the resulting congruent species phylogenies were not always topologically identical to the true species trees.

Collapse

Allman ES, Mitchell JD, Rhodes JA. Gene tree discord, simplex plots, and statistical tests under the coalescent. Syst Biol 2021;71:929-942. [PMID: 33560348 DOI: 10.1093/sysbio/syab008] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Revised: 01/31/2021] [Accepted: 02/03/2021] [Indexed: 02/06/2023] Open

Rabiee M, Mirarab S. SODA: Multi-locus species delimitation using quartet frequencies. Bioinformatics 2021;36:5623-5631. [PMID: 33555318 DOI: 10.1093/bioinformatics/btaa1010] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Revised: 10/19/2020] [Accepted: 11/21/2020] [Indexed: 11/14/2022] Open

Tabaszewski P, Gorecki P, Markin A, Anderson T, Eulenstein O. Consensus of All Solutions for Intractable Phylogenetic Tree Inference. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021;18:149-161. [PMID: 31613775 DOI: 10.1109/tcbb.2019.2947051] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]

Legried B, Molloy EK, Warnow T, Roch S. Polynomial-Time Statistical Estimation of Species Trees Under Gene Duplication and Loss. J Comput Biol 2020;28:452-468. [DOI: 10.1089/cmb.2020.0424] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open

Li Q, Scornavacca C, Galtier N, Chan YB. The Multilocus Multispecies Coalescent: A Flexible New Model of Gene Family Evolution. Syst Biol 2020;70:822-837. [PMID: 33169795 DOI: 10.1093/sysbio/syaa084] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 05/07/2020] [Accepted: 10/19/2020] [Indexed: 02/06/2023] Open

Zhang C, Scornavacca C, Molloy EK, Mirarab S. ASTRAL-Pro: Quartet-Based Species-Tree Inference despite Paralogy. Mol Biol Evol 2020;37:3292-3307. [PMID: 32886770 PMCID: PMC7751180 DOI: 10.1093/molbev/msaa139] [Citation(s) in RCA: 111] [Impact Index Per Article: 22.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open

Davín AA, Tricou T, Tannier E, de Vienne DM, Szöllősi GJ. Zombi: a phylogenetic simulator of trees, genomes and sequences that accounts for dead linages. Bioinformatics 2020;36:1286-1288. [PMID: 31566657 PMCID: PMC7031779 DOI: 10.1093/bioinformatics/btz710] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2019] [Revised: 09/09/2019] [Accepted: 09/26/2019] [Indexed: 11/14/2022] Open

Van Dam MH, Henderson JB, Esposito L, Trautwein M. Genomic Characterization and Curation of UCEs Improves Species Tree Reconstruction. Syst Biol 2020;70:307-321. [PMID: 32750133 PMCID: PMC7875437 DOI: 10.1093/sysbio/syaa063] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2019] [Revised: 07/26/2020] [Accepted: 07/29/2020] [Indexed: 12/12/2022] Open

Abstract

Ultraconserved genomic elements (UCEs) are generally treated as independent loci in phylogenetic analyses. The identification pipeline for UCE probes does not require prior knowledge of genetic identity, only selecting loci that are highly conserved, single copy, without repeats, and of a particular length. Here, we characterized UCEs from 11 phylogenomic studies across the animal tree of life, from birds to marine invertebrates. We found that within vertebrate lineages, UCEs are mostly intronic and intergenic, while in invertebrates, the majority are in exons. We then curated four different sets of UCE markers by genomic category from five different studies including: birds, mammals, fish, Hymenoptera (ants, wasps, and bees), and Coleoptera (beetles). Of genes captured by UCEs, we find that many are represented by two or more UCEs, corresponding to nonoverlapping segments of a single gene. We considered these UCEs to be nonindependent, merged all UCEs that belonged to a particular gene, constructed gene and species trees, and then evaluated the subsequent effect of merging cogenic UCEs on gene and species tree reconstruction. Average bootstrap support for merged UCE gene trees was significantly improved across all data sets apparently driven by the increase in loci length. Additionally, we conducted simulations and found that gene trees generated from merged UCEs were more accurate than those generated by unmerged UCEs. As loci length improves gene tree accuracy, this modest degree of UCE characterization and curation impacts downstream analyses and demonstrates the advantages of incorporating basic genomic characterizations into phylogenomic analyses. [Anchored hybrid enrichment; ants; ASTRAL; bait capture; carangimorph; Coleoptera; conserved nonexonic elements; exon capture; gene tree; Hymenoptera; mammal; phylogenomic markers; songbird; species tree; ultraconserved elements; weevils.]

Collapse

Bobay LM. CoreSimul: a forward-in-time simulator of genome evolution for prokaryotes modeling homologous recombination. BMC Bioinformatics 2020;21:264. [PMID: 32580695 PMCID: PMC7315543 DOI: 10.1186/s12859-020-03619-x] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2020] [Accepted: 06/19/2020] [Indexed: 12/26/2022] Open

Jermiin LS, Catullo RA, Holland BR. A new phylogenetic protocol: dealing with model misspecification and confirmation bias in molecular phylogenetics. NAR Genom Bioinform 2020;2:lqaa041. [PMID: 33575594 PMCID: PMC7671319 DOI: 10.1093/nargab/lqaa041] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Revised: 05/18/2020] [Accepted: 06/04/2020] [Indexed: 12/15/2022] Open