1
|
Pyron RA, O'Connell KA, Myers EA, Beamer DA, Baños H. Complex Hybridization in a Clade of Polytypic Salamanders (Plethodontidae: Desmognathus) Uncovered by Estimating Higher-Level Phylogenetic Networks. Syst Biol 2025; 74:124-140. [PMID: 39468736 DOI: 10.1093/sysbio/syae060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Revised: 07/22/2024] [Accepted: 10/24/2024] [Indexed: 10/30/2024] Open
Abstract
Reticulation between radiating lineages is a common feature of diversification. We examine these phenomena in the Pisgah clade of Desmognathus salamanders from the southern Appalachian Mountains of the eastern United States. The group contains 4-7 species exhibiting 2 discrete phenotypes, aquatic "shovel-nosed" and semi-aquatic "black-bellied" forms. These ecomorphologies are ancient and have apparently been transmitted repeatedly between lineages through introgression. Geographically proximate populations of both phenotypes exhibit admixture, and at least 2 black-bellied lineages have been produced via reticulations between shovel-nosed parentals, suggesting potential hybrid speciation dynamics. However, computational constraints currently limit our ability to reconstruct network radiations from gene-tree data. Available methods are limited to level-1 networks wherein reticulations do not share edges, and higher-level networks may be non-identifiable in many cases. We present a heuristic approach to recover information from higher-level networks across a range of potentially identifiable empirical scenarios, supported by theory and simulation. When extrinsic information indicates the location and direction of reticulations, our method can successfully estimate a reduced possible set of nonlevel-1 networks. Phylogenomic data support a single backbone topology with up to 5 overlapping hybrid edges in the Pisgah clade. These results suggest an unusual mechanism of ecomorphological hybrid speciation, wherein a binary threshold trait causes some hybrid populations to shift between microhabitat niches, promoting ecological divergence between sympatric hybrids and parentals. This contrasts with other well-known systems in which hybrids exhibit intermediate, novel, or transgressive phenotypes. The genetic basis of these phenotypes is unclear and further data are needed to clarify the evolutionary basis of morphological changes with ecological consequences.
Collapse
Affiliation(s)
- R Alexander Pyron
- Department of Biological Sciences, The George Washington University, 2029 G St. NW, Washington, DC 20052, USA
- Department of Vertebrate Zoology, National Museum of Natural History, Smithsonian Institution, 10th St. & Constitution Ave. NW, Washington, DC 20560-0162, USA
| | - Kyle A O'Connell
- Department of Biological Sciences, The George Washington University, 2029 G St. NW, Washington, DC 20052, USA
- Department of Vertebrate Zoology, National Museum of Natural History, Smithsonian Institution, 10th St. & Constitution Ave. NW, Washington, DC 20560-0162, USA
- Deloitte Consulting LLP, Health Data and AI, 1919 North Lynn St., Arlington, VA 22209, USA
| | - Edward A Myers
- Department of Vertebrate Zoology, National Museum of Natural History, Smithsonian Institution, 10th St. & Constitution Ave. NW, Washington, DC 20560-0162, USA
- Department of Herpetology, California Academy of Sciences, 55 Music Concourse Dr., San Francisco, CA 94118, USA
| | - David A Beamer
- Office of Research, Economic Development and Engagement, East Carolina University, 209 East 5th St., Greenville, NC 27858, USA
| | - Hector Baños
- Department of Biochemistry and Molecular Biology, Faculty of Medicine, Dalhousie University, 5850 College St., Halifax, NS B3H 4R2, Canada
- Department of Mathematics and Statistics, Faculty of Science, Dalhousie University, 6297 Castine Way, Halifax, NS B3H 4R2, Canada
- Department of Mathematics, California State University San Bernardino, 5500 University Pkwy, San Bernardino, CA, USA
| |
Collapse
|
2
|
Allman ES, Baños H, Mitchell JD, Rhodes JA. TINNiK: inference of the tree of blobs of a species network under the coalescent model. Algorithms Mol Biol 2024; 19:23. [PMID: 39501362 PMCID: PMC11539473 DOI: 10.1186/s13015-024-00266-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2024] [Accepted: 08/22/2024] [Indexed: 11/08/2024] Open
Abstract
The tree of blobs of a species network shows only the tree-like aspects of relationships of taxa on a network, omitting information on network substructures where hybridization or other types of lateral transfer of genetic information occur. By isolating such regions of a network, inference of the tree of blobs can serve as a starting point for a more detailed investigation, or indicate the limit of what may be inferrable without additional assumptions. Building on our theoretical work on the identifiability of the tree of blobs from gene quartet distributions under the Network Multispecies Coalescent model, we develop an algorithm, TINNiK, for statistically consistent tree of blobs inference. We provide examples of its application to both simulated and empirical datasets, utilizing an implementation in the MSCquartets 2.0 R package.
Collapse
Affiliation(s)
- Elizabeth S Allman
- Department of Mathematics and Statistics, University of Alaska, Fairbanks, AK, USA.
| | - Hector Baños
- Department of Mathematics, California State University San Bernadino, San Bernadino, CA, USA
| | - Jonathan D Mitchell
- School of Natural Sciences (Mathematics), University of Tasmania, Hobart, TAS, Australia
- ARC Centre of Excellence for Plant Success in Nature and Agriculture, University of Tasmania, Hobart, TAS, Australia
| | - John A Rhodes
- Department of Mathematics and Statistics, University of Alaska, Fairbanks, AK, USA
| |
Collapse
|
3
|
Allman ES, Baños H, Mitchell JD, Rhodes JA. TINNiK: Inference of the Tree of Blobs of a Species Network Under the Coalescent. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.20.590418. [PMID: 38712257 PMCID: PMC11071406 DOI: 10.1101/2024.04.20.590418] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
The tree of blobs of a species network shows only the tree-like aspects of relationships of taxa on a network, omitting information on network substructures where hybridization or other types of lateral transfer of genetic information occur. By isolating such regions of a network, inference of the tree of blobs can serve as a starting point for a more detailed investigation, or indicate the limit of what may be inferrable without additional assumptions. Building on our theoretical work on the identifiability of the tree of blobs from gene quartet distributions under the Network Multispecies Coalescent model, we develop an algorithm, TINNiK, for statistically consistent tree of blobs inference. We provide examples of its application to both simulated and empirical datasets, utilizing an implementation in the MSCquartets 2.0 R package.
Collapse
Affiliation(s)
- Elizabeth S. Allman
- Department of Mathematics and Statistics, University of Alaska, Fairbanks, AK, USA
| | - Hector Baños
- Department of Mathematics, California State University San Bernadino, San Bernadino, CA, USA
| | - Jonathan D. Mitchell
- School of Natural Sciences (Mathematics), University of Tasmania, Hobart, TAS, Australia
- ARC Centre of Excellence for Plant Success in Nature and Agriculture, University of Tasmania, Hobart, TAS, Australia
| | - John A. Rhodes
- Department of Mathematics and Statistics, University of Alaska, Fairbanks, AK, USA
| |
Collapse
|
4
|
Allman ES, Banos H, Rhodes JA. Testing Multispecies Coalescent Simulators Using Summary Statistics. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1613-1618. [PMID: 35617176 PMCID: PMC10183998 DOI: 10.1109/tcbb.2022.3177956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
As genomic-scale datasets motivate research on species tree inference, simulators of the multispecies coalescent (MSC) process have become essential for the testing and evaluation of new inference methods. However, the simulators themselves must be tested to ensure that they give valid samples. This work develops methods for checking whether a collection of gene trees is in accord with the MSC model on a given species tree. When applied to well-known simulators, we find that several give flawed samples. The tests presented are capable of validating both topological and metric properties of gene tree samples, and are implemented in a freely available R package MSCsimtester so that developers and users may easily apply them.
Collapse
|
5
|
Allman ES, Baños H, Mitchell JD, Rhodes JA. The tree of blobs of a species network: identifiability under the coalescent. J Math Biol 2022; 86:10. [PMID: 36472708 PMCID: PMC10062380 DOI: 10.1007/s00285-022-01838-9] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Revised: 08/31/2022] [Accepted: 11/17/2022] [Indexed: 12/12/2022]
Abstract
Inference of species networks from genomic data under the Network Multispecies Coalescent Model is currently severely limited by heavy computational demands. It also remains unclear how complicated networks can be for consistent inference to be possible. As a step toward inferring a general species network, this work considers its tree of blobs, in which non-cut edges are contracted to nodes, so only tree-like relationships between the taxa are shown. An identifiability theorem, that most features of the unrooted tree of blobs can be determined from the distribution of gene quartet topologies, is established. This depends upon an analysis of gene quartet concordance factors under the model, together with a new combinatorial inference rule. The arguments for this theoretical result suggest a practical algorithm for tree of blobs inference, to be fully developed in a subsequent work.
Collapse
Affiliation(s)
- Elizabeth S Allman
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK, 99775, USA
| | - Hector Baños
- Department of Biochemistry and Molecular Biology, Faculty of Medicine, Dalhousie University, Halifax, NS, Canada
- Department of Mathematics and Statistics, Faculty of Science, Dalhousie University, Halifax, NS, Canada
| | - Jonathan D Mitchell
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK, 99775, USA
- School of Natural Sciences (Mathematics), University of Tasmania, Hobart, TAS, 7001, Australia
- ARC Centre of Excellence for Plant Success in Nature and Agriculture, University of Tasmania, Hobart, TAS, 7001, Australia
| | - John A Rhodes
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK, 99775, USA.
| |
Collapse
|
6
|
Zhelezov G, Degnan JH. Trying Out a Million Genes to Find the Perfect Pair with RTIST. Bioinformatics 2022; 38:3565-3573. [PMID: 35641003 DOI: 10.1093/bioinformatics/btac349] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Revised: 05/07/2022] [Accepted: 05/17/2022] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Consensus methods can be used for reconstructing a species tree from several gene trees which exhibit incompatible topologies due to incomplete lineage sorting. Motivated by the fact that there are no anomalous rooted gene trees with three taxa and no anomalous unrooted gene trees with four taxa in the multispecies coalescent model, several contemporary methods form the gene tree consensus by finding the median tree with respect to the triplet or quartet distance-i.e., estimate the species tree as the tree which minimizes the sum of triplet or quartet distances to the input gene trees. These methods reformulate the solution to the consensus problem as the solution to a recursively-solved dynamic programming problem. We present an iterative, easily-parallelizable approach to finding the exact median triplet tree, and implement it as an open source software package which can also find suboptimal consensus trees within a specified triplet distance to the gene trees. The most time-consuming step for methods of this type is the creation of a weights array for all possible subtree bipartitions. By grouping the relevant calculations and array update operations of different bipartitions of the same subtree together, this implementation finds the exact median tree of many gene trees faster than comparable methods, has better scaling properties with respect to the number of gene trees, and has a smaller memory footprint. RESULTS RTIST (Rooted Triple Inference of Species Trees) finds the exact median triplet tree of a set of gene trees. Its runtime and memory footprints scale better than existing algorithms. RTIST can resolve all the non-unique median trees, as well as sub-optimal consensus trees within a user-specified triplet distance to the median. Although it is limited in the number of taxa (≤ 20), its runtime changes little when the number of gene trees is changed by several orders of magnitude. AVAILABILITY RTIST is written in C and Python. It is freely available at https://github.com/glebzhelezov/rtist.
Collapse
Affiliation(s)
- Gleb Zhelezov
- Department of Mathematics and Statistics, University of New Mexico, Albuquerque, NM, 87131, USA
| | - James H Degnan
- Department of Mathematics and Statistics, University of New Mexico, Albuquerque, NM, 87131, USA
| |
Collapse
|
7
|
Dasarathy G, Mossel E, Nowak R, Roch S. A stochastic Farris transform for genetic data under the multispecies coalescent with applications to data requirements. J Math Biol 2022; 84:36. [PMID: 35394192 PMCID: PMC9258723 DOI: 10.1007/s00285-022-01731-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Revised: 02/15/2022] [Accepted: 02/17/2022] [Indexed: 10/18/2022]
Abstract
Species tree estimation faces many significant hurdles. Chief among them is that the trees describing the ancestral lineages of each individual gene-the gene trees-often differ from the species tree. The multispecies coalescent is commonly used to model this gene tree discordance, at least when it is believed to arise from incomplete lineage sorting, a population-genetic effect. Another significant challenge in this area is that molecular sequences associated to each gene typically provide limited information about the gene trees themselves. While the modeling of sequence evolution by single-site substitutions is well-studied, few species tree reconstruction methods with theoretical guarantees actually address this latter issue. Instead, a standard-but unsatisfactory-assumption is that gene trees are perfectly reconstructed before being fed into a so-called summary method. Hence much remains to be done in the development of inference methodologies that rigorously account for gene tree estimation error-or completely avoid gene tree estimation in the first place. In previous work, a data requirement trade-off was derived between the number of loci m needed for an accurate reconstruction and the length of the locus sequences k. It was shown that to reconstruct an internal branch of length f, one needs m to be of the order of [Formula: see text]. That previous result was obtained under the restrictive assumption that mutation rates as well as population sizes are constant across the species phylogeny. Here we further generalize this result beyond this assumption. Our main contribution is a novel reduction to the molecular clock case under the multispecies coalescent, which we refer to as a stochastic Farris transform. As a corollary, we also obtain a new identifiability result of independent interest: for any species tree with [Formula: see text] species, the rooted topology of the species tree can be identified from the distribution of its unrooted weighted gene trees even in the absence of a molecular clock.
Collapse
Affiliation(s)
- Gautam Dasarathy
- School of Electrical, Computer, and Energy Engineering, Arizona State University, Tempe, USA
| | - Elchanan Mossel
- Department of Mathematics and IDSS, Massachusetts Institute of Technology, Cambridge, USA
| | - Robert Nowak
- Department of Electrical and Computer Engineering, University of Wisconsin, Madison, USA
| | - Sebastien Roch
- Department of Mathematics, University of Wisconsin, Madison, USA.
| |
Collapse
|
8
|
Identifiability of species network topologies from genomic sequences using the logDet distance. J Math Biol 2022; 84:35. [PMID: 35385988 DOI: 10.1007/s00285-022-01734-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Revised: 01/12/2022] [Accepted: 03/02/2022] [Indexed: 10/18/2022]
Abstract
Inference of network-like evolutionary relationships between species from genomic data must address the interwoven signals from both gene flow and incomplete lineage sorting. The heavy computational demands of standard approaches to this problem severely limit the size of datasets that may be analyzed, in both the number of species and the number of genetic loci. Here we provide a theoretical pointer to more efficient methods, by showing that logDet distances computed from genomic-scale sequences retain sufficient information to recover network relationships in the level-1 ultrametric case. This result is obtained under the Network Multispecies Coalescent model combined with a mixture of General Time-Reversible sequence evolution models across individual gene trees. It applies to both unlinked site data, such as for SNPs, and to sequence data in which many contiguous sites may have evolved on a common tree, such as concatenated gene sequences. Thus under standard stochastic models statistically justifiable inference of network relationships from sequences can be accomplished without consideration of individual genes or gene trees.
Collapse
|
9
|
Rhodes JA, Baños H, Mitchell JD, Allman ES. MSCquartets 1.0: quartet methods for species trees and networks under the multispecies coalescent model in R. Bioinformatics 2021; 37:1766-1768. [PMID: 33031510 DOI: 10.1093/bioinformatics/btaa868] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2020] [Revised: 09/17/2020] [Accepted: 09/23/2020] [Indexed: 12/29/2022] Open
Abstract
SUMMARY MSCquartets is an R package for species tree hypothesis testing, inference of species trees and inference of species networks under the Multispecies Coalescent model of incomplete lineage sorting and its network analog. Input for these analyses are collections of metric or topological locus trees which are then summarized by the quartets displayed on them. Results of hypothesis tests at user-supplied levels are displayed in a simplex plot by color-coded points. The package implements the QDC and WQDC algorithms for topological and metric species tree inference, and the NANUQ algorithm for level-1 topological species network inference, all of which give statistically consistent estimators under the model. AVAILABILITY AND IMPLEMENTATION MSCquartets is available through the Comprehensive R Archive Network: https://CRAN.R-project.org/package=MSCquartets.
Collapse
Affiliation(s)
- John A Rhodes
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK 99775-6660, USA
| | - Hector Baños
- School of Mathematics, Georgia Institute of Technology, Atlanta, GA 30332-0160, USA
| | - Jonathan D Mitchell
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK 99775-6660, USA.,Unité Bioinformatique Evolutive, C3BI USR 3756, Institut Pasteur & CNRS, Paris, France
| | - Elizabeth S Allman
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK 99775-6660, USA
| |
Collapse
|
10
|
Rhodes JA, Baños H, Mitchell JD, Allman ES. MSCquartets 1.0: quartet methods for species trees and networks under the multispecies coalescent model in R. BIOINFORMATICS (OXFORD, ENGLAND) 2021; 37:1766-1768. [PMID: 33031510 DOI: 10.1101/2020.05.01.073361] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/01/2020] [Revised: 09/17/2020] [Accepted: 09/23/2020] [Indexed: 05/26/2023]
Abstract
SUMMARY MSCquartets is an R package for species tree hypothesis testing, inference of species trees and inference of species networks under the Multispecies Coalescent model of incomplete lineage sorting and its network analog. Input for these analyses are collections of metric or topological locus trees which are then summarized by the quartets displayed on them. Results of hypothesis tests at user-supplied levels are displayed in a simplex plot by color-coded points. The package implements the QDC and WQDC algorithms for topological and metric species tree inference, and the NANUQ algorithm for level-1 topological species network inference, all of which give statistically consistent estimators under the model. AVAILABILITY AND IMPLEMENTATION MSCquartets is available through the Comprehensive R Archive Network: https://CRAN.R-project.org/package=MSCquartets.
Collapse
Affiliation(s)
- John A Rhodes
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK 99775-6660, USA
| | - Hector Baños
- School of Mathematics, Georgia Institute of Technology, Atlanta, GA 30332-0160, USA
| | - Jonathan D Mitchell
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK 99775-6660, USA
- Unité Bioinformatique Evolutive, C3BI USR 3756, Institut Pasteur & CNRS, Paris, France
| | - Elizabeth S Allman
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK 99775-6660, USA
| |
Collapse
|
11
|
Markin A, Eulenstein O. Quartet-Based Inference is Statistically Consistent Under the Unified Duplication-Loss-Coalescence Model. Bioinformatics 2021; 37:4064-4074. [PMID: 34048529 PMCID: PMC9113308 DOI: 10.1093/bioinformatics/btab414] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2020] [Revised: 05/19/2021] [Accepted: 05/27/2021] [Indexed: 12/19/2022] Open
Abstract
Motivation The classic multispecies coalescent (MSC) model provides the means for theoretical justification of incomplete lineage sorting-aware species tree inference methods. This has motivated an extensive body of work on phylogenetic methods that are statistically consistent under MSC. One such particularly popular method is ASTRAL, a quartet-based species tree inference method. Novel studies suggest that ASTRAL also performs well when given multi-locus gene trees in simulation studies. Further, Legried et al. recently demonstrated that ASTRAL is statistically consistent under the gene duplication and loss model (GDL). GDL is prevalent in evolutionary histories and is the first core process in the powerful duplication-loss-coalescence evolutionary model (DLCoal) by Rasmussen and Kellis. Results In this work, we prove that ASTRAL is statistically consistent under the general DLCoal model. Therefore, our result supports the empirical evidence from the simulation-based studies. More broadly, we prove that the quartet-based inference approach is statistically consistent under DLCoal. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Alexey Markin
- Virus and Prion Research Unit, National Animal Disease Center, USDA-ARS, Ames, IA, 50010, USA
| | - Oliver Eulenstein
- Department of Computer Science, Iowa State University, Ames, IA, 50011, USA
| |
Collapse
|
12
|
Allman ES, Mitchell JD, Rhodes JA. Gene tree discord, simplex plots, and statistical tests under the coalescent. Syst Biol 2021; 71:929-942. [PMID: 33560348 DOI: 10.1093/sysbio/syab008] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Revised: 01/31/2021] [Accepted: 02/03/2021] [Indexed: 02/06/2023] Open
Abstract
A simple graphical device, the simplex plot of quartet concordance factors, is introduced to aid in the exploration of a collection of gene trees on a common set of taxa. A single plot summarizes all gene tree discord, and allows for visual comparison to the expected discord from the multispecies coalescent model (MSC) of incomplete lineage sorting on a species tree. A formal statistical procedure is described that can quantify the deviation from expectation for each subset of four taxa, suggesting when the data is not in accord with the MSC, and thus that either gene tree inference error is substantial or a more complex model such as that on a network may be required. If the collection of gene trees is in accord with the MSC, the plots reveal when substantial incomplete lineage sorting is present. Applications to both simulated and empirical multilocus data sets illustrate the insights provided.
Collapse
Affiliation(s)
- Elizabeth S Allman
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK 99709, USA
| | - Jonathan D Mitchell
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK 99709, USA.,Unité Bioinformatique Evolutive, C3BI USR 3756, Institut Pasteur & CNRS, Paris, France
| | - John A Rhodes
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK 99709, USA
| |
Collapse
|
13
|
Jaffe A, Amsel N, Aizenbud Y, Nadler B, Chang JT, Kluger Y. Spectral neighbor joining for reconstruction of latent tree Models. SIAM JOURNAL ON MATHEMATICS OF DATA SCIENCE 2021; 3:113-141. [PMID: 34124606 PMCID: PMC8194222 DOI: 10.1137/20m1365715] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
A common assumption in multiple scientific applications is that the distribution of observed data can be modeled by a latent tree graphical model. An important example is phylogenetics, where the tree models the evolutionary lineages of a set of observed organisms. Given a set of independent realizations of the random variables at the leaves of the tree, a key challenge is to infer the underlying tree topology. In this work we develop Spectral Neighbor Joining (SNJ), a novel method to recover the structure of latent tree graphical models. Given a matrix that contains a measure of similarity between all pairs of observed variables, SNJ computes a spectral measure of cohesion between groups of observed variables. We prove that SNJ is consistent, and derive a sufficient condition for correct tree recovery from an estimated similarity matrix. Combining this condition with a concentration of measure result on the similarity matrix, we bound the number of samples required to recover the tree with high probability. We illustrate via extensive simulations that in comparison to several other reconstruction methods, SNJ requires fewer samples to accurately recover trees with a large number of leaves or long edges.
Collapse
Affiliation(s)
- Ariel Jaffe
- Program in Applied Mathematics, Yale University, New Haven, CT 06511
| | - Noah Amsel
- Program in Applied Mathematics, Yale University, New Haven, CT 06511
| | - Yariv Aizenbud
- Program in Applied Mathematics, Yale University, New Haven, CT 06511
| | - Boaz Nadler
- Department of Computer Science, Weizmann Institute of Science, Rehovot, 76100, Israel
| | - Joseph T Chang
- Department of Statistics, Yale University, New Haven, CT 06520, USA
| | - Yuval Kluger
- Program in Applied Mathematics, Yale University, New Haven, CT 06511
- Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511
- Department of Pathology, Yale University New Haven, CT 06511
| |
Collapse
|
14
|
Yourdkhani S, Rhodes JA. Inferring Metric Trees from Weighted Quartets via an Intertaxon Distance. Bull Math Biol 2020; 82:97. [PMID: 32676801 DOI: 10.1007/s11538-020-00773-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2020] [Accepted: 07/02/2020] [Indexed: 11/24/2022]
Abstract
A metric phylogenetic tree relating a collection of taxa induces weighted rooted triples and weighted quartets for all subsets of three and four taxa, respectively. New intertaxon distances are defined that can be calculated from these weights, and shown to exactly fit the same tree topology, but with edge weights rescaled by certain factors dependent on the associated split size. These distances are analogs for metric trees of similar ones recently introduced for topological trees that are based on induced unweighted rooted triples and quartets. The distances introduced here lead to new statistically consistent methods of inferring a metric species tree from a collection of topological gene trees generated under the multispecies coalescent model of incomplete lineage sorting. Simulations provide insight into their potential.
Collapse
Affiliation(s)
- Samaneh Yourdkhani
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, 99775, USA
| | - John A Rhodes
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, 99775, USA.
| |
Collapse
|
15
|
Allman ES, Baños H, Rhodes JA. NANUQ: a method for inferring species networks from gene trees under the coalescent model. Algorithms Mol Biol 2019; 14:24. [PMID: 31827592 PMCID: PMC6896299 DOI: 10.1186/s13015-019-0159-2] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2019] [Accepted: 11/07/2019] [Indexed: 01/07/2023] Open
Abstract
Species networks generalize the notion of species trees to allow for hybridization or other lateral gene transfer. Under the network multispecies coalescent model, individual gene trees arising from a network can have any topology, but arise with frequencies dependent on the network structure and numerical parameters. We propose a new algorithm for statistical inference of a level-1 species network under this model, from data consisting of gene tree topologies, and provide the theoretical justification for it. The algorithm is based on an analysis of quartets displayed on gene trees, combining several statistical hypothesis tests with combinatorial ideas such as a quartet-based intertaxon distance appropriate to networks, the NeighborNet algorithm for circular split systems, and the Circular Network algorithm for constructing a splits graph.
Collapse
|