1
|
Choi KP, Kaur G, Thompson A, Wu T. Distributions of 4-subtree patterns for uniform random unrooted phylogenetic trees. J Theor Biol 2024; 584:111794. [PMID: 38499267 DOI: 10.1016/j.jtbi.2024.111794] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2023] [Revised: 03/10/2024] [Accepted: 03/13/2024] [Indexed: 03/20/2024]
Abstract
Tree shape statistics based on peripheral structures have been utilized to study evolutionary mechanisms and inference methods. Partially motivated by a recent study by Pouryahya and Sankoff on modeling the accumulation of subgenomes in the evolution of polyploids, we present the distribution of subtree patterns with four or fewer leaves for the unrooted Proportional to Distinguishable Arrangements (PDA) model. We derive a recursive formula for computing the joint distributions, as well as a Strong Law of Large Numbers and a Central Limit Theorem for the joint distributions. This enables us to confirm several conjectures proposed by Pouryahya and Sankoff, as well as provide some theoretical insights into their observations. Based on their empirical datasets, we demonstrate that the statistical test based on the joint distribution could be more sensitive than those based on one individual subtree pattern to detect the existence of evolutionary forces such as whole genome duplication.
Collapse
Affiliation(s)
- Kwok Pui Choi
- Department of Statistics and Data Sciences, National University of Singapore, Singapore 117546, Singapore.
| | - Gursharn Kaur
- Biocomplexity Institute, University of Virginia, Charlottesville, 22911, USA.
| | - Ariadne Thompson
- School of Computing Sciences, University of East Anglia, Norwich, NR4 7TJ, UK.
| | - Taoyang Wu
- School of Computing Sciences, University of East Anglia, Norwich, NR4 7TJ, UK.
| |
Collapse
|
2
|
Wakeley J, Fan WT(L, Koch E, Sunyaev S. Recurrent mutation in the ancestry of a rare variant. Genetics 2023; 224:iyad049. [PMID: 36967220 PMCID: PMC10324944 DOI: 10.1093/genetics/iyad049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Revised: 01/30/2023] [Accepted: 03/08/2023] [Indexed: 03/28/2023] Open
Abstract
Recurrent mutation produces multiple copies of the same allele which may be co-segregating in a population. Yet, most analyses of allele-frequency or site-frequency spectra assume that all observed copies of an allele trace back to a single mutation. We develop a sampling theory for the number of latent mutations in the ancestry of a rare variant, specifically a variant observed in relatively small count in a large sample. Our results follow from the statistical independence of low-count mutations, which we show to hold for the standard neutral coalescent or diffusion model of population genetics as well as for more general coalescent trees. For populations of constant size, these counts are distributed like the number of alleles in the Ewens sampling formula. We develop a Poisson sampling model for populations of varying size and illustrate it using new results for site-frequency spectra in an exponentially growing population. We apply our model to a large data set of human SNPs and use it to explain dramatic differences in site-frequency spectra across the range of mutation rates in the human genome.
Collapse
Affiliation(s)
- John Wakeley
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA
| | - Wai-Tong (Louis) Fan
- Department of Mathematics, Indiana University, Bloomington, IN 47405, USA
- Center of Mathematical Sciences and Applications, Harvard University, Cambridge, MA 02138, USA
| | - Evan Koch
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
- Division of Genetics, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA
| | - Shamil Sunyaev
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
- Division of Genetics, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
3
|
Distributions of cherries and pitchforks for the Ford model. Theor Popul Biol 2023; 149:27-38. [PMID: 36566944 DOI: 10.1016/j.tpb.2022.12.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2022] [Revised: 12/11/2022] [Accepted: 12/13/2022] [Indexed: 12/24/2022]
Abstract
Distributional properties of tree shape statistics under random phylogenetic tree models play an important role in investigating the evolutionary forces underlying the observed phylogenies. In this paper, we study two subtree counting statistics, the number of cherries and that of pitchforks for the Ford model, the alpha model introduced by Daniel Ford. It is a one-parameter family of random phylogenetic tree models which includes the proportional to distinguishable arrangement (PDA) and the Yule models, two tree models commonly used in phylogenetics. Based on a non-uniform version of the extended Pólya urn models in which negative entries are permitted for their replacement matrices, we obtain the strong law of large numbers and the central limit theorem for the joint distribution of these two statistics for the Ford model. Furthermore, we derive a recursive formula for computing the exact joint distribution of these two statistics. This leads to exact formulas for their means and higher order asymptotic expansions of their second moments, which allows us to identify a critical parameter value for the correlation between these two statistics. That is, when the number of tree leaves is sufficiently large, they are negatively correlated for 0≤α≤1/2 and positively correlated for 1/2<α<1.
Collapse
|
4
|
Lappo E, Rosenberg NA. Approximations to the expectations and variances of ratios of tree properties under the coalescent. G3 (BETHESDA, MD.) 2022; 12:jkac205. [PMID: 35951748 PMCID: PMC9526068 DOI: 10.1093/g3journal/jkac205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Accepted: 08/01/2022] [Indexed: 11/14/2022]
Abstract
Properties of gene genealogies such as tree height (H), total branch length (L), total lengths of external (E) and internal (I) branches, mean length of basal branches (B), and the underlying coalescence times (T) can be used to study population-genetic processes and to develop statistical tests of population-genetic models. Uses of tree features in statistical tests often rely on predictions that depend on pairwise relationships among such features. For genealogies under the coalescent, we provide exact expressions for Taylor approximations to expected values and variances of ratios Xn/Yn, for all 15 pairs among the variables {Hn,Ln,En,In,Bn,Tk}, considering n leaves and 2≤k≤n. For expected values of the ratios, the approximations match closely with empirical simulation-based values. The approximations to the variances are not as accurate, but they generally match simulations in their trends as n increases. Although En has expectation 2 and Hn has expectation 2 in the limit as n→∞, the approximation to the limiting expectation for En/Hn is not 1, instead equaling π2/3-2≈1.28987. The new approximations augment fundamental results in coalescent theory on the shapes of genealogical trees.
Collapse
Affiliation(s)
- Egor Lappo
- Department of Biology, Stanford University, Stanford, CA 94305, USA
| | - Noah A Rosenberg
- Department of Biology, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
5
|
Ledda A, Cummins M, Shaw LP, Jauneikaite E, Cole K, Lasalle F, Barry D, Turton J, Rosmarin C, Anaraki S, Wareham D, Stoesser N, Paul J, Manuel R, Cherian BP, Didelot X. Hospital outbreak of carbapenem-resistant Enterobacterales associated with a blaOXA-48 plasmid carried mostly by Escherichia coli ST399. Microb Genom 2022; 8:000675. [PMID: 35442183 PMCID: PMC9453065 DOI: 10.1099/mgen.0.000675] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
A hospital outbreak of carbapenem-resistant Enterobacterales was detected by routine surveillance. Whole genome sequencing and subsequent analysis revealed a conserved promiscuous blaOXA-48 carrying plasmid as the defining factor within this outbreak. Four different species of Enterobacterales were involved in the outbreak. Escherichia coli ST399 accounted for 35 of all the 55 isolates. Comparative genomics analysis using publicly available E. coli ST399 genomes showed that the outbreak E. coli ST399 isolates formed a unique clade. We developed a mathematical model of pOXA-48-like plasmid transmission between host lineages and used it to estimate its conjugation rate, giving a lower bound of 0.23 conjugation events per lineage per year. Our analysis suggests that co-evolution between the pOXA-48-like plasmid and E. coli ST399 could have played a role in the outbreak. This is the first study to report carbapenem-resistant E. coli ST399 carrying blaOXA-48 as the main cause of a plasmid-borne outbreak within a hospital setting. Our findings suggest complementary roles for both plasmid conjugation and clonal expansion in the emergence of this outbreak.
Collapse
Affiliation(s)
- Alice Ledda
- Department of Infectious Disease Epidemiology, School of Public Health, Imperial College London, UK
- Healthcare Associated Infections and Antimicrobial Resistance Division, National Infection Service, Public Health England, London, UK
- *Correspondence: Alice Ledda,
| | - Martina Cummins
- Department of Microbiology and Infection Control, Barts Health NHS Trust, London, UK
| | - Liam P. Shaw
- Department of Zoology, University of Oxford, Oxford, UK
| | - Elita Jauneikaite
- Department of Infectious Disease Epidemiology, School of Public Health, Imperial College London, UK
- NHIR Health Protection Research Unit in Healthcare Associated Infections and Antimicrobial Resistance, Department of Infectious disease, Imperial College London, Hammersmith Campus, London, UK
| | | | - Florent Lasalle
- Department of Infectious Disease Epidemiology, School of Public Health, Imperial College London, UK
- Microbes and Pathogens Programme, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | - Deborah Barry
- Department of Microbiology and Infection Control, Barts Health NHS Trust, London, UK
| | - Jane Turton
- Healthcare Associated Infections and Antimicrobial Resistance Division, National Infection Service, Public Health England, London, UK
| | - Caryn Rosmarin
- Department of Microbiology and Infection Control, Barts Health NHS Trust, London, UK
| | - Sudy Anaraki
- North East and North Central London Health Protection Team, Public Health England, London, UK
| | - David Wareham
- Department of Microbiology and Infection Control, Barts Health NHS Trust, London, UK
| | - Nicole Stoesser
- Modernising Medical Microbiology, Nuffield Department of Clinical Medicine, University of Oxford, John Radcliffe Hospital, Oxford, UK
| | - John Paul
- Brighton and Sussex Medical school, Department of Global health and Infection, University of Sussex, Falmer, Brighton, UK
| | - Rohini Manuel
- Public Health Laboratory London, National Infection Service, Public Health England, London, UK
| | - Benny P. Cherian
- Department of Microbiology and Infection Control, Barts Health NHS Trust, London, UK
| | - Xavier Didelot
- School of Life Sciences and Department of Statistics, University of Warwick, Coventry, UK
| |
Collapse
|
6
|
A compendium of covariances and correlation coefficients of coalescent tree properties. Theor Popul Biol 2022; 143:1-13. [PMID: 34757022 PMCID: PMC9731325 DOI: 10.1016/j.tpb.2021.09.008] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2021] [Revised: 09/21/2021] [Accepted: 09/28/2021] [Indexed: 02/03/2023]
Abstract
Gene genealogies are frequently studied by measuring properties such as their height (H), length (L), sum of external branches (E), sum of internal branches (I), and mean of their two basal branches (B), and the coalescence times that contribute to the other genealogical features (T). These tree properties and their relationships can provide insight into the effects of population-genetic processes on genealogies and genetic sequences. Here, under the coalescent model, we study the 15 correlations among pairs of features of genealogical trees: Hn, Ln, En, In, Bn, and Tk for a sample of size n, with 2≤k≤n. We report high correlations among Hn, Ln, In, and Bn, with all pairwise correlations of these quantities having values greater than or equal to 6[6ζ(3)+6-π2]/(π18+9π2-π4)≈0.84930 in the limit as n→∞, where ζ is the Riemann zeta function. Although En has expectation 2 for all n and Hn has expectation 2 in the n→∞ limit, their limiting correlation is 0. The results contribute toward understanding features of the shapes of coalescent trees.
Collapse
|
7
|
Dilber E, Terhorst J. Robust detection of natural selection using a probabilistic model of tree imbalance. Genetics 2022; 220:6511494. [PMID: 35100408 PMCID: PMC8893258 DOI: 10.1093/genetics/iyac009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2021] [Accepted: 12/16/2021] [Indexed: 01/21/2023] Open
Abstract
Neutrality tests such as Tajima's D and Fay and Wu's H are standard implements in the population genetics toolbox. One of their most common uses is to scan the genome for signals of natural selection. However, it is well understood that D and H are confounded by other evolutionary forces-in particular, population expansion-that may be unrelated to selection. Because they are not model-based, it is not clear how to deconfound these tests in a principled way. In this article, we derive new likelihood-based methods for detecting natural selection, which are robust to fluctuations in effective population size. At the core of our method is a novel probabilistic model of tree imbalance, which generalizes Kingman's coalescent to allow certain aberrant tree topologies to arise more frequently than is expected under neutrality. We derive a frequency spectrum-based estimator that can be used in place of D, and also extend to the case where genealogies are first estimated. We benchmark our methods on real and simulated data, and provide an open source software implementation.
Collapse
Affiliation(s)
- Enes Dilber
- Department of Statistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Jonathan Terhorst
- Department of Statistics, University of Michigan, Ann Arbor, MI 48109, USA,Corresponding author: Department of Statistics, University of Michigan, Ann Arbor, MI 48109, USA.
| |
Collapse
|
8
|
Multivariate phase-type theory for the site frequency spectrum. J Math Biol 2021; 83:63. [PMID: 34783900 DOI: 10.1007/s00285-021-01689-w] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2021] [Revised: 08/09/2021] [Accepted: 10/13/2021] [Indexed: 10/19/2022]
Abstract
Linear functions of the site frequency spectrum (SFS) play a major role for understanding and investigating genetic diversity. Estimators of the mutation rate (e.g. based on the total number of segregating sites or average of the pairwise differences) and tests for neutrality (e.g. Tajima's D) are perhaps the most well-known examples. The distribution of linear functions of the SFS is important for constructing confidence intervals for the estimators, and to determine significance thresholds for neutrality tests. These distributions are often approximated using simulation procedures. In this paper we use multivariate phase-type theory to specify, characterize and calculate the distribution of linear functions of the site frequency spectrum. In particular, we show that many of the classical estimators of the mutation rate are distributed according to a discrete phase-type distribution. Neutrality tests, however, are generally not discrete phase-type distributed. For neutrality tests we derive the probability generating function using continuous multivariate phase-type theory, and numerically invert the function to obtain the distribution. A main result is an analytically tractable formula for the probability generating function of the SFS. Software implementation of the phase-type methodology is available in the R package PhaseTypeR, and R code for the reproduction of our results is available as an accompanying vignette.
Collapse
|
9
|
Choi KP, Kaur G, Wu T. On asymptotic joint distributions of cherries and pitchforks for random phylogenetic trees. J Math Biol 2021; 83:40. [PMID: 34554333 PMCID: PMC8460594 DOI: 10.1007/s00285-021-01667-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2021] [Revised: 07/29/2021] [Accepted: 09/08/2021] [Indexed: 11/24/2022]
Abstract
Tree shape statistics provide valuable quantitative insights into evolutionary mechanisms underpinning phylogenetic trees, a commonly used graph representation of evolutionary relationships among taxonomic units ranging from viruses to species. We study two subtree counting statistics, the number of cherries and the number of pitchforks, for random phylogenetic trees generated by two widely used null tree models: the proportional to distinguishable arrangements (PDA) and the Yule-Harding-Kingman (YHK) models. By developing limit theorems for a version of extended Pólya urn models in which negative entries are permitted for their replacement matrices, we deduce the strong laws of large numbers and the central limit theorems for the joint distributions of these two counting statistics for the PDA and the YHK models. Our results indicate that the limiting behaviour of these two statistics, when appropriately scaled using the number of leaves in the underlying trees, is independent of the initial tree used in the tree generating process.
Collapse
Affiliation(s)
- Kwok Pui Choi
- Department of Statistics and Data Science, and the Department of Mathematics, National University of Singapore, Singapore, 117546, Republic of Singapore
| | - Gursharn Kaur
- Department of Statistics and Data Science, and the Department of Mathematics, National University of Singapore, Singapore, 117546, Republic of Singapore
| | - Taoyang Wu
- School of Computing Sciences, University of East Anglia, Norwich, NR4 7TJ, UK.
| |
Collapse
|
10
|
Cappello L, Palacios JA. SEQUENTIAL IMPORTANCE SAMPLING FOR MULTIRESOLUTION KINGMAN-TAJIMA COALESCENT COUNTING. Ann Appl Stat 2021; 14:727-751. [PMID: 33995755 DOI: 10.1214/19-aoas1313] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
Statistical inference of evolutionary parameters from molecular sequence data relies on coalescent models to account for the shared genealogical ancestry of the samples. However, inferential algorithms do not scale to available data sets. A strategy to improve computational efficiency is to rely on simpler coalescent and mutation models, resulting in smaller hidden state spaces. An estimate of the cardinality of the state-space of genealogical trees at different resolutions is essential to decide the best modeling strategy for a given dataset. To our knowledge, there is neither an exact nor approximate method to determine these cardinalities. We propose a sequential importance sampling algorithm to estimate the cardinality of the sample space of genealogical trees under different coalescent resolutions. Our sampling scheme proceeds sequentially across the set of combinatorial constraints imposed by the data, which in this work are completely linked sequences of DNA at a non recombining segment. We analyze the cardinality of different genealogical tree spaces on simulations to study the settings that favor coarser resolutions. We apply our method to estimate the cardinality of genealogical tree spaces from mtDNA data from the 1000 genomes and a sample from a Melanesian population at the β-globin locus.
Collapse
|
11
|
In Silico Molecular Docking Analysis of α-Pinene: An Antioxidant and Anticancer Drug Obtained from Myrtus communis. INTERNATIONAL JOURNAL OF CANCER MANAGEMENT 2021. [DOI: 10.5812/ijcm.89116] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Background: Testis-specific protein on Y chromosome (TSPY) is the output of a tandem gene cluster. TSPY expression has been observed in gonadoblastoma and numerous distinct kinds of germ cell tumors, such as carcinoma in situ/intratubular germ cell neoplasia, seminoma, and extragonadal intracranial germ cell tumors (GCT). Myrtus communis extract rich in α-pinene showed high antioxidant and anticancer activity against a TSPY. Methods: The molecular weight and theoretical isoelectric of the TSPY proteins were calculated, using the ExPASSY ProtParam tools. Some software like mega 6, BioEdit, NEB cutter (New England Biolabs), and CAP3 were used to analyze clustering and find restriction enzymes on the TSPY sequence. To evaluate the nucleotide diversity of all sequences, the number of diverse situations and Tajima’s and Watterson’s estimators of theta were assessed. Nucleotide polymorphism can be measured by several parameters, such as haplotypes diversity, nucleotide diversity, theta using Dnasp software. To find interaction networks of protein-protein search tool for the retrieval of interacting genes/proteins (STRING) tools and to predict 3D structure, SWISS-MODEL was used; however, for docking protein-peptide based on interaction, Swiss Dock, Galaxy web, and CABS-dock software were employed. Results: We report a high (0.91) dN/dS index, positive Tajima's D, Fu, and Li’s tests, and a non-significant D test suggesting the occurrence of old modifications or a decrease of newborn mutations in the TSPY gene family. Interestingly, several hub proteins produced a strong chain or an operative module within their protein groups, such as nucleosome assembly protein (1NAP1L), RBMXL2, TBL1Y, and AMELY, which are all associated with the same cellular appliance elements and/or genetic uses. The docking of the TSPY target with α-pinene using docking revealed that the computationally-prognosticated lowest energy networks of TSPY are established by intermolecular hydrogen bonds and stacking interactions. Conclusions: The results of this study demonstrated that α-pinene interacts with the TSPY protein target and could be developed as a promising candidate for the new anticancer agent.
Collapse
|
12
|
Abstract
Genealogical tree modeling is essential for estimating evolutionary parameters in population genetics and phylogenetics. Recent mathematical results concerning ranked genealogies without leaf labels unlock opportunities in the analysis of evolutionary trees. In particular, comparisons between ranked genealogies facilitate the study of evolutionary processes of different organisms sampled at multiple time periods. We propose metrics on ranked tree shapes and ranked genealogies for lineages isochronously and heterochronously sampled. Our proposed tree metrics make it possible to conduct statistical analyses of ranked tree shapes and timed ranked tree shapes or ranked genealogies. Such analyses allow us to assess differences in tree distributions, quantify estimation uncertainty, and summarize tree distributions. We show the utility of our metrics via simulations and an application in infectious diseases.
Collapse
Affiliation(s)
- Jaehee Kim
- Department of Biology, Stanford University, Stanford, CA 94305
| | | | - Julia A Palacios
- Department of Statistics, Stanford University, Stanford, CA 94305;
- Department of Biomedical Data Science, Stanford School of Medicine, Stanford, CA 94305
| |
Collapse
|
13
|
Ralph P, Thornton K, Kelleher J. Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes. Genetics 2020; 215:779-797. [PMID: 32357960 PMCID: PMC7337078 DOI: 10.1534/genetics.120.303253] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2019] [Accepted: 04/28/2020] [Indexed: 12/11/2022] Open
Abstract
As a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe the underlying genealogies. We use this correspondence to define a general framework that efficiently computes single-site population genetic statistics using the succinct tree sequence encoding of genealogies and genome sequence. The general approach accumulates sample weights within the genealogical tree at each position on the genome, which are then combined using a summary function; different statistics result from different choices of weight and function. Results can be reported in three ways: by site, which corresponds to statistics calculated as usual from genome sequence; by branch, which gives the expected value of the dual site statistic under the infinite sites model of mutation, and by node, which summarizes the contribution of each ancestor to these statistics. We use the framework to implement many currently defined statistics of genome sequence (making the statistics' relationship to the underlying genealogical trees concrete and explicit), as well as the corresponding branch statistics of tree shape. We evaluate computational performance using simulated data, and show that calculating statistics from tree sequences using this general framework is several orders of magnitude more efficient than optimized matrix-based methods in terms of both run time and memory requirements. We also explore how well the duality between site and branch statistics holds in practice on trees inferred from the 1000 Genomes Project data set, and discuss ways in which deviations may encode interesting biological signals.
Collapse
Affiliation(s)
- Peter Ralph
- Institute of Evolution and Ecology, Departments of Mathematics and Biology, University of Oregon, Eugene, Oregon 97405
| | - Kevin Thornton
- Department of Ecology and Evolutionary Biology, University of California, Irvine, California 92697
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, United Kingdom OX3 7LF
| |
Collapse
|
14
|
On cherry and pitchfork distributions of random rooted and unrooted phylogenetic trees. Theor Popul Biol 2020; 132:92-104. [PMID: 32135170 DOI: 10.1016/j.tpb.2020.02.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2019] [Revised: 01/20/2020] [Accepted: 02/25/2020] [Indexed: 01/08/2023]
Abstract
Tree shape statistics are important for investigating evolutionary mechanisms mediating phylogenetic trees. As a step towards bridging shape statistics between rooted and unrooted trees, we present a comparison study on two subtree statistics known as numbers of cherries and pitchforks for the proportional to distinguishable arrangements (PDA) and the Yule-Harding-Kingman (YHK) models. Based on recursive formulas on the joint distribution of the number of cherries and that of pitchforks, it is shown that cherry distributions are log-concave for both rooted and unrooted trees under these two models. Furthermore, the mean number of cherries and that of pitchforks for unrooted trees converge respectively to those for rooted trees under the YHK model while there exists a limiting gap of 1∕4 for the PDA model. Finally, the total variation distances between the cherry distributions of rooted and those of unrooted trees converge for both models. Our results indicate that caution is required for conducting statistical analysis for tree shapes involving both rooted and unrooted trees.
Collapse
|
15
|
Satta Y, Zheng W, Nishiyama KV, Iwasaki RL, Hayakawa T, Fujito NT, Takahata N. Two-dimensional site frequency spectrum for detecting, classifying and dating incomplete selective sweeps. Genes Genet Syst 2019; 94:283-300. [DOI: 10.1266/ggs.19-00012] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Affiliation(s)
- Yoko Satta
- School of Advanced Sciences, SOKENDAI (The Graduate University for Advanced Studies)
| | - Wanjing Zheng
- School of Advanced Sciences, SOKENDAI (The Graduate University for Advanced Studies)
| | - Kumiko V. Nishiyama
- School of Advanced Sciences, SOKENDAI (The Graduate University for Advanced Studies)
| | - Risa L. Iwasaki
- School of Advanced Sciences, SOKENDAI (The Graduate University for Advanced Studies)
| | - Toshiyuki Hayakawa
- Graduate School of Systems Life Sciences and Faculty of Arts and Science, Kyushu University
| | - Naoko T. Fujito
- Institute for Human Genetics and Department of Epidemiology and Biostatistics, University of California
| | - Naoyuki Takahata
- School of Advanced Sciences, SOKENDAI (The Graduate University for Advanced Studies)
| |
Collapse
|
16
|
The Evolving Moran Genealogy. Theor Popul Biol 2019; 130:94-105. [PMID: 31330138 DOI: 10.1016/j.tpb.2019.07.005] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2018] [Revised: 06/24/2019] [Accepted: 07/05/2019] [Indexed: 11/21/2022]
Abstract
We study the evolution of the population genealogy in the classic neutral Moran Model of finite size n∈N and in discrete time. The stochastic transformations that shape a Moran population can be realized directly on its genealogy and give rise to a process on a state space consisting of n-sized binary increasing trees. We derive a number of properties of this process, and show that they are in agreement with existing results on the infinite-population limit of the Moran Model. Most importantly, this process admits time reversal, which makes it possible to simplify the mechanisms determining state changes, and allows for a thorough investigation of the Most Recent Common Ancestorprocess.
Collapse
|
17
|
Varshney RK, Thudi M, Roorkiwal M, He W, Upadhyaya HD, Yang W, Bajaj P, Cubry P, Rathore A, Jian J, Doddamani D, Khan AW, Garg V, Chitikineni A, Xu D, Gaur PM, Singh NP, Chaturvedi SK, Nadigatla GVPR, Krishnamurthy L, Dixit GP, Fikre A, Kimurto PK, Sreeman SM, Bharadwaj C, Tripathi S, Wang J, Lee SH, Edwards D, Polavarapu KKB, Penmetsa RV, Crossa J, Nguyen HT, Siddique KHM, Colmer TD, Sutton T, von Wettberg E, Vigouroux Y, Xu X, Liu X. Resequencing of 429 chickpea accessions from 45 countries provides insights into genome diversity, domestication and agronomic traits. Nat Genet 2019; 51:857-864. [DOI: 10.1038/s41588-019-0401-3] [Citation(s) in RCA: 147] [Impact Index Per Article: 24.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2018] [Accepted: 03/21/2019] [Indexed: 11/09/2022]
|
18
|
Fujito NT, Satta Y, Hayakawa T, Takahata N. A new inference method for detecting an ongoing selective sweep. Genes Genet Syst 2018; 93:149-161. [DOI: 10.1266/ggs.18-00008] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Affiliation(s)
- Naoko T. Fujito
- School of Advanced Sciences, SOKENDAI (The Graduate University for Advanced Studies)
| | - Yoko Satta
- School of Advanced Sciences, SOKENDAI (The Graduate University for Advanced Studies)
| | - Toshiyuki Hayakawa
- Graduate School of Systems Life Sciences, Kyushu University
- Faculty of Arts and Science, Kyushu University
| | - Naoyuki Takahata
- School of Advanced Sciences, SOKENDAI (The Graduate University for Advanced Studies)
| |
Collapse
|
19
|
Arbisser IM, Jewett EM, Rosenberg NA. On the joint distribution of tree height and tree length under the coalescent. Theor Popul Biol 2018; 122:46-56. [PMID: 29132923 PMCID: PMC5945353 DOI: 10.1016/j.tpb.2017.10.008] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2017] [Revised: 10/30/2017] [Accepted: 10/31/2017] [Indexed: 10/18/2022]
Abstract
Many statistics that examine genetic variation depend on the underlying shapes of genealogical trees. Under the coalescent model, we investigate the joint distribution of two quantities that describe genealogical tree shape: tree height and tree length. We derive a recursive formula for their exact joint distribution under a demographic model of a constant-sized population. We obtain approximations for the mean and variance of the ratio of tree height to tree length, using them to show that this ratio converges in probability to 0 as the sample size increases. We find that as the sample size increases, the correlation coefficient for tree height and length approaches (π2-6)∕[π2π2-18]≈0.9340. Using simulations, we examine the joint distribution of height and length under demographic models with population growth and population subdivision. We interpret the joint distribution in relation to problems of interest in data analysis, including inference of the time to the most recent common ancestor. The results assist in understanding the influences of demographic histories on two fundamental features of tree shape.
Collapse
Affiliation(s)
- Ilana M Arbisser
- Department of Biology, Stanford University, Stanford, CA 94305, USA.
| | - Ethan M Jewett
- Departments of Electrical Engineering & Computer Science and Statistics, University of California, Berkeley, CA 94720, USA
| | - Noah A Rosenberg
- Department of Biology, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
20
|
Reppell M, Zöllner S. An efficient algorithm for generating the internal branches of a Kingman coalescent. Theor Popul Biol 2018; 122:57-66. [PMID: 28709926 PMCID: PMC5764821 DOI: 10.1016/j.tpb.2017.05.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2016] [Revised: 05/19/2017] [Accepted: 05/26/2017] [Indexed: 01/16/2023]
Abstract
Coalescent simulations are a widely used approach for simulating sample genealogies, but can become computationally burdensome in large samples. Methods exist to analytically calculate a sample's expected frequency spectrum without simulating full genealogies. However, statistics that rely on the distribution of the length of internal coalescent branches, such as the probability that two mutations of equal size arose on the same genealogical branch, have previously required full coalescent simulations to estimate. Here, we present a sampling method capable of efficiently generating limited portions of sample genealogies using a series of analytic equations that give probabilities for the number, start, and end of internal branches conditional on the number of final samples they subtend. These equations are independent of the coalescent waiting times and need only be calculated a single time, lending themselves to efficient computation. We compare our method with full coalescent simulations to show the resulting distribution of branch lengths and summary statistics are equivalent, but that for many conditions our method is at least 10 times faster.
Collapse
Affiliation(s)
- M Reppell
- Department of Human Genetics, University of Chicago, Chicago, IL, USA.
| | - S Zöllner
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, MI, USA; Department of Psychiatry, University of Michigan, Ann Arbor, MI, USA
| |
Collapse
|
21
|
Ferretti L, Klassmann A, Raineri E, Ramos-Onsins SE, Wiehe T, Achaz G. The neutral frequency spectrum of linked sites. Theor Popul Biol 2018; 123:70-79. [PMID: 29964061 DOI: 10.1016/j.tpb.2018.06.001] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2017] [Revised: 06/01/2018] [Accepted: 06/11/2018] [Indexed: 11/28/2022]
Abstract
We introduce the conditional Site Frequency Spectrum (SFS) for a genomic region linked to a focal mutation of known frequency. An exact expression for its expected value is provided for the neutral model without recombination. Its relation with the expected SFS for two sites, 2-SFS, is discussed. These spectra derive from the coalescent approach of Fu (1995) for finite samples, which is reviewed. Remarkably simple expressions are obtained for the linked SFS of a large population, which are also solutions of the multi-allelic Kolmogorov equations. These formulae are the immediate extensions of the well known single site θ∕f neutral SFS. Besides the general interest in these spectra, they relate to relevant biological cases, such as structural variants and introgressions. As an application, a recipe to adapt Tajima's D and other SFS-based neutrality tests to a non-recombining region containing a neutral marker is presented.
Collapse
Affiliation(s)
- Luca Ferretti
- The Pirbright Institute, Woking, United Kingdom; Institut de Systématique, Evolution, Biodiversité, UMR 7205, MNHN and Centre Interdisciplinaire de Recherche en Biologie, UMR 7241, Collége de France, Paris, France.
| | | | - Emanuele Raineri
- CNAG-CRG, Centre for Genomic Regulation (CRG) and UPF, Barcelona, Spain
| | | | - Thomas Wiehe
- Institut für Genetik, Universität zu Köln, Köln, Germany
| | - Guillaume Achaz
- Institut de Systématique, Evolution, Biodiversité, UMR 7205, MNHN and Centre Interdisciplinaire de Recherche en Biologie, UMR 7241, Collége de France, Paris, France
| |
Collapse
|
22
|
Satta Y, Fujito NT, Takahata N. Nonequilibrium Neutral Theory for Hitchhikers. Mol Biol Evol 2018; 35:1362-1365. [PMID: 29722819 DOI: 10.1093/molbev/msy093] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Selective sweep is a phenomenon of reduced variation at presumably neutrally evolving sites (hitchhikers) in the genome that is caused by the spread of a selected allele at a linked focal site, and is widely used to test for action of positive selection. Nonetheless, selective sweep may also provide an unprecedented opportunity for studying nonequilibrium properties of the neutral variation itself. We have demonstrated this possibility in relation to ancient selective sweep for modern human-specific changes and ongoing selective sweep for local population-specific changes.
Collapse
Affiliation(s)
- Yoko Satta
- Department of Evolutionary Studies of Biosystems, School of Advanced Sciences, SOKENDAI (The Graduate University for Advanced Studies), Hayama, Kanagawa, Japan
| | - Naoko T Fujito
- Department of Evolutionary Studies of Biosystems, School of Advanced Sciences, SOKENDAI (The Graduate University for Advanced Studies), Hayama, Kanagawa, Japan
| | - Naoyuki Takahata
- Department of Evolutionary Studies of Biosystems, School of Advanced Sciences, SOKENDAI (The Graduate University for Advanced Studies), Hayama, Kanagawa, Japan
| |
Collapse
|
23
|
Detecting Recent Positive Selection with a Single Locus Test Bipartitioning the Coalescent Tree. Genetics 2017; 208:791-805. [PMID: 29217523 DOI: 10.1534/genetics.117.300401] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2017] [Accepted: 12/01/2017] [Indexed: 01/09/2023] Open
Abstract
Many population genomic studies have been conducted in the past to search for traces of recent events of positive selection. These traces, however, can be obscured by temporal variation of population size or other demographic factors. To reduce the confounding impact of demography, the coalescent tree topology has been used as an additional source of information for detecting recent positive selection in a population or a species. Based on the branching pattern at the root, we partition the hypothetical coalescent tree, inferred from a sequence sample, into two subtrees. The reasoning is that positive selection could impose a strong impact on branch length in one of the two subtrees while demography has the same effect on average on both subtrees. Thus, positive selection should be detectable by comparing statistics calculated for the two subtrees. Simulations demonstrate that the proposed test based on these principles has high power to detect recent positive selection even when DNA polymorphism data from only one locus is available, and that it is robust to the confounding effect of demography. One feature is that all components in the summary statistics ([Formula: see text]) can be computed analytically. Moreover, misinference of derived and ancestral alleles is seen to have only a limited effect on the test, and it therefore avoids a notorious problem when searching for traces of recent positive selection.
Collapse
|