1
|
Celentano M, DeWitt WS, Prillo S, Song YS. Exact and efficient phylodynamic simulation from arbitrarily large populations. Proc Natl Acad Sci U S A 2025; 122:e2412978122. [PMID: 40366686 DOI: 10.1073/pnas.2412978122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Accepted: 04/15/2025] [Indexed: 05/15/2025] Open
Abstract
Many biological studies involve inferring the evolutionary history of a sample of individuals from a large population and interpreting the reconstructed tree. Such an ascertained tree typically represents only a small part of a comprehensive population tree and is distorted by survivorship and sampling biases. Inferring evolutionary parameters from ascertained trees requires modeling both the underlying population dynamics and the ascertainment process. A crucial component of this phylodynamic modeling involves tree simulation, which is used to benchmark probabilistic inference methods. To simulate an ascertained tree, one must first simulate the full population tree and then prune unobserved lineages. Consequently, the computational cost is determined not by the size of the final simulated tree, but by the size of the population tree in which it is embedded. In most biological scenarios, simulations of the entire population are prohibitively expensive due to computational demands placed on lineages without sampled descendants. Here, we address this challenge by proving that, for any partially ascertained process from a general multitype birth-death-mutation-sampling model, there exists an equivalent process with complete sampling and no death, a property which we leverage to develop a highly efficient algorithm for simulating trees. Our algorithm scales linearly with the size of the final simulated tree and is independent of the population size, enabling simulations from extremely large populations beyond the reach of current methods but essential for various biological applications. We anticipate that this massive speedup will significantly advance the development of novel inference methods that require extensive training data.
Collapse
Affiliation(s)
- Michael Celentano
- Department of Statistics, University of California, Berkeley, CA 94720
| | - William S DeWitt
- Computer Science Division, University of California, Berkeley, CA 94720
| | - Sebastian Prillo
- Computer Science Division, University of California, Berkeley, CA 94720
| | - Yun S Song
- Department of Statistics, University of California, Berkeley, CA 94720
- Computer Science Division, University of California, Berkeley, CA 94720
| |
Collapse
|
2
|
Översti S, Weber A, Baran V, Kieninger B, Dilthey A, Houwaart T, Walker A, Schneider-Brachert W, Kühnert D. Evolutionary and epidemic dynamics of COVID-19 in Germany exemplified by three Bayesian phylodynamic case studies. Bioinform Biol Insights 2025; 19:11779322251321065. [PMID: 40078196 PMCID: PMC11898094 DOI: 10.1177/11779322251321065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2024] [Accepted: 01/29/2025] [Indexed: 03/14/2025] Open
Abstract
The importance of genomic surveillance strategies for pathogens has been particularly evident during the coronavirus disease 2019 (COVID-19) pandemic, as genomic data from the causative agent, severe acute respiratory syndrome coronavirus type 2 (SARS-CoV-2), have guided public health decisions worldwide. Bayesian phylodynamic inference, integrating epidemiology and evolutionary biology, has become an essential tool in genomic epidemiological surveillance. It enables the estimation of epidemiological parameters, such as the reproductive number, from pathogen sequence data alone. Despite the phylodynamic approach being widely adopted, the abundance of phylodynamic models often makes it challenging to select the appropriate model for specific research questions. This article illustrates the application of phylodynamic birth-death-sampling models in public health using genomic data, with a focus on SARS-CoV-2. Targeting researchers less familiar with phylodynamics, it introduces a comprehensive workflow, including the conceptualisation of a research study and detailed steps for data preprocessing and postprocessing. In addition, we demonstrate the versatility of birth-death-sampling models through three case studies from Germany, utilising the BEAST2 software and its model implementations. Each case study addresses a distinct research question relevant not only to SARS-CoV-2 but also to other pathogens: Case study 1 finds traces of a superspreading event at the start of an early outbreak, exemplifying how simple models for genomic data can provide information that would otherwise only be accessible through extensive contact tracing. Case study 2 compares transmission dynamics in a nosocomial outbreak to community transmission, highlighting distinct dynamics through integrative analysis. Case study 3 investigates whether local transmission patterns align with national trends, demonstrating how phylodynamic models can disentangle complex population substructure with little additional information. For each case study, we emphasise critical points where model assumptions and data properties may misalign and outline appropriate validation assessments. Overall, we aim to provide researchers with examples on using birth-death-sampling models in genomic epidemiology, balancing theoretical and practical aspects.
Collapse
Affiliation(s)
- Sanni Översti
- Transmission, Infection, Diversification & Evolution Group (tide), Max Planck Institute of Geoanthropology, Jena, Germany
- Department of Archaeogenetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| | - Ariane Weber
- Transmission, Infection, Diversification & Evolution Group (tide), Max Planck Institute of Geoanthropology, Jena, Germany
- Department of Archaeogenetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| | - Viktor Baran
- Department of Infection Prevention and Infectious Diseases, University Hospital Regensburg, Regensburg, Germany
| | - Bärbel Kieninger
- Department of Infection Prevention and Infectious Diseases, University Hospital Regensburg, Regensburg, Germany
| | - Alexander Dilthey
- Institute of Medical Microbiology and Hospital Hygiene, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Torsten Houwaart
- Institute of Medical Microbiology and Hospital Hygiene, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Andreas Walker
- Institute of Virology, University Hospital Düsseldorf, Düsseldorf, Germany
| | - Wulf Schneider-Brachert
- Department of Infection Prevention and Infectious Diseases, University Hospital Regensburg, Regensburg, Germany
| | - Denise Kühnert
- Transmission, Infection, Diversification & Evolution Group (tide), Max Planck Institute of Geoanthropology, Jena, Germany
- Department of Archaeogenetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
- Phylogenomics Unit, Centre for Artificial Intelligence in Public Health Research, Robert Koch Institute, Wildau, Germany
| |
Collapse
|
3
|
Darlim G, Höhna S. The effects of cryptic diversity on diversification dynamics analyses in Crocodylia. Proc Biol Sci 2025; 292:20250091. [PMID: 40101764 PMCID: PMC11919527 DOI: 10.1098/rspb.2025.0091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2025] [Revised: 02/19/2025] [Accepted: 02/19/2025] [Indexed: 03/20/2025] Open
Abstract
Incomplete taxon sampling due to underestimation of present-day biodiversity biases diversification analysis by favouring slowdowns in speciation rates towards the recent time. For instance, in diversification dynamics studies in Crocodylia, long-term low net-diversification rates and slowdowns in speciation rates have been suggested to characterize crocodylian evolution. However, crocodylian cryptic diversity has never been considered. Here, we explore the effects of incorporating cryptic diversity into a diversification dynamics analysis of extant crocodylians. We inferred a time-calibrated cryptic-species-level phylogeny using cytochrome b sequences of 45 lineages compared with the formally recognized 26 crocodylian species. Diversification rate estimates using the cryptic-species-level phylogeny show increasing speciation and net-diversification rates towards the present time, which contrasts with previous findings. Cryptic diversity should be considered in future macroevolutionary analyses; however, the representation of cryptic extinct taxa represents a major challenge. Additionally, further investigation of crocodylian diversification dynamics under different underlying genomic data is encouraged upon advances in population genetics. Our case study adds to the diversification dynamics knowledge of extant taxa and demonstrates that cryptic species and robust taxonomic assessment are essential to study recent biodiversity dynamics with broad implications for evolutionary biology and ecology.
Collapse
Affiliation(s)
- Gustavo Darlim
- GeoBio-Center LMU, Ludwig-Maximilians-Universität München, Munich, Germany
- Department of Earth and Environmental Sciences, Palaeontology and Geobiology, Ludwig-Maximilians-Universität München, Munich, Germany
| | - Sebastian Höhna
- GeoBio-Center LMU, Ludwig-Maximilians-Universität München, Munich, Germany
- Department of Earth and Environmental Sciences, Palaeontology and Geobiology, Ludwig-Maximilians-Universität München, Munich, Germany
| |
Collapse
|
4
|
Rannala B, Yang Z. Reading tree leaves: inferring speciation anfd extinction processes using phylogenies. Philos Trans R Soc Lond B Biol Sci 2025; 380:20230309. [PMID: 39976406 PMCID: PMC11867106 DOI: 10.1098/rstb.2023.0309] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2024] [Revised: 09/21/2024] [Accepted: 10/14/2024] [Indexed: 02/21/2025] Open
Abstract
The birth-death process (BDP) is widely used in evolutionary biology as a model for generating phylogenetic trees of species. The generalized birth-death process (GBDP) allows rate variation over time, with speciation and extinction rates to be arbitrary functions of time. Here we review the probability theory underpinning the GBDP as a model of cladogenesis and recent findings concerning its identifiability. The GBDP with arbitrary continuous rate functions has been shown to be non-identifiable from lineage-through-time data: even with species phylogenies of infinite size the parameters cannot be estimated. However, a restricted class of BDPs with piecewise-constant rates has been shown to be identifiable. We review and illustrate these results using simple examples and discuss their implications for biologists interested in inferring the past tempo and mode of evolution using reconstructed phylogenetic trees.This article is part of the theme issue '"A mathematical theory of evolution": phylogenetic models dating back 100 years'.
Collapse
Affiliation(s)
- Bruce Rannala
- Department of Evolution and Ecology, University of California, Davis, CA95616, USA
| | - Ziheng Yang
- Department of Genetics, Evolution, and Environment, University College London, LondonWC1E 6BT, UK
| |
Collapse
|
5
|
Truman K, Vaughan TG, Gavryushkin A, Gavryushkina A“S. The Fossilized Birth-Death Model Is Identifiable. Syst Biol 2025; 74:112-123. [PMID: 39436077 PMCID: PMC11997801 DOI: 10.1093/sysbio/syae058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2024] [Accepted: 10/13/2024] [Indexed: 10/23/2024] Open
Abstract
Time-dependent birth-death sampling models have been used in numerous studies to infer past evolutionary dynamics in different biological contexts, for example, speciation and extinction rates in macroevolutionary studies, or effective reproductive number in epidemiological studies. These models are branching processes where lineages can bifurcate, die, or be sampled with time-dependent birth, death, and sampling rates, generating phylogenetic trees. It has been shown that in some subclasses of such models, different sets of rates can result in the same distributions of reconstructed phylogenetic trees, and therefore, the rates become unidentifiable from the trees regardless of their size. Here, we show that widely used time-dependent fossilized birth-death (FBD) models are identifiable. This subclass of models makes more realistic assumptions about the fossilization process and certain infectious disease transmission processes than the unidentifiable birth-death sampling models. Namely, FBD models assume that sampled lineages stay in the process rather than being immediately removed upon sampling. The identifiability of the time-dependent FBD model justifies using statistical methods that implement this model to infer the underlying temporal diversification or epidemiological dynamics from phylogenetic trees or directly from molecular or other comparative data. We further show that the time-dependent FBD model with an extra parameter, the removal after sampling probability, is unidentifiable. This implies that in scenarios where we do not know how sampling affects lineages, we are unable to infer this extra parameter together with birth, death, and sampling rates solely from trees.
Collapse
Affiliation(s)
- Kate Truman
- Biological Data Science Laboratory, School of Mathematics and Statistics, University of Canterbury, Private Bag 4800, Christchurch 8140, New Zealand
- Biomathematics Research Centre, University of Canterbury, Private Bag 4800, Christchurch 8140, New Zealand
| | - Timothy G Vaughan
- Department of Biosystems Science and Engineering, ETH Zurich, Schanzenstrasse 44, Postfach 4009, Basel 9, Switzerland
- Swiss Institute of Bioinformatics, Quartier Sorge - Bâtiment Amphipôle, Quartier Sorge - Bâtiment Amphipôle, Lausanne 1015, Switzerland
| | - Alex Gavryushkin
- Biological Data Science Laboratory, School of Mathematics and Statistics, University of Canterbury, Private Bag 4800, Christchurch 8140, New Zealand
- Biomathematics Research Centre, University of Canterbury, Private Bag 4800, Christchurch 8140, New Zealand
| | - Alexandra “Sasha” Gavryushkina
- Biological Data Science Laboratory, School of Mathematics and Statistics, University of Canterbury, Private Bag 4800, Christchurch 8140, New Zealand
- Biomathematics Research Centre, University of Canterbury, Private Bag 4800, Christchurch 8140, New Zealand
| |
Collapse
|
6
|
Soewongsono AC, Landis MJ. A Diffusion-Based Approach for Simulating Forward-in-Time State-Dependent Speciation and Extinction Dynamics. Bull Math Biol 2024; 86:101. [PMID: 38970749 DOI: 10.1007/s11538-024-01337-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2024] [Accepted: 06/27/2024] [Indexed: 07/08/2024]
Abstract
We establish a general framework using a diffusion approximation to simulate forward-in-time state counts or frequencies for cladogenetic state-dependent speciation-extinction (ClaSSE) models. We apply the framework to various two- and three-region geographic-state speciation-extinction (GeoSSE) models. We show that the species range state dynamics simulated under tree-based and diffusion-based processes are comparable. We derive a method to infer rate parameters that are compatible with given observed stationary state frequencies and obtain an analytical result to compute stationary state frequencies for a given set of rate parameters. We also describe a procedure to find the time to reach the stationary frequencies of a ClaSSE model using our diffusion-based approach, which we demonstrate using a worked example for a two-region GeoSSE model. Finally, we discuss how the diffusion framework can be applied to formalize relationships between evolutionary patterns and processes under state-dependent diversification scenarios.
Collapse
Affiliation(s)
- Albert C Soewongsono
- Department of Biology, Washington University in St. Louis, Rebstock Hall, St. Louis, MO, 63130, USA.
| | - Michael J Landis
- Department of Biology, Washington University in St. Louis, Rebstock Hall, St. Louis, MO, 63130, USA
| |
Collapse
|
7
|
Zhang C, Ronquist F, Stadler T. Skyline Fossilized Birth-Death Model is Robust to Violations of Sampling Assumptions in Total-Evidence Dating. Syst Biol 2023; 72:1316-1336. [PMID: 37605524 DOI: 10.1093/sysbio/syad054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Revised: 08/07/2023] [Accepted: 08/15/2023] [Indexed: 08/23/2023] Open
Abstract
Several total-evidence dating studies under the fossilized birth-death (FBD) model have produced very old age estimates, which are not supported by the fossil record. This phenomenon has been termed "deep root attraction (DRA)." For two specific data sets, involving divergence time estimation for the early radiations of ants, bees, and wasps (Hymenoptera) and of placental mammals (Eutheria), it has been shown that the DRA effect can be greatly reduced by accommodating the fact that extant species in these trees have been sampled to maximize diversity, so-called diversified sampling. Unfortunately, current methods to accommodate diversified sampling only consider the extreme case where it is possible to identify a cut-off time such that all splits occurring before this time are represented in the sampled tree but none of the younger splits. In reality, the sampling bias is rarely this extreme and may be difficult to model properly. Similar modeling challenges apply to the sampling of the fossil record. This raises the question of whether it is possible to find dating methods that are more robust to sampling biases. Here, we show that the skyline FBD (SFBD) process, where the diversification and fossil-sampling rates can vary over time in a piecewise fashion, provides age estimates that are more robust to inadequacies in the modeling of the sampling process and less sensitive to DRA effects. In the SFBD model we consider, rates in different time intervals are either considered to be independent and identically distributed or assumed to be autocorrelated following an Ornstein-Uhlenbeck (OU) process. Through simulations and reanalyses of Hymenoptera and Eutheria data, we show that both variants of the SFBD model unify age estimates under random and diversified sampling assumptions. The SFBD model can resolve DRA by absorbing the deviations from the sampling assumptions into the inferred dynamics of the diversification process over time. Although this means that the inferred diversification dynamics must be interpreted with caution, taking sampling biases into account, we conclude that the SFBD model represents the most robust approach currently available for addressing DRA in total-evidence dating.
Collapse
Affiliation(s)
- Chi Zhang
- Key Laboratory of Vertebrate Evolution and Human Origins, Institute of Vertebrate Paleontology and Paleoanthropology, Chinese Academy of Sciences, Beijing 100044, China
| | - Fredrik Ronquist
- Department of Bioinformatics and Genetics, Swedish Museum of Natural History, SE.10405 Stockholm, Sweden
| | - Tanja Stadler
- Department of Biosystems Science and Engineering, Eidgenössische Technische Hochschule Zürich, 4058 Basel, Switzerland
- Swiss Institute of Bioinformatics (SIB), 1015 Lausanne, Switzerland
| |
Collapse
|