1
|
Översti S, Weber A, Baran V, Kieninger B, Dilthey A, Houwaart T, Walker A, Schneider-Brachert W, Kühnert D. Evolutionary and epidemic dynamics of COVID-19 in Germany exemplified by three Bayesian phylodynamic case studies. Bioinform Biol Insights 2025; 19:11779322251321065. [PMID: 40078196 PMCID: PMC11898094 DOI: 10.1177/11779322251321065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2024] [Accepted: 01/29/2025] [Indexed: 03/14/2025] Open
Abstract
The importance of genomic surveillance strategies for pathogens has been particularly evident during the coronavirus disease 2019 (COVID-19) pandemic, as genomic data from the causative agent, severe acute respiratory syndrome coronavirus type 2 (SARS-CoV-2), have guided public health decisions worldwide. Bayesian phylodynamic inference, integrating epidemiology and evolutionary biology, has become an essential tool in genomic epidemiological surveillance. It enables the estimation of epidemiological parameters, such as the reproductive number, from pathogen sequence data alone. Despite the phylodynamic approach being widely adopted, the abundance of phylodynamic models often makes it challenging to select the appropriate model for specific research questions. This article illustrates the application of phylodynamic birth-death-sampling models in public health using genomic data, with a focus on SARS-CoV-2. Targeting researchers less familiar with phylodynamics, it introduces a comprehensive workflow, including the conceptualisation of a research study and detailed steps for data preprocessing and postprocessing. In addition, we demonstrate the versatility of birth-death-sampling models through three case studies from Germany, utilising the BEAST2 software and its model implementations. Each case study addresses a distinct research question relevant not only to SARS-CoV-2 but also to other pathogens: Case study 1 finds traces of a superspreading event at the start of an early outbreak, exemplifying how simple models for genomic data can provide information that would otherwise only be accessible through extensive contact tracing. Case study 2 compares transmission dynamics in a nosocomial outbreak to community transmission, highlighting distinct dynamics through integrative analysis. Case study 3 investigates whether local transmission patterns align with national trends, demonstrating how phylodynamic models can disentangle complex population substructure with little additional information. For each case study, we emphasise critical points where model assumptions and data properties may misalign and outline appropriate validation assessments. Overall, we aim to provide researchers with examples on using birth-death-sampling models in genomic epidemiology, balancing theoretical and practical aspects.
Collapse
Affiliation(s)
- Sanni Översti
- Transmission, Infection, Diversification & Evolution Group (tide), Max Planck Institute of Geoanthropology, Jena, Germany
- Department of Archaeogenetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| | - Ariane Weber
- Transmission, Infection, Diversification & Evolution Group (tide), Max Planck Institute of Geoanthropology, Jena, Germany
- Department of Archaeogenetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| | - Viktor Baran
- Department of Infection Prevention and Infectious Diseases, University Hospital Regensburg, Regensburg, Germany
| | - Bärbel Kieninger
- Department of Infection Prevention and Infectious Diseases, University Hospital Regensburg, Regensburg, Germany
| | - Alexander Dilthey
- Institute of Medical Microbiology and Hospital Hygiene, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Torsten Houwaart
- Institute of Medical Microbiology and Hospital Hygiene, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Andreas Walker
- Institute of Virology, University Hospital Düsseldorf, Düsseldorf, Germany
| | - Wulf Schneider-Brachert
- Department of Infection Prevention and Infectious Diseases, University Hospital Regensburg, Regensburg, Germany
| | - Denise Kühnert
- Transmission, Infection, Diversification & Evolution Group (tide), Max Planck Institute of Geoanthropology, Jena, Germany
- Department of Archaeogenetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
- Phylogenomics Unit, Centre for Artificial Intelligence in Public Health Research, Robert Koch Institute, Wildau, Germany
| |
Collapse
|
2
|
Collienne L, Barker M, Suchard MA, Matsen FA. Phylogenetic Tree Instability After Taxon Addition: Empirical Frequency, Predictability, and Consequences For Online Inference. Syst Biol 2025; 74:101-111. [PMID: 39453463 PMCID: PMC11809580 DOI: 10.1093/sysbio/syae059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2024] [Revised: 09/30/2024] [Accepted: 10/22/2024] [Indexed: 10/26/2024] Open
Abstract
Online phylogenetic inference methods add sequentially arriving sequences to an inferred phylogeny without the need to recompute the entire tree from scratch. Some online method implementations exist already, but there remains concern that additional sequences may change the topological relationship among the original set of taxa. We call such a change in tree topology a lack of stability for the inferred tree. In this article, we analyze the stability of single taxon addition in a Maximum Likelihood framework across 1000 empirical datasets. We find that instability occurs in almost 90% of our examples, although observed topological differences do not always reach significance under the approximately unbiased (AU) test. Changes in tree topology after addition of a taxon rarely occur close to its attachment location, and are more frequently observed in more distant tree locations carrying low bootstrap support. To investigate whether instability is predictable, we hypothesize sources of instability and design summary statistics addressing these hypotheses. Using these summary statistics as input features for machine learning under random forests, we are able to predict instability and can identify the most influential features. In summary, it does not appear that a strict insertion-only online inference method will deliver globally optimal trees, although relaxing insertion strictness by allowing for a small number of final tree rearrangements or accepting slightly suboptimal solutions appears feasible.
Collapse
Affiliation(s)
- Lena Collienne
- Computational Biology Program, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, Seattle, WA 98109, USA
| | - Mary Barker
- Computational Biology Program, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, Seattle, WA 98109, USA
- Howard Hughes Medical Institute, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, Seattle, WA 98109, USA
| | - Marc A Suchard
- Department of Human Genetics, University of California, 885 Tiverton Drive, Los Angeles, CA 90095, USA
- Department of Computational Medicine, University of California, 885 Tiverton Drive, Los Angeles, CA 90095, USA
- Department of Biostatistics, University of California, 650 Charles E. Young Dr. South, Los Angeles, CA 90095, USA
| | - Frederick A Matsen
- Computational Biology Program, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, Seattle, WA 98109, USA
- Howard Hughes Medical Institute, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, Seattle, WA 98109, USA
- Department of Statistics, University of Washington, Padelford Hall, Northeast Stevens Way, Seattle, WA 98195, USA
- Department of Genome Sciences, University of Washington, 3720 15th Ave NE, Seattle, WA 98195, USA
| |
Collapse
|
3
|
Schrago CG, Mello B. Challenges in Assembling the Dated Tree of Life. Genome Biol Evol 2024; 16:evae229. [PMID: 39475308 PMCID: PMC11523137 DOI: 10.1093/gbe/evae229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/15/2024] [Indexed: 11/02/2024] Open
Abstract
The assembly of a comprehensive and dated Tree of Life (ToL) remains one of the most formidable challenges in evolutionary biology. The complexity of life's history, involving both vertical and horizontal transmission of genetic information, defies its representation by a simple bifurcating phylogeny. With the advent of genome and metagenome sequencing, vast amounts of data have become available. However, employing this information for phylogeny and divergence time inference has introduced significant theoretical and computational hurdles. This perspective addresses some key methodological challenges in assembling the dated ToL, namely, the identification and classification of homologous genes, accounting for gene tree-species tree mismatch due to population-level processes along with duplication, loss, and horizontal gene transfer, and the accurate dating of evolutionary events. Ultimately, the success of this endeavor requires new approaches that integrate knowledge databases with optimized phylogenetic algorithms capable of managing complex evolutionary models.
Collapse
Affiliation(s)
- Carlos G Schrago
- Department of Genetics, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
| | - Beatriz Mello
- Department of Genetics, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
| |
Collapse
|
4
|
Mello B, Schrago CG. Modeling Substitution Rate Evolution across Lineages and Relaxing the Molecular Clock. Genome Biol Evol 2024; 16:evae199. [PMID: 39332907 PMCID: PMC11430275 DOI: 10.1093/gbe/evae199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/08/2024] [Indexed: 09/29/2024] Open
Abstract
Relaxing the molecular clock using models of how substitution rates change across lineages has become essential for addressing evolutionary problems. The diversity of rate evolution models and their implementations are substantial, and studies have demonstrated their impact on divergence time estimates can be as significant as that of calibration information. In this review, we trace the development of rate evolution models from the proposal of the molecular clock concept to the development of sophisticated Bayesian and non-Bayesian methods that handle rate variation in phylogenies. We discuss the various approaches to modeling rate evolution, provide a comprehensive list of available software, and examine the challenges and advancements of the prevalent Bayesian framework, contrasting them to faster non-Bayesian methods. Lastly, we offer insights into potential advancements in the field in the era of big data.
Collapse
Affiliation(s)
- Beatriz Mello
- Department of Genetics, Federal University of Rio de Janeiro, Rio de Janeiro, RJ 21941-617, Brazil
| | - Carlos G Schrago
- Department of Genetics, Federal University of Rio de Janeiro, Rio de Janeiro, RJ 21941-617, Brazil
| |
Collapse
|
5
|
Iglhaut C, Pečerska J, Gil M, Anisimova M. Please Mind the Gap: Indel-Aware Parsimony for Fast and Accurate Ancestral Sequence Reconstruction and Multiple Sequence Alignment Including Long Indels. Mol Biol Evol 2024; 41:msae109. [PMID: 38842253 PMCID: PMC11221656 DOI: 10.1093/molbev/msae109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Revised: 05/30/2024] [Accepted: 06/03/2024] [Indexed: 06/07/2024] Open
Abstract
Despite having important biological implications, insertion, and deletion (indel) events are often disregarded or mishandled during phylogenetic inference. In multiple sequence alignment, indels are represented as gaps and are estimated without considering the distinct evolutionary history of insertions and deletions. Consequently, indels are usually excluded from subsequent inference steps, such as ancestral sequence reconstruction and phylogenetic tree search. Here, we introduce indel-aware parsimony (indelMaP), a novel way to treat gaps under the parsimony criterion by considering insertions and deletions as separate evolutionary events and accounting for long indels. By identifying the precise location of an evolutionary event on the tree, we can separate overlapping indel events and use affine gap penalties for long indel modeling. Our indel-aware approach harnesses the phylogenetic signal from indels, including them into all inference stages. Validation and comparison to state-of-the-art inference tools on simulated data show that indelMaP is most suitable for densely sampled datasets with closely to moderately related sequences, where it can reach alignment quality comparable to probabilistic methods and accurately infer ancestral sequences, including indel patterns. Due to its remarkable speed, our method is well suited for epidemiological datasets, eliminating the need for downsampling and enabling the exploitation of the additional information provided by dense taxonomic sampling. Moreover, indelMaP offers new insights into the indel patterns of biologically significant sequences and advances our understanding of genetic variability by considering gaps as crucial evolutionary signals rather than mere artefacts.
Collapse
Affiliation(s)
- Clara Iglhaut
- Institute of Computational Life Science, Zurich University of Applied Science, Wädenswil, Switzerland
- Faculty of Mathematics and Science, University of Zurich, Zürich, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Jūlija Pečerska
- Institute of Computational Life Science, Zurich University of Applied Science, Wädenswil, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Manuel Gil
- Institute of Computational Life Science, Zurich University of Applied Science, Wädenswil, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Maria Anisimova
- Institute of Computational Life Science, Zurich University of Applied Science, Wädenswil, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
6
|
de Oliveira Martins L, Mather AE, Page AJ. Scalable neighbour search and alignment with uvaia. PeerJ 2024; 12:e16890. [PMID: 38464752 PMCID: PMC10924453 DOI: 10.7717/peerj.16890] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Accepted: 01/15/2024] [Indexed: 03/12/2024] Open
Abstract
Despite millions of SARS-CoV-2 genomes being sequenced and shared globally, manipulating such data sets is still challenging, especially selecting sequences for focused phylogenetic analysis. We present a novel method, uvaia, which is based on partial and exact sequence similarity for quickly extracting database sequences similar to query sequences of interest. Many SARS-CoV-2 phylogenetic analyses rely on very low numbers of ambiguous sites as a measure of quality since ambiguous sites do not contribute to single nucleotide polymorphism (SNP) differences. Uvaia overcomes this limitation by using measures of sequence similarity which consider partially ambiguous sites, allowing for more ambiguous sequences to be included in the analysis if needed. Such fine-grained definition of similarity allows not only for better phylogenetic analyses, but could also lead to improved classification and biogeographical inferences. Uvaia works natively with compressed files, can use multiple cores and efficiently utilises memory, being able to analyse large data sets on a standard desktop.
Collapse
Affiliation(s)
| | - Alison E. Mather
- Quadram Institute Bioscience, Norwich, United Kingdom
- University of East Anglia, Norwich, United Kingdom
| | | |
Collapse
|
7
|
de Bernardi Schneider A, Su M, Hinrichs AS, Wang J, Amin H, Bell J, Wadford DA, O’Toole Á, Scher E, Perry MD, Turakhia Y, De Maio N, Hughes S, Corbett-Detig R. SARS-CoV-2 lineage assignments using phylogenetic placement/UShER are superior to pangoLEARN machine-learning method. Virus Evol 2024; 10:vead085. [PMID: 38361813 PMCID: PMC10868549 DOI: 10.1093/ve/vead085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Revised: 12/13/2023] [Accepted: 01/05/2024] [Indexed: 02/17/2024] Open
Abstract
With the rapid spread and evolution of SARS-CoV-2, the ability to monitor its transmission and distinguish among viral lineages is critical for pandemic response efforts. The most commonly used software for the lineage assignment of newly isolated SARS-CoV-2 genomes is pangolin, which offers two methods of assignment, pangoLEARN and pUShER. PangoLEARN rapidly assigns lineages using a machine-learning algorithm, while pUShER performs a phylogenetic placement to identify the lineage corresponding to a newly sequenced genome. In a preliminary study, we observed that pangoLEARN (decision tree model), while substantially faster than pUShER, offered less consistency across different versions of pangolin v3. Here, we expand upon this analysis to include v3 and v4 of pangolin, which moved the default algorithm for lineage assignment from pangoLEARN in v3 to pUShER in v4, and perform a thorough analysis confirming that pUShER is not only more stable across versions but also more accurate. Our findings suggest that future lineage assignment algorithms for various pathogens should consider the value of phylogenetic placement.
Collapse
Affiliation(s)
- Adriano de Bernardi Schneider
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Michelle Su
- Department of Health and Mental Hygiene, New York City Public Health Laboratory, New York, NY 10016, USA
| | - Angie S Hinrichs
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Jade Wang
- Department of Health and Mental Hygiene, New York City Public Health Laboratory, New York, NY 10016, USA
| | - Helly Amin
- Department of Health and Mental Hygiene, New York City Public Health Laboratory, New York, NY 10016, USA
| | - John Bell
- California Department of Public Health (CDPH), VRDL/COVIDNet, Richmond, CA 94804, USA
| | - Debra A Wadford
- California Department of Public Health (CDPH), VRDL/COVIDNet, Richmond, CA 94804, USA
| | - Áine O’Toole
- Institute of Evolutionary Biology, University of Edinburgh, Edinburgh EH9 3FL, UK
| | - Emily Scher
- Institute of Evolutionary Biology, University of Edinburgh, Edinburgh EH9 3FL, UK
| | - Marc D Perry
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Yatish Turakhia
- Department of Electrical and Computer Engineering, University of California San Diego, San Diego, CA 92093, USA
| | - Nicola De Maio
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton CB10 1SD, UK
| | - Scott Hughes
- Department of Health and Mental Hygiene, New York City Public Health Laboratory, New York, NY 10016, USA
| | - Russ Corbett-Detig
- Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| |
Collapse
|
8
|
Truszkowski J, Perrigo A, Broman D, Ronquist F, Antonelli A. Online tree expansion could help solve the problem of scalability in Bayesian phylogenetics. Syst Biol 2023; 72:1199-1206. [PMID: 37498209 PMCID: PMC10627553 DOI: 10.1093/sysbio/syad045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Revised: 06/22/2023] [Accepted: 07/11/2023] [Indexed: 07/28/2023] Open
Abstract
Bayesian phylogenetics is now facing a critical point. Over the last 20 years, Bayesian methods have reshaped phylogenetic inference and gained widespread popularity due to their high accuracy, the ability to quantify the uncertainty of inferences and the possibility of accommodating multiple aspects of evolutionary processes in the models that are used. Unfortunately, Bayesian methods are computationally expensive, and typical applications involve at most a few hundred sequences. This is problematic in the age of rapidly expanding genomic data and increasing scope of evolutionary analyses, forcing researchers to resort to less accurate but faster methods, such as maximum parsimony and maximum likelihood. Does this spell doom for Bayesian methods? Not necessarily. Here, we discuss some recently proposed approaches that could help scale up Bayesian analyses of evolutionary problems considerably. We focus on two particular aspects: online phylogenetics, where new data sequences are added to existing analyses, and alternatives to Markov chain Monte Carlo (MCMC) for scalable Bayesian inference. We identify 5 specific challenges and discuss how they might be overcome. We believe that online phylogenetic approaches and Sequential Monte Carlo hold great promise and could potentially speed up tree inference by orders of magnitude. We call for collaborative efforts to speed up the development of methods for real-time tree expansion through online phylogenetics.
Collapse
Affiliation(s)
- Jakub Truszkowski
- Department of Biological and Environmental Sciences, University of Gothenburg, P. O. Box 461, SE.405 30 Gothenburg, Sweden
- Gothenburg Global Biodiversity Centre, Box 461, 405 30 Gothenburg, Sweden
| | - Allison Perrigo
- Department of Biological and Environmental Sciences, University of Gothenburg, P. O. Box 461, SE.405 30 Gothenburg, Sweden
- Gothenburg Global Biodiversity Centre, Box 461, 405 30 Gothenburg, Sweden
| | - David Broman
- Department of Computer Science and Digital Futures, KTH Royal Institute of Technology, SE.100 44 Stockholm, Sweden
| | - Fredrik Ronquist
- Department of Bioinformatics and Genetics, Swedish Museum of Natural History, P. O. Box 50007, SE.104 05 Stockholm, Sweden
| | - Alexandre Antonelli
- Department of Biological and Environmental Sciences, University of Gothenburg, P. O. Box 461, SE.405 30 Gothenburg, Sweden
- Gothenburg Global Biodiversity Centre, Box 461, 405 30 Gothenburg, Sweden
- Royal Botanic Gardens, Kew, Richmond, Surrey TW9 3AE, UK
- Department of Plant Sciences, University of Oxford, South Parks Road, Oxford OX1 3 RB, UK
| |
Collapse
|