1
|
Zhu Y, Li Y, Li C, Shen XX, Zhou X. A critical evaluation of deep-learning based phylogenetic inference programs using simulated datasets. J Genet Genomics 2025; 52:714-717. [PMID: 39824436 DOI: 10.1016/j.jgg.2025.01.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2024] [Revised: 01/08/2025] [Accepted: 01/09/2025] [Indexed: 01/20/2025]
Affiliation(s)
- Yixiao Zhu
- College of Agriculture and Biotechnology and Centre for Evolutionary & Organismal Biology, Zhejiang University, Hangzhou, Zhejiang 310058, China
| | - Yonglin Li
- Guangdong Laboratory for Lingnan Modern Agriculture, Guangdong Province Key Laboratory of Microbial Signals and Disease Control, Integrative Microbiology Research Centre, South China Agricultural University, Guangzhou, Guangdong 510642, China
| | - Chuhao Li
- Guangdong Laboratory for Lingnan Modern Agriculture, Guangdong Province Key Laboratory of Microbial Signals and Disease Control, Integrative Microbiology Research Centre, South China Agricultural University, Guangzhou, Guangdong 510642, China
| | - Xing-Xing Shen
- College of Agriculture and Biotechnology and Centre for Evolutionary & Organismal Biology, Zhejiang University, Hangzhou, Zhejiang 310058, China.
| | - Xiaofan Zhou
- Guangdong Laboratory for Lingnan Modern Agriculture, Guangdong Province Key Laboratory of Microbial Signals and Disease Control, Integrative Microbiology Research Centre, South China Agricultural University, Guangzhou, Guangdong 510642, China.
| |
Collapse
|
2
|
Moreno MA, Rodriguez-Papa S, Dolson E. Ecology, Spatial Structure, and Selection Pressure Induce Strong Signatures in Phylogenetic Structure. ARTIFICIAL LIFE 2025; 31:129-152. [PMID: 40298478 DOI: 10.1162/artl_a_00470] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/30/2025]
Abstract
Evolutionary dynamics are shaped by a variety of fundamental, generic drivers, including spatial structure, ecology, and selection pressure. These drivers impact the trajectory of evolution and have been hypothesized to influence phylogenetic structure. For instance, they can help explain natural history, steer behavior of contemporary evolving populations, and influence the efficacy of application-oriented evolutionary optimization. Likewise, in inquiry-oriented Artificial Life systems, these drivers constitute key building blocks for open-ended evolution. Here we set out to assess (a) if spatial structure, ecology, and selection pressure leave detectable signatures in phylogenetic structure; (b) the extent, in particular, to which ecology can be detected and discerned in the presence of spatial structure; and (c) the extent to which these phylogenetic signatures generalize across evolutionary systems. To this end, we analyze phylogenies generated by manipulating spatial structure, ecology, and selection pressure within three computational models of varied scope and sophistication. We find that selection pressure, spatial structure, and ecology have characteristic effects on phylogenetic metrics, although these effects are complex and not always intuitive. Signatures have some consistency across systems when using equivalent taxonomic unit definitions (e.g., individual, genotype, species). Furthermore, we find that sufficiently strong ecology can be detected in the presence of spatial structure. We also find that, while low-resolution phylogenetic reconstructions can bias some phylogenetic metrics, high-resolution reconstructions recapitulate them faithfully. Although our results suggest a potential for evolutionary inference of spatial structure, ecology, and selection pressure through phylogenetic analysis, further methods development is needed to distinguish these drivers' phylometric signatures from each other and to appropriately normalize phylogenetic metrics. With such work, phylogenetic analysis could provide a versatile tool kit with which to study large-scale, evolving populations.
Collapse
Affiliation(s)
- Matthew Andres Moreno
- University of Michigan, Department of Ecology and Evolutionary Biology, Center for the Study of Complex Systems, Michigan Institute for Data and AI in Society.
| | | | - Emily Dolson
- Michigan State University, Department of Computer Science and Engineering, Program in Ecology, Evolution, and Behavior
| |
Collapse
|
3
|
Penn MJ, Scheidwasser N, Khurana MP, Duchêne DA, Donnelly CA, Bhatt S. Phylo2Vec: A Vector Representation for Binary Trees. Syst Biol 2025; 74:250-266. [PMID: 38935520 PMCID: PMC11958935 DOI: 10.1093/sysbio/syae030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Revised: 06/16/2024] [Accepted: 06/26/2024] [Indexed: 06/29/2024] Open
Abstract
Binary phylogenetic trees inferred from biological data are central to understanding the shared history among evolutionary units. However, inferring the placement of latent nodes in a tree is computationally expensive. State-of-the-art methods rely on carefully designed heuristics for tree search, using different data structures for easy manipulation (e.g., classes in object-oriented programming languages) and readable representation of trees (e.g., Newick-format strings). Here, we present Phylo2Vec, a parsimonious encoding for phylogenetic trees that serves as a unified approach for both manipulating and representing phylogenetic trees. Phylo2Vec maps any binary tree with n leaves to a unique integer vector of length n-1. The advantages of Phylo2Vec are 4-fold: (i) fast tree sampling, (ii) compressed tree representation compared to a Newick string, (iii) quick and unambiguous verification if 2 binary trees are identical topologically, and (iv) systematic ability to traverse tree space in very large or small jumps. As a proof of concept, we use Phylo2Vec for ML inference on 5 real-world datasets and show that a simple hill-climbing-based optimization scheme can efficiently traverse the vastness of tree space from a random to an optimal tree.
Collapse
Affiliation(s)
- Matthew J Penn
- Department of Statistics, University of Oxford, 24-29 St Giles’, Oxford OX1 3LB, UK
| | - Neil Scheidwasser
- Department of Public Health, Section of Epidemiology, University of Copenhagen, Øster Farimagsgade 5, build. 24 Q, 1st floor, 1353 København K, Denmark
| | - Mark P Khurana
- Department of Public Health, Section of Epidemiology, University of Copenhagen, Øster Farimagsgade 5, build. 24 Q, 1st floor, 1353 København K, Denmark
| | - David A Duchêne
- Department of Public Health, Section of Epidemiology, University of Copenhagen, Øster Farimagsgade 5, build. 24 Q, 1st floor, 1353 København K, Denmark
| | - Christl A Donnelly
- Department of Statistics, University of Oxford, 24-29 St Giles’, Oxford OX1 3LB, UK
- Nuffield Department of Medicine, Pandemic Sciences Institute, University of Oxford, Old Road Campus Research Building, Old Road Campus, Roosevelt Drive, Oxford OX3 7DQ, UK
| | - Samir Bhatt
- Department of Public Health, Section of Epidemiology, University of Copenhagen, Øster Farimagsgade 5, build. 24 Q, 1st floor, 1353 København K, Denmark
- Department of Infectious Disease Epidemiology, School of Public Health, Faculty of Medicine, MRC Centre for Global Infectious Disease Analysis, Imperial College London, Level 2, Faculty Building, South Kensington Campus, London SW7 2AZ, UK
| |
Collapse
|
4
|
Översti S, Weber A, Baran V, Kieninger B, Dilthey A, Houwaart T, Walker A, Schneider-Brachert W, Kühnert D. Evolutionary and epidemic dynamics of COVID-19 in Germany exemplified by three Bayesian phylodynamic case studies. Bioinform Biol Insights 2025; 19:11779322251321065. [PMID: 40078196 PMCID: PMC11898094 DOI: 10.1177/11779322251321065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2024] [Accepted: 01/29/2025] [Indexed: 03/14/2025] Open
Abstract
The importance of genomic surveillance strategies for pathogens has been particularly evident during the coronavirus disease 2019 (COVID-19) pandemic, as genomic data from the causative agent, severe acute respiratory syndrome coronavirus type 2 (SARS-CoV-2), have guided public health decisions worldwide. Bayesian phylodynamic inference, integrating epidemiology and evolutionary biology, has become an essential tool in genomic epidemiological surveillance. It enables the estimation of epidemiological parameters, such as the reproductive number, from pathogen sequence data alone. Despite the phylodynamic approach being widely adopted, the abundance of phylodynamic models often makes it challenging to select the appropriate model for specific research questions. This article illustrates the application of phylodynamic birth-death-sampling models in public health using genomic data, with a focus on SARS-CoV-2. Targeting researchers less familiar with phylodynamics, it introduces a comprehensive workflow, including the conceptualisation of a research study and detailed steps for data preprocessing and postprocessing. In addition, we demonstrate the versatility of birth-death-sampling models through three case studies from Germany, utilising the BEAST2 software and its model implementations. Each case study addresses a distinct research question relevant not only to SARS-CoV-2 but also to other pathogens: Case study 1 finds traces of a superspreading event at the start of an early outbreak, exemplifying how simple models for genomic data can provide information that would otherwise only be accessible through extensive contact tracing. Case study 2 compares transmission dynamics in a nosocomial outbreak to community transmission, highlighting distinct dynamics through integrative analysis. Case study 3 investigates whether local transmission patterns align with national trends, demonstrating how phylodynamic models can disentangle complex population substructure with little additional information. For each case study, we emphasise critical points where model assumptions and data properties may misalign and outline appropriate validation assessments. Overall, we aim to provide researchers with examples on using birth-death-sampling models in genomic epidemiology, balancing theoretical and practical aspects.
Collapse
Affiliation(s)
- Sanni Översti
- Transmission, Infection, Diversification & Evolution Group (tide), Max Planck Institute of Geoanthropology, Jena, Germany
- Department of Archaeogenetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| | - Ariane Weber
- Transmission, Infection, Diversification & Evolution Group (tide), Max Planck Institute of Geoanthropology, Jena, Germany
- Department of Archaeogenetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| | - Viktor Baran
- Department of Infection Prevention and Infectious Diseases, University Hospital Regensburg, Regensburg, Germany
| | - Bärbel Kieninger
- Department of Infection Prevention and Infectious Diseases, University Hospital Regensburg, Regensburg, Germany
| | - Alexander Dilthey
- Institute of Medical Microbiology and Hospital Hygiene, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Torsten Houwaart
- Institute of Medical Microbiology and Hospital Hygiene, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Andreas Walker
- Institute of Virology, University Hospital Düsseldorf, Düsseldorf, Germany
| | - Wulf Schneider-Brachert
- Department of Infection Prevention and Infectious Diseases, University Hospital Regensburg, Regensburg, Germany
| | - Denise Kühnert
- Transmission, Infection, Diversification & Evolution Group (tide), Max Planck Institute of Geoanthropology, Jena, Germany
- Department of Archaeogenetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
- Phylogenomics Unit, Centre for Artificial Intelligence in Public Health Research, Robert Koch Institute, Wildau, Germany
| |
Collapse
|
5
|
Loo EPI, Szurek B, Arra Y, Stiebner M, Buchholzer M, Devanna BN, Vera Cruz CM, Frommer WB. Closing the Information Gap Between the Field and Scientific Literature for Improved Disease Management, with a Focus on Rice and Bacterial Blight. MOLECULAR PLANT-MICROBE INTERACTIONS : MPMI 2025; 38:134-141. [PMID: 39186001 DOI: 10.1094/mpmi-07-24-0075-fi] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/27/2024]
Abstract
A path to sustainably reduce world hunger, food insecurity, and malnutrition is to close the crop yield gap and, particularly, lower losses due to pathogens. Breeding resistant crops is key to achieving this goal, which is an effort requiring collaboration among stakeholders, scientists, breeders, farmers, and policymakers. During a disease outbreak, epidemiologists survey the occurrence of a disease after which pathologists investigate mechanisms to stop an infection. Policymakers then implement strategies with farmers and breeders to overcome the outbreak. Information flow from the field to the lab and back to the field involves several processing hubs that require different information inputs. Failure to communicate the necessary information results in the transfer of meaningless data. Here, we discuss gaps in information acquisition and transfer between the field and laboratory. Using rice bacterial blight disease as an example, we discuss pathogen biology and disease resistance to point out the importance of reporting pathogen strains that caused an outbreak to optimize the deployment of resistant crop varieties. We examine differences between infection in the field and assays performed in the laboratory to draw awareness of possible misinformation concerning plant resistance or susceptibility. We discuss key data considered useful for reporting disease outbreaks, sampling bias, and suggestions for improving data quality. We also touch on the knowledge gap in the state-of-the-art literature regarding disease dispersal and transmission. We use a recent case study to exemplify the gaps mentioned. We conclude by highlighting potential actions that may contribute to food security and to closing the yield gap. [Formula: see text] Copyright © 2025 The Author(s). This is an open access article distributed under the CC BY 4.0 International license.
Collapse
Affiliation(s)
- Eliza P I Loo
- Faculty of Mathematics and Natural Sciences, Institute for Molecular Physiology, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany
- Cluster of Excellence on Plant Sciences (CEPLAS), Heinrich Heine University, 40225 Düsseldorf, Germany
| | - Boris Szurek
- Plant Health Institute of Montpellier (PHIM), Université Montpellier, IRD, CIRAD, INRAE, Institut Agro, Montpellier, France
| | - Yugander Arra
- Faculty of Mathematics and Natural Sciences, Institute for Molecular Physiology, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany
- Cluster of Excellence on Plant Sciences (CEPLAS), Heinrich Heine University, 40225 Düsseldorf, Germany
| | - Melissa Stiebner
- Faculty of Mathematics and Natural Sciences, Institute for Molecular Physiology, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany
- Cluster of Excellence on Plant Sciences (CEPLAS), Heinrich Heine University, 40225 Düsseldorf, Germany
| | - Marcel Buchholzer
- Faculty of Mathematics and Natural Sciences, Institute for Molecular Physiology, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany
- Cluster of Excellence on Plant Sciences (CEPLAS), Heinrich Heine University, 40225 Düsseldorf, Germany
| | - B N Devanna
- Faculty of Mathematics and Natural Sciences, Institute for Molecular Physiology, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany
- Cluster of Excellence on Plant Sciences (CEPLAS), Heinrich Heine University, 40225 Düsseldorf, Germany
- ICAR-National Rice Research Institute, Cuttack, Odisha, India
| | | | - Wolf B Frommer
- Faculty of Mathematics and Natural Sciences, Institute for Molecular Physiology, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany
- Cluster of Excellence on Plant Sciences (CEPLAS), Heinrich Heine University, 40225 Düsseldorf, Germany
- Institute for Transformative Biomolecules, ITbM, Nagoya University, Nagoya, Japan
| |
Collapse
|
6
|
Landis MJ, Thompson A. phyddle: software for exploring phylogenetic models with deep learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.08.06.606717. [PMID: 39149349 PMCID: PMC11326143 DOI: 10.1101/2024.08.06.606717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 08/17/2024]
Abstract
Phylogenies contain a wealth of information about the evolutionary history and process that gave rise to the diversity of life. This information can be extracted by fitting phylogenetic models to trees. However, many realistic phylogenetic models lack tractable likelihood functions, prohibiting their use with standard inference methods. We present phyddle, pipeline-based software for performing phylogenetic modeling tasks on trees using likelihood-free deep learning approaches. phyddle has a flexible command-line interface, making it easy to integrate deep learning approaches for phylogenetics into research workflows. phyddle coordinates modeling tasks through five pipeline analysis steps (Simulate, Format, Train, Estimate, and Plot) that transform raw phylogenetic datasets as input into numerical and visual model-based output. We conduct three experiments to compare the accuracy of likelihood-based inferences against deep learning-based inferences obtained through phyddle. Benchmarks show that phyddle accurately performs the inference tasks for which it was designed, such as estimating macroevolutionary parameters, selecting among continuous trait evolution models, and passing coverage tests for epidemiological models, even for models that lack tractable likelihoods. Learn more about phyddle at https://phyddle.org.
Collapse
Affiliation(s)
- Michael J. Landis
- Department of Biology, Washington University, St. Louis, MO, 63110, USA
| | - Ammon Thompson
- Participant in an education program sponsored by U.S. Department of Defense (DOD)
| |
Collapse
|
7
|
Chauve C, Colijn C, Zhang L. A vector representation for phylogenetic trees. Philos Trans R Soc Lond B Biol Sci 2025; 380:20240226. [PMID: 39976399 PMCID: PMC11867187 DOI: 10.1098/rstb.2024.0226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Revised: 08/02/2024] [Accepted: 09/05/2024] [Indexed: 02/21/2025] Open
Abstract
Good representations for phylogenetic trees and networks are important for enhancing storage efficiency and scalability for the inference and analysis of evolutionary trees for genes, genomes and species. We propose a new representation for rooted phylogenetic trees that encodes a tree on [Formula: see text] ordered taxa as a vector of length [Formula: see text] in which each taxon appears exactly twice. Using this new tree representation, we introduce a novel tree rearrangement operator, termed an HOP, that results in a tree space of linear diameter and quadratic neighbourhood size. We also introduce a novel metric, the HOP distance, which is the minimum number of HOPs to transform a tree into another tree. The HOP distance can be computed in near-linear time-a rare instance of tree rearrangement distance that is tractable. Our experiments show that the HOP distance is better correlated to the Subtree-Prune-and-Regraft distance than the widely used Robinson-Foulds distance. We also describe how the proposed tree representation can be further generalized to tree-child networks, showcasing its versatility and potential applications in broader evolutionary analyses.This article is part of the theme issue '"A mathematical theory of evolution": phylogenetic models dating back 100 years'.
Collapse
Affiliation(s)
- Cedric Chauve
- Department of Mathematics, Simon Fraser University, Burnaby, British ColumbiaV5A 1S6, Canada
| | - Caroline Colijn
- Department of Mathematics, National University of Singapore, Singapore119076, Singapore
| | - Louxin Zhang
- Department of Mathematics, National University of Singapore, Singapore119076, Singapore
| |
Collapse
|
8
|
Kraemer MUG, Tsui JLH, Chang SY, Lytras S, Khurana MP, Vanderslott S, Bajaj S, Scheidwasser N, Curran-Sebastian JL, Semenova E, Zhang M, Unwin HJT, Watson OJ, Mills C, Dasgupta A, Ferretti L, Scarpino SV, Koua E, Morgan O, Tegally H, Paquet U, Moutsianas L, Fraser C, Ferguson NM, Topol EJ, Duchêne DA, Stadler T, Kingori P, Parker MJ, Dominici F, Shadbolt N, Suchard MA, Ratmann O, Flaxman S, Holmes EC, Gomez-Rodriguez M, Schölkopf B, Donnelly CA, Pybus OG, Cauchemez S, Bhatt S. Artificial intelligence for modelling infectious disease epidemics. Nature 2025; 638:623-635. [PMID: 39972226 PMCID: PMC11987553 DOI: 10.1038/s41586-024-08564-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2024] [Accepted: 12/20/2024] [Indexed: 02/21/2025]
Abstract
Infectious disease threats to individual and public health are numerous, varied and frequently unexpected. Artificial intelligence (AI) and related technologies, which are already supporting human decision making in economics, medicine and social science, have the potential to transform the scope and power of infectious disease epidemiology. Here we consider the application to infectious disease modelling of AI systems that combine machine learning, computational statistics, information retrieval and data science. We first outline how recent advances in AI can accelerate breakthroughs in answering key epidemiological questions and we discuss specific AI methods that can be applied to routinely collected infectious disease surveillance data. Second, we elaborate on the social context of AI for infectious disease epidemiology, including issues such as explainability, safety, accountability and ethics. Finally, we summarize some limitations of AI applications in this field and provide recommendations for how infectious disease epidemiology can harness most effectively current and future developments in AI.
Collapse
Affiliation(s)
- Moritz U G Kraemer
- Pandemic Sciences Institute, University of Oxford, Oxford, UK.
- Department of Biology, University of Oxford, Oxford, UK.
| | - Joseph L-H Tsui
- Pandemic Sciences Institute, University of Oxford, Oxford, UK
- Department of Biology, University of Oxford, Oxford, UK
| | - Serina Y Chang
- Department of Electrical Engineering and Computer Science, University of California Berkeley, Berkeley, CA, USA
- UCSF UC Berkeley Joint Program in Computational Precision Health, Berkeley, CA, USA
| | - Spyros Lytras
- Division of Systems Virology, Department of Microbiology and Immunology, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan
| | - Mark P Khurana
- Section of Epidemiology, Department of Public Health, University of Copenhagen, Copenhagen, Denmark
| | - Samantha Vanderslott
- Pandemic Sciences Institute, University of Oxford, Oxford, UK
- Oxford Vaccine Group, University of Oxford and NIHR Oxford Biomedical Research Centre, Oxford, UK
| | - Sumali Bajaj
- Department of Biology, University of Oxford, Oxford, UK
| | - Neil Scheidwasser
- Section of Epidemiology, Department of Public Health, University of Copenhagen, Copenhagen, Denmark
| | | | - Elizaveta Semenova
- Department of Epidemiology and Biostatistics, Imperial College London, London, UK
| | - Mengyan Zhang
- Department of Computer Science, University of Oxford, Oxford, UK
| | | | - Oliver J Watson
- MRC Centre for Global Infectious Disease Analysis, School of Public Health, Imperial College London, London, UK
| | - Cathal Mills
- Pandemic Sciences Institute, University of Oxford, Oxford, UK
- Department of Statistics, University of Oxford, Oxford, UK
| | - Abhishek Dasgupta
- Pandemic Sciences Institute, University of Oxford, Oxford, UK
- Doctoral Training Centre, University of Oxford, Oxford, UK
| | - Luca Ferretti
- Pandemic Sciences Institute, University of Oxford, Oxford, UK
| | - Samuel V Scarpino
- Institute for Experiential AI, Northeastern University, Boston, MA, USA
- Santa Fe Institute, Santa Fe, NM, USA
| | - Etien Koua
- World Health Organization Regional Office for Africa, Brazzaville, Congo
| | - Oliver Morgan
- WHO Hub for Pandemic and Epidemic Intelligence, Health Emergencies Programme, World Health Organization, Berlin, Germany
| | - Houriiyah Tegally
- Centre for Epidemic Response and Innovation (CERI), School for Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa
| | - Ulrich Paquet
- African Institute for Mathematical Sciences (AIMS) South Africa, Muizenberg, Cape Town, South Africa
| | | | | | - Neil M Ferguson
- MRC Centre for Global Infectious Disease Analysis, School of Public Health, Imperial College London, London, UK
| | | | - David A Duchêne
- Section of Epidemiology, Department of Public Health, University of Copenhagen, Copenhagen, Denmark
| | - Tanja Stadler
- Department of Biosystems Science and Engineering, ETH Zürich, Basel, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Patricia Kingori
- The Ethox Centre, Nuffield Department of Population Health, University of Oxford, Oxford, UK
| | - Michael J Parker
- Pandemic Sciences Institute, University of Oxford, Oxford, UK
- The Ethox Centre, Nuffield Department of Population Health, University of Oxford, Oxford, UK
| | - Francesca Dominici
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Nigel Shadbolt
- Department of Computer Science, University of Oxford, Oxford, UK
- The Open Data Institute, London, UK
| | - Marc A Suchard
- Department of Biostatistics, University of California, Los Angeles, Los Angeles, USA
| | - Oliver Ratmann
- Department of Mathematics, Imperial College London, London, UK
- Imperial-X, Imperial College, London, UK
| | - Seth Flaxman
- Department of Computer Science, University of Oxford, Oxford, UK
| | - Edward C Holmes
- School of Medical Sciences, The University of Sydney, Sydney, New South Wales, Australia
| | | | - Bernhard Schölkopf
- Max Planck Institute for Intelligent Systems and ELLIS Institute Tübingen, Tübingen, Germany
| | - Christl A Donnelly
- Pandemic Sciences Institute, University of Oxford, Oxford, UK
- Department of Statistics, University of Oxford, Oxford, UK
| | - Oliver G Pybus
- Pandemic Sciences Institute, University of Oxford, Oxford, UK
- Department of Biology, University of Oxford, Oxford, UK
- Department of Pathobiology and Population Sciences, The Royal Veterinary College, London, UK
| | - Simon Cauchemez
- Mathematical Modelling of Infectious Diseases Unit, Institut Pasteur, Université Paris Cité, U1332 INSERM, UMR2000 CNRS, Paris, France
| | - Samir Bhatt
- Section of Epidemiology, Department of Public Health, University of Copenhagen, Copenhagen, Denmark.
- MRC Centre for Global Infectious Disease Analysis, School of Public Health, Imperial College London, London, UK.
- Pioneer Centre for Artificial Intelligence University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
9
|
Roa Lozano J, Duncan M, McKenna DD, Castoe TA, DeGiorgio M, Adams R. TraitTrainR: accelerating large-scale simulation under models of continuous trait evolution. BIOINFORMATICS ADVANCES 2024; 5:vbae196. [PMID: 39758830 PMCID: PMC11696700 DOI: 10.1093/bioadv/vbae196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/21/2024] [Revised: 11/08/2024] [Accepted: 12/05/2024] [Indexed: 01/07/2025]
Abstract
Motivation The scale and scope of comparative trait data are expanding at unprecedented rates, and recent advances in evolutionary modeling and simulation sometimes struggle to match this pace. Well-organized and flexible applications for conducting large-scale simulations of evolution hold promise in this context for understanding models and more so our ability to confidently estimate them with real trait data sampled from nature. Results We introduce TraitTrainR, an R package designed to facilitate efficient, large-scale simulations under complex models of continuous trait evolution. TraitTrainR employs several output formats, supports popular trait data transformations, accommodates multi-trait evolution, and exhibits flexibility in defining input parameter space and model stacking. Moreover, TraitTrainR permits measurement error, allowing for investigation of its potential impacts on evolutionary inference. We envision a wealth of applications of TraitTrainR, and we demonstrate one such example by examining the problem of evolutionary model selection in three empirical phylogenetic case studies. Collectively, these demonstrations of applying TraitTrainR to explore problems in model selection underscores its utility and broader promise for addressing key questions, including those related to experimental design and statistical power, in comparative biology. Availability and implementation TraitTrainR is developed in R 4.4.0 and is freely available at https://github.com/radamsRHA/TraitTrainR/, which includes detailed documentation, quick-start guides, and a step-by-step tutorial.
Collapse
Affiliation(s)
- Jenniffer Roa Lozano
- Center for Agricultural Data Analytics, University of Arkansas, Fayetteville, AR 72701, United States
- Department of Entomology and Plant Pathology, University of Arkansas, Fayetteville, AR 72701, United States
| | - Mataya Duncan
- Center for Agricultural Data Analytics, University of Arkansas, Fayetteville, AR 72701, United States
- Department of Entomology and Plant Pathology, University of Arkansas, Fayetteville, AR 72701, United States
| | - Duane D McKenna
- Department of Biological Sciences, University of Memphis, Memphis, TN 38152, United States
- Center for Biodiversity Research, University of Memphis, Memphis, TN 38152, United States
| | - Todd A Castoe
- Department of Biology, University of Texas at Arlington, Arlington, TX 76010, United States
| | - Michael DeGiorgio
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, United States
| | - Richard Adams
- Center for Agricultural Data Analytics, University of Arkansas, Fayetteville, AR 72701, United States
- Department of Entomology and Plant Pathology, University of Arkansas, Fayetteville, AR 72701, United States
| |
Collapse
|
10
|
He J, Zhong S, Qin C, Nong A, Lin Z, Liang H, Zhang F, Jiang J, Pan P, Wei W, Liu J, Liu D, Ye L, Liang H, Liang B. The trend, prevalence and potential risk factors of secondary HIV transmission among HIV/AIDS individuals receiving ART in Guangxi, China: a longitudinal cross-sectional study. Emerg Microbes Infect 2024; 13:2429622. [PMID: 39552513 PMCID: PMC11587721 DOI: 10.1080/22221751.2024.2429622] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2024] [Revised: 09/29/2024] [Accepted: 11/10/2024] [Indexed: 11/19/2024]
Abstract
Identifying the prevalence and risk factors of secondary human immunodeficiency virus (HIV) transmission from people living with HIV (PLWH) to other people is crucial for ending the HIV epidemic. However, the data among antiretroviral therapy (ART) patients is limited. This study aims to assess the prevalence and risk factors of secondary HIV transmission among PLWH receiving ART by longitudinal molecular networks in China. In this study, the prevalence of secondary HIV transmission was 10.8%. The R0 was greater than 1 from 2017 to 2021 and peaked in 2019. PLWHs who were male sex, older age, engaged in condomless sex, experienced higher ART follow-up viral load, experienced ART medical omissions, infected with non-CRF01_AE subtype, and self-reported sexually transmitted infections (STIs) at HIV diagnosis increased the risk of secondary HIV transmission. However, those participants with higher education were less likely to be involved in secondary HIV transmission. The diagnostic age of the participants was nonlinearly associated with the risk of secondary HIV transmission, with a cutoff value of 43.13 years indicating a higher risk of secondary HIV transmission for patients diagnosed at or above this age. This study revealed substantial secondary HIV transmission and persistent HIV expansion among local PLWH, highlighting the necessity of enhancing viral load monitor, promoting adherence to ART, and promoting safe sex practices, particularly among older adults with HIV, to mitigate secondary HIV transmission.
Collapse
Affiliation(s)
- Jinfeng He
- Guangxi Key Laboratory of AIDS Prevention and Treatment & Guangxi Colleges and Universities Key Laboratory of Prevention and Control of Highly Prevalent Diseases, School of Public Health, Guangxi Medical University, Nanning, People’s Republic of China
- Guangxi Engineering Center for Organoids and Organ-on-chips of Highly Pathogenic Microbial Infections & Biosafety III laboratory, Life Science Institute, Guangxi Medical University, Nanning, People’s Republic of China
| | - Shanmei Zhong
- Guangxi Key Laboratory of AIDS Prevention and Treatment & Guangxi Colleges and Universities Key Laboratory of Prevention and Control of Highly Prevalent Diseases, School of Public Health, Guangxi Medical University, Nanning, People’s Republic of China
- Guangxi Engineering Center for Organoids and Organ-on-chips of Highly Pathogenic Microbial Infections & Biosafety III laboratory, Life Science Institute, Guangxi Medical University, Nanning, People’s Republic of China
| | - Cai Qin
- Guangxi Key Laboratory of AIDS Prevention and Treatment & Guangxi Colleges and Universities Key Laboratory of Prevention and Control of Highly Prevalent Diseases, School of Public Health, Guangxi Medical University, Nanning, People’s Republic of China
- Guangxi Engineering Center for Organoids and Organ-on-chips of Highly Pathogenic Microbial Infections & Biosafety III laboratory, Life Science Institute, Guangxi Medical University, Nanning, People’s Republic of China
| | - Aidan Nong
- Chongzuo Center for Disease Control and Prevention, Chongzuo, People’s Republic of China
| | - Zhaosen Lin
- Qinzhou Center for Disease Control and Prevention, Qinzhou, People’s Republic of China
| | - Huayue Liang
- Guangxi Key Laboratory of AIDS Prevention and Treatment & Guangxi Colleges and Universities Key Laboratory of Prevention and Control of Highly Prevalent Diseases, School of Public Health, Guangxi Medical University, Nanning, People’s Republic of China
- Guangxi Engineering Center for Organoids and Organ-on-chips of Highly Pathogenic Microbial Infections & Biosafety III laboratory, Life Science Institute, Guangxi Medical University, Nanning, People’s Republic of China
| | - Fei Zhang
- Guangxi Key Laboratory of AIDS Prevention and Treatment & Guangxi Colleges and Universities Key Laboratory of Prevention and Control of Highly Prevalent Diseases, School of Public Health, Guangxi Medical University, Nanning, People’s Republic of China
- Guangxi Engineering Center for Organoids and Organ-on-chips of Highly Pathogenic Microbial Infections & Biosafety III laboratory, Life Science Institute, Guangxi Medical University, Nanning, People’s Republic of China
| | - Jiaxiao Jiang
- Guangxi Key Laboratory of AIDS Prevention and Treatment & Guangxi Colleges and Universities Key Laboratory of Prevention and Control of Highly Prevalent Diseases, School of Public Health, Guangxi Medical University, Nanning, People’s Republic of China
- Guangxi Engineering Center for Organoids and Organ-on-chips of Highly Pathogenic Microbial Infections & Biosafety III laboratory, Life Science Institute, Guangxi Medical University, Nanning, People’s Republic of China
| | - Peijiang Pan
- Guangxi Key Laboratory of AIDS Prevention and Treatment & Guangxi Colleges and Universities Key Laboratory of Prevention and Control of Highly Prevalent Diseases, School of Public Health, Guangxi Medical University, Nanning, People’s Republic of China
- Guangxi Engineering Center for Organoids and Organ-on-chips of Highly Pathogenic Microbial Infections & Biosafety III laboratory, Life Science Institute, Guangxi Medical University, Nanning, People’s Republic of China
| | - Wudi Wei
- Guangxi Key Laboratory of AIDS Prevention and Treatment & Guangxi Colleges and Universities Key Laboratory of Prevention and Control of Highly Prevalent Diseases, School of Public Health, Guangxi Medical University, Nanning, People’s Republic of China
- Guangxi Engineering Center for Organoids and Organ-on-chips of Highly Pathogenic Microbial Infections & Biosafety III laboratory, Life Science Institute, Guangxi Medical University, Nanning, People’s Republic of China
| | - Jie Liu
- Guangxi Key Laboratory of AIDS Prevention and Treatment & Guangxi Colleges and Universities Key Laboratory of Prevention and Control of Highly Prevalent Diseases, School of Public Health, Guangxi Medical University, Nanning, People’s Republic of China
- Guangxi Engineering Center for Organoids and Organ-on-chips of Highly Pathogenic Microbial Infections & Biosafety III laboratory, Life Science Institute, Guangxi Medical University, Nanning, People’s Republic of China
| | - Deping Liu
- Qinzhou Center for Disease Control and Prevention, Qinzhou, People’s Republic of China
| | - Li Ye
- Guangxi Key Laboratory of AIDS Prevention and Treatment & Guangxi Colleges and Universities Key Laboratory of Prevention and Control of Highly Prevalent Diseases, School of Public Health, Guangxi Medical University, Nanning, People’s Republic of China
- Guangxi Engineering Center for Organoids and Organ-on-chips of Highly Pathogenic Microbial Infections & Biosafety III laboratory, Life Science Institute, Guangxi Medical University, Nanning, People’s Republic of China
| | - Hao Liang
- Guangxi Key Laboratory of AIDS Prevention and Treatment & Guangxi Colleges and Universities Key Laboratory of Prevention and Control of Highly Prevalent Diseases, School of Public Health, Guangxi Medical University, Nanning, People’s Republic of China
- Guangxi Engineering Center for Organoids and Organ-on-chips of Highly Pathogenic Microbial Infections & Biosafety III laboratory, Life Science Institute, Guangxi Medical University, Nanning, People’s Republic of China
| | - Bingyu Liang
- Guangxi Key Laboratory of AIDS Prevention and Treatment & Guangxi Colleges and Universities Key Laboratory of Prevention and Control of Highly Prevalent Diseases, School of Public Health, Guangxi Medical University, Nanning, People’s Republic of China
- Guangxi Engineering Center for Organoids and Organ-on-chips of Highly Pathogenic Microbial Infections & Biosafety III laboratory, Life Science Institute, Guangxi Medical University, Nanning, People’s Republic of China
| |
Collapse
|
11
|
Lara-Ramírez EE, Rivera G, Oliva-Hernández AA, Bocanegra-Garcia V, López JA, Guo X. Unsupervised learning analysis on the proteomes of Zika virus. PeerJ Comput Sci 2024; 10:e2443. [PMID: 39650519 PMCID: PMC11623125 DOI: 10.7717/peerj-cs.2443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2024] [Accepted: 10/01/2024] [Indexed: 12/11/2024]
Abstract
Background The Zika virus (ZIKV), which is transmitted by mosquito vectors to nonhuman primates and humans, causes devastating outbreaks in the poorest tropical regions of the world. Molecular epidemiology, supported by clustering phylogenetic gold standard studies using sequence data, has provided valuable information for tracking and controlling the spread of ZIKV. Unsupervised learning (UL), a form of machine learning algorithm, can be applied on the datasets without the need of known information for training. Methods In this work, unsupervised Random Forest (URF), followed by the application of dimensional reduction algorithms such as principal component analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), t-distributed stochastic neighbor embedding (t-SNE), and autoencoders were used to uncover hidden patterns from polymorphic amino acid sites extracted on the proteome ZIKV multi-alignments, without the need of an underlying evolutionary model. Results The four UL algorithms revealed specific host and geographical clustering patterns for ZIKV. Among the four dimensionality reduction (DR) algorithms, the performance was better for UMAP. The four algorithms allowed the identification of imported viruses for specific geographical clusters. The UL dimension coordinates showed a significant correlation with phylogenetic tree branch lengths and significant phylogenetic dependence in Abouheif's Cmean and Pagel's Lambda tests (p value < 0.01) that showed comparable performance with the phylogenetic method. This analytical strategy was generalizable to an external large dengue type 2 dataset. Conclusion These UL algorithms could be practical evolutionary analytical techniques to track the dispersal of viral pathogens.
Collapse
Affiliation(s)
- Edgar E. Lara-Ramírez
- Laboratorio de Biotecnología Farmacéutica, Centro de Biotecnología Genómica, Instituto Politécnico Nacional, Reynosa, Tamaulipas, México
| | - Gildardo Rivera
- Laboratorio de Biotecnología Farmacéutica, Centro de Biotecnología Genómica, Instituto Politécnico Nacional, Reynosa, Tamaulipas, México
| | - Amanda Alejandra Oliva-Hernández
- Laboratorio de Biotecnología Experimental, Centro de Biotecnología Genómica, Instituto Politécnico Nacional, Reynosa, Tamaulipas, México
| | - Virgilio Bocanegra-Garcia
- Laboratorio de Interacción Ambiente Microorganismo, Centro de Biotecnología Genómica, Instituto Politécnico Nacional, Reynosa, Tamaulipas, México
| | - Jesús Adrián López
- Laboratorio de microRNAs y Cáncer, Unidad Académica de Ciencias Biológicas, Universidad Autónoma de Zacatecas, Zacatecas, Zacatecas, México
| | - Xianwu Guo
- Laboratorio de Biotecnología Genómica, Centro de Biotecnología Genómica, Instituto Politécnico Nacional, Reynosa, Tamaulipas, México
| |
Collapse
|
12
|
Xie R, Adam DC, Hu S, Cowling BJ, Gascuel O, Zhukova A, Dhanasekaran V. Integrating Contact Tracing Data to Enhance Outbreak Phylodynamic Inference: A Deep Learning Approach. Mol Biol Evol 2024; 41:msae232. [PMID: 39497507 PMCID: PMC11600589 DOI: 10.1093/molbev/msae232] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Revised: 09/27/2024] [Accepted: 10/24/2024] [Indexed: 11/28/2024] Open
Abstract
Phylodynamics is central to understanding infectious disease dynamics through the integration of genomic and epidemiological data. Despite advancements, including the application of deep learning to overcome computational limitations, significant challenges persist due to data inadequacies and statistical unidentifiability of key parameters. These issues are particularly pronounced in poorly resolved phylogenies, commonly observed in outbreaks such as SARS-CoV-2. In this study, we conducted a thorough evaluation of PhyloDeep, a deep learning inference tool for phylodynamics, assessing its performance on poorly resolved phylogenies. Our findings reveal the limited predictive accuracy of PhyloDeep (and other state-of-the-art approaches) in these scenarios. However, models trained on poorly resolved, realistically simulated trees demonstrate improved predictive power, despite not being infallible, especially in scenarios with superspreading dynamics, whose parameters are challenging to capture accurately. Notably, we observe markedly improved performance through the integration of minimal contact tracing data, which refines poorly resolved trees. Applying this approach to a sample of SARS-CoV-2 sequences partially matched to contact tracing from Hong Kong yields informative estimates of superspreading potential, extending beyond the scope of contact tracing data alone. Our findings demonstrate the potential for enhancing phylodynamic analysis through complementary data integration, ultimately increasing the precision of epidemiological predictions crucial for public health decision-making and outbreak control.
Collapse
Affiliation(s)
- Ruopeng Xie
- School of Public Health, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong S.A.R., China
- HKU-Pasteur Research Pole, School of Public Health, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong S.A.R., China
| | - Dillon C Adam
- School of Public Health, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong S.A.R., China
| | - Shu Hu
- School of Public Health, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong S.A.R., China
- HKU-Pasteur Research Pole, School of Public Health, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong S.A.R., China
| | - Benjamin J Cowling
- School of Public Health, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong S.A.R., China
- Laboratory of Data Discovery for Health, Hong Kong Science and Technology Park, New Territories, Hong Kong S.A.R., China
| | - Olivier Gascuel
- Biologie intégrative des populations, Evolution moléculaire (BIPEM), Institut de Systématique, Evolution, Biodiversité (ISYEB, UMR 7205—CNRS, MNHN, SU, EPHE, UA), Muséum National d’Histoire Naturelle, Paris 75005 France
| | - Anna Zhukova
- Bioinformatics and Biostatistics Hub, Institut Pasteur, Université de Paris, Paris 75015, France
- G5 Evolutionary Dynamics of Infectious Diseases, Institut Pasteur, Université de Paris, Paris 75015, France
| | - Vijaykrishna Dhanasekaran
- School of Public Health, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong S.A.R., China
- HKU-Pasteur Research Pole, School of Public Health, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong S.A.R., China
| |
Collapse
|
13
|
Janzen T, Etienne RS. Phylogenetic tree statistics: A systematic overview using the new R package 'treestats'. Mol Phylogenet Evol 2024; 200:108168. [PMID: 39117295 DOI: 10.1016/j.ympev.2024.108168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 07/19/2024] [Accepted: 08/04/2024] [Indexed: 08/10/2024]
Abstract
Phylogenetic trees are believed to contain a wealth of information on diversification processes. However, comparing phylogenetic trees is not straightforward due to their high dimensionality. Researchers have therefore defined a wide range of low-dimensional summary statistics. Currently, it remains unexplored to what extent these summary statistics cover the same underlying information and what summary statistics best explain observed variation across phylogenies. Furthermore, a large subset of available summary statistics focusses on measuring the topological features of a phylogenetic tree, but are often only explored at the extreme edge cases of the fully balanced or imbalanced tree and not for trees of intermediate balance. Here, we introduce a new R package called 'treestats', that provides speed optimized code to compute 70 summary statistics. We study correlations between summary statistics on empirical trees and on trees simulated using several diversification models. Furthermore, we introduce an algorithm to create intermediately balanced trees in a well-defined manner, in order to explore variation in summary statistics across a balance gradient. We find that almost all summary statistics are correlated with tree size, and find that it is difficult, if not impossible, to correct for tree size, unless the tree generating model is known. Furthermore, we find that across empirical and simulated trees, at least three large clusters of correlated summary statistics can be found, where statistics group together based on information used (topology or branching times). However, the finer grained correlation structure appears to depend strongly on either the taxonomic group studied (in empirical studies) or the tree generating model (in simulation studies). Amongst statistics describing the (im)balance of a tree, we find that almost all statistics vary non-linearly, and sometimes even non-monotonically, with our generated balance gradient. This indicates that balance is perhaps a more complex property of a tree than previously thought. Furthermore, using our new imbalancing algorithm, we devise a numerical test to identify balance statistics, and identify several statistics as balance statistics that were not previously considered as such. Lastly, our results lead to several recommendations on which statistics to select when analyzing and comparing phylogenetic trees.
Collapse
Affiliation(s)
- Thijs Janzen
- Groningen Institute for Evolutionary Life Sciences, University of Groningen, Groningen, the Netherlands.
| | - Rampal S Etienne
- Groningen Institute for Evolutionary Life Sciences, University of Groningen, Groningen, the Netherlands
| |
Collapse
|
14
|
Maestri R, Perez-Lamarque B, Zhukova A, Morlon H. Recent evolutionary origin and localized diversity hotspots of mammalian coronaviruses. eLife 2024; 13:RP91745. [PMID: 39196812 PMCID: PMC11357359 DOI: 10.7554/elife.91745] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/30/2024] Open
Abstract
Several coronaviruses infect humans, with three, including the SARS-CoV2, causing diseases. While coronaviruses are especially prone to induce pandemics, we know little about their evolutionary history, host-to-host transmissions, and biogeography. One of the difficulties lies in dating the origination of the family, a particularly challenging task for RNA viruses in general. Previous cophylogenetic tests of virus-host associations, including in the Coronaviridae family, have suggested a virus-host codiversification history stretching many millions of years. Here, we establish a framework for robustly testing scenarios of ancient origination and codiversification versus recent origination and diversification by host switches. Applied to coronaviruses and their mammalian hosts, our results support a scenario of recent origination of coronaviruses in bats and diversification by host switches, with preferential host switches within mammalian orders. Hotspots of coronavirus diversity, concentrated in East Asia and Europe, are consistent with this scenario of relatively recent origination and localized host switches. Spillovers from bats to other species are rare, but have the highest probability to be towards humans than to any other mammal species, implicating humans as the evolutionary intermediate host. The high host-switching rates within orders, as well as between humans, domesticated mammals, and non-flying wild mammals, indicates the potential for rapid additional spreading of coronaviruses across the world. Our results suggest that the evolutionary history of extant mammalian coronaviruses is recent, and that cases of long-term virus-host codiversification have been largely over-estimated.
Collapse
Affiliation(s)
- Renan Maestri
- Institut de Biologie de l'École Normale Supérieure (IBENS), École Normale Supérieure, CNRS, INSERM, Université PSLParisFrance
- Departamento de Ecologia, Instituto de Biociências, Universidade Federal do Rio Grande do SulPorto AlegreBrazil
| | - Benoît Perez-Lamarque
- Institut de Biologie de l'École Normale Supérieure (IBENS), École Normale Supérieure, CNRS, INSERM, Université PSLParisFrance
- Institut de Systématique, Évolution, Biodiversité (ISYEB), Muséum national d’histoire naturelle, CNRS, Sorbonne Université, EPHE, UAParisFrance
| | - Anna Zhukova
- Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics HubParisFrance
| | - Hélène Morlon
- Institut de Biologie de l'École Normale Supérieure (IBENS), École Normale Supérieure, CNRS, INSERM, Université PSLParisFrance
| |
Collapse
|
15
|
Khurana MP, Curran-Sebastian J, Scheidwasser N, Morgenstern C, Rasmussen M, Fonager J, Stegger M, Tang MHE, Juul JL, Escobar-Herrera LA, Møller FT, Albertsen M, Kraemer MUG, du Plessis L, Jokelainen P, Lehmann S, Krause TG, Ullum H, Duchêne DA, Mortensen LH, Bhatt S. High-resolution epidemiological landscape from ~290,000 SARS-CoV-2 genomes from Denmark. Nat Commun 2024; 15:7123. [PMID: 39164246 PMCID: PMC11335946 DOI: 10.1038/s41467-024-51371-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Accepted: 08/01/2024] [Indexed: 08/22/2024] Open
Abstract
Vast amounts of pathogen genomic, demographic and spatial data are transforming our understanding of SARS-CoV-2 emergence and spread. We examined the drivers of molecular evolution and spread of 291,791 SARS-CoV-2 genomes from Denmark in 2021. With a sequencing rate consistently exceeding 60%, and up to 80% of PCR-positive samples between March and November, the viral genome set is broadly whole-epidemic representative. We identify a consistent rise in viral diversity over time, with notable spikes upon the importation of novel variants (e.g., Delta and Omicron). By linking genomic data with rich individual-level demographic data from national registers, we find that individuals aged < 15 and > 75 years had a lower contribution to molecular change (i.e., branch lengths) compared to other age groups, but similar molecular evolutionary rates, suggesting a lower likelihood of introducing novel variants. Similarly, we find greater molecular change among vaccinated individuals, suggestive of immune evasion. We also observe evidence of transmission in rural areas to follow predictable diffusion processes. Conversely, urban areas are expectedly more complex due to their high mobility, emphasising the role of population structure in driving virus spread. Our analyses highlight the added value of integrating genomic data with detailed demographic and spatial information, particularly in the absence of structured infection surveys.
Collapse
Affiliation(s)
- Mark P Khurana
- Section of Epidemiology, Department of Public Health, University of Copenhagen, Copenhagen, Denmark.
| | - Jacob Curran-Sebastian
- Section of Epidemiology, Department of Public Health, University of Copenhagen, Copenhagen, Denmark
| | - Neil Scheidwasser
- Section of Epidemiology, Department of Public Health, University of Copenhagen, Copenhagen, Denmark
| | - Christian Morgenstern
- MRC Centre for Global Infectious Disease Analysis, Department of Infectious Disease Epidemiology, School of Public Health, Faculty of Medicine, Imperial College London, London, UK
| | - Morten Rasmussen
- Virus Research and Development Laboratory, Statens Serum Institut, Copenhagen, Denmark
| | - Jannik Fonager
- Virus Research and Development Laboratory, Statens Serum Institut, Copenhagen, Denmark
| | - Marc Stegger
- Department of Bacteria, Parasites and Fungi, Statens Serum Institut, Copenhagen, Denmark
- Antimicrobial Resistance and Infectious Diseases Laboratory, Harry Butler Institute, Murdoch University, Murdoch, WA, Australia
| | - Man-Hung Eric Tang
- Department of Bacteria, Parasites and Fungi, Statens Serum Institut, Copenhagen, Denmark
| | - Jonas L Juul
- Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kongens Lyngby, Denmark
| | | | | | - Mads Albertsen
- Department of Chemistry and Bioscience, Aalborg University, Aalborg, Denmark
| | | | - Louis du Plessis
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
| | - Pikka Jokelainen
- Infectious Disease Preparedness, Statens Serum Institut, Copenhagen, Denmark
| | - Sune Lehmann
- Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Tyra G Krause
- Epidemiological Infectious Disease Preparedness, Statens Serum Institut Copenhagen, Copenhagen, Denmark
| | | | - David A Duchêne
- Section of Epidemiology, Department of Public Health, University of Copenhagen, Copenhagen, Denmark
| | - Laust H Mortensen
- Section of Epidemiology, Department of Public Health, University of Copenhagen, Copenhagen, Denmark
- Statistics Denmark, Copenhagen, Denmark
| | - Samir Bhatt
- Section of Epidemiology, Department of Public Health, University of Copenhagen, Copenhagen, Denmark
- MRC Centre for Global Infectious Disease Analysis, Department of Infectious Disease Epidemiology, School of Public Health, Faculty of Medicine, Imperial College London, London, UK
| |
Collapse
|
16
|
Soewongsono AC, Landis MJ. A Diffusion-Based Approach for Simulating Forward-in-Time State-Dependent Speciation and Extinction Dynamics. Bull Math Biol 2024; 86:101. [PMID: 38970749 DOI: 10.1007/s11538-024-01337-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2024] [Accepted: 06/27/2024] [Indexed: 07/08/2024]
Abstract
We establish a general framework using a diffusion approximation to simulate forward-in-time state counts or frequencies for cladogenetic state-dependent speciation-extinction (ClaSSE) models. We apply the framework to various two- and three-region geographic-state speciation-extinction (GeoSSE) models. We show that the species range state dynamics simulated under tree-based and diffusion-based processes are comparable. We derive a method to infer rate parameters that are compatible with given observed stationary state frequencies and obtain an analytical result to compute stationary state frequencies for a given set of rate parameters. We also describe a procedure to find the time to reach the stationary frequencies of a ClaSSE model using our diffusion-based approach, which we demonstrate using a worked example for a two-region GeoSSE model. Finally, we discuss how the diffusion framework can be applied to formalize relationships between evolutionary patterns and processes under state-dependent diversification scenarios.
Collapse
Affiliation(s)
- Albert C Soewongsono
- Department of Biology, Washington University in St. Louis, Rebstock Hall, St. Louis, MO, 63130, USA.
| | - Michael J Landis
- Department of Biology, Washington University in St. Louis, Rebstock Hall, St. Louis, MO, 63130, USA
| |
Collapse
|
17
|
Xu P, Liang S, Hahn A, Zhao V, Lo WT‘J, Haller BC, Sobkowiak B, Chitwood MH, Colijn C, Cohen T, Rhee KY, Messer PW, Wells MT, Clark AG, Kim J. e3SIM: epidemiological-ecological-evolutionary simulation framework for genomic epidemiology. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.29.601123. [PMID: 39005464 PMCID: PMC11244936 DOI: 10.1101/2024.06.29.601123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Infectious disease dynamics are driven by the complex interplay of epidemiological, ecological, and evolutionary processes. Accurately modeling these interactions is crucial for understanding pathogen spread and informing public health strategies. However, existing simulators often fail to capture the dynamic interplay between these processes, resulting in oversimplified models that do not fully reflect real-world complexities in which the pathogen's genetic evolution dynamically influences disease transmission. We introduce the epidemiological-ecological-evolutionary simulator (e3SIM), an open-source framework that concurrently models the transmission dynamics and molecular evolution of pathogens within a host population while integrating environmental factors. Using an agent-based, discrete-generation, forward-in-time approach, e3SIM incorporates compartmental models, host-population contact networks, and quantitative-trait models for pathogens. This integration allows for realistic simulations of disease spread and pathogen evolution. Key features include a modular and scalable design, flexibility in modeling various epidemiological and population-genetic complexities, incorporation of time-varying environmental factors, and a user-friendly graphical interface. We demonstrate e3SIM's capabilities through simulations of realistic outbreak scenarios with SARS-CoV-2 and Mycobacterium tuberculosis, illustrating its flexibility for studying the genomic epidemiology of diverse pathogen types.
Collapse
Affiliation(s)
- Peiyu Xu
- Department of Molecular Biology & Genetics, Cornell University, Ithaca, NY, USA
| | - Shenni Liang
- Department of Computational Science, Cornell University, Ithaca, NY, USA
| | - Andrew Hahn
- Department of Computational Science, Cornell University, Ithaca, NY, USA
| | - Vivian Zhao
- Department of Computational Science, Cornell University, Ithaca, NY, USA
| | - Wai Tung ‘Jack’ Lo
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Benjamin C. Haller
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Benjamin Sobkowiak
- Department of Epidemiology of Microbial Disease, Yale School of Public Health, New Haven, CT, USA
| | - Melanie H. Chitwood
- Department of Epidemiology of Microbial Disease, Yale School of Public Health, New Haven, CT, USA
| | - Caroline Colijn
- Department of Mathematics, Simon Fraser University, Burnaby, BC, Canada
| | - Ted Cohen
- Department of Epidemiology of Microbial Disease, Yale School of Public Health, New Haven, CT, USA
| | - Kyu Y. Rhee
- Department of Medicine, Weill Cornell Medicine, New York, NY, USA
| | - Philipp W. Messer
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Martin T. Wells
- Department of Statistics and Data Science, Cornell University, Ithaca, NY, USA
| | - Andrew G. Clark
- Department of Molecular Biology & Genetics, Cornell University, Ithaca, NY, USA
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Jaehee Kim
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| |
Collapse
|
18
|
Mo YK, Hahn MW, Smith ML. Applications of machine learning in phylogenetics. Mol Phylogenet Evol 2024; 196:108066. [PMID: 38565358 DOI: 10.1016/j.ympev.2024.108066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 02/16/2024] [Accepted: 03/21/2024] [Indexed: 04/04/2024]
Abstract
Machine learning has increasingly been applied to a wide range of questions in phylogenetic inference. Supervised machine learning approaches that rely on simulated training data have been used to infer tree topologies and branch lengths, to select substitution models, and to perform downstream inferences of introgression and diversification. Here, we review how researchers have used several promising machine learning approaches to make phylogenetic inferences. Despite the promise of these methods, several barriers prevent supervised machine learning from reaching its full potential in phylogenetics. We discuss these barriers and potential paths forward. In the future, we expect that the application of careful network designs and data encodings will allow supervised machine learning to accommodate the complex processes that continue to confound traditional phylogenetic methods.
Collapse
Affiliation(s)
- Yu K Mo
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA
| | - Matthew W Hahn
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA; Department of Biology, Indiana University, Bloomington, IN 47405, USA
| | - Megan L Smith
- Department of Biological Sciences, Mississippi State University, Starkville, MS 39762, USA.
| |
Collapse
|
19
|
Soewongsono AC, Landis MJ. A Diffusion-Based Approach for Simulating Forward-in-Time State-Dependent Speciation and Extinction Dynamics. ARXIV 2024:arXiv:2402.00246v2. [PMID: 38351931 PMCID: PMC10862938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/20/2024]
Abstract
We establish a general framework using a diffusion approximation to simulate forward-in-time state counts or frequencies for cladogenetic state-dependent speciation-extinction (ClaSSE) models. We apply the framework to various two- and three-region geographic-state speciation-extinction (GeoSSE) models. We show that the species range state dynamics simulated under tree-based and diffusion-based processes are comparable. We derive a method to infer rate parameters that are compatible with given observed stationary state frequencies and obtain an analytical result to compute stationary state frequencies for a given set of rate parameters. We also describe a procedure to find the time to reach the stationary frequencies of a ClaSSE model using our diffusion-based approach, which we demonstrate using a worked example for a two-region GeoSSE model. Finally, we discuss how the diffusion framework can be applied to formalize relationships between evolutionary patterns and processes under state-dependent diversification scenarios.
Collapse
Affiliation(s)
- Albert C Soewongsono
- Department of Biology, Washington University in St. Louis, Rebstock Hall, St. Louis, Missouri, 63130, USA
| | - Michael J Landis
- Department of Biology, Washington University in St. Louis, Rebstock Hall, St. Louis, Missouri, 63130, USA
| |
Collapse
|
20
|
Lin Q, Goldberg EE, Leitner T, Molina-París C, King AA, Romero-Severson EO. The Number and Pattern of Viral Genomic Reassortments are not Necessarily Identifiable from Segment Trees. Mol Biol Evol 2024; 41:msae078. [PMID: 38648521 PMCID: PMC11152448 DOI: 10.1093/molbev/msae078] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Revised: 02/23/2024] [Accepted: 04/09/2024] [Indexed: 04/25/2024] Open
Abstract
Reassortment is an evolutionary process common in viruses with segmented genomes. These viruses can swap whole genomic segments during cellular co-infection, giving rise to novel progeny formed from the mixture of parental segments. Since large-scale genome rearrangements have the potential to generate new phenotypes, reassortment is important to both evolutionary biology and public health research. However, statistical inference of the pattern of reassortment events from phylogenetic data is exceptionally difficult, potentially involving inference of general graphs in which individual segment trees are embedded. In this paper, we argue that, in general, the number and pattern of reassortment events are not identifiable from segment trees alone, even with theoretically ideal data. We call this fact the fundamental problem of reassortment, which we illustrate using the concept of the "first-infection tree," a potentially counterfactual genealogy that would have been observed in the segment trees had no reassortment occurred. Further, we illustrate four additional problems that can arise logically in the inference of reassortment events and show, using simulated data, that these problems are not rare and can potentially distort our observation of reassortment even in small data sets. Finally, we discuss how existing methods can be augmented or adapted to account for not only the fundamental problem of reassortment, but also the four additional situations that can complicate the inference of reassortment.
Collapse
Affiliation(s)
- Qianying Lin
- Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, NM, USA
| | - Emma E Goldberg
- Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, NM, USA
| | - Thomas Leitner
- Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, NM, USA
| | - Carmen Molina-París
- Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, NM, USA
| | - Aaron A King
- Department of Ecology & Evolutionary Biology, University of Michigan, Ann Arbor, MI, USA
- Department of Mathematics, University of Michigan, Ann Arbor, MI, USA
- Center for the Study of Complex Systems, University of Michigan, Ann Arbor, MI, USA
- Santa Fe Institute, Santa Fe, NM, USA
| | - Ethan O Romero-Severson
- Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, NM, USA
| |
Collapse
|
21
|
Thompson A, Liebeskind BJ, Scully EJ, Landis MJ. Deep Learning and Likelihood Approaches for Viral Phylogeography Converge on the Same Answers Whether the Inference Model Is Right or Wrong. Syst Biol 2024; 73:183-206. [PMID: 38189575 PMCID: PMC11249978 DOI: 10.1093/sysbio/syad074] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Revised: 11/22/2023] [Accepted: 01/05/2024] [Indexed: 01/09/2024] Open
Abstract
Analysis of phylogenetic trees has become an essential tool in epidemiology. Likelihood-based methods fit models to phylogenies to draw inferences about the phylodynamics and history of viral transmission. However, these methods are often computationally expensive, which limits the complexity and realism of phylodynamic models and makes them ill-suited for informing policy decisions in real-time during rapidly developing outbreaks. Likelihood-free methods using deep learning are pushing the boundaries of inference beyond these constraints. In this paper, we extend, compare, and contrast a recently developed deep learning method for likelihood-free inference from trees. We trained multiple deep neural networks using phylogenies from simulated outbreaks that spread among 5 locations and found they achieve close to the same levels of accuracy as Bayesian inference under the true simulation model. We compared robustness to model misspecification of a trained neural network to that of a Bayesian method. We found that both models had comparable performance, converging on similar biases. We also implemented a method of uncertainty quantification called conformalized quantile regression that we demonstrate has similar patterns of sensitivity to model misspecification as Bayesian highest posterior density (HPD) and greatly overlap with HPDs, but have lower precision (more conservative). Finally, we trained and tested a neural network against phylogeographic data from a recent study of the SARS-Cov-2 pandemic in Europe and obtained similar estimates of region-specific epidemiological parameters and the location of the common ancestor in Europe. Along with being as accurate and robust as likelihood-based methods, our trained neural networks are on average over 3 orders of magnitude faster after training. Our results support the notion that neural networks can be trained with simulated data to accurately mimic the good and bad statistical properties of the likelihood functions of generative phylogenetic models.
Collapse
Affiliation(s)
- Ammon Thompson
- Participant in an Education Program Sponsored by U.S. Department of Defense (DOD) at the National Geospatial-Intelligence Agency, Springfield, VA 22150, USA
| | | | - Erik J Scully
- National Geospatial-Intelligence Agency, Springfield, VA 22150, USA
| | - Michael J Landis
- Department of Biology, Washington University in St. Louis, Rebstock Hall, St. Louis, MO 63130, USA
| |
Collapse
|
22
|
Bouckaert RR. Variational Bayesian phylogenies through matrix representation of tree space. PeerJ 2024; 12:e17276. [PMID: 38699195 PMCID: PMC11064865 DOI: 10.7717/peerj.17276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2023] [Accepted: 04/01/2024] [Indexed: 05/05/2024] Open
Abstract
In this article, we study the distance matrix as a representation of a phylogeny by way of hierarchical clustering. By defining a multivariate normal distribution on (a subset of) the entries in a matrix, this allows us to represent a distribution over rooted time trees. Here, we demonstrate tree distributions can be represented accurately this way for a number of published tree distributions. Though such a representation does not map to unique trees, restriction to a subspace, in particular one we call a "cube", makes the representation bijective at the cost of not being able to represent all possible trees. We introduce an algorithm "cubeVB" specifically for cubes and show through well calibrated simulation study that it is possible to recover parameters of interest like tree height and length. Although a cube cannot represent all of tree space, it is a great improvement over a single summary tree, and it opens up exciting new opportunities for scaling up Bayesian phylogenetic inference. We also demonstrate how to use a matrix representation of a tree distribution to get better summary trees than commonly used maximum clade credibility trees. An open source implementation of the cubeVB algorithm is available from https://github.com/rbouckaert/cubevb as the cubevb package for BEAST 2.
Collapse
Affiliation(s)
- Remco R. Bouckaert
- School of Computer Science, University of Auckland, Auckland, New Zealand
| |
Collapse
|
23
|
Sun C, Fang R, Salemi M, Prosperi M, Rife Magalis B. DeepDynaForecast: Phylogenetic-informed graph deep learning for epidemic transmission dynamic prediction. PLoS Comput Biol 2024; 20:e1011351. [PMID: 38598563 PMCID: PMC11034642 DOI: 10.1371/journal.pcbi.1011351] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 04/22/2024] [Accepted: 03/11/2024] [Indexed: 04/12/2024] Open
Abstract
In the midst of an outbreak or sustained epidemic, reliable prediction of transmission risks and patterns of spread is critical to inform public health programs. Projections of transmission growth or decline among specific risk groups can aid in optimizing interventions, particularly when resources are limited. Phylogenetic trees have been widely used in the detection of transmission chains and high-risk populations. Moreover, tree topology and the incorporation of population parameters (phylodynamics) can be useful in reconstructing the evolutionary dynamics of an epidemic across space and time among individuals. We now demonstrate the utility of phylodynamic trees for transmission modeling and forecasting, developing a phylogeny-based deep learning system, referred to as DeepDynaForecast. Our approach leverages a primal-dual graph learning structure with shortcut multi-layer aggregation, which is suited for the early identification and prediction of transmission dynamics in emerging high-risk groups. We demonstrate the accuracy of DeepDynaForecast using simulated outbreak data and the utility of the learned model using empirical, large-scale data from the human immunodeficiency virus epidemic in Florida between 2012 and 2020. Our framework is available as open-source software (MIT license) at github.com/lab-smile/DeepDynaForcast.
Collapse
Affiliation(s)
- Chaoyue Sun
- Department of Electrical and Computer Engineering, Herbert Wertheim College of Engineering, University of Florida, Gainesville, Florida, United States of America
| | - Ruogu Fang
- Department of Electrical and Computer Engineering, Herbert Wertheim College of Engineering, University of Florida, Gainesville, Florida, United States of America
- J. Crayton Pruitt Family Department of Biomedical Engineering, Herbert Wertheim College of Engineering, University of Florida, Gainesville, Florida, United States of America
- Center for Cognitive Aging and Memory, McKnight Brain Institute, University of Florida, Gainesville, Florida, United States of America
| | - Marco Salemi
- Department of Pathology, Immunology, and Laboratory Medicine, University of Florida, Gainesville, Florida, United States of America
- Emerging Pathogens Institute, University of Florida, Gainesville, Florida, United States of America
| | - Mattia Prosperi
- Emerging Pathogens Institute, University of Florida, Gainesville, Florida, United States of America
- Department of Epidemiology, University of Florida, Gainesville, Florida, United States of America
| | - Brittany Rife Magalis
- Department of Pathology, Immunology, and Laboratory Medicine, University of Florida, Gainesville, Florida, United States of America
- Emerging Pathogens Institute, University of Florida, Gainesville, Florida, United States of America
| |
Collapse
|
24
|
Mohammad N, Huguenin A, Lefebvre A, Menvielle L, Toubas D, Ranque S, Villena I, Tannier X, Normand AC, Piarroux R. Nosocomial transmission of Aspergillus flavus in a neonatal intensive care unit: Long-term persistence in environment and interest of MALDI-ToF mass-spectrometry coupled with convolutional neural network for rapid clone recognition. Med Mycol 2024; 62:myad136. [PMID: 38142226 DOI: 10.1093/mmy/myad136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 11/28/2023] [Accepted: 12/21/2023] [Indexed: 12/25/2023] Open
Abstract
Aspergillosis of the newborn remains a rare but severe disease. We report four cases of primary cutaneous Aspergillus flavus infections in premature newborns linked to incubators contamination by putative clonal strains. Our objective was to evaluate the ability of matrix-assisted laser desorption/ionisation time of flight (MALDI-TOF) coupled to convolutional neural network (CNN) for clone recognition in a context where only a very small number of strains are available for machine learning. Clinical and environmental A. flavus isolates (n = 64) were studied, 15 were epidemiologically related to the four cases. All strains were typed using microsatellite length polymorphism. We found a common genotype for 9/15 related strains. The isolates of this common genotype were selected to obtain a training dataset (6 clonal isolates/25 non-clonal) and a test dataset (3 clonal isolates/31 non-clonal), and spectra were analysed with a simple CNN model. On the test dataset using CNN model, all 31 non-clonal isolates were correctly classified, 2/3 clonal isolates were unambiguously correctly classified, whereas the third strain was undetermined (i.e., the CNN model was unable to discriminate between GT8 and non-GT8). Clonal strains of A. flavus have persisted in the neonatal intensive care unit for several years. Indeed, two strains of A. flavus isolated from incubators in September 2007 are identical to the strain responsible for the second case that occurred 3 years later. MALDI-TOF is a promising tool for detecting clonal isolates of A. flavus using CNN even with a limited training set for limited cost and handling time.
Collapse
Affiliation(s)
- Noshine Mohammad
- Sorbonne Université, INSERM, Institut Pierre-Louis d'Epidémiologie et de Santé Publique, AP-HP, Paris, France
- Groupe Hospitalier Pitié-Salpêtrière, Service de Parasitologie-Mycologie, Paris, France
| | - Antoine Huguenin
- Laboratoire de Parasitologie-Mycologie, Pôle de Biologie et de Pathologie, CHU de Reims, Reims, France
- Université de Reims Champagne Ardenne, ESCAPE EA7510, Reims, France
| | | | - Laura Menvielle
- CHU de Reims, Hôpital Américain, Service de réanimation néonatale, 45 rue Cognaq Jay, Reims, France
| | - Dominique Toubas
- Laboratoire de Parasitologie-Mycologie, Pôle de Biologie et de Pathologie, CHU de Reims, Reims, France
- Université de Reims Champagne Ardenne, ESCAPE EA7510, Reims, France
- Equipe Opérationnelle d'Hygiène, CHU de Reims, France
- CHU de Reims, Hôpital Américain, Service de réanimation néonatale, 45 rue Cognaq Jay, Reims, France
- BioSpecT (Translational BioSpectroscopy) EA 7506, SFR Santé, Université de Reims Champagne-Ardenne, Reims, France
| | - Stéphane Ranque
- IHU-Méditerranée Infection, Marseille, France
- Aix-Marseille Université, AP-HM, IRD, SSA, VITROME, Marseille, France
| | - Isabelle Villena
- Laboratoire de Parasitologie-Mycologie, Pôle de Biologie et de Pathologie, CHU de Reims, Reims, France
- Université de Reims Champagne Ardenne, ESCAPE EA7510, Reims, France
| | - Xavier Tannier
- Sorbonne Université, INSERM, Université Sorbonne Paris Nord, Laboratoire d'Informatique Médicale et d'Ingénierie des connaissances en e-Santé, LIMICS, Paris, France
| | - Anne-Cécile Normand
- Groupe Hospitalier Pitié-Salpêtrière, Service de Parasitologie-Mycologie, Paris, France
| | - Renaud Piarroux
- Sorbonne Université, INSERM, Institut Pierre-Louis d'Epidémiologie et de Santé Publique, AP-HP, Paris, France
- Groupe Hospitalier Pitié-Salpêtrière, Service de Parasitologie-Mycologie, Paris, France
| |
Collapse
|
25
|
Zhukova A, Hecht F, Maday Y, Gascuel O. Fast and Accurate Maximum-Likelihood Estimation of Multi-Type Birth-Death Epidemiological Models from Phylogenetic Trees. Syst Biol 2023; 72:1387-1402. [PMID: 37703335 PMCID: PMC10924745 DOI: 10.1093/sysbio/syad059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Revised: 09/05/2023] [Accepted: 09/07/2023] [Indexed: 09/15/2023] Open
Abstract
Multi-type birth-death (MTBD) models are phylodynamic analogies of compartmental models in classical epidemiology. They serve to infer such epidemiological parameters as the average number of secondary infections Re and the infectious time from a phylogenetic tree (a genealogy of pathogen sequences). The representatives of this model family focus on various aspects of pathogen epidemics. For instance, the birth-death exposed-infectious (BDEI) model describes the transmission of pathogens featuring an incubation period (when there is a delay between the moment of infection and becoming infectious, as for Ebola and SARS-CoV-2), and permits its estimation along with other parameters. With constantly growing sequencing data, MTBD models should be extremely useful for unravelling information on pathogen epidemics. However, existing implementations of these models in a phylodynamic framework have not yet caught up with the sequencing speed. Computing time and numerical instability issues limit their applicability to medium data sets (≤ 500 samples), while the accuracy of estimations should increase with more data. We propose a new highly parallelizable formulation of ordinary differential equations for MTBD models. We also extend them to forests to represent situations when a (sub-)epidemic started from several cases (e.g., multiple introductions to a country). We implemented it for the BDEI model in a maximum likelihood framework using a combination of numerical analysis methods for efficient equation resolution. Our implementation estimates epidemiological parameter values and their confidence intervals in two minutes on a phylogenetic tree of 10,000 samples. Comparison to the existing implementations on simulated data shows that it is not only much faster but also more accurate. An application of our tool to the 2014 Ebola epidemic in Sierra-Leone is also convincing, with very fast calculation and precise estimates. As MTBD models are closely related to Cladogenetic State Speciation and Extinction (ClaSSE)-like models, our findings could also be easily transferred to the macroevolution domain.
Collapse
Affiliation(s)
- Anna Zhukova
- Unité Bioinformatique Evolutive, Institut Pasteur, Université de Paris, 28 rue du docteur Roux, 75015 Paris, France
- Bioinformatics and Biostatistics Hub, Institut Pasteur, Université de Paris, 28 rue du docteur Roux, 75015 Paris, France
| | - Frédéric Hecht
- Sorbonne Université, CNRS, Université Paris Cité, Laboratoire Jacques-Louis Lions (LJLL), 4 place Jussieu, F-75005 Paris, France
| | - Yvon Maday
- Sorbonne Université, CNRS, Université Paris Cité, Laboratoire Jacques-Louis Lions (LJLL), 4 place Jussieu, F-75005 Paris, France
- Institut Universitaire de France, 1 rue Descartes, 75231 Paris CEDEX 05, France
| | - Olivier Gascuel
- Unité Bioinformatique Evolutive, Institut Pasteur, Université de Paris, 28 rue du docteur Roux, 75015 Paris, France
- Institut de Systématique, Evolution, Biodiversité (ISYEB) - URM 7205 CNRS, Museum National d’Histoire Naturelle, SU, EPHE & UA, 57 rue Cuvier, CP 50 75005 Paris, France
| |
Collapse
|
26
|
Lambert S, Voznica J, Morlon H. Deep Learning from Phylogenies for Diversification Analyses. Syst Biol 2023; 72:1262-1279. [PMID: 37556735 DOI: 10.1093/sysbio/syad044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2022] [Revised: 06/20/2023] [Accepted: 08/08/2023] [Indexed: 08/11/2023] Open
Abstract
Birth-death (BD) models are widely used in combination with species phylogenies to study past diversification dynamics. Current inference approaches typically rely on likelihood-based methods. These methods are not generalizable, as a new likelihood formula must be established each time a new model is proposed; for some models, such a formula is not even tractable. Deep learning can bring solutions in such situations, as deep neural networks can be trained to learn the relation between simulations and parameter values as a regression problem. In this paper, we adapt a recently developed deep learning method from pathogen phylodynamics to the case of diversification inference, and we extend its applicability to the case of the inference of state-dependent diversification models from phylogenies associated with trait data. We demonstrate the accuracy and time efficiency of the approach for the time-constant homogeneous BD model and the Binary-State Speciation and Extinction model. Finally, we illustrate the use of the proposed inference machinery by reanalyzing a phylogeny of primates and their associated ecological role as seed dispersers. Deep learning inference provides at least the same accuracy as likelihood-based inference while being faster by several orders of magnitude, offering a promising new inference approach for the deployment of future models in the field.
Collapse
Affiliation(s)
- Sophia Lambert
- Institut de Biologie de l'École Normale Supérieure, École Normale Supérieure, CNRS, INSERM, Université Paris Sciences et Lettres, 46 Rue d'Ulm, 75005 Paris, France
- Institute of Ecology and Evolution, Department of Biology, 5289 University of Oregon, Eugene, OR 97403, USA
| | - Jakub Voznica
- Institut Pasteur, Université Paris Cité, Unité Bioinformatique Evolutive, 25-28 Rue du Dr Roux, 75015 Paris, France
- Unité de Biologie Computationnelle, USR 3756 CNRS, 25-28 Rue du Dr Roux, 75015 Paris, France
| | - Hélène Morlon
- Institut de Biologie de l'École Normale Supérieure, École Normale Supérieure, CNRS, INSERM, Université Paris Sciences et Lettres, 46 Rue d'Ulm, 75005 Paris, France
| |
Collapse
|
27
|
Penn MJ, Scheidwasser N, Penn J, Donnelly CA, Duchêne DA, Bhatt S. Leaping through Tree Space: Continuous Phylogenetic Inference for Rooted and Unrooted Trees. Genome Biol Evol 2023; 15:evad213. [PMID: 38085949 PMCID: PMC10745275 DOI: 10.1093/gbe/evad213] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/16/2023] [Indexed: 12/24/2023] Open
Abstract
Phylogenetics is now fundamental in life sciences, providing insights into the earliest branches of life and the origins and spread of epidemics. However, finding suitable phylogenies from the vast space of possible trees remains challenging. To address this problem, for the first time, we perform both tree exploration and inference in a continuous space where the computation of gradients is possible. This continuous relaxation allows for major leaps across tree space in both rooted and unrooted trees, and is less susceptible to convergence to local minima. Our approach outperforms the current best methods for inference on unrooted trees and, in simulation, accurately infers the tree and root in ultrametric cases. The approach is effective in cases of empirical data with negligible amounts of data, which we demonstrate on the phylogeny of jawed vertebrates. Indeed, only a few genes with an ultrametric signal were generally sufficient for resolving the major lineages of vertebrates. Optimization is possible via automatic differentiation and our method presents an effective way forward for exploring the most difficult, data-deficient phylogenetic questions.
Collapse
Affiliation(s)
- Matthew J Penn
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Neil Scheidwasser
- Section of Epidemiology, University of Copenhagen, Copenhagen, Denmark
| | - Joseph Penn
- Department of Physics, University of Oxford, Oxford, United Kingdom
| | - Christl A Donnelly
- Department of Statistics, University of Oxford, Oxford, United Kingdom
- Pandemic Sciences Institute, University of Oxford, Oxford, United Kingdom
- Department of Infectious Disease Epidemiology, MRC Centre for Global Infectious Disease Analysis, School of Public Health, Faculty of Medicine, Imperial College London, London, United Kingdom
| | - David A Duchêne
- Center for Evolutionary Hologenomics, Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - Samir Bhatt
- Section of Epidemiology, University of Copenhagen, Copenhagen, Denmark
- Department of Infectious Disease Epidemiology, MRC Centre for Global Infectious Disease Analysis, School of Public Health, Faculty of Medicine, Imperial College London, London, United Kingdom
| |
Collapse
|
28
|
Smith ML, Hahn MW. Phylogenetic inference using generative adversarial networks. Bioinformatics 2023; 39:btad543. [PMID: 37669126 PMCID: PMC10500083 DOI: 10.1093/bioinformatics/btad543] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Revised: 08/25/2023] [Accepted: 09/04/2023] [Indexed: 09/07/2023] Open
Abstract
MOTIVATION The application of machine learning approaches in phylogenetics has been impeded by the vast model space associated with inference. Supervised machine learning approaches require data from across this space to train models. Because of this, previous approaches have typically been limited to inferring relationships among unrooted quartets of taxa, where there are only three possible topologies. Here, we explore the potential of generative adversarial networks (GANs) to address this limitation. GANs consist of a generator and a discriminator: at each step, the generator aims to create data that is similar to real data, while the discriminator attempts to distinguish generated and real data. By using an evolutionary model as the generator, we use GANs to make evolutionary inferences. Since a new model can be considered at each iteration, heuristic searches of complex model spaces are possible. Thus, GANs offer a potential solution to the challenges of applying machine learning in phylogenetics. RESULTS We developed phyloGAN, a GAN that infers phylogenetic relationships among species. phyloGAN takes as input a concatenated alignment, or a set of gene alignments, and infers a phylogenetic tree either considering or ignoring gene tree heterogeneity. We explored the performance of phyloGAN for up to 15 taxa in the concatenation case and 6 taxa when considering gene tree heterogeneity. Error rates are relatively low in these simple cases. However, run times are slow and performance metrics suggest issues during training. Future work should explore novel architectures that may result in more stable and efficient GANs for phylogenetics. AVAILABILITY AND IMPLEMENTATION phyloGAN is available on github: https://github.com/meganlsmith/phyloGAN/.
Collapse
Affiliation(s)
- Megan L Smith
- Department of Biology, Indiana University, 1001 E 3rd St, Bloomington, IN 47405, United States
| | - Matthew W Hahn
- Department of Biology, Indiana University, 1001 E 3rd St, Bloomington, IN 47405, United States
- Department of Computer Science, Indiana University, 700 N Woodlawn Avenue, Bloomington, IN 47408, United States
| |
Collapse
|
29
|
Hederman AP, Ackerman ME. Leveraging deep learning to improve vaccine design. Trends Immunol 2023; 44:333-344. [PMID: 37003949 PMCID: PMC10485910 DOI: 10.1016/j.it.2023.03.002] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Revised: 03/05/2023] [Accepted: 03/05/2023] [Indexed: 04/03/2023]
Abstract
Deep learning has led to incredible breakthroughs in areas of research, from self-driving vehicles to solutions, to formal mathematical proofs. In the biomedical sciences, however, the revolutionary results seen in other fields are only now beginning to be realized. Vaccine research and development efforts represent an application with high public health significance. Protein structure prediction, immune repertoire analysis, and phylogenetics are three principal areas in which deep learning is poised to provide key advances. Here, we opine on some of the current challenges with deep learning and how they are being addressed. Despite the nascent stage of deep learning applications in immunological studies, there is ample opportunity to utilize this new technology to address the most challenging and burdensome infectious diseases confronting global populations.
Collapse
Affiliation(s)
| | - Margaret E Ackerman
- Thayer School of Engineering, Dartmouth College, Hanover, NH, USA; Department of Microbiology and Immunology, Geisel School of Medicine, Hanover, NH, USA.
| |
Collapse
|
30
|
Barzilai LP, Schrago CG. Signatures of natural selection in tree topology shape of serially sampled viral phylogenies. Mol Phylogenet Evol 2023; 183:107776. [PMID: 36990305 DOI: 10.1016/j.ympev.2023.107776] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2022] [Revised: 02/24/2023] [Accepted: 03/24/2023] [Indexed: 03/29/2023]
Abstract
Tree shape metrics can be computed fast for trees of any size, which makes them promising alternatives to intensive statistical methods and parameter-rich evolutionary models in the era of massive data availability. Previous studies have demonstrated their effectiveness in unveiling important parameters in viral evolutionary dynamics, although the impact of natural selection on the shape of tree topologies has not been thoroughly investigated. We carried out a forward-time and individual-based simulation to investigate whether tree shape metrics of several kinds could predict the selection regime employed to generate the data. To examine the impact of the genetic diversity of the founder viral population, simulations were run under two opposing starting configurations of the genetic diversity of the infecting viral population. We found that four evolutionary regimes, namely, negative, positive, and frequency-dependent selection, as well as neutral evolution, were successfully distinguished by tree topology shape metrics. Two metrics from the Laplacian spectral density profile (principal eigenvalue and peakedness) and the number of cherries were the most informative for indicating selection type. The genetic diversity of the founder population had an impact on differentiating evolutionary scenarios. Tree imbalance, which has been frequently associated with the action of natural selection on intrahost viral diversity, was also characteristic of neutrally evolving serially sampled data. Metrics calculated from empirical analysis of HIV datasets indicated that most tree topologies exhibited shapes closer to the frequency-dependent selection or neutral evolution regimes.
Collapse
|
31
|
Towards precision medicine: Omics approach for COVID-19. BIOSAFETY AND HEALTH 2023; 5:78-88. [PMID: 36687209 PMCID: PMC9846903 DOI: 10.1016/j.bsheal.2023.01.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Revised: 01/15/2023] [Accepted: 01/16/2023] [Indexed: 01/19/2023] Open
Abstract
The coronavirus disease 2019 (COVID-19) pandemic had a devastating impact on human society. Beginning with genome surveillance of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the development of omics technologies brought a clearer understanding of the complex SARS-CoV-2 and COVID-19. Here, we reviewed how omics, including genomics, proteomics, single-cell multi-omics, and clinical phenomics, play roles in answering biological and clinical questions about COVID-19. Large-scale sequencing and advanced analysis methods facilitate COVID-19 discovery from virus evolution and severity risk prediction to potential treatment identification. Omics would indicate precise and globalized prevention and medicine for the COVID-19 pandemic under the utilization of big data capability and phenotypes refinement. Furthermore, decoding the evolution rule of SARS-CoV-2 by deep learning models is promising to forecast new variants and achieve more precise data to predict future pandemics and prevent them on time.
Collapse
|