1
|
Magee AF, Holbrook AJ, Pekar JE, Caviedes-Solis IW, Matsen IV FA, Baele G, Wertheim JO, Ji X, Lemey P, Suchard MA. Random-Effects Substitution Models for Phylogenetics via Scalable Gradient Approximations. Syst Biol 2024; 73:562-578. [PMID: 38712512 PMCID: PMC11498053 DOI: 10.1093/sysbio/syae019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Revised: 02/26/2024] [Accepted: 05/02/2024] [Indexed: 05/08/2024] Open
Abstract
Phylogenetic and discrete-trait evolutionary inference depend heavily on an appropriate characterization of the underlying character substitution process. In this paper, we present random-effects substitution models that extend common continuous-time Markov chain models into a richer class of processes capable of capturing a wider variety of substitution dynamics. As these random-effects substitution models often require many more parameters than their usual counterparts, inference can be both statistically and computationally challenging. Thus, we also propose an efficient approach to compute an approximation to the gradient of the data likelihood with respect to all unknown substitution model parameters. We demonstrate that this approximate gradient enables scaling of sampling-based inference, namely Bayesian inference via Hamiltonian Monte Carlo, under random-effects substitution models across large trees and state-spaces. Applied to a dataset of 583 SARS-CoV-2 sequences, an HKY model with random-effects shows strong signals of nonreversibility in the substitution process, and posterior predictive model checks clearly show that it is a more adequate model than a reversible model. When analyzing the pattern of phylogeographic spread of 1441 influenza A virus (H3N2) sequences between 14 regions, a random-effects phylogeographic substitution model infers that air travel volume adequately predicts almost all dispersal rates. A random-effects state-dependent substitution model reveals no evidence for an effect of arboreality on the swimming mode in the tree frog subfamily Hylinae. Simulations reveal that random-effects substitution models can accommodate both negligible and radical departures from the underlying base substitution model. We show that our gradient-based inference approach is over an order of magnitude more time efficient than conventional approaches.
Collapse
Affiliation(s)
- Andrew F Magee
- Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California - Los Angeles, Los Angeles, CA, USA
| | - Andrew J Holbrook
- Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California - Los Angeles, Los Angeles, CA, USA
| | - Jonathan E Pekar
- Bioinformatics and Systems Biology Graduate Program, University of California - San Diego, La Jolla, CA, USA
- Department of Biomedical Informatics, University of California - San Diega, La Jolla, CA, USA
| | | | - Fredrick A Matsen IV
- Howard Hughes Medical Institute, Seattle, Washington, USA
- Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA
- Department of Statistics, University of Washington, Seattle, Washington, USA
| | - Guy Baele
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| | - Joel O Wertheim
- Department of Medicine, University of California - San Diego, La Jolla, CA, USA
| | - Xiang Ji
- Department of Mathematics, Tulane University, New Orleans, LA, USA
| | - Philippe Lemey
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| | - Marc A Suchard
- Department of Biostatistics, Jonathan and Karin Fielding School of Public Health, University of California - Los Angeles, Los Angeles, CA, USA
- Department of Biomathematics, David Geffen School of Medicine at UCLA, University of California - Los Angeles, Los Angeles, CA, USA
- Department of Human Genetics, David Geffen School of Medicine at UCLA, University of California - Los Angeles, Los Angeles, CA, USA
| |
Collapse
|
2
|
Thompson A, Liebeskind BJ, Scully EJ, Landis MJ. Deep Learning and Likelihood Approaches for Viral Phylogeography Converge on the Same Answers Whether the Inference Model Is Right or Wrong. Syst Biol 2024; 73:183-206. [PMID: 38189575 PMCID: PMC11249978 DOI: 10.1093/sysbio/syad074] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Revised: 11/22/2023] [Accepted: 01/05/2024] [Indexed: 01/09/2024] Open
Abstract
Analysis of phylogenetic trees has become an essential tool in epidemiology. Likelihood-based methods fit models to phylogenies to draw inferences about the phylodynamics and history of viral transmission. However, these methods are often computationally expensive, which limits the complexity and realism of phylodynamic models and makes them ill-suited for informing policy decisions in real-time during rapidly developing outbreaks. Likelihood-free methods using deep learning are pushing the boundaries of inference beyond these constraints. In this paper, we extend, compare, and contrast a recently developed deep learning method for likelihood-free inference from trees. We trained multiple deep neural networks using phylogenies from simulated outbreaks that spread among 5 locations and found they achieve close to the same levels of accuracy as Bayesian inference under the true simulation model. We compared robustness to model misspecification of a trained neural network to that of a Bayesian method. We found that both models had comparable performance, converging on similar biases. We also implemented a method of uncertainty quantification called conformalized quantile regression that we demonstrate has similar patterns of sensitivity to model misspecification as Bayesian highest posterior density (HPD) and greatly overlap with HPDs, but have lower precision (more conservative). Finally, we trained and tested a neural network against phylogeographic data from a recent study of the SARS-Cov-2 pandemic in Europe and obtained similar estimates of region-specific epidemiological parameters and the location of the common ancestor in Europe. Along with being as accurate and robust as likelihood-based methods, our trained neural networks are on average over 3 orders of magnitude faster after training. Our results support the notion that neural networks can be trained with simulated data to accurately mimic the good and bad statistical properties of the likelihood functions of generative phylogenetic models.
Collapse
Affiliation(s)
- Ammon Thompson
- Participant in an Education Program Sponsored by U.S. Department of Defense (DOD) at the National Geospatial-Intelligence Agency, Springfield, VA 22150, USA
| | | | - Erik J Scully
- National Geospatial-Intelligence Agency, Springfield, VA 22150, USA
| | - Michael J Landis
- Department of Biology, Washington University in St. Louis, Rebstock Hall, St. Louis, MO 63130, USA
| |
Collapse
|
3
|
Chen Z, Lemey P, Yu H. Approaches and challenges to inferring the geographical source of infectious disease outbreaks using genomic data. THE LANCET. MICROBE 2024; 5:e81-e92. [PMID: 38042165 DOI: 10.1016/s2666-5247(23)00296-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 09/03/2023] [Accepted: 09/13/2023] [Indexed: 12/04/2023]
Abstract
Genomic data hold increasing potential in the elucidation of transmission dynamics and geographical sources of infectious disease outbreaks. Phylogeographic methods that use epidemiological and genomic data obtained from surveillance enable us to infer the history of spatial transmission that is naturally embedded in the topology of phylogenetic trees as a record of the dispersal of infectious agents between geographical locations. In this Review, we provide an overview of phylogeographic approaches widely used for reconstructing the geographical sources of outbreaks of interest. These approaches can be classified into ancestral trait or state reconstruction and structured population models, with structured population models including popular structured coalescent and birth-death models. We also describe the major challenges associated with sequencing technologies, surveillance strategies, data sharing, and analysis frameworks that became apparent during the generation of large-scale genomic data in recent years, extending beyond inference approaches. Finally, we highlight the role of genomic data in geographical source inference and clarify how this enhances understanding and molecular investigations of outbreak sources.
Collapse
Affiliation(s)
- Zhiyuan Chen
- School of Public Health, Fudan University, Key Laboratory of Public Health Safety, Ministry of Education, Shanghai, China
| | - Philippe Lemey
- Department of Microbiology, Immunology and Transplantation, Rega Institute, Laboratory of Clinical and Evolutionary Virology, KU Leuven, Leuven, Belgium
| | - Hongjie Yu
- School of Public Health, Fudan University, Key Laboratory of Public Health Safety, Ministry of Education, Shanghai, China.
| |
Collapse
|
4
|
Gao J, May MR, Rannala B, Moore BR. PrioriTree: a utility for improving phylodynamic analyses in BEAST. Bioinformatics 2023; 39:6967033. [PMID: 36592035 PMCID: PMC9841403 DOI: 10.1093/bioinformatics/btac849] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2022] [Revised: 12/20/2022] [Accepted: 12/30/2022] [Indexed: 01/03/2023] Open
Abstract
SUMMARY Phylodynamic methods are central to studies of the geographic and demographic history of disease outbreaks. Inference under discrete-geographic phylodynamic models-which involve many parameters that must be inferred from minimal information-is inherently sensitive to our prior beliefs about the model parameters. We present an interactive utility, PrioriTree, to help researchers identify and accommodate prior sensitivity in discrete-geographic inferences. Specifically, PrioriTree provides a suite of functions to generate input files for-and summarize output from-BEAST analyses for performing robust Bayesian inference, data-cloning analyses and assessing the relative and absolute fit of candidate discrete-geographic (prior) models to empirical datasets. AVAILABILITY AND IMPLEMENTATION PrioriTree is distributed as an R package available at https://github.com/jsigao/prioritree, with a comprehensive user manual provided at https://bookdown.org/jsigao/prioritree_manual/.
Collapse
Affiliation(s)
- Jiansi Gao
- To whom correspondence should be addressed
| | - Michael R May
- Department of Evolution and Ecology, University of California, Davis, Davis, CA 95616, USA
- Department of Integrative Biology, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Bruce Rannala
- Department of Evolution and Ecology, University of California, Davis, Davis, CA 95616, USA
| | - Brian R Moore
- Department of Evolution and Ecology, University of California, Davis, Davis, CA 95616, USA
| |
Collapse
|