1
|
Liu Y, Chen H, Duan W, Zhang X, He X, Nielsen R, Ma L, Zhai W. Predicting Egg Passage Adaptations to Design Better Vaccines for the H3N2 Influenza Virus. Viruses 2022; 14:v14092065. [PMID: 36146872 PMCID: PMC9501976 DOI: 10.3390/v14092065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Revised: 09/12/2022] [Accepted: 09/13/2022] [Indexed: 11/16/2022] Open
Abstract
Seasonal H3N2 influenza evolves rapidly, leading to an extremely poor vaccine efficacy. Substitutions employed during vaccine production using embryonated eggs (i.e., egg passage adaptation) contribute to the poor vaccine efficacy (VE), but the evolutionary mechanism remains elusive. Using an unprecedented number of hemagglutinin sequences (n = 89,853), we found that the fitness landscape of passage adaptation is dominated by pervasive epistasis between two leading residues (186 and 194) and multiple other positions. Convergent evolutionary paths driven by strong epistasis explain most of the variation in VE, which has resulted in extremely poor vaccines for the past decade. Leveraging the unique fitness landscape, we developed a novel machine learning model that can predict egg passage substitutions for any candidate vaccine strain before the passage experiment, providing a unique opportunity for the selection of optimal vaccine viruses. Our study presents one of the most comprehensive characterizations of the fitness landscape of a virus and demonstrates that evolutionary trajectories can be harnessed for improved influenza vaccines.
Collapse
Affiliation(s)
- Yunsong Liu
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of the Chinese Academy of Sciences, Beijing 100049, China
| | - Hui Chen
- Human Genetics, Genome Institute of Singapore, Agency for Science, Technology and Research, Singapore 138672, Singapore
| | - Wenyuan Duan
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of the Chinese Academy of Sciences, Beijing 100049, China
| | - Xinyi Zhang
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of the Chinese Academy of Sciences, Beijing 100049, China
| | - Xionglei He
- MOE Key Laboratory of Gene Function and Regulation, State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou 510275, China
| | - Rasmus Nielsen
- Department of Integrative Biology, University of California-Berkeley, Berkeley, CA 94707, USA
- Department of Statistics, University of California-Berkeley, Berkeley, CA 94707, USA
- Globe Institute, University of Copenhagen, 1350 København, Copenhagen, Denmark
| | - Liang Ma
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
| | - Weiwei Zhai
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of the Chinese Academy of Sciences, Beijing 100049, China
- Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China
- Correspondence:
| |
Collapse
|
2
|
Latrille T, Lanore V, Lartillot N. Inferring long-term effective population size with Mutation-Selection Models. Mol Biol Evol 2021; 38:4573-4587. [PMID: 34191010 PMCID: PMC8476147 DOI: 10.1093/molbev/msab160] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Mutation–selection phylogenetic codon models are grounded on population genetics first principles and represent a principled approach for investigating the intricate interplay between mutation, selection, and drift. In their current form, mutation–selection codon models are entirely characterized by the collection of site-specific amino-acid fitness profiles. However, thus far, they have relied on the assumption of a constant genetic drift, translating into a unique effective population size (Ne) across the phylogeny, clearly an unrealistic assumption. This assumption can be alleviated by introducing variation in Ne between lineages. In addition to Ne, the mutation rate (μ) is susceptible to vary between lineages, and both should covary with life-history traits (LHTs). This suggests that the model should more globally account for the joint evolutionary process followed by all of these lineage-specific variables (Ne, μ, and LHTs). In this direction, we introduce an extended mutation–selection model jointly reconstructing in a Bayesian Monte Carlo framework the fitness landscape across sites and long-term trends in Ne, μ, and LHTs along the phylogeny, from an alignment of DNA coding sequences and a matrix of observed LHTs in extant species. The model was tested against simulated data and applied to empirical data in mammals, isopods, and primates. The reconstructed history of Ne in these groups appears to correlate with LHTs or ecological variables in a way that suggests that the reconstruction is reasonable, at least in its global trends. On the other hand, the range of variation in Ne inferred across species is surprisingly narrow. This last point suggests that some of the assumptions of the model, in particular concerning the assumed absence of epistatic interactions between sites, are potentially problematic.
Collapse
Affiliation(s)
- T Latrille
- Université de Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Évolutive UMR, 5558, F-69622, Villeurbanne, France.,École Normale Supérieure de Lyon, Université de Lyon, Université Lyon 1, Lyon, France,
| | - V Lanore
- Université de Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Évolutive UMR, 5558, F-69622, Villeurbanne, France
| | - N Lartillot
- Université de Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Évolutive UMR, 5558, F-69622, Villeurbanne, France
| |
Collapse
|
3
|
Landis M, Edwards EJ, Donoghue MJ. Modeling Phylogenetic Biome Shifts on a Planet with a Past. Syst Biol 2020; 70:86-107. [PMID: 32514540 DOI: 10.1093/sysbio/syaa045] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2020] [Accepted: 05/27/2020] [Indexed: 12/30/2022] Open
Abstract
The spatial distribution of biomes has changed considerably over deep time, so the geographical opportunity for an evolutionary lineage to shift into a new biome may depend on how the availability and connectivity of biomes has varied temporally. To better understand how lineages shift between biomes in space and time, we developed a phylogenetic biome shift model in which each lineage shifts between biomes and disperses between regions at rates that depend on the lineage's biome affinity and location relative to the spatial distribution of biomes at any given time. To study the behavior of the biome shift model in an empirical setting, we developed a literature-based representation of paleobiome structure for three mesic forest biomes, six regions, and eight time strata, ranging from the Late Cretaceous (100 Ma) through the present. We then fitted the model to a time-calibrated phylogeny of 119 Viburnum species to compare how the results responded to various realistic or unrealistic assumptions about paleobiome structure. Ancestral biome estimates that account for paleobiome dynamics reconstructed a warm temperate (or tropical) origin of Viburnum, which is consistent with previous fossil-based estimates of ancestral biomes. Imposing unrealistic paleobiome distributions led to ancestral biome estimates that eliminated support for tropical origins, and instead inflated support for cold temperate ancestry throughout the warmer Paleocene and Eocene. The biome shift model we describe is applicable to the study of evolutionary systems beyond Viburnum, and the core mechanisms of our model are extensible to the design of richer phylogenetic models of historical biogeography and/or lineage diversification. We conclude that biome shift models that account for dynamic geographical opportunities are important for inferring ancestral biomes that are compatible with our understanding of Earth history.[Ancestral states; biome shifts; historical biogeography; niche conservatism; phylogenetics].
Collapse
Affiliation(s)
- Michael Landis
- Department of Biology, Washington University in St. Louis, One Brookings Drive, St. Louis, MI 63130, USA.,Department of Ecology & Evolutionary Biology, Yale University, PO Box 208106, New Haven, CT 06520, USA
| | - Erika J Edwards
- Department of Ecology & Evolutionary Biology, Yale University, PO Box 208106, New Haven, CT 06520, USA.,Division of Botany, Yale Peabody Museum of Natural History, P.O. Box 208118, New Haven, CT 06520, USA
| | - Michael J Donoghue
- Department of Ecology & Evolutionary Biology, Yale University, PO Box 208106, New Haven, CT 06520, USA.,Division of Botany, Yale Peabody Museum of Natural History, P.O. Box 208118, New Haven, CT 06520, USA
| |
Collapse
|
4
|
Freyman WA, Höhna S. Stochastic Character Mapping of State-Dependent Diversification Reveals the Tempo of Evolutionary Decline in Self-Compatible Onagraceae Lineages. Syst Biol 2018; 68:505-519. [DOI: 10.1093/sysbio/syy078] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2018] [Revised: 11/05/2018] [Accepted: 11/13/2018] [Indexed: 11/13/2022] Open
Affiliation(s)
- William A Freyman
- Department of Integrative Biology, University of California, Berkeley, 3040 Valley Life Sciences Building #3140, CA 94720, USA
| | - Sebastian Höhna
- Division of Evolutionary Biology, Ludwig-Maximilians-Universität München, Geschwister-Scholl-Platz 1, 80539 Munich, Germany
| |
Collapse
|
5
|
Landis MJ, Freyman WA, Baldwin BG. Retracing the Hawaiian silversword radiation despite phylogenetic, biogeographic, and paleogeographic uncertainty. Evolution 2018; 72:2343-2359. [PMID: 30198108 DOI: 10.1111/evo.13594] [Citation(s) in RCA: 47] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2018] [Accepted: 08/17/2018] [Indexed: 12/25/2022]
Abstract
The Hawaiian silversword alliance (Asteraceae) is an iconic adaptive radiation. However, like many island plant lineages, no fossils have been assigned to the clade. As a result, the clade's age and diversification rate are not known precisely, making it difficult to test biogeographic hypotheses about the radiation. In lieu of fossils, paleogeographically structured biogeographic processes may inform species divergence times; for example, an island must first exist for a clade to radiate upon it. We date the silversword clade and test biogeographic hypotheses about its radiation across the Hawaiian Archipelago by modeling interactions between species relationships, molecular evolution, biogeographic scenarios, divergence times, and island origination times using the Bayesian phylogenetic framework, RevBayes. The ancestor of living silverswords most likely colonized the modern Hawaiian Islands once from the mainland approximately 5.1 Ma, with the most recent common ancestor of extant silversword lineages first appearing approximately 3.5 Ma. Applying an event-based test of the progression rule of island biogeography, we found strong evidence that the dispersal process favors old-to-young directionality, but strong evidence for diversification continuing unabated into later phases of island ontogeny, particularly for Kaua'i. This work serves as a general example for how diversification studies benefit from incorporating biogeographic and paleogeographic components.
Collapse
Affiliation(s)
- Michael J Landis
- Department of Ecology & Evolution, Yale University, New Haven, Connecticut 06511
| | - William A Freyman
- Department of Ecology, Evolution, & Behavior, University of Minnesota, Saint Paul, Minnesota 55108.,Department of Integrative Biology, University of California, Berkeley, California 94720.,Jepson Herbarium, University of California, Berkeley, California 94720
| | - Bruce G Baldwin
- Department of Integrative Biology, University of California, Berkeley, California 94720.,Jepson Herbarium, University of California, Berkeley, California 94720
| |
Collapse
|
6
|
Lee HJ, Kishino H, Rodrigue N, Thorne JL. Grouping substitution types into different relaxed molecular clocks. Philos Trans R Soc Lond B Biol Sci 2017; 371:rstb.2015.0141. [PMID: 27325837 DOI: 10.1098/rstb.2015.0141] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/07/2016] [Indexed: 11/12/2022] Open
Abstract
Different types of nucleotide substitutions experience different patterns of rate change over time. We propose clustering context-dependent (or context-independent) nucleotide substitution types according to how their rates change and then using the grouping for divergence time estimation. With our models, relative rates among types that are in the same group are fixed, whereas absolute rates of the types within a group change over time according to a shared relaxed molecular clock. We illustrate our procedure by analysing a 0.15 Mb intergenic region to infer divergence times relating eight primates. The different groupings of substitution types that we explore have little effect on the posterior means of divergence times, but the widths of the credibility intervals decrease as the number of groups increases.This article is part of the themed issue 'Dating species divergences using rocks and clocks'.
Collapse
Affiliation(s)
- Hui-Jie Lee
- Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA
| | - Hirohisa Kishino
- Laboratory of Biometrics and Bioinformatics, University of Tokyo, Tokyo, Japan
| | - Nicolas Rodrigue
- Department of Biology, Institute of Biochemistry, and School of Mathematics and Statistics, Carleton University, Ottawa, Ontario, Canada
| | - Jeffrey L Thorne
- Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, USA
| |
Collapse
|
7
|
Davydov II, Robinson-Rechavi M, Salamin N. State aggregation for fast likelihood computations in molecular evolution. Bioinformatics 2017; 33:354-362. [PMID: 28172542 PMCID: PMC5408795 DOI: 10.1093/bioinformatics/btw632] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2016] [Revised: 09/07/2016] [Accepted: 09/23/2016] [Indexed: 12/24/2022] Open
Abstract
Motivation Codon models are widely used to identify the signature of selection at the molecular level and to test for changes in selective pressure during the evolution of genes encoding proteins. The large size of the state space of the Markov processes used to model codon evolution makes it difficult to use these models with large biological datasets. We propose here to use state aggregation to reduce the state space of codon models and, thus, improve the computational performance of likelihood estimation on these models. Results We show that this heuristic speeds up the computations of the M0 and branch-site models up to 6.8 times. We also show through simulations that state aggregation does not introduce a detectable bias. We analyzed a real dataset and show that aggregation provides highly correlated predictions compared to the full likelihood computations. Finally, state aggregation is a very general approach and can be applied to any continuous-time Markov process-based model with large state space, such as amino acid and coevolution models. We therefore discuss different ways to apply state aggregation to Markov models used in phylogenetics. Availability and Implementation The heuristic is implemented in the godon package (https://bitbucket.org/Davydov/godon) and in a version of FastCodeML (https://gitlab.isb-sib.ch/phylo/fastcodeml). Contact nicolas.salamin@unil.ch Supplementary Information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Iakov I Davydov
- Department of Ecology and Evolution, Biophore, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Genopode, Quartier Sorge, Lausanne, Switzerland
| | - Marc Robinson-Rechavi
- Department of Ecology and Evolution, Biophore, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Genopode, Quartier Sorge, Lausanne, Switzerland
| | - Nicolas Salamin
- Department of Ecology and Evolution, Biophore, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Genopode, Quartier Sorge, Lausanne, Switzerland
| |
Collapse
|
8
|
Lee HJ, Rodrigue N, Thorne JL. Relaxing the Molecular Clock to Different Degrees for Different Substitution Types. Mol Biol Evol 2015; 32:1948-61. [PMID: 25931515 PMCID: PMC4833082 DOI: 10.1093/molbev/msv099] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Rates of molecular evolution can vary over time. Diverse statistical techniques for divergence time estimation have been developed to accommodate this variation. These typically require that all sequence (or codon) positions at a locus change independently of one another. They also generally assume that the rates of different types of nucleotide substitutions vary across a phylogeny in the same way. This permits divergence time estimation procedures to employ an instantaneous rate matrix with relative rates that do not differ among branches. However, previous studies have suggested that some substitution types (e.g., CpG to TpG changes in mammals) are more clock-like than others. As has been previously noted, this is biologically plausible given the mutational mechanism of CpG to TpG changes. Through stochastic mapping of sequence histories from context-independent substitution models, our approach allows for context-dependent nucleotide substitutions to change their relative rates over time. We apply our approach to the analysis of a 0.15 Mb intergenic region from eight primates. In accord with previous findings, we find comparatively little rate variation over time for CpG to TpG substitutions but we find more for other substitution types. We conclude by discussing the limitations and prospects of our approach.
Collapse
Affiliation(s)
- Hui-Jie Lee
- Department of Statistics, North Carolina State University
| | | | - Jeffrey L Thorne
- Department of Statistics, North Carolina State University Department of Biological Sciences, North Carolina State University
| |
Collapse
|
9
|
Höhna S, Heath TA, Boussau B, Landis MJ, Ronquist F, Huelsenbeck JP. Probabilistic graphical model representation in phylogenetics. Syst Biol 2014; 63:753-71. [PMID: 24951559 PMCID: PMC4184382 DOI: 10.1093/sysbio/syu039] [Citation(s) in RCA: 69] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Recent years have seen a rapid expansion of the model space explored in statistical phylogenetics, emphasizing the need for new approaches to statistical model representation and software development. Clear communication and representation of the chosen model is crucial for: (i) reproducibility of an analysis, (ii) model development, and (iii) software design. Moreover, a unified, clear and understandable framework for model representation lowers the barrier for beginners and nonspecialists to grasp complex phylogenetic models, including their assumptions and parameter/variable dependencies. Graphical modeling is a unifying framework that has gained in popularity in the statistical literature in recent years. The core idea is to break complex models into conditionally independent distributions. The strength lies in the comprehensibility, flexibility, and adaptability of this formalism, and the large body of computational work based on it. Graphical models are well-suited to teach statistical models, to facilitate communication among phylogeneticists and in the development of generic software for simulation and statistical inference. Here, we provide an introduction to graphical models for phylogeneticists and extend the standard graphical model representation to the realm of phylogenetics. We introduce a new graphical model component, tree plates, to capture the changing structure of the subgraph corresponding to a phylogenetic tree. We describe a range of phylogenetic models using the graphical model framework and introduce modules to simplify the representation of standard components in large and complex models. Phylogenetic model graphs can be readily used in simulation, maximum likelihood inference, and Bayesian inference using, for example, Metropolis–Hastings or Gibbs sampling of the posterior distribution. [Computation; graphical models; inference; modularization; statistical phylogenetics; tree plate.]
Collapse
Affiliation(s)
- Sebastian Höhna
- Department of Mathematics, Stockholm University, Stockholm, SE-106 91 Stockholm, Sweden; Department of Evolution and Ecology, University of California, Davis, Storer Hall, One Shields Avenue, Davis, CA 95616, USA; Department of Integrative Biology, University of California, Berkeley, CA 94720, USA; Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS 66045, USA; Bioinformatics and Evolutionary Genomics, Université de Lyon, Villeurbanne, France; Department of Bioinformatics and Genetics, Swedish Museum of Natural History, SE-10405 Stockholm, Sweden; and Department of Biological Science, King Abdulaziz University, Jeddah, Saudi Arabia;Department of Mathematics, Stockholm University, Stockholm, SE-106 91 Stockholm, Sweden; Department of Evolution and Ecology, University of California, Davis, Storer Hall, One Shields Avenue, Davis, CA 95616, USA; Department of Integrative Biology, University of California, Berkeley, CA 94720, USA; Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS 66045, USA; Bioinformatics and Evolutionary Genomics, Université de Lyon, Villeurbanne, France; Department of Bioinformatics and Genetics, Swedish Museum of Natural History, SE-10405 Stockholm, Sweden; and Department of Biological Science, King Abdulaziz University, Jeddah, Saudi Arabia;
| | - Tracy A Heath
- Department of Mathematics, Stockholm University, Stockholm, SE-106 91 Stockholm, Sweden; Department of Evolution and Ecology, University of California, Davis, Storer Hall, One Shields Avenue, Davis, CA 95616, USA; Department of Integrative Biology, University of California, Berkeley, CA 94720, USA; Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS 66045, USA; Bioinformatics and Evolutionary Genomics, Université de Lyon, Villeurbanne, France; Department of Bioinformatics and Genetics, Swedish Museum of Natural History, SE-10405 Stockholm, Sweden; and Department of Biological Science, King Abdulaziz University, Jeddah, Saudi Arabia;Department of Mathematics, Stockholm University, Stockholm, SE-106 91 Stockholm, Sweden; Department of Evolution and Ecology, University of California, Davis, Storer Hall, One Shields Avenue, Davis, CA 95616, USA; Department of Integrative Biology, University of California, Berkeley, CA 94720, USA; Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS 66045, USA; Bioinformatics and Evolutionary Genomics, Université de Lyon, Villeurbanne, France; Department of Bioinformatics and Genetics, Swedish Museum of Natural History, SE-10405 Stockholm, Sweden; and Department of Biological Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Bastien Boussau
- Department of Mathematics, Stockholm University, Stockholm, SE-106 91 Stockholm, Sweden; Department of Evolution and Ecology, University of California, Davis, Storer Hall, One Shields Avenue, Davis, CA 95616, USA; Department of Integrative Biology, University of California, Berkeley, CA 94720, USA; Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS 66045, USA; Bioinformatics and Evolutionary Genomics, Université de Lyon, Villeurbanne, France; Department of Bioinformatics and Genetics, Swedish Museum of Natural History, SE-10405 Stockholm, Sweden; and Department of Biological Science, King Abdulaziz University, Jeddah, Saudi Arabia;Department of Mathematics, Stockholm University, Stockholm, SE-106 91 Stockholm, Sweden; Department of Evolution and Ecology, University of California, Davis, Storer Hall, One Shields Avenue, Davis, CA 95616, USA; Department of Integrative Biology, University of California, Berkeley, CA 94720, USA; Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS 66045, USA; Bioinformatics and Evolutionary Genomics, Université de Lyon, Villeurbanne, France; Department of Bioinformatics and Genetics, Swedish Museum of Natural History, SE-10405 Stockholm, Sweden; and Department of Biological Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Michael J Landis
- Department of Mathematics, Stockholm University, Stockholm, SE-106 91 Stockholm, Sweden; Department of Evolution and Ecology, University of California, Davis, Storer Hall, One Shields Avenue, Davis, CA 95616, USA; Department of Integrative Biology, University of California, Berkeley, CA 94720, USA; Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS 66045, USA; Bioinformatics and Evolutionary Genomics, Université de Lyon, Villeurbanne, France; Department of Bioinformatics and Genetics, Swedish Museum of Natural History, SE-10405 Stockholm, Sweden; and Department of Biological Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Fredrik Ronquist
- Department of Mathematics, Stockholm University, Stockholm, SE-106 91 Stockholm, Sweden; Department of Evolution and Ecology, University of California, Davis, Storer Hall, One Shields Avenue, Davis, CA 95616, USA; Department of Integrative Biology, University of California, Berkeley, CA 94720, USA; Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS 66045, USA; Bioinformatics and Evolutionary Genomics, Université de Lyon, Villeurbanne, France; Department of Bioinformatics and Genetics, Swedish Museum of Natural History, SE-10405 Stockholm, Sweden; and Department of Biological Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - John P Huelsenbeck
- Department of Mathematics, Stockholm University, Stockholm, SE-106 91 Stockholm, Sweden; Department of Evolution and Ecology, University of California, Davis, Storer Hall, One Shields Avenue, Davis, CA 95616, USA; Department of Integrative Biology, University of California, Berkeley, CA 94720, USA; Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS 66045, USA; Bioinformatics and Evolutionary Genomics, Université de Lyon, Villeurbanne, France; Department of Bioinformatics and Genetics, Swedish Museum of Natural History, SE-10405 Stockholm, Sweden; and Department of Biological Science, King Abdulaziz University, Jeddah, Saudi Arabia;Department of Mathematics, Stockholm University, Stockholm, SE-106 91 Stockholm, Sweden; Department of Evolution and Ecology, University of California, Davis, Storer Hall, One Shields Avenue, Davis, CA 95616, USA; Department of Integrative Biology, University of California, Berkeley, CA 94720, USA; Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS 66045, USA; Bioinformatics and Evolutionary Genomics, Université de Lyon, Villeurbanne, France; Department of Bioinformatics and Genetics, Swedish Museum of Natural History, SE-10405 Stockholm, Sweden; and Department of Biological Science, King Abdulaziz University, Jeddah, Saudi Arabia
| |
Collapse
|
10
|
Abstract
Phylogenetic stochastic mapping is a method for reconstructing the history of trait changes on a phylogenetic tree relating species/organism carrying the trait. State-of-the-art methods assume that the trait evolves according to a continuous-time Markov chain (CTMC) and works well for small state spaces. The computations slow down considerably for larger state spaces (e.g., space of codons), because current methodology relies on exponentiating CTMC infinitesimal rate matrices-an operation whose computational complexity grows as the size of the CTMC state space cubed. In this work, we introduce a new approach, based on a CTMC technique called uniformization, which does not use matrix exponentiation for phylogenetic stochastic mapping. Our method is based on a new Markov chain Monte Carlo (MCMC) algorithm that targets the distribution of trait histories conditional on the trait data observed at the tips of the tree. The computational complexity of our MCMC method grows as the size of the CTMC state space squared. Moreover, in contrast to competing matrix exponentiation methods, if the rate matrix is sparse, we can leverage this sparsity and increase the computational efficiency of our algorithm further. Using simulated data, we illustrate advantages of our MCMC algorithm and investigate how large the state space needs to be for our method to outperform matrix exponentiation approaches. We show that even on the moderately large state space of codons our MCMC method can be significantly faster than currently used matrix exponentiation methods.
Collapse
Affiliation(s)
- Jan Irvahn
- 1 Department of Statistics, University of Washington , Seattle, Washington
| | | |
Collapse
|
11
|
Vaughan TG, Kühnert D, Popinga A, Welch D, Drummond AJ. Efficient Bayesian inference under the structured coalescent. Bioinformatics 2014; 30:2272-9. [PMID: 24753484 PMCID: PMC4207426 DOI: 10.1093/bioinformatics/btu201] [Citation(s) in RCA: 97] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
Motivation: Population structure significantly affects evolutionary dynamics. Such structure may be due to spatial segregation, but may also reflect any other gene-flow-limiting aspect of a model. In combination with the structured coalescent, this fact can be used to inform phylogenetic tree reconstruction, as well as to infer parameters such as migration rates and subpopulation sizes from annotated sequence data. However, conducting Bayesian inference under the structured coalescent is impeded by the difficulty of constructing Markov Chain Monte Carlo (MCMC) sampling algorithms (samplers) capable of efficiently exploring the state space. Results: In this article, we present a new MCMC sampler capable of sampling from posterior distributions over structured trees: timed phylogenetic trees in which lineages are associated with the distinct subpopulation in which they lie. The sampler includes a set of MCMC proposal functions that offer significant mixing improvements over a previously published method. Furthermore, its implementation as a BEAST 2 package ensures maximum flexibility with respect to model and prior specification. We demonstrate the usefulness of this new sampler by using it to infer migration rates and effective population sizes of H3N2 influenza between New Zealand, New York and Hong Kong from publicly available hemagglutinin (HA) gene sequences under the structured coalescent. Availability and implementation: The sampler has been implemented as a publicly available BEAST 2 package that is distributed under version 3 of the GNU General Public License at http://compevol.github.io/MultiTypeTree. Contact:tgvaughan@gmail.com Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Timothy G Vaughan
- Allan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston North 4442, New Zealand, Institute of Integrative Biology, Swiss Federal Institute of Technology (ETH), Zurich 8092, Switzerland and Department of Computer Science, University of Auckland, Auckland 1142, New Zealand
| | - Denise Kühnert
- Allan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston North 4442, New Zealand, Institute of Integrative Biology, Swiss Federal Institute of Technology (ETH), Zurich 8092, Switzerland and Department of Computer Science, University of Auckland, Auckland 1142, New ZealandAllan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston North 4442, New Zealand, Institute of Integrative Biology, Swiss Federal Institute of Technology (ETH), Zurich 8092, Switzerland and Department of Computer Science, University of Auckland, Auckland 1142, New ZealandAllan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston North 4442, New Zealand, Institute of Integrative Biology, Swiss Federal Institute of Technology (ETH), Zurich 8092, Switzerland and Department of Computer Science, University of Auckland, Auckland 1142, New Zealand
| | - Alex Popinga
- Allan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston North 4442, New Zealand, Institute of Integrative Biology, Swiss Federal Institute of Technology (ETH), Zurich 8092, Switzerland and Department of Computer Science, University of Auckland, Auckland 1142, New ZealandAllan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston North 4442, New Zealand, Institute of Integrative Biology, Swiss Federal Institute of Technology (ETH), Zurich 8092, Switzerland and Department of Computer Science, University of Auckland, Auckland 1142, New Zealand
| | - David Welch
- Allan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston North 4442, New Zealand, Institute of Integrative Biology, Swiss Federal Institute of Technology (ETH), Zurich 8092, Switzerland and Department of Computer Science, University of Auckland, Auckland 1142, New ZealandAllan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston North 4442, New Zealand, Institute of Integrative Biology, Swiss Federal Institute of Technology (ETH), Zurich 8092, Switzerland and Department of Computer Science, University of Auckland, Auckland 1142, New Zealand
| | - Alexei J Drummond
- Allan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston North 4442, New Zealand, Institute of Integrative Biology, Swiss Federal Institute of Technology (ETH), Zurich 8092, Switzerland and Department of Computer Science, University of Auckland, Auckland 1142, New ZealandAllan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston North 4442, New Zealand, Institute of Integrative Biology, Swiss Federal Institute of Technology (ETH), Zurich 8092, Switzerland and Department of Computer Science, University of Auckland, Auckland 1142, New Zealand
| |
Collapse
|
12
|
Lartillot N. A phylogenetic Kalman filter for ancestral trait reconstruction using molecular data. ACTA ACUST UNITED AC 2013; 30:488-96. [PMID: 24318999 DOI: 10.1093/bioinformatics/btt707] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Correlation between life history or ecological traits and genomic features such as nucleotide or amino acid composition can be used for reconstructing the evolutionary history of the traits of interest along phylogenies. Thus far, however, such ancestral reconstructions have been done using simple linear regression approaches that do not account for phylogenetic inertia. These reconstructions could instead be seen as a genuine comparative regression problem, such as formalized by classical generalized least-square comparative methods, in which the trait of interest and the molecular predictor are represented as correlated Brownian characters coevolving along the phylogeny. RESULTS Here, a Bayesian sampler is introduced, representing an alternative and more efficient algorithmic solution to this comparative regression problem, compared with currently existing generalized least-square approaches. Technically, ancestral trait reconstruction based on a molecular predictor is shown to be formally equivalent to a phylogenetic Kalman filter problem, for which backward and forward recursions are developed and implemented in the context of a Markov chain Monte Carlo sampler. The comparative regression method results in more accurate reconstructions and a more faithful representation of uncertainty, compared with simple linear regression. Application to the reconstruction of the evolution of optimal growth temperature in Archaea, using GC composition in ribosomal RNA stems and amino acid composition of a sample of protein-coding genes, confirms previous findings, in particular, pointing to a hyperthermophilic ancestor for the kingdom. AVAILABILITY AND IMPLEMENTATION The program is freely available at www.phylobayes.org.
Collapse
Affiliation(s)
- Nicolas Lartillot
- Laboratoire de Biométrie et Biologie Évolutive, Centre National de la Recherche Scientifique, UMR 5558. Université Lyon 1, F-69622 Villeurbanne, France and Centre Robert-Cedergren pour la Bioinformatique, Département de Biochimie, Université de Montréal, Québec, Canada
| |
Collapse
|
13
|
Lartillot N, Rodrigue N, Stubbs D, Richer J. PhyloBayes MPI: phylogenetic reconstruction with infinite mixtures of profiles in a parallel environment. Syst Biol 2013; 62:611-5. [PMID: 23564032 DOI: 10.1093/sysbio/syt022] [Citation(s) in RCA: 581] [Impact Index Per Article: 48.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Modeling across site variation of the substitution process is increasingly recognized as important for obtaining more accurate phylogenetic reconstructions. Both finite and infinite mixture models have been proposed and have been shown to significantly improve on classical single-matrix models. Compared with their finite counterparts, infinite mixtures have a greater expressivity. However, they are computationally more challenging. This has resulted in practical compromises in the design of infinite mixture models. In particular, a fast but simplified version of a Dirichlet process model over equilibrium frequency profiles implemented in PhyloBayes has often been used in recent phylogenomics studies, while more refined model structures, more realistic and empirically more fit, have been practically out of reach. We introduce a message passing interface version of PhyloBayes, implementing the Dirichlet process mixture models as well as more classical empirical matrices and finite mixtures. The parallelization is made efficient thanks to the combination of two algorithmic strategies: a partial Gibbs sampling update of the tree topology and the use of a truncated stick-breaking representation for the Dirichlet process prior. The implementation shows close to linear gains in computational speed for up to 64 cores, thus allowing faster phylogenetic reconstruction under complex mixture models. PhyloBayes MPI is freely available from our website www.phylobayes.org.
Collapse
Affiliation(s)
- Nicolas Lartillot
- Centre Robert Cedergren pour la Bioinformatique, Département de Biochimie, Université de Montréal, C.P. 6128, Succursale Centre-ville. Montréal, Québec H3C 3J7, Canada.
| | | | | | | |
Collapse
|
14
|
Murrell B, Moola S, Mabona A, Weighill T, Sheward D, Kosakovsky Pond SL, Scheffler K. FUBAR: a fast, unconstrained bayesian approximation for inferring selection. Mol Biol Evol 2013; 30:1196-205. [PMID: 23420840 DOI: 10.1093/molbev/mst030] [Citation(s) in RCA: 929] [Impact Index Per Article: 77.4] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
Model-based analyses of natural selection often categorize sites into a relatively small number of site classes. Forcing each site to belong to one of these classes places unrealistic constraints on the distribution of selection parameters, which can result in misleading inference due to model misspecification. We present an approximate hierarchical Bayesian method using a Markov chain Monte Carlo (MCMC) routine that ensures robustness against model misspecification by averaging over a large number of predefined site classes. This leaves the distribution of selection parameters essentially unconstrained, and also allows sites experiencing positive and purifying selection to be identified orders of magnitude faster than by existing methods. We demonstrate that popular random effects likelihood methods can produce misleading results when sites assigned to the same site class experience different levels of positive or purifying selection--an unavoidable scenario when using a small number of site classes. Our Fast Unconstrained Bayesian AppRoximation (FUBAR) is unaffected by this problem, while achieving higher power than existing unconstrained (fixed effects likelihood) methods. The speed advantage of FUBAR allows us to analyze larger data sets than other methods: We illustrate this on a large influenza hemagglutinin data set (3,142 sequences). FUBAR is available as a batch file within the latest HyPhy distribution (http://www.hyphy.org), as well as on the Datamonkey web server (http://www.datamonkey.org/).
Collapse
Affiliation(s)
- Ben Murrell
- Department of Mathematical Sciences, Stellenbosch University, Stellenbosch, South Africa
| | | | | | | | | | | | | |
Collapse
|
15
|
On the statistical interpretation of site-specific variables in phylogeny-based substitution models. Genetics 2012; 193:557-64. [PMID: 23222651 DOI: 10.1534/genetics.112.145722] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
Phylogeny-based modeling of heterogeneity across the positions of multiple-sequence alignments has generally been approached from two main perspectives. The first treats site specificities as random variables drawn from a statistical law, and the likelihood function takes the form of an integral over this law. The second assigns distinct variables to each position, and, in a maximum-likelihood context, adjusts these variables, along with global parameters, to optimize a joint likelihood function. Here, it is emphasized that while the first approach directly enjoys the statistical guaranties of traditional likelihood theory, the latter does not, and should be approached with particular caution when the site-specific variables are high dimensional. Using a phylogeny-based mutation-selection framework, it is shown that the difference in interpretation of site-specific variables explains the incongruities in recent studies regarding distributions of selection coefficients.
Collapse
|
16
|
Lemey P, Minin VN, Bielejec F, Kosakovsky Pond SL, Suchard MA. A counting renaissance: combining stochastic mapping and empirical Bayes to quickly detect amino acid sites under positive selection. ACTA ACUST UNITED AC 2012; 28:3248-56. [PMID: 23064000 DOI: 10.1093/bioinformatics/bts580] [Citation(s) in RCA: 62] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
MOTIVATION Statistical methods for comparing relative rates of synonymous and non-synonymous substitutions maintain a central role in detecting positive selection. To identify selection, researchers often estimate the ratio of these relative rates (dN/dS) at individual alignment sites. Fitting a codon substitution model that captures heterogeneity in dN/dS across sites provides a reliable way to perform such estimation, but it remains computationally prohibitive for massive datasets. By using crude estimates of the numbers of synonymous and non-synonymous substitutions at each site, counting approaches scale well to large datasets, but they fail to account for ancestral state reconstruction uncertainty and to provide site-specific dN/dS estimates. RESULTS We propose a hybrid solution that borrows the computational strength of counting methods, but augments these methods with empirical Bayes modeling to produce a relatively fast and reliable method capable of estimating site-specific dN/dS values in large datasets. Importantly, our hybrid approach, set in a Bayesian framework, integrates over the posterior distribution of phylogenies and ancestral reconstructions to quantify uncertainty about site-specific dN/dS estimates. Simulations demonstrate that this method competes well with more-principled statistical procedures and, in some cases, even outperforms them. We illustrate the utility of our method using human immunodeficiency virus, feline panleukopenia and canine parvovirus evolution examples.
Collapse
Affiliation(s)
- Philippe Lemey
- Department of Microbiology and Immunology, Rega Institute, KU Leuven, B-3000 Leuven, Belgium.
| | | | | | | | | |
Collapse
|
17
|
Romiguier J, Figuet E, Galtier N, Douzery EJP, Boussau B, Dutheil JY, Ranwez V. Fast and robust characterization of time-heterogeneous sequence evolutionary processes using substitution mapping. PLoS One 2012; 7:e33852. [PMID: 22479459 PMCID: PMC3313935 DOI: 10.1371/journal.pone.0033852] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2011] [Accepted: 02/22/2012] [Indexed: 12/22/2022] Open
Abstract
Genes and genomes do not evolve similarly in all branches of the tree of life. Detecting and characterizing the heterogeneity in time, and between lineages, of the nucleotide (or amino acid) substitution process is an important goal of current molecular evolutionary research. This task is typically achieved through the use of non-homogeneous models of sequence evolution, which being highly parametrized and computationally-demanding are not appropriate for large-scale analyses. Here we investigate an alternative methodological option based on probabilistic substitution mapping. The idea is to first reconstruct the substitutional history of each site of an alignment under a homogeneous model of sequence evolution, then to characterize variations in the substitution process across lineages based on substitution counts. Using simulated and published datasets, we demonstrate that probabilistic substitution mapping is robust in that it typically provides accurate reconstruction of sequence ancestry even when the true process is heterogeneous, but a homogeneous model is adopted. Consequently, we show that the new approach is essentially as efficient as and extremely faster than (up to 25 000 times) existing methods, thus paving the way for a systematic survey of substitution process heterogeneity across genes and lineages.
Collapse
Affiliation(s)
- Jonathan Romiguier
- Institut des Sciences de l'Evolution de Montpellier, CNRS-Université Montpellier 2, Montpellier, France.
| | | | | | | | | | | | | |
Collapse
|
18
|
Dutheil JY, Galtier N, Romiguier J, Douzery EJ, Ranwez V, Boussau B. Efficient Selection of Branch-Specific Models of Sequence Evolution. Mol Biol Evol 2012; 29:1861-74. [DOI: 10.1093/molbev/mss059] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
19
|
Choi B, Rempala GA. Inference for discretely observed stochastic kinetic networks with applications to epidemic modeling. Biostatistics 2011; 13:153-65. [PMID: 21835814 DOI: 10.1093/biostatistics/kxr019] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We present a new method for Bayesian Markov Chain Monte Carlo-based inference in certain types of stochastic models, suitable for modeling noisy epidemic data. We apply the so-called uniformization representation of a Markov process, in order to efficiently generate appropriate conditional distributions in the Gibbs sampler algorithm. The approach is shown to work well in various data-poor settings, that is, when only partial information about the epidemic process is available, as illustrated on the synthetic data from SIR-type epidemics and the Center for Disease Control and Prevention data from the onset of the H1N1 pandemic in the United States.
Collapse
Affiliation(s)
- Boseung Choi
- Department of Computer Science and Statistics, Daegu University, Gyeongbuk 712-714, Republic of Korea
| | | |
Collapse
|
20
|
Rodrigue N, Aris-Brosou S. Fast Bayesian choice of phylogenetic models: prospecting data augmentation-based thermodynamic integration. Syst Biol 2011; 60:881-7. [PMID: 21804092 DOI: 10.1093/sysbio/syr065] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Affiliation(s)
- Nicolas Rodrigue
- Department of Biology and Center for Advanced Research in Environmental Genomics, University of Ottawa, 30 Marie Curie Pvt., Ottawa, ON, Canada
| | | |
Collapse
|
21
|
Lakner C, Holder MT, Goldman N, Naylor GJP. What's in a Likelihood? Simple Models of Protein Evolution and the Contribution of Structurally Viable Reconstructions to the Likelihood. Syst Biol 2011; 60:161-74. [DOI: 10.1093/sysbio/syq088] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Affiliation(s)
- Clemens Lakner
- Department of Biological Science, Section of Ecology and Evolution
- Department of Scientific Computing, Florida State University, Tallahassee, FL 32306-4120, USA
| | - Mark T. Holder
- Department of Ecology and Evolution, University of Kansas, 6031 Haworth, 1200 Sunnyside Avenue, Lawrence, KS 66045
| | - Nick Goldman
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Gavin J. P. Naylor
- Department of Scientific Computing, Florida State University, Tallahassee, FL 32306-4120, USA
| |
Collapse
|
22
|
Lartillot N, Poujol R. A Phylogenetic Model for Investigating Correlated Evolution of Substitution Rates and Continuous Phenotypic Characters. Mol Biol Evol 2010; 28:729-44. [DOI: 10.1093/molbev/msq244] [Citation(s) in RCA: 163] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
23
|
Andrieux D. Thermodynamic large fluctuations from uniformized dynamics. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2010; 82:031124. [PMID: 21230042 DOI: 10.1103/physreve.82.031124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/13/2010] [Revised: 06/07/2010] [Indexed: 05/30/2023]
Abstract
Large fluctuations have received considerable attention as they encode information on the fine-scale dynamics. Large deviation relations known as fluctuation theorems also capture crucial nonequilibrium thermodynamical properties. Here we report that, using the technique of uniformization, the thermodynamic large deviation functions of continuous-time Markov processes can be obtained from Markov chains evolving in discrete time. This formulation offers theoretical and numerical approaches to explore large deviation properties. In particular, the time evolution of autonomous and nonautonomous processes can be expressed in terms of a single Poisson rate. In this way the uniformization procedure leads to a simple and efficient way to simulate stochastic trajectories that reproduce the exact fluxes statistics. We illustrate the formalism for the current fluctuations in a stochastic pump model.
Collapse
Affiliation(s)
- David Andrieux
- Department of Neurobiology and Kavli Institute for Neuroscience, Yale University School of Medicine, New Haven, Connecticut 06510, USA
| |
Collapse
|
24
|
Mutation-selection models of coding sequence evolution with site-heterogeneous amino acid fitness profiles. Proc Natl Acad Sci U S A 2010; 107:4629-34. [PMID: 20176949 DOI: 10.1073/pnas.0910915107] [Citation(s) in RCA: 131] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Modeling the interplay between mutation and selection at the molecular level is key to evolutionary studies. To this end, codon-based evolutionary models have been proposed as pertinent means of studying long-range evolutionary patterns and are widely used. However, these approaches have not yet consolidated results from amino acid level phylogenetic studies showing that selection acting on proteins displays strong site-specific effects, which translate into heterogeneous amino acid propensities across the columns of alignments; related codon-level studies have instead focused on either modeling a single selective context for all codon columns, or a separate selective context for each codon column, with the former strategy deemed too simplistic and the latter deemed overparameterized. Here, we integrate recent developments in nonparametric statistical approaches to propose a probabilistic model that accounts for the heterogeneity of amino acid fitness profiles across the coding positions of a gene. We apply the model to a dozen real protein-coding gene alignments and find it to produce biologically plausible inferences, for instance, as pertaining to site-specific amino acid constraints, as well as distributions of scaled selection coefficients. In their account of mutational features as well as the heterogeneous regimes of selection at the amino acid level, the modeling approaches studied here can form a backdrop for several extensions, accounting for other selective features, for variable population size, or for subtleties of mutational features, all with parameterizations couched within population-genetic theory.
Collapse
|
25
|
Zhou Y, Brinkmann H, Rodrigue N, Lartillot N, Philippe H. A Dirichlet Process Covarion Mixture Model and Its Assessments Using Posterior Predictive Discrepancy Tests. Mol Biol Evol 2009; 27:371-84. [DOI: 10.1093/molbev/msp248] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
26
|
de Koning APJ, Gu W, Pollock DD. Rapid likelihood analysis on large phylogenies using partial sampling of substitution histories. Mol Biol Evol 2009; 27:249-65. [PMID: 19783593 DOI: 10.1093/molbev/msp228] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Likelihood-based approaches can reconstruct evolutionary processes in greater detail and with better precision from larger data sets. The extremely large comparative genomic data sets that are now being generated thus create new opportunities for understanding molecular evolution, but analysis of such large quantities of data poses escalating computational challenges. Recently developed Markov chain Monte Carlo methods that augment substitution histories are a promising approach to alleviate these computational costs. We analyzed the computational costs of several such approaches, considering how they scale with model and data set complexity. This provided a theoretical framework to understand the most important computational bottlenecks, leading us to combine novel variations of our conditional pathway integration approach with recent advances made by others. The resulting technique ("partial sampling" of substitution histories) is considerably faster than all other approaches we considered. It is accurate, simple to implement, and scales exceptionally well with dimensions of model complexity and data set size. In particular, the time complexity of sampling unobserved substitution histories using the new method is much faster than previously existing methods, and model parameter and branch length updates are independent of data set size. We compared the performance of methods on a 224-taxon set of mammalian cytochrome-b sequences. For a simple nucleotide substitution model, partial sampling was at least 10 times faster than the PhyloBayes program, which samples substitutions in continuous time, and about 100 times faster than when using fully integrated substitution histories. Under a general reversible model of amino acid substitution, the partial sampling method was 1,600 times faster than when using fully integrated substitution histories, confirming significantly improved scaling with model state-space complexity. Partial sampling of substitutions thus dramatically improves the utility of likelihood approaches for analyzing complex evolutionary processes on large data sets.
Collapse
Affiliation(s)
- A P Jason de Koning
- Department of Biochemistry and Molecular Genetics, and Consortium for Comparative Genomics, University of Colorado Denver School of Medicine, USA
| | | | | |
Collapse
|
27
|
Hobolth A, Stone EA. SIMULATION FROM ENDPOINT-CONDITIONED, CONTINUOUS-TIME MARKOV CHAINS ON A FINITE STATE SPACE, WITH APPLICATIONS TO MOLECULAR EVOLUTION. Ann Appl Stat 2009; 3:1204. [PMID: 20148133 PMCID: PMC2818752 DOI: 10.1214/09-aoas247] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Analyses of serially-sampled data often begin with the assumption that the observations represent discrete samples from a latent continuous-time stochastic process. The continuous-time Markov chain (CTMC) is one such generative model whose popularity extends to a variety of disciplines ranging from computational finance to human genetics and genomics. A common theme among these diverse applications is the need to simulate sample paths of a CTMC conditional on realized data that is discretely observed. Here we present a general solution to this sampling problem when the CTMC is defined on a discrete and finite state space. Specifically, we consider the generation of sample paths, including intermediate states and times of transition, from a CTMC whose beginning and ending states are known across a time interval of length T. We first unify the literature through a discussion of the three predominant approaches: (1) modified rejection sampling, (2) direct sampling, and (3) uniformization. We then give analytical results for the complexity and efficiency of each method in terms of the instantaneous transition rate matrix Q of the CTMC, its beginning and ending states, and the length of sampling time T. In doing so, we show that no method dominates the others across all model specifications, and we give explicit proof of which method prevails for any given Q, T, and endpoints. Finally, we introduce and compare three applications of CTMCs to demonstrate the pitfalls of choosing an inefficient sampler.
Collapse
Affiliation(s)
- Asger Hobolth
- Department of Mathematical Sciences, Aarhus University, Denmark
| | - Eric A. Stone
- Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695, USA
| |
Collapse
|
28
|
Minin VN, Suchard MA. Fast, accurate and simulation-free stochastic mapping. Philos Trans R Soc Lond B Biol Sci 2009; 363:3985-95. [PMID: 18852111 DOI: 10.1098/rstb.2008.0176] [Citation(s) in RCA: 120] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Mapping evolutionary trajectories of discrete traits onto phylogenies receives considerable attention in evolutionary biology. Given the trait observations at the tips of a phylogenetic tree, researchers are often interested where on the tree the trait changes its state and whether some changes are preferential in certain parts of the tree. In a model-based phylogenetic framework, such questions translate into characterizing probabilistic properties of evolutionary trajectories. Current methods of assessing these properties rely on computationally expensive simulations. In this paper, we present an efficient, simulation-free algorithm for computing two important and ubiquitous evolutionary trajectory properties. The first is the mean number of trait changes, where changes can be divided into classes of interest (e.g. synonymous/non-synonymous mutations). The mean evolutionary reward, accrued proportionally to the time a trait occupies each of its states, is the second property. To illustrate the usefulness of our results, we first employ our simulation-free stochastic mapping to execute a posterior predictive test of correlation between two evolutionary traits. We conclude by mapping synonymous and non-synonymous mutations onto branches of an HIV intrahost phylogenetic tree and comparing selection pressure on terminal and internal tree branches.
Collapse
Affiliation(s)
- Vladimir N Minin
- Department of Statistics, University of Washington, Seattle, WA 98195-4322, USA.
| | | |
Collapse
|
29
|
Rodrigue N, Kleinman CL, Philippe H, Lartillot N. Computational Methods for Evaluating Phylogenetic Models of Coding Sequence Evolution with Dependence between Codons. Mol Biol Evol 2009; 26:1663-76. [DOI: 10.1093/molbev/msp078] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
30
|
Abstract
Probabilistic models of sequence evolution are in widespread use in phylogenetics and molecular sequence evolution. These models have become increasingly sophisticated and combined with statistical model comparison techniques have helped to shed light on how genes and proteins evolve. Models of codon evolution have been particularly useful, because, in addition to providing a significant improvement in model realism for protein-coding sequences, codon models can also be designed to test hypotheses about the selective pressures that shape the evolution of the sequences. Such models typically assume a phylogeny and can be used to identify sites or lineages that have evolved adaptively. Recently some of the key assumptions that underlie phylogenetic tests of selection have been questioned, such as the assumption that the rate of synonymous changes is constant across sites or that a single phylogenetic tree can be assumed at all sites for recombining sequences. While some of these issues have been addressed through the development of novel methods, others remain as caveats that need to be considered on a case-by-case basis. Here, we outline the theory of codon models and their application to the detection of positive selection. We review some of the more recent developments that have improved their power and utility, laying a foundation for further advances in the modeling of coding sequence evolution.
Collapse
Affiliation(s)
- Wayne Delport
- University of Cape Town, Observatory, 7925, Cape Town, South Africa
| | | | | |
Collapse
|
31
|
Anisimova M, Kosiol C. Investigating protein-coding sequence evolution with probabilistic codon substitution models. Mol Biol Evol 2008; 26:255-71. [PMID: 18922761 DOI: 10.1093/molbev/msn232] [Citation(s) in RCA: 104] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
This review is motivated by the true explosion in the number of recent studies both developing and ameliorating probabilistic models of codon evolution. Traditionally parametric, the first codon models focused on estimating the effects of selective pressure on the protein via an explicit parameter in the maximum likelihood framework. Likelihood ratio tests of nested codon models armed the biologists with powerful tools, which provided unambiguous evidence for positive selection in real data. This, in turn, triggered a new wave of methodological developments. The new generation of models views the codon evolution process in a more sophisticated way, relaxing several mathematical assumptions. These models make a greater use of physicochemical amino acid properties, genetic code machinery, and the large amounts of data from the public domain. The overview of the most recent advances on modeling codon evolution is presented here, and a wide range of their applications to real data is discussed. On the downside, availability of a large variety of models, each accounting for various biological factors, increases the margin for misinterpretation; the biological meaning of certain parameters may vary among models, and model selection procedures also deserve greater attention. Solid understanding of the modeling assumptions and their applicability is essential for successful statistical data analysis.
Collapse
Affiliation(s)
- Maria Anisimova
- Institute of Computational Science, Swiss Federal Institute of Technology, Zurich, Switzerland.
| | | |
Collapse
|