1
|
Silvestro D, Latrille T, Salamin N. Toward a Semi-Supervised Learning Approach to Phylogenetic Estimation. Syst Biol 2024; 73:789-806. [PMID: 38916476 PMCID: PMC11639169 DOI: 10.1093/sysbio/syae029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 05/21/2024] [Accepted: 06/24/2024] [Indexed: 06/26/2024] Open
Abstract
Models have always been central to inferring molecular evolution and to reconstructing phylogenetic trees. Their use typically involves the development of a mechanistic framework reflecting our understanding of the underlying biological processes, such as nucleotide substitutions, and the estimation of model parameters by maximum likelihood or Bayesian inference. However, deriving and optimizing the likelihood of the data is not always possible under complex evolutionary scenarios or even tractable for large datasets, often leading to unrealistic simplifying assumptions in the fitted models. To overcome this issue, we coupled stochastic simulations of genome evolution with a new supervised deep-learning model to infer key parameters of molecular evolution. Our model is designed to directly analyze multiple sequence alignments and estimate per-site evolutionary rates and divergence without requiring a known phylogenetic tree. The accuracy of our predictions matched that of likelihood-based phylogenetic inference when rate heterogeneity followed a simple gamma distribution, but it strongly exceeded it under more complex patterns of rate variation, such as codon models. Our approach is highly scalable and can be efficiently applied to genomic data, as we showed on a dataset of 26 million nucleotides from the clownfish clade. Our simulations also showed that the integration of per-site rates obtained by deep learning within a Bayesian framework led to significantly more accurate phylogenetic inference, particularly with respect to the estimated branch lengths. We thus propose that future advancements in phylogenetic analysis will benefit from a semi-supervised learning approach that combines deep-learning estimation of substitution rates, which allows for more flexible models of rate variation, and probabilistic inference of the phylogenetic tree, which guarantees interpretability and a rigorous assessment of statistical support.
Collapse
Affiliation(s)
- Daniele Silvestro
- Department of Biology, University of Fribourg and Swiss Institute of Bioinformatics, 1700 Fribourg, Switzerland
- Department of Biological and Environmental Sciences, Gothenburg Global Biodiversity Centre, University of Gothenburg, 40530 Gothenburg, Sweden
| | - Thibault Latrille
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| | - Nicolas Salamin
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| |
Collapse
|
2
|
Goyal R, Carnegie N, Slipher S, Turk P, Little SJ, De Gruttola V. Estimating contact network properties by integrating multiple data sources associated with infectious diseases. Stat Med 2023; 42:3593-3615. [PMID: 37392149 PMCID: PMC10825904 DOI: 10.1002/sim.9816] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2022] [Revised: 05/09/2023] [Accepted: 05/19/2023] [Indexed: 07/03/2023]
Abstract
To effectively mitigate the spread of communicable diseases, it is necessary to understand the interactions that enable disease transmission among individuals in a population; we refer to the set of these interactions as a contact network. The structure of the contact network can have profound effects on both the spread of infectious diseases and the effectiveness of control programs. Therefore, understanding the contact network permits more efficient use of resources. Measuring the structure of the network, however, is a challenging problem. We present a Bayesian approach to integrate multiple data sources associated with the transmission of infectious diseases to more precisely and accurately estimate important properties of the contact network. An important aspect of the approach is the use of the congruence class models for networks. We conduct simulation studies modeling pathogens resembling SARS-CoV-2 and HIV to assess the method; subsequently, we apply our approach to HIV data from the University of California San Diego Primary Infection Resource Consortium. Based on simulation studies, we demonstrate that the integration of epidemiological and viral genetic data with risk behavior survey data can lead to large decreases in mean squared error (MSE) in contact network estimates compared to estimates based strictly on risk behavior information. This decrease in MSE is present even in settings where the risk behavior surveys contain measurement error. Through these simulations, we also highlight certain settings where the approach does not improve MSE.
Collapse
Affiliation(s)
- Ravi Goyal
- Division of Infectious Diseases and Global Public, University of California San Diego, San Diego, California, USA
| | | | - Sally Slipher
- Department of Mathematical Sciences, Montana State University, Bozeman, Montana, USA
| | - Philip Turk
- Department of Data Science, University of Mississippi Medical Center, Jackson, Mississippi, USA
| | - Susan J Little
- Division of Infectious Diseases and Global Public, University of California San Diego, La Jolla, California, USA
| | - Victor De Gruttola
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
| |
Collapse
|
3
|
Gupta MK, Vadde R. Next-generation development and application of codon model in evolution. Front Genet 2023; 14:1091575. [PMID: 36777719 PMCID: PMC9911445 DOI: 10.3389/fgene.2023.1091575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Accepted: 01/17/2023] [Indexed: 01/28/2023] Open
Abstract
To date, numerous nucleotide, amino acid, and codon substitution models have been developed to estimate the evolutionary history of any sequence/organism in a more comprehensive way. Out of these three, the codon substitution model is the most powerful. These models have been utilized extensively to detect selective pressure on a protein, codon usage bias, ancestral reconstruction and phylogenetic reconstruction. However, due to more computational demanding, in comparison to nucleotide and amino acid substitution models, only a few studies have employed the codon substitution model to understand the heterogeneity of the evolutionary process in a genome-scale analysis. Hence, there is always a question of how to develop more robust but less computationally demanding codon substitution models to get more accurate results. In this review article, the authors attempted to understand the basis of the development of different types of codon-substitution models and how this information can be utilized to develop more robust but less computationally demanding codon substitution models. The codon substitution model enables to detect selection regime under which any gene or gene region is evolving, codon usage bias in any organism or tissue-specific region and phylogenetic relationship between different lineages more accurately than nucleotide and amino acid substitution models. Thus, in the near future, these codon models can be utilized in the field of conservation, breeding and medicine.
Collapse
|
4
|
Hufsky F, Abecasis A, Agudelo-Romero P, Bletsa M, Brown K, Claus C, Deinhardt-Emmer S, Deng L, Friedel CC, Gismondi MI, Kostaki EG, Kühnert D, Kulkarni-Kale U, Metzner KJ, Meyer IM, Miozzi L, Nishimura L, Paraskevopoulou S, Pérez-Cataluña A, Rahlff J, Thomson E, Tumescheit C, van der Hoek L, Van Espen L, Vandamme AM, Zaheri M, Zuckerman N, Marz M. Women in the European Virus Bioinformatics Center. Viruses 2022; 14:1522. [PMID: 35891501 PMCID: PMC9319252 DOI: 10.3390/v14071522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 07/05/2022] [Accepted: 07/07/2022] [Indexed: 02/01/2023] Open
Abstract
Viruses are the cause of a considerable burden to human, animal and plant health, while on the other hand playing an important role in regulating entire ecosystems. The power of new sequencing technologies combined with new tools for processing "Big Data" offers unprecedented opportunities to answer fundamental questions in virology. Virologists have an urgent need for virus-specific bioinformatics tools. These developments have led to the formation of the European Virus Bioinformatics Center, a network of experts in virology and bioinformatics who are joining forces to enable extensive exchange and collaboration between these research areas. The EVBC strives to provide talented researchers with a supportive environment free of gender bias, but the gender gap in science, especially in math-intensive fields such as computer science, persists. To bring more talented women into research and keep them there, we need to highlight role models to spark their interest, and we need to ensure that female scientists are not kept at lower levels but are given the opportunity to lead the field. Here we showcase the work of the EVBC and highlight the achievements of some outstanding women experts in virology and viral bioinformatics.
Collapse
Affiliation(s)
- Franziska Hufsky
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany
| | - Ana Abecasis
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Global Health and Tropical Medicine, Institute of Hygiene and Tropical Medicine, New University of Lisbon, 1349-008 Lisbon, Portugal
| | - Patricia Agudelo-Romero
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Wal-Yan Respiratory Research Centre, Telethon Kids Institute, University of Western Australia, Nedlands, WA 6009, Australia
| | - Magda Bletsa
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Department of Hygiene, Epidemiology and Medical Statistics, Medical School, National and Kapodistrian University of Athens, 115 27 Athens, Greece
- Department of Microbiology, Immunology and Transplantation, Rega Institute, Katholieke Universiteit Leuven, B-3000 Leuven, Belgium
| | - Katherine Brown
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Division of Virology, Department of Pathology, University of Cambridge, Cambridge CB2 1TN, UK
| | - Claudia Claus
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Institute of Medical Microbiology and Virology, Medical Faculty, Leipzig University, 04103 Leipzig, Germany
| | - Stefanie Deinhardt-Emmer
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Institute of Medical Microbiology, Jena University Hospital, 07747 Jena, Germany
| | - Li Deng
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Institute of Virology, Helmholtz Centre Munich-German Research Center for Environmental Health, 85764 Neuherberg, Germany
- Microbial Disease Prevention, School of Life Sciences, Technical University of Munich, 85354 Freising, Germany
| | - Caroline C. Friedel
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Institute of Informatics, Ludwig-Maximilians-Universität München, 80333 Munich, Germany
| | - María Inés Gismondi
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Institute of Agrobiotechnology and Molecular Biology (IABIMO), National Institute for Agriculture Technology (INTA), National Research Council (CONICET), Hurlingham B1686IGC, Argentina
- Department of Basic Sciences, National University of Luján, Luján B6702MZP, Argentina
| | - Evangelia Georgia Kostaki
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Department of Hygiene, Epidemiology and Medical Statistics, Medical School, National and Kapodistrian University of Athens, 115 27 Athens, Greece
| | - Denise Kühnert
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Transmission, Infection, Diversification and Evolution Group, Max Planck Institute for the Science of Human History, 07745 Jena, Germany
| | - Urmila Kulkarni-Kale
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Bioinformatics Centre, Savitribai Phule Pune University, Pune 411007, India
| | - Karin J. Metzner
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Department of Infectious Diseases and Hospital Epidemiology, University Hospital Zurich, 8091 Zurich, Switzerland
- Institute of Medical Virology, University of Zurich, 8057 Zurich, Switzerland
| | - Irmtraud M. Meyer
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, 10115 Berlin, Germany
- Institute of Chemistry and Biochemistry, Department of Biology, Chemistry and Pharmacy, Freie Universität Berlin, 14195 Berlin, Germany
- Faculty of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany
| | - Laura Miozzi
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Institute for Sustainable Plant Protection, National Research Council of Italy, 10135 Torino, Italy
| | - Luca Nishimura
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Department of Genetics, School of Life Science, The Graduate University for Advanced Studies (SOKENDAI), Mishima 411-8540, Japan
- Human Genetics Laboratory, National Institute of Genetics, Mishima 411-8540, Japan
| | - Sofia Paraskevopoulou
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Methods Development and Research Infrastructure, Bioinformatics and Systems Biology, Robert Koch Institute, 13353 Berlin, Germany
| | - Alba Pérez-Cataluña
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- VISAFELab, Department of Preservation and Food Safety Technologies, Institute of Agrochemistry and Food Technology, IATA-CSIC, 46980 Valencia, Spain
| | - Janina Rahlff
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Centre for Ecology and Evolution in Microbial Model Systems (EEMiS), Department of Biology and Environmental Science, Linneaus University, 391 82 Kalmar, Sweden
| | - Emma Thomson
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Queen Elizabeth University Hospital, NHS Greater Glasgow and Clyde, Glasgow G51 4TF, UK
- MRC-University of Glasgow Centre for Virus Research, Glasgow G61 1QH, UK
| | - Charlotte Tumescheit
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- School of Biological Sciences, Seoul National University, Seoul 08826, Korea
| | - Lia van der Hoek
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Laboratory of Experimental Virology, Department of Medical Microbiology and Infection Prevention, Amsterdam UMC, University of Amsterdam, 1012 WX Amsterdam, The Netherlands
- Amsterdam Institute for Infection and Immunity, 1100 DD Amsterdam, The Netherlands
| | - Lore Van Espen
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Department of Microbiology, Immunology and Transplantation, Rega Institute, Katholieke Universiteit Leuven, B-3000 Leuven, Belgium
| | - Anne-Mieke Vandamme
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Department of Microbiology, Immunology and Transplantation, Rega Institute, Katholieke Universiteit Leuven, B-3000 Leuven, Belgium
- Global Health and Tropical Medicine, Instituto de Higiene e Medicina Tropical, Universidade Nova de Lisboa, 1349-008 Lisbon, Portugal
- Institute for the Future, Katholieke Universiteit Leuven, B-3000 Leuven, Belgium
| | - Maryam Zaheri
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Institute of Medical Virology, University of Zurich, 8057 Zurich, Switzerland
| | - Neta Zuckerman
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- Central Virology Laboratory, Public Health Services, Ministry of Health and Sheba Medical Center, Ramat Gan 52621, Israel
| | - Manja Marz
- European Virus Bioinformatics Center, 07743 Jena, Germany; (A.A.); (P.A.-R.); (M.B.); (K.B.); (C.C.); (S.D.-E.); (L.D.); (C.C.F.); (M.I.G.); (E.G.K.); (D.K.); (U.K.-K.); (K.J.M.); (I.M.M.); (L.M.); (L.N.); (S.P.); (A.P.-C.); (J.R.); (E.T.); (C.T.); (L.v.d.H.); (L.V.E.); (A.-M.V.); (M.Z.); (N.Z.)
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, 07743 Jena, Germany
| |
Collapse
|
5
|
Sahm A, Koch P, Horvath S, Hoffmann S. An analysis of methylome evolution in primates. Mol Biol Evol 2021; 38:4700-4714. [PMID: 34175932 PMCID: PMC8557466 DOI: 10.1093/molbev/msab189] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Although the investigation of the epigenome becomes increasingly important, still little is known about the long-term evolution of epigenetic marks and systematic investigation strategies are still lacking. Here, we systematically demonstrate the transfer of classic phylogenetic methods such as maximum likelihood based on substitution models, parsimony, and distance-based to interval-scaled epigenetic data. Using a great apes blood data set, we demonstrate that DNA methylation is evolutionarily conserved at the level of individual CpGs in promotors, enhancers, and genic regions. Our analysis also reveals that this epigenomic conservation is significantly correlated with its transcription factor binding density. Binding sites for transcription factors involved in neuron differentiation and components of AP-1 evolve at a significantly higher rate at methylation than at the nucleotide level. Moreover, our models suggest an accelerated epigenomic evolution at binding sites of BRCA1, chromobox homolog protein 2, and factors of the polycomb repressor 2 complex in humans. For most genomic regions, the methylation-based reconstruction of phylogenetic trees is at par with sequence-based reconstruction. Most strikingly, phylogenetic reconstruction using methylation rates in enhancer regions was ineffective independently of the chosen model. We identify a set of phylogenetically uninformative CpG sites enriched in enhancers controlling immune-related genes.
Collapse
Affiliation(s)
- Arne Sahm
- Computational Biology Group, Leibniz Institute on Aging - Fritz Lipmann Institute, Jena, Germany
| | - Philipp Koch
- Core Facility Life Science Computing, Leibniz Institute on Aging - Fritz Lipmann Institute, Jena, Germany
| | - Steve Horvath
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA 90095, USA
| | - Steve Hoffmann
- Computational Biology Group, Leibniz Institute on Aging - Fritz Lipmann Institute, Jena, Germany
| |
Collapse
|
6
|
Extra base hits: Widespread empirical support for instantaneous multiple-nucleotide changes. PLoS One 2021; 16:e0248337. [PMID: 33711070 PMCID: PMC7954308 DOI: 10.1371/journal.pone.0248337] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Accepted: 02/24/2021] [Indexed: 01/03/2023] Open
Abstract
Despite many attempts to introduce evolutionary models that permit substitutions to instantly alter more than one nucleotide in a codon, the prevailing wisdom remains that such changes are rare and generally negligible or are reflective of non-biological artifacts, such as alignment errors. Codon models continue to posit that only single nucleotide change have non-zero rates. Here, we develop and test a simple hierarchy of codon-substitution models with non-zero evolutionary rates for only one-nucleotide (1H), one- and two-nucleotide (2H), or any (3H) codon substitutions. Using over 42, 000 empirical alignments, we find widespread statistical support for multiple hits: 61% of alignments prefer models with 2H allowed, and 23%-with 3H allowed. Analyses of simulated data suggest that these results are not likely to be due to simple artifacts such as model misspecification or alignment errors. Further modeling reveals that synonymous codon island jumping among codons encoding serine, especially along short branches, contributes significantly to this 3H signal. While serine codons were prominently involved in multiple-hit substitutions, there were other common exchanges contributing to better model fit. It appears that a small subset of sites in most alignments have unusual evolutionary dynamics not well explained by existing model formalisms, and that commonly estimated quantities, such as dN/dS ratios may be biased by model misspecification. Our findings highlight the need for continued evaluation of assumptions underlying workhorse evolutionary models and subsequent evolutionary inference techniques. We provide a software implementation for evolutionary biologists to assess the potential impact of extra base hits in their data in the HyPhy package and in the Datamonkey.org server.
Collapse
|
7
|
Jones CT, Youssef N, Susko E, Bielawski JP. A Phenotype-Genotype Codon Model for Detecting Adaptive Evolution. Syst Biol 2021; 69:722-738. [PMID: 31730199 DOI: 10.1093/sysbio/syz075] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2019] [Revised: 11/09/2019] [Accepted: 11/11/2019] [Indexed: 01/03/2023] Open
Abstract
A central objective in biology is to link adaptive evolution in a gene to structural and/or functional phenotypic novelties. Yet most analytic methods make inferences mainly from either phenotypic data or genetic data alone. A small number of models have been developed to infer correlations between the rate of molecular evolution and changes in a discrete or continuous life history trait. But such correlations are not necessarily evidence of adaptation. Here, we present a novel approach called the phenotype-genotype branch-site model (PG-BSM) designed to detect evidence of adaptive codon evolution associated with discrete-state phenotype evolution. An episode of adaptation is inferred under standard codon substitution models when there is evidence of positive selection in the form of an elevation in the nonsynonymous-to-synonymous rate ratio $\omega$ to a value $\omega > 1$. As it is becoming increasingly clear that $\omega > 1$ can occur without adaptation, the PG-BSM was formulated to infer an instance of adaptive evolution without appealing to evidence of positive selection. The null model makes use of a covarion-like component to account for general heterotachy (i.e., random changes in the evolutionary rate at a site over time). The alternative model employs samples of the phenotypic evolutionary history to test for phenomenological patterns of heterotachy consistent with specific mechanisms of molecular adaptation. These include 1) a persistent increase/decrease in $\omega$ at a site following a change in phenotype (the pattern) consistent with an increase/decrease in the functional importance of the site (the mechanism); and 2) a transient increase in $\omega$ at a site along a branch over which the phenotype changed (the pattern) consistent with a change in the site's optimal amino acid (the mechanism). Rejection of the null is followed by post hoc analyses to identify sites with strongest evidence for adaptation in association with changes in the phenotype as well as the most likely evolutionary history of the phenotype. Simulation studies based on a novel method for generating mechanistically realistic signatures of molecular adaptation show that the PG-BSM has good statistical properties. Analyses of real alignments show that site patterns identified post hoc are consistent with the specific mechanisms of adaptation included in the alternate model. Further simulation studies show that the covarion-like component of the PG-BSM plays a crucial role in mitigating recently discovered statistical pathologies associated with confounding by accounting for heterotachy-by-any-cause. [Adaptive evolution; branch-site model; confounding; mutation-selection; phenotype-genotype.].
Collapse
Affiliation(s)
- Christopher T Jones
- Department of Mathematics and Statistics, Dalhousie University, 1233 LeMarchant Street, B3H 4R2, Halifax, Nova Scotia, Canada
| | - Noor Youssef
- Department of Biology, Dalhousie University, 1233 LeMarchant Street, B3H 4R2, Halifax, Nova Scotia, Canada
| | - Edward Susko
- Department of Mathematics and Statistics, Dalhousie University, 1233 LeMarchant Street, B3H 4R2, Halifax, Nova Scotia, Canada.,Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, 1233 LeMarchant Street, B3H 4R2, Halifax, Nova Scotia, Canada
| | - Joseph P Bielawski
- Department of Mathematics and Statistics, Dalhousie University, 1233 LeMarchant Street, B3H 4R2, Halifax, Nova Scotia, Canada.,Department of Biology, Dalhousie University, 1233 LeMarchant Street, B3H 4R2, Halifax, Nova Scotia, Canada.,Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, 1233 LeMarchant Street, B3H 4R2, Halifax, Nova Scotia, Canada
| |
Collapse
|
8
|
Picard L, Ganivet Q, Allatif O, Cimarelli A, Guéguen L, Etienne L. DGINN, an automated and highly-flexible pipeline for the detection of genetic innovations on protein-coding genes. Nucleic Acids Res 2020; 48:e103. [PMID: 32941639 PMCID: PMC7544217 DOI: 10.1093/nar/gkaa680] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2020] [Revised: 06/29/2020] [Accepted: 09/04/2020] [Indexed: 12/13/2022] Open
Abstract
Adaptive evolution has shaped major biological processes. Finding the protein-coding genes and the sites that have been subjected to adaptation during evolutionary time is a major endeavor. However, very few methods fully automate the identification of positively selected genes, and widespread sources of genetic innovations such as gene duplication and recombination are absent from most pipelines. Here, we developed DGINN, a highly-flexible and public pipeline to Detect Genetic INNovations and adaptive evolution in protein-coding genes. DGINN automates, from a gene's sequence, all steps of the evolutionary analyses necessary to detect the aforementioned innovations, including the search for homologs in databases, assignation of orthology groups, identification of duplication and recombination events, as well as detection of positive selection using five methods to increase precision and ranking of genes when a large panel is analyzed. DGINN was validated on nineteen genes with previously-characterized evolutionary histories in primates, including some engaged in host-pathogen arms-races. Our results confirm and also expand results from the literature, including novel findings on the Guanylate-binding protein family, GBPs. This establishes DGINN as an efficient tool to automatically detect genetic innovations and adaptive evolution in diverse datasets, from the user's gene of interest to a large gene list in any species range.
Collapse
Affiliation(s)
- Lea Picard
- CIRI - Centre International de Recherche en Infectiologie, Univ Lyon, Inserm U1111, Université Claude Bernard Lyon 1, CNRS UMR5308, ENS de Lyon, Lyon, France
- Laboratoire de Biologie et Biométrie Evolutive, CNRS UMR 5558, Université Claude Bernard Lyon 1, Villeurbanne, France
| | - Quentin Ganivet
- Laboratoire de Biologie et Biométrie Evolutive, CNRS UMR 5558, Université Claude Bernard Lyon 1, Villeurbanne, France
| | - Omran Allatif
- CIRI - Centre International de Recherche en Infectiologie, Univ Lyon, Inserm U1111, Université Claude Bernard Lyon 1, CNRS UMR5308, ENS de Lyon, Lyon, France
| | - Andrea Cimarelli
- CIRI - Centre International de Recherche en Infectiologie, Univ Lyon, Inserm U1111, Université Claude Bernard Lyon 1, CNRS UMR5308, ENS de Lyon, Lyon, France
| | - Laurent Guéguen
- Laboratoire de Biologie et Biométrie Evolutive, CNRS UMR 5558, Université Claude Bernard Lyon 1, Villeurbanne, France
- Swedish Collegium for Advanced Study, Uppsala, Sweden
| | - Lucie Etienne
- CIRI - Centre International de Recherche en Infectiologie, Univ Lyon, Inserm U1111, Université Claude Bernard Lyon 1, CNRS UMR5308, ENS de Lyon, Lyon, France
| |
Collapse
|
9
|
Wisotsky SR, Kosakovsky Pond SL, Shank SD, Muse SV. Synonymous Site-to-Site Substitution Rate Variation Dramatically Inflates False Positive Rates of Selection Analyses: Ignore at Your Own Peril. Mol Biol Evol 2020; 37:2430-2439. [PMID: 32068869 PMCID: PMC7403620 DOI: 10.1093/molbev/msaa037] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Most molecular evolutionary studies of natural selection maintain the decades-old assumption that synonymous substitution rate variation (SRV) across sites within genes occurs at levels that are either nonexistent or negligible. However, numerous studies challenge this assumption from a biological perspective and show that SRV is comparable in magnitude to that of nonsynonymous substitution rate variation. We evaluated the impact of this assumption on methods for inferring selection at the molecular level by incorporating SRV into an existing method (BUSTED) for detecting signatures of episodic diversifying selection in genes. Using simulated data we found that failing to account for even moderate levels of SRV in selection testing is likely to produce intolerably high false positive rates. To evaluate the effect of the SRV assumption on actual inferences we compared results of tests with and without the assumption in an empirical analysis of over 13,000 Euteleostomi (bony vertebrate) gene alignments from the Selectome database. This exercise reveals that close to 50% of positive results (i.e., evidence for selection) in empirical analyses disappear when SRV is modeled as part of the statistical analysis and are thus candidates for being false positives. The results from this work add to a growing literature establishing that tests of selection are much more sensitive to certain model assumptions than previously believed.
Collapse
Affiliation(s)
- Sadie R Wisotsky
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA
| | | | - Stephen D Shank
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA
| | - Spencer V Muse
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC
- Department of Statistics, North Carolina State University, Raleigh, NC
| |
Collapse
|
10
|
Moshiri N, Ragonnet-Cronin M, Wertheim JO, Mirarab S. FAVITES: simultaneous simulation of transmission networks, phylogenetic trees and sequences. Bioinformatics 2020; 35:1852-1861. [PMID: 30395173 DOI: 10.1093/bioinformatics/bty921] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2018] [Revised: 10/29/2018] [Accepted: 11/01/2018] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The ability to simulate epidemics as a function of model parameters allows insights that are unobtainable from real datasets. Further, reconstructing transmission networks for fast-evolving viruses like Human Immunodeficiency Virus (HIV) may have the potential to greatly enhance epidemic intervention, but transmission network reconstruction methods have been inadequately studied, largely because it is difficult to obtain 'truth' sets on which to test them and properly measure their performance. RESULTS We introduce FrAmework for VIral Transmission and Evolution Simulation (FAVITES), a robust framework for simulating realistic datasets for epidemics that are caused by fast-evolving pathogens like HIV. FAVITES creates a generative model to produce contact networks, transmission networks, phylogenetic trees and sequence datasets, and to add error to the data. FAVITES is designed to be extensible by dividing the generative model into modules, each of which is expressed as a fixed API that can be implemented using various models. We use FAVITES to simulate HIV datasets and study the realism of the simulated datasets. We then use the simulated data to study the impact of the increased treatment efforts on epidemiological outcomes. We also study two transmission network reconstruction methods and their effectiveness in detecting fast-growing clusters. AVAILABILITY AND IMPLEMENTATION FAVITES is available at https://github.com/niemasd/FAVITES, and a Docker image can be found on DockerHub (https://hub.docker.com/r/niemasd/favites). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Niema Moshiri
- Bioinformatics and Systems Biology Graduate Program, UC San Diego, La Jolla, USA
| | | | | | - Siavash Mirarab
- Department of Electrical and Computer Engineering, UC San Diego, La Jolla, USA
| |
Collapse
|
11
|
Jones CT, Youssef N, Susko E, Bielawski JP. Phenomenological Load on Model Parameters Can Lead to False Biological Conclusions. Mol Biol Evol 2019; 35:1473-1488. [PMID: 29596684 DOI: 10.1093/molbev/msy049] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
When a substitution model is fitted to an alignment using maximum likelihood, its parameters are adjusted to account for as much site-pattern variation as possible. A parameter might therefore absorb a substantial quantity of the total variance in an alignment (or more formally, bring about a substantial reduction in the deviance of the fitted model) even if the process it represents played no role in the generation of the data. When this occurs, we say that the parameter estimate carries phenomenological load (PL). Large PL in a parameter estimate is a concern because it not only invalidates its mechanistic interpretation (if it has one) but also increases the likelihood that it will be found to be statistically significant. The problem of PL was not identified in the past because most off-the-shelf substitution models make simplifying assumptions that preclude the generation of realistic levels of variation. In this study, we use the more realistic mutation-selection framework as the basis of a generating model formulated to produce data that mimic an alignment of mammalian mitochondrial DNA. We show that a parameter estimate can carry PL when 1) the substitution model is underspecified and 2) the parameter represents a process that is confounded with other processes represented in the data-generating model. We then provide a method that can be used to identify signal for the process that a given parameter represents despite the existence of PL.
Collapse
Affiliation(s)
- Christopher T Jones
- Department of Mathematics and Statistics, Dalhousie University, Halifax, NS, Canada
| | - Noor Youssef
- Department of Biology, Dalhousie University, Halifax, NS, Canada
| | - Edward Susko
- Department of Mathematics and Statistics, Dalhousie University, Halifax, NS, Canada
| | | |
Collapse
|
12
|
Dunn KA, Kenney T, Gu H, Bielawski JP. Improved inference of site-specific positive selection under a generalized parametric codon model when there are multinucleotide mutations and multiple nonsynonymous rates. BMC Evol Biol 2019; 19:22. [PMID: 30642241 PMCID: PMC6332903 DOI: 10.1186/s12862-018-1326-7] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2018] [Accepted: 12/11/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND An excess of nonsynonymous substitutions, over neutrality, is considered evidence of positive Darwinian selection. Inference for proteins often relies on estimation of the nonsynonymous to synonymous ratio (ω = dN/dS) within a codon model. However, to ease computational difficulties, ω is typically estimated assuming an idealized substitution process where (i) all nonsynonymous substitutions have the same rate (regardless of impact on organism fitness) and (ii) instantaneous double and triple (DT) nucleotide mutations have zero probability (despite evidence that they can occur). It follows that estimates of ω represent an imperfect summary of the intensity of selection, and that tests based on the ω > 1 threshold could be negatively impacted. RESULTS We developed a general-purpose parametric (GPP) modelling framework for codons. This novel approach allows specification of all possible instantaneous codon substitutions, including multiple nonsynonymous rates (MNRs) and instantaneous DT nucleotide changes. Existing codon models are specified as special cases of the GPP model. We use GPP models to implement likelihood ratio tests for ω > 1 that accommodate MNRs and DT mutations. Through both simulation and real data analysis, we find that failure to model MNRs and DT mutations reduces power in some cases and inflates false positives in others. False positives under traditional M2a and M8 models were very sensitive to DT changes. This was exacerbated by the choice of frequency parameterization (GY vs. MG), with rates sometimes > 90% under MG. By including MNRs and DT mutations, accuracy and power was greatly improved under the GPP framework. However, we also find that over-parameterized models can perform less well, and this can contribute to degraded performance of LRTs. CONCLUSIONS We suggest GPP models should be used alongside traditional codon models. Further, all codon models should be deployed within an experimental design that includes (i) assessing robustness to model assumptions, and (ii) investigation of non-standard behaviour of MLEs. As the goal of every analysis is to avoid false conclusions, more work is needed on model selection methods that consider both the increase in fit engendered by a model parameter and the degree to which that parameter is affected by un-modelled evolutionary processes.
Collapse
Affiliation(s)
- Katherine A. Dunn
- Department of Biology, Dalhousie University, Halifax, Nova Scotia B3H 4J1 Canada
| | - Toby Kenney
- Department of Mathematics & Statistics, Dalhousie University, Halifax, Nova Scotia B3H 4J1 Canada
| | - Hong Gu
- Department of Mathematics & Statistics, Dalhousie University, Halifax, Nova Scotia B3H 4J1 Canada
| | - Joseph P. Bielawski
- Department of Biology, Dalhousie University, Halifax, Nova Scotia B3H 4J1 Canada
- Department of Mathematics & Statistics, Dalhousie University, Halifax, Nova Scotia B3H 4J1 Canada
- Centre Comparative Genomics and Evolutionary Bioinformatics (CGEB) at Dalhousie University, Halifax, Canada
| |
Collapse
|
13
|
Looking for Darwin in Genomic Sequences: Validity and Success Depends on the Relationship Between Model and Data. Methods Mol Biol 2019; 1910:399-426. [PMID: 31278672 DOI: 10.1007/978-1-4939-9074-0_13] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Codon substitution models (CSMs) are commonly used to infer the history of natural section for a set of protein-coding sequences, often with the explicit goal of detecting the signature of positive Darwinian selection. However, the validity and success of CSMs used in conjunction with the maximum likelihood (ML) framework is sometimes challenged with claims that the approach might too often support false conclusions. In this chapter, we use a case study approach to identify four legitimate statistical difficulties associated with inference of evolutionary events using CSMs. These include: (1) model misspecification, (2) low information content, (3) the confounding of processes, and (4) phenomenological load, or PL. While past criticisms of CSMs can be connected to these issues, the historical critiques were often misdirected, or overstated, because they failed to recognize that the success of any model-based approach depends on the relationship between model and data. Here, we explore this relationship and provide a candid assessment of the limitations of CSMs to extract historical information from extant sequences. To aid in this assessment, we provide a brief overview of: (1) a more realistic way of thinking about the process of codon evolution framed in terms of population genetic parameters, and (2) a novel presentation of the ML statistical framework. We then divide the development of CSMs into two broad phases of scientific activity and show that the latter phase is characterized by increases in model complexity that can sometimes negatively impact inference of evolutionary mechanisms. Such problems are not yet widely appreciated by the users of CSMs. These problems can be avoided by using a model that is appropriate for the data; but, understanding the relationship between the data and a fitted model is a difficult task. We argue that the only way to properly understand that relationship is to perform in silico experiments using a generating process that can mimic the data as closely as possible. The mutation-selection modeling framework (MutSel) is presented as the basis of such a generating process. We contend that if complex CSMs continue to be developed for testing explicit mechanistic hypotheses, then additional analyses such as those described in here (e.g., penalized LRTs and estimation of PL) will need to be applied alongside the more traditional inferential methods.
Collapse
|
14
|
Ashkenazy H, Sela I, Levy Karin E, Landan G, Pupko T. Multiple Sequence Alignment Averaging Improves Phylogeny Reconstruction. Syst Biol 2019; 68:117-130. [PMID: 29771363 PMCID: PMC6657586 DOI: 10.1093/sysbio/syy036] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2017] [Revised: 05/07/2018] [Accepted: 05/09/2018] [Indexed: 01/11/2023] Open
Abstract
The classic methodology of inferring a phylogenetic tree from sequence data is composed of two steps. First, a multiple sequence alignment (MSA) is computed. Then, a tree is reconstructed assuming the MSA is correct. Yet, inferred MSAs were shown to be inaccurate and alignment errors reduce tree inference accuracy. It was previously proposed that filtering unreliable alignment regions can increase the accuracy of tree inference. However, it was also demonstrated that the benefit of this filtering is often obscured by the resulting loss of phylogenetic signal. In this work we explore an approach, in which instead of relying on a single MSA, we generate a large set of alternative MSAs and concatenate them into a single SuperMSA. By doing so, we account for phylogenetic signals contained in columns that are not present in the single MSA computed by alignment algorithms. Using simulations, we demonstrate that this approach results, on average, in more accurate trees compared to 1) using an unfiltered MSA and 2) using a single MSA with weights assigned to columns according to their reliability. Next, we explore in which regions of the MSA space our approach is expected to be beneficial. Finally, we provide a simple criterion for deciding whether or not the extra effort of computing a SuperMSA and inferring a tree from it is beneficial. Based on these assessments, we expect our methodology to be useful for many cases in which diverged sequences are analyzed. The option to generate such a SuperMSA is available at http://guidance.tau.ac.il.
Collapse
Affiliation(s)
- Haim Ashkenazy
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Ramat Aviv 69978, Tel Aviv, Israel
| | - Itamar Sela
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Eli Levy Karin
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Ramat Aviv 69978, Tel Aviv, Israel
- Department of Molecular Biology & Ecology of Plants, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Giddy Landan
- Institute of Microbiology, Christian-Albrechts-University of Kiel, 24118 Kiel, Germany
| | - Tal Pupko
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Ramat Aviv 69978, Tel Aviv, Israel
| |
Collapse
|
15
|
Levinstein Hallak K, Tzur S, Rosset S. Big data analysis of human mitochondrial DNA substitution models: a regression approach. BMC Genomics 2018; 19:759. [PMID: 30340456 PMCID: PMC6195736 DOI: 10.1186/s12864-018-5123-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2018] [Accepted: 09/27/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We study Phylotree, a comprehensive representation of the phylogeny of global human mitochondrial DNA (mtDNA) variations, to better understand the mtDNA substitution mechanism and its most influential factors. We consider a substitution model, where a set of genetic features may predict the rate at which mtDNA substitutions occur. To find an appropriate model, an exhaustive analysis on the effect of multiple factors on the substitution rate is performed through Negative Binomial and Poisson regressions. We examine three different inclusion options for each categorical factor: omission, inclusion as an explanatory variable, and by-value partitioning. The examined factors include genes, codon position, a CpG indicator, directionality, nucleotide, amino acid, codon, and context (neighboring nucleotides), in addition to other site based factors. Partitioning a model by a factor's value results in several sub-models (one for each value), where the likelihoods of the sub-models can be combined to form a score for the entire model. Eventually, the leading models are considered as viable candidates for explaining mtDNA substitution rates. RESULTS Initially, we introduce a novel clustering technique on genes, based on three similarity tests between pairs of genes, supporting previous results regarding gene functionalities in the mtDNA. These clusters are then used as a factor in our models. We present leading models for the protein coding genes, rRNA and tRNA genes and the control region, showing it is disadvantageous to separate the models of transitions/transversions, or synonymous/non-synonymous substitutions. We identify a context effect that cannot be attributed solely to protein level constraints or CpG pairs. For protein-coding genes, we show that the substitution model should be partitioned into sub-models according to the codon position and input codon; additionally we confirm that gene identity and cluster have no significant effect once the above factors are accounted for. CONCLUSIONS We leverage the large, high-confidence Phylotree mtDNA phylogeny to develop a new statistical approach. We model the substitution rates using regressions, allowing consideration of many factors simultaneously. This admits the use of model selection tools helping to identify the set of factors best explaining the mutational dynamics when considered in tandem.
Collapse
Affiliation(s)
- Keren Levinstein Hallak
- Department of Statistics and Operations Research, School of Mathematical Sciences, Tel-Aviv University, 6997801, Tel-Aviv, Israel
| | - Shay Tzur
- Braun School of Public Health and Community Medicine, The Hebrew University of Jerusalem, 9112102, Jerusalem, Israel
| | - Saharon Rosset
- Department of Statistics and Operations Research, School of Mathematical Sciences, Tel-Aviv University, 6997801, Tel-Aviv, Israel.
| |
Collapse
|
16
|
Oda H, Ota M, Toh H. Profile comparison revealed deviation from structural constraint at the positively selected sites. Biosystems 2016; 147:67-77. [PMID: 27443483 DOI: 10.1016/j.biosystems.2016.07.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2015] [Revised: 07/13/2016] [Accepted: 07/16/2016] [Indexed: 11/18/2022]
Abstract
The amino acid substitutions at a site are affected by mixture of various constraints. It is also known that the amino acid substitutions are accelerated at sites under positive selection. However, the relationship between the substitutions at positively selected sites and the constraints has not been thoroughly examined. The advances in computational biology have enabled us to divide the mixture of the constraints into the structural constraint and the remainings by using the amino acid sequences and the tertiary structures, which is expressed as the deviation of the mixture of constraints from the structural constraint. Here, two types of profiles, or matrices with the size of 20 x (site length), are compared. One of the profiles represents the mixture of constraints, and is generated from a multiple amino acid sequence alignment, whereas the other is designed to represent the structural constraints. We applied the profile comparison method to proteins under positive selection to examine the relationship between the positive selection and constraints. The results suggested that the constraint at a site under positive selection tends to be deviated from the structural constraint at the site.
Collapse
Affiliation(s)
- Hiroyuki Oda
- Graduate School of Systems Life Sciences, Kyushu University, 744 Motooka Nishi-ku, Fukuoka 819-0395, Japan.
| | - Motonori Ota
- Graduate School of Information Science, Nagoya University, Furo-cho, Chikusa-ku, Nagoya City, Aichi 464-8601, Japan
| | - Hiroyuki Toh
- Department of Biomedical Chemistry, School of Science and Technology, Kwansei Gakuin University, 2-1 Gakuen, Sanda, Hyogo 669-1337, Japan
| |
Collapse
|
17
|
Kryuchkova-Mostacci N, Robinson-Rechavi M. Tissue-Specific Evolution of Protein Coding Genes in Human and Mouse. PLoS One 2015; 10:e0131673. [PMID: 26121354 PMCID: PMC4488272 DOI: 10.1371/journal.pone.0131673] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2015] [Accepted: 06/04/2015] [Indexed: 12/23/2022] Open
Abstract
Protein-coding genes evolve at different rates, and the influence of different parameters, from gene size to expression level, has been extensively studied. While in yeast gene expression level is the major causal factor of gene evolutionary rate, the situation is more complex in animals. Here we investigate these relations further, especially taking in account gene expression in different organs as well as indirect correlations between parameters. We used RNA-seq data from two large datasets, covering 22 mouse tissues and 27 human tissues. Over all tissues, evolutionary rate only correlates weakly with levels and breadth of expression. The strongest explanatory factors of purifying selection are GC content, expression in many developmental stages, and expression in brain tissues. While the main component of evolutionary rate is purifying selection, we also find tissue-specific patterns for sites under neutral evolution and for positive selection. We observe fast evolution of genes expressed in testis, but also in other tissues, notably liver, which are explained by weak purifying selection rather than by positive selection.
Collapse
Affiliation(s)
- Nadezda Kryuchkova-Mostacci
- Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Marc Robinson-Rechavi
- Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|