1
|
Chen J, Li Q, Xia S, Arsala D, Sosa D, Wang D, Long M. The Rapid Evolution of De Novo Proteins in Structure and Complex. Genome Biol Evol 2024; 16:evae107. [PMID: 38753069 DOI: 10.1093/gbe/evae107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/10/2024] [Indexed: 06/06/2024] Open
Abstract
Recent studies in the rice genome-wide have established that de novo genes, evolving from noncoding sequences, enhance protein diversity through a stepwise process. However, the pattern and rate of their evolution in protein structure over time remain unclear. Here, we addressed these issues within a surprisingly short evolutionary timescale (<1 million years for 97% of Oryza de novo genes) with comparative approaches to gene duplicates. We found that de novo genes evolve faster than gene duplicates in the intrinsically disordered regions (such as random coils), secondary structure elements (such as α helix and β strand), hydrophobicity, and molecular recognition features. In de novo proteins, specifically, we observed an 8% to 14% decay in random coils and intrinsically disordered region lengths and a 2.3% to 6.5% increase in structured elements, hydrophobicity, and molecular recognition features, per million years on average. These patterns of structural evolution align with changes in amino acid composition over time as well. We also revealed higher positive charges but smaller molecular weights for de novo proteins than duplicates. Tertiary structure predictions showed that most de novo proteins, though not typically well folded on their own, readily form low-energy and compact complexes with other proteins facilitated by extensive residue contacts and conformational flexibility, suggesting a faster-binding scenario in de novo proteins to promote interaction. These analyses illuminate a rapid evolution of protein structure in de novo genes in rice genomes, originating from noncoding sequences, highlighting their quick transformation into active, protein complex-forming components within a remarkably short evolutionary timeframe.
Collapse
Affiliation(s)
- Jianhai Chen
- Department of Ecology and Evolution, The University of Chicago, Chicago, IL 60637, USA
| | - Qingrong Li
- Division of Pharmaceutical Sciences, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA 92093, USA
- Department of Cellular & Molecular Medicine, School of Medicine, University of California San Diego, La Jolla, CA 92093, USA
| | - Shengqian Xia
- Department of Ecology and Evolution, The University of Chicago, Chicago, IL 60637, USA
| | - Deanna Arsala
- Department of Ecology and Evolution, The University of Chicago, Chicago, IL 60637, USA
| | - Dylan Sosa
- Department of Ecology and Evolution, The University of Chicago, Chicago, IL 60637, USA
| | - Dong Wang
- Division of Pharmaceutical Sciences, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA 92093, USA
- Department of Cellular & Molecular Medicine, School of Medicine, University of California San Diego, La Jolla, CA 92093, USA
| | - Manyuan Long
- Department of Ecology and Evolution, The University of Chicago, Chicago, IL 60637, USA
| |
Collapse
|
2
|
Middendorf L, Eicholt LA. Random, de novo, and conserved proteins: How structure and disorder predictors perform differently. Proteins 2024; 92:757-767. [PMID: 38226524 DOI: 10.1002/prot.26652] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Revised: 10/18/2023] [Accepted: 12/01/2023] [Indexed: 01/17/2024]
Abstract
Understanding the emergence and structural characteristics of de novo and random proteins is crucial for unraveling protein evolution and designing novel enzymes. However, experimental determination of their structures remains challenging. Recent advancements in protein structure prediction, particularly with AlphaFold2 (AF2), have expanded our knowledge of protein structures, but their applicability to de novo and random proteins is unclear. In this study, we investigate the structural predictions and confidence scores of AF2 and protein language model-based predictor ESMFold for de novo and conserved proteins from Drosophila and a dataset of comparable random proteins. We find that the structural predictions for de novo and random proteins differ significantly from conserved proteins. Interestingly, a positive correlation between disorder and confidence scores (pLDDT) is observed for de novo and random proteins, in contrast to the negative correlation observed for conserved proteins. Furthermore, the performance of structure predictors for de novo and random proteins is hampered by the lack of sequence identity. We also observe fluctuating median predicted disorder among different sequence length quartiles for random proteins, suggesting an influence of sequence length on disorder predictions. In conclusion, while structure predictors provide initial insights into the structural composition of de novo and random proteins, their accuracy and applicability to such proteins remain limited. Experimental determination of their structures is necessary for a comprehensive understanding. The positive correlation between disorder and pLDDT could imply a potential for conditional folding and transient binding interactions of de novo and random proteins.
Collapse
Affiliation(s)
- Lasse Middendorf
- Institute for Evolution and Biodiversity, University of Muenster, Muenster, Germany
| | - Lars A Eicholt
- Institute for Evolution and Biodiversity, University of Muenster, Muenster, Germany
| |
Collapse
|
3
|
Kore H, Datta KK, Nagaraj SH, Gowda H. Protein-coding potential of non-canonical open reading frames in human transcriptome. Biochem Biophys Res Commun 2023; 684:149040. [PMID: 37897910 DOI: 10.1016/j.bbrc.2023.09.068] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2023] [Revised: 09/09/2023] [Accepted: 09/23/2023] [Indexed: 10/30/2023]
Abstract
In recent years, proteogenomics and ribosome profiling studies have identified a large number of proteins encoded by noncoding regions in the human genome. They are encoded by small open reading frames (sORFs) in the untranslated regions (UTRs) of mRNAs and long non-coding RNAs (lncRNAs). These sORF encoded proteins (SEPs) are often <150AA and show poor evolutionary conservation. A subset of them have been functionally characterized and shown to play an important role in fundamental biological processes including cardiac and muscle function, DNA repair, embryonic development and various human diseases. How many novel protein-coding regions exist in the human genome and what fraction of them are functionally important remains a mystery. In this review, we discuss current progress in unraveling SEPs, approaches used for their identification, their limitations and reliability of these identifications. We also discuss functionally characterized SEPs and their involvement in various biological processes and diseases. Lastly, we provide insights into their distinctive features compared to canonical proteins and challenges associated with annotating these in protein reference databases.
Collapse
Affiliation(s)
- Hitesh Kore
- Centre for Genomics and Personalised Health, Queensland University of Technology, Brisbane, Queensland, 4059, Australia; Cancer Precision Medicine Group, QIMR Berghofer Medical Research Institute, 300 Herston Road, Herston, Queensland, 4006, Australia; Faculty of Health, Queensland University of Technology, Brisbane, Queensland, 4059, Australia.
| | - Keshava K Datta
- Proteomics and Metabolomics Platform, La Trobe University, Melbourne, VIC, 3083, Australia
| | - Shivashankar H Nagaraj
- Centre for Genomics and Personalised Health, Queensland University of Technology, Brisbane, Queensland, 4059, Australia; Faculty of Health, Queensland University of Technology, Brisbane, Queensland, 4059, Australia
| | - Harsha Gowda
- Centre for Genomics and Personalised Health, Queensland University of Technology, Brisbane, Queensland, 4059, Australia; Cancer Precision Medicine Group, QIMR Berghofer Medical Research Institute, 300 Herston Road, Herston, Queensland, 4006, Australia; Faculty of Health, Queensland University of Technology, Brisbane, Queensland, 4059, Australia; Faculty of Medicine, The University of Queensland, Queensland, 4072, Australia.
| |
Collapse
|
4
|
Knoshaug EP, Sun P, Nag A, Nguyen H, Mattoon EM, Zhang N, Liu J, Chen C, Cheng J, Zhang R, St. John P, Umen J. Identification and preliminary characterization of conserved uncharacterized proteins from Chlamydomonas reinhardtii, Arabidopsis thaliana, and Setaria viridis. PLANT DIRECT 2023; 7:e527. [PMID: 38044962 PMCID: PMC10690477 DOI: 10.1002/pld3.527] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Revised: 08/03/2023] [Accepted: 08/11/2023] [Indexed: 12/05/2023]
Abstract
The rapid accumulation of sequenced plant genomes in the past decade has outpaced the still difficult problem of genome-wide protein-coding gene annotation. A substantial fraction of protein-coding genes in all plant genomes are poorly annotated or unannotated and remain functionally uncharacterized. We identified unannotated proteins in three model organisms representing distinct branches of the green lineage (Viridiplantae): Arabidopsis thaliana (eudicot), Setaria viridis (monocot), and Chlamydomonas reinhardtii (Chlorophyte alga). Using similarity searching, we identified a subset of unannotated proteins that were conserved between these species and defined them as Deep Green proteins. Bioinformatic, genomic, and structural predictions were performed to begin classifying Deep Green genes and proteins. Compared to whole proteomes for each species, the Deep Green set was enriched for proteins with predicted chloroplast targeting signals predictive of photosynthetic or plastid functions, a result that was consistent with enrichment for daylight phase diurnal expression patterning. Structural predictions using AlphaFold and comparisons to known structures showed that a significant proportion of Deep Green proteins may possess novel folds. Though only available for three organisms, the Deep Green genes and proteins provide a starting resource of high-value targets for further investigation of potentially new protein structures and functions conserved across the green lineage.
Collapse
Affiliation(s)
- Eric P. Knoshaug
- Biosciences CenterNational Renewable Energy LaboratoryGoldenColoradoUSA
| | - Peipei Sun
- Donald Danforth Plant Science CenterSt. LouisMOUSA
| | - Ambarish Nag
- Computational Sciences CenterNational Renewable Energy LaboratoryGoldenColoradoUSA
| | - Huong Nguyen
- Donald Danforth Plant Science CenterSt. LouisMOUSA
- Institute of Genomics for Crop Abiotic Stress Tolerance, Department of Plant and Soil ScienceTexas Tech UniversityLubbockTexasUSA
| | - Erin M. Mattoon
- Donald Danforth Plant Science CenterSt. LouisMOUSA
- Plant and Microbial Biosciences Program, Division of Biology and Biomedical SciencesWashington University in Saint LouisSt. LouisMissouriUSA
| | | | - Jian Liu
- Department of Electrical Engineering and Computer ScienceUniversity of MissouriColumbiaMissouriUSA
| | - Chen Chen
- Department of Electrical Engineering and Computer ScienceUniversity of MissouriColumbiaMissouriUSA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer ScienceUniversity of MissouriColumbiaMissouriUSA
| | - Ru Zhang
- Donald Danforth Plant Science CenterSt. LouisMOUSA
| | - Peter St. John
- Biosciences CenterNational Renewable Energy LaboratoryGoldenColoradoUSA
| | - James Umen
- Donald Danforth Plant Science CenterSt. LouisMOUSA
| |
Collapse
|
5
|
Aubel M, Eicholt L, Bornberg-Bauer E. Assessing structure and disorder prediction tools for de novo emerged proteins in the age of machine learning. F1000Res 2023; 12:347. [PMID: 37113259 PMCID: PMC10126731 DOI: 10.12688/f1000research.130443.1] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/17/2023] [Indexed: 03/31/2023] Open
Abstract
Background: De novo protein coding genes emerge from scratch in the non-coding regions of the genome and have, per definition, no homology to other genes. Therefore, their encoded de novo proteins belong to the so-called "dark protein space". So far, only four de novo protein structures have been experimentally approximated. Low homology, presumed high disorder and limited structures result in low confidence structural predictions for de novo proteins in most cases. Here, we look at the most widely used structure and disorder predictors and assess their applicability for de novo emerged proteins. Since AlphaFold2 is based on the generation of multiple sequence alignments and was trained on solved structures of largely conserved and globular proteins, its performance on de novo proteins remains unknown. More recently, natural language models of proteins have been used for alignment-free structure predictions, potentially making them more suitable for de novo proteins than AlphaFold2. Methods: We applied different disorder predictors (IUPred3 short/long, flDPnn) and structure predictors, AlphaFold2 on the one hand and language-based models (Omegafold, ESMfold, RGN2) on the other hand, to four de novo proteins with experimental evidence on structure. We compared the resulting predictions between the different predictors as well as to the existing experimental evidence. Results: Results from IUPred, the most widely used disorder predictor, depend heavily on the choice of parameters and differ significantly from flDPnn which has been found to outperform most other predictors in a comparative assessment study recently. Similarly, different structure predictors yielded varying results and confidence scores for de novo proteins. Conclusions: We suggest that, while in some cases protein language model based approaches might be more accurate than AlphaFold2, the structure prediction of de novo emerged proteins remains a difficult task for any predictor, be it disorder or structure.
Collapse
Affiliation(s)
- Margaux Aubel
- Institute for Evolution and Bidiversity, University of Muenster, Muenster, 48149, Germany
| | - Lars Eicholt
- Institute for Evolution and Bidiversity, University of Muenster, Muenster, 48149, Germany
| | - Erich Bornberg-Bauer
- Institute for Evolution and Bidiversity, University of Muenster, Muenster, 48149, Germany
- Department Protein Evolution, Max Planck-Institute for Biology, Tuebingen, 72076, Germany
| |
Collapse
|
6
|
Karlowski WM, Varshney D, Zielezinski A. Taxonomically Restricted Genes in Bacillus may Form Clusters of Homologs and Can be Traced to a Large Reservoir of Noncoding Sequences. Genome Biol Evol 2023; 15:7039703. [PMID: 36790099 PMCID: PMC10003748 DOI: 10.1093/gbe/evad023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Revised: 01/09/2023] [Accepted: 02/08/2023] [Indexed: 02/16/2023] Open
Abstract
Taxonomically restricted genes (TRGs) are unique for a defined group of organisms and may act as potential genetic determinants of lineage-specific, biological properties. Here, we explore the TRGs of highly diverse and economically important Bacillus bacteria by examining commonly used TRG identification parameters and data sources. We show the significant effects of sequence similarity thresholds, composition, and the size of the reference database in the identification process. Subsequently, we applied stringent TRG search parameters and expanded the identification procedure by incorporating an analysis of noncoding and non-syntenic regions of non-Bacillus genomes. A multiplex annotation procedure minimized the number of false-positive TRG predictions and showed nearly one-third of the alleged TRGs could be mapped to genes missed in genome annotations. We traced the putative origin of TRGs by identifying homologous, noncoding genomic regions in non-Bacillus species and detected sequence changes that could transform these regions into protein-coding genes. In addition, our analysis indicated that Bacillus TRGs represent a specific group of genes mostly showing intermediate sequence properties between genes that are conserved across multiple taxa and nonannotated peptides encoded by open reading frames.
Collapse
Affiliation(s)
- Wojciech M Karlowski
- Department of Computational Biology, Adam Mickiewicz University in Poznan, Uniwersytetu Poznanskiego 6, Poznan, Poland
| | - Deepti Varshney
- Department of Computational Biology, Adam Mickiewicz University in Poznan, Uniwersytetu Poznanskiego 6, Poznan, Poland
| | - Andrzej Zielezinski
- Department of Computational Biology, Adam Mickiewicz University in Poznan, Uniwersytetu Poznanskiego 6, Poznan, Poland
| |
Collapse
|
7
|
Intrinsically Disordered Proteins: An Overview. Int J Mol Sci 2022; 23:ijms232214050. [PMID: 36430530 PMCID: PMC9693201 DOI: 10.3390/ijms232214050] [Citation(s) in RCA: 27] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2022] [Revised: 11/07/2022] [Accepted: 11/08/2022] [Indexed: 11/16/2022] Open
Abstract
Many proteins and protein segments cannot attain a single stable three-dimensional structure under physiological conditions; instead, they adopt multiple interconverting conformational states. Such intrinsically disordered proteins or protein segments are highly abundant across proteomes, and are involved in various effector functions. This review focuses on different aspects of disordered proteins and disordered protein regions, which form the basis of the so-called "Disorder-function paradigm" of proteins. Additionally, various experimental approaches and computational tools used for characterizing disordered regions in proteins are discussed. Finally, the role of disordered proteins in diseases and their utility as potential drug targets are explored.
Collapse
|
8
|
Parikh SB, Houghton C, Van Oss SB, Wacholder A, Carvunis A. Origins, evolution, and physiological implications of de novo genes in yeast. Yeast 2022; 39:471-481. [PMID: 35959631 PMCID: PMC9544372 DOI: 10.1002/yea.3810] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Revised: 08/08/2022] [Accepted: 08/09/2022] [Indexed: 12/03/2022] Open
Abstract
De novo gene birth is the process by which new genes emerge in sequences that were previously noncoding. Over the past decade, researchers have taken advantage of the power of yeast as a model and a tool to study the evolutionary mechanisms and physiological implications of de novo gene birth. We summarize the mechanisms that have been proposed to explicate how noncoding sequences can become protein-coding genes, highlighting the discovery of pervasive translation of the yeast transcriptome and its presumed impact on evolutionary innovation. We summarize current best practices for the identification and characterization of de novo genes. Crucially, we explain that the field is still in its nascency, with the physiological roles of most young yeast de novo genes identified thus far still utterly unknown. We hope this review inspires researchers to investigate the true contribution of de novo gene birth to cellular physiology and phenotypic diversity across yeast strains and species.
Collapse
Affiliation(s)
- Saurin B. Parikh
- Department of Computational and Systems Biology, School of Medicine, Pittsburgh Center for Evolutionary Biology and EvolutionUniversity of PittsburghPittsburghPennsylvaniaUSA
| | - Carly Houghton
- Department of Computational and Systems Biology, School of Medicine, Pittsburgh Center for Evolutionary Biology and EvolutionUniversity of PittsburghPittsburghPennsylvaniaUSA
| | - S. Branden Van Oss
- Department of Computational and Systems Biology, School of Medicine, Pittsburgh Center for Evolutionary Biology and EvolutionUniversity of PittsburghPittsburghPennsylvaniaUSA
| | - Aaron Wacholder
- Department of Computational and Systems Biology, School of Medicine, Pittsburgh Center for Evolutionary Biology and EvolutionUniversity of PittsburghPittsburghPennsylvaniaUSA
| | - Anne‐Ruxandra Carvunis
- Department of Computational and Systems Biology, School of Medicine, Pittsburgh Center for Evolutionary Biology and EvolutionUniversity of PittsburghPittsburghPennsylvaniaUSA
| |
Collapse
|
9
|
Sangster AG, Zarin T, Moses AM. Evolution of short linear motifs and disordered proteins Topic: yeast as model system to study evolution. Curr Opin Genet Dev 2022; 76:101964. [PMID: 35939968 DOI: 10.1016/j.gde.2022.101964] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 06/29/2022] [Accepted: 07/08/2022] [Indexed: 11/26/2022]
Abstract
Evolutionary preservation of protein structure had a major influence on the field of molecular evolution: changes in individual amino acids that did not disrupt protein folding would either have no effect or subtly change the 'lock' so that it could fit a new 'key'. Homology of individual amino acids could be confidently assigned through sequence alignments, and models of evolution could be tested. This view of molecular evolution excluded large regions of proteins that could not be confidently aligned, such as intrinsically disordered regions (IDRs) that do not fold into stable structures. In the last decade, major progress has been made in understanding the evolution of IDRs, much of it facilitated by new experimental and computational approaches in yeast. Here, we review this progress as well as several still outstanding questions.
Collapse
Affiliation(s)
- Ami G Sangster
- Cell & Systems Biology, University of Toronto, 25 Harbord St., Toronto, ON M5S 3G5, Canada
| | - Taraneh Zarin
- Cell & Systems Biology, University of Toronto, 25 Harbord St., Toronto, ON M5S 3G5, Canada. https://twitter.com/@taraneh_z
| | - Alan M Moses
- Cell & Systems Biology, University of Toronto, 25 Harbord St., Toronto, ON M5S 3G5, Canada.
| |
Collapse
|
10
|
Kosinski LJ, Aviles NR, Gomez K, Masel J. Random peptides rich in small and disorder-promoting amino acids are less likely to be harmful. Genome Biol Evol 2022; 14:evac085. [PMID: 35668555 PMCID: PMC9210321 DOI: 10.1093/gbe/evac085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2021] [Revised: 04/01/2022] [Accepted: 05/27/2022] [Indexed: 11/15/2022] Open
Abstract
Proteins are the workhorses of the cell, yet they carry great potential for harm via misfolding and aggregation. Despite the dangers, proteins are sometimes born de novo from non-coding DNA. Proteins are more likely to be born from non-coding regions that produce peptides that do little to no harm when translated than from regions that produce harmful peptides. To investigate which newborn proteins are most likely to "first, do no harm", we estimate fitnesses from an experiment that competed Escherichia coli lineages that each expressed a unique random peptide. A variety of peptide metrics significantly predict lineage fitness, but this predictive power stems from simple amino acid frequencies rather than the ordering of amino acids. Amino acids that are smaller and that promote intrinsic structural disorder have more benign fitness effects. We validate that the amino acids that indicate benign effects in random peptides expressed in E. coli also do so in an independent dataset of random N-terminal tags in which it is possible to control for expression level. The same amino acids are also enriched in young animal proteins.
Collapse
Affiliation(s)
- Luke J Kosinski
- Department of Molecular and Cellular Biology, University of Arizona, Tucson, USA
| | - Nathan R Aviles
- Graduate Interdisciplinary Program in Statistics, University of Arizona, Tucson, USA
| | - Kevin Gomez
- Graduate Interdisciplinary Program in Applied Math, University of Arizona, Tucson, USA
| | - Joanna Masel
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, USA
| |
Collapse
|
11
|
Cherezov RO, Vorontsova JE, Simonova OB. The Phenomenon of Evolutionary “De Novo Generation” of Genes. Russ J Dev Biol 2021. [DOI: 10.1134/s1062360421060035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
12
|
Papadopoulos C, Callebaut I, Gelly JC, Hatin I, Namy O, Renard M, Lespinet O, Lopes A. Intergenic ORFs as elementary structural modules of de novo gene birth and protein evolution. Genome Res 2021; 31:2303-2315. [PMID: 34810219 PMCID: PMC8647833 DOI: 10.1101/gr.275638.121] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 09/23/2021] [Indexed: 01/08/2023]
Abstract
The noncoding genome plays an important role in de novo gene birth and in the emergence of genetic novelty. Nevertheless, how noncoding sequences' properties could promote the birth of novel genes and shape the evolution and the structural diversity of proteins remains unclear. Therefore, by combining different bioinformatic approaches, we characterized the fold potential diversity of the amino acid sequences encoded by all intergenic open reading frames (ORFs) of S. cerevisiae with the aim of (1) exploring whether the structural states' diversity of proteomes is already present in noncoding sequences, and (2) estimating the potential of the noncoding genome to produce novel protein bricks that could either give rise to novel genes or be integrated into pre-existing proteins, thus participating in protein structure diversity and evolution. We showed that amino acid sequences encoded by most yeast intergenic ORFs contain the elementary building blocks of protein structures. Moreover, they encompass the large structural state diversity of canonical proteins, with the majority predicted as foldable. Then, we investigated the early stages of de novo gene birth by reconstructing the ancestral sequences of 70 yeast de novo genes and characterized the sequence and structural properties of intergenic ORFs with a strong translation signal. This enabled us to highlight sequence and structural factors determining de novo gene emergence. Finally, we showed a strong correlation between the fold potential of de novo proteins and one of their ancestral amino acid sequences, reflecting the relationship between the noncoding genome and the protein structure universe.
Collapse
Affiliation(s)
- Chris Papadopoulos
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198 Gif-sur-Yvette, France
| | - Isabelle Callebaut
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, 75005 Paris, France
| | - Jean-Christophe Gelly
- Université de Paris, Biologie Intégrée du Globule Rouge, UMR_S1134, BIGR, INSERM, F-75015 Paris, France
- Laboratoire d'Excellence GR-Ex, 75015 Paris, France
- Institut National de la Transfusion Sanguine, F-75015 Paris, France
| | - Isabelle Hatin
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198 Gif-sur-Yvette, France
| | - Olivier Namy
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198 Gif-sur-Yvette, France
| | - Maxime Renard
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198 Gif-sur-Yvette, France
| | - Olivier Lespinet
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198 Gif-sur-Yvette, France
| | - Anne Lopes
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198 Gif-sur-Yvette, France
| |
Collapse
|
13
|
Castro JF, Tautz D. The Effects of Sequence Length and Composition of Random Sequence Peptides on the Growth of E. coli Cells. Genes (Basel) 2021; 12:1913. [PMID: 34946861 PMCID: PMC8702183 DOI: 10.3390/genes12121913] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2021] [Revised: 11/22/2021] [Accepted: 11/26/2021] [Indexed: 12/21/2022] Open
Abstract
We study the potential for the de novo evolution of genes from random nucleotide sequences using libraries of E. coli expressing random sequence peptides. We assess the effects of such peptides on cell growth by monitoring frequency changes in individual clones in a complex library through four serial passages. Using a new analysis pipeline that allows the tracing of peptides of all lengths, we find that over half of the peptides have consistent effects on cell growth. Across nine different experiments, around 16% of clones increase in frequency and 36% decrease, with some variation between individual experiments. Shorter peptides (8-20 residues), are more likely to increase in frequency, longer ones are more likely to decrease. GC content, amino acid composition, intrinsic disorder, and aggregation propensity show slightly different patterns between peptide groups. Sequences that increase in frequency tend to be more disordered with lower aggregation propensity. This coincides with the observation that young genes with more disordered structures are better tolerated in genomes. Our data indicate that random sequences can be a source of evolutionary innovation, since a large fraction of them are well tolerated by the cells or can provide a growth advantage.
Collapse
Affiliation(s)
| | - Diethard Tautz
- Max Planck Institute for Evolutionary Biology, August-Thienemann Strasse 2, 24306 Plön, Germany;
| |
Collapse
|
14
|
Fesenko I, Shabalina SA, Mamaeva A, Knyazev A, Glushkevich A, Lyapina I, Ziganshin R, Kovalchuk S, Kharlampieva D, Lazarev V, Taliansky M, Koonin EV. A vast pool of lineage-specific microproteins encoded by long non-coding RNAs in plants. Nucleic Acids Res 2021; 49:10328-10346. [PMID: 34570232 DOI: 10.1093/nar/gkab816] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Revised: 08/17/2021] [Accepted: 09/17/2021] [Indexed: 12/17/2022] Open
Abstract
Pervasive transcription of eukaryotic genomes results in expression of long non-coding RNAs (lncRNAs) most of which are poorly conserved in evolution and appear to be non-functional. However, some lncRNAs have been shown to perform specific functions, in particular, transcription regulation. Thousands of small open reading frames (smORFs, <100 codons) located on lncRNAs potentially might be translated into peptides or microproteins. We report a comprehensive analysis of the conservation and evolutionary trajectories of lncRNAs-smORFs from the moss Physcomitrium patens across transcriptomes of 479 plant species. Although thousands of smORFs are subject to substantial purifying selection, the majority of the smORFs appear to be evolutionary young and could represent a major pool for functional innovation. Using nanopore RNA sequencing, we show that, on average, the transcriptional level of conserved smORFs is higher than that of non-conserved smORFs. Proteomic analysis confirmed translation of 82 novel species-specific smORFs. Numerous conserved smORFs containing low complexity regions (LCRs) or transmembrane domains were identified, the biological functions of a selected LCR-smORF were demonstrated experimentally. Thus, microproteins encoded by smORFs are a major, functionally diverse component of the plant proteome.
Collapse
Affiliation(s)
- Igor Fesenko
- Shemyakin and Ovchinnikov Institute of Bioorganic Chemistry of the Russian Academy of Sciences, Moscow 117997, Russian Federation
| | - Svetlana A Shabalina
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Anna Mamaeva
- Shemyakin and Ovchinnikov Institute of Bioorganic Chemistry of the Russian Academy of Sciences, Moscow 117997, Russian Federation
| | - Andrey Knyazev
- Shemyakin and Ovchinnikov Institute of Bioorganic Chemistry of the Russian Academy of Sciences, Moscow 117997, Russian Federation
| | - Anna Glushkevich
- Shemyakin and Ovchinnikov Institute of Bioorganic Chemistry of the Russian Academy of Sciences, Moscow 117997, Russian Federation
| | - Irina Lyapina
- Shemyakin and Ovchinnikov Institute of Bioorganic Chemistry of the Russian Academy of Sciences, Moscow 117997, Russian Federation
| | - Rustam Ziganshin
- Shemyakin and Ovchinnikov Institute of Bioorganic Chemistry of the Russian Academy of Sciences, Moscow 117997, Russian Federation
| | - Sergey Kovalchuk
- Shemyakin and Ovchinnikov Institute of Bioorganic Chemistry of the Russian Academy of Sciences, Moscow 117997, Russian Federation
| | - Daria Kharlampieva
- Department of Cell Biology, Federal Research and Clinical Center of Physical -Chemical Medicine of Federal Medical Biological Agency, Moscow 119435, Russian Federation
| | - Vassili Lazarev
- Department of Cell Biology, Federal Research and Clinical Center of Physical -Chemical Medicine of Federal Medical Biological Agency, Moscow 119435, Russian Federation.,Moscow Institute of Physics and Technology (National Research University), Dolgoprudny, Moscow region, 141701, Russian Federation
| | - Michael Taliansky
- Shemyakin and Ovchinnikov Institute of Bioorganic Chemistry of the Russian Academy of Sciences, Moscow 117997, Russian Federation.,The James Hutton Institute, Invergowrie, Dundee DD2 5DA, UK
| | - Eugene V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
15
|
Homopeptide and homocodon levels across fungi are coupled to GC/AT-bias and intrinsic disorder, with unique behaviours for some amino acids. Sci Rep 2021; 11:10025. [PMID: 33976321 PMCID: PMC8113271 DOI: 10.1038/s41598-021-89650-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Accepted: 04/22/2021] [Indexed: 11/09/2022] Open
Abstract
Homopeptides (runs of one amino-acid type) are evolutionarily important since they are prone to expand/contract during DNA replication, recombination and repair. To gain insight into the genomic/proteomic traits driving their variation, we analyzed how homopeptides and homocodons (which are pure codon repeats) vary across 405 Dikarya, and probed their linkage to genome GC/AT bias and other factors. We find that amino-acid homopeptide frequencies vary diversely between clades, with the AT-rich Saccharomycotina trending distinctly. As organisms evolve, homocodon and homopeptide numbers are majorly coupled to GC/AT-bias, exhibiting a bi-furcated correlation with degree of AT- or GC-bias. Mid-GC/AT genomes tend to have markedly fewer simply because they are mid-GC/AT. Despite these trends, homopeptides tend to be GC-biased relative to other parts of coding sequences, even in AT-rich organisms, indicating they absorb AT bias less or are inherently more GC-rich. The most frequent and most variable homopeptide amino acids favour intrinsic disorder, and there are an opposing correlation and anti-correlation versus homopeptide levels for intrinsic disorder and structured-domain content respectively. Specific homopeptides show unique behaviours that we suggest are linked to inherent slippage probabilities during DNA replication and recombination, such as poly-glutamine, which is an evolutionarily very variable homopeptide with a codon repertoire unbiased for GC/AT, and poly-lysine whose homocodons are overwhelmingly made from the codon AAG.
Collapse
|
16
|
James JE, Willis SM, Nelson PG, Weibel C, Kosinski LJ, Masel J. Universal and taxon-specific trends in protein sequences as a function of age. eLife 2021; 10:e57347. [PMID: 33416492 PMCID: PMC7819706 DOI: 10.7554/elife.57347] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2020] [Accepted: 01/05/2021] [Indexed: 01/12/2023] Open
Abstract
Extant protein-coding sequences span a huge range of ages, from those that emerged only recently to those present in the last universal common ancestor. Because evolution has had less time to act on young sequences, there might be 'phylostratigraphy' trends in any properties that evolve slowly with age. A long-term reduction in hydrophobicity and hydrophobic clustering was found in previous, taxonomically restricted studies. Here we perform integrated phylostratigraphy across 435 fully sequenced species, using sensitive HMM methods to detect protein domain homology. We find that the reduction in hydrophobic clustering is universal across lineages. However, only young animal domains have a tendency to have higher structural disorder. Among ancient domains, trends in amino acid composition reflect the order of recruitment into the genetic code, suggesting that the composition of the contemporary descendants of ancient sequences reflects amino acid availability during the earliest stages of life, when these sequences first emerged.
Collapse
Affiliation(s)
- Jennifer E James
- Department of Ecology and Evolutionary Biology, University of ArizonaTucsonUnited States
| | - Sara M Willis
- Department of Ecology and Evolutionary Biology, University of ArizonaTucsonUnited States
| | - Paul G Nelson
- Department of Ecology and Evolutionary Biology, University of ArizonaTucsonUnited States
| | - Catherine Weibel
- Department of Physics, University of ArizonaTucsonUnited States
- Department of Mathematics, University of ArizonaTucsonUnited States
| | - Luke J Kosinski
- Department of Molecular and Cellular Biology, University of ArizonaTucsonUnited States
| | - Joanna Masel
- Department of Ecology and Evolutionary Biology, University of ArizonaTucsonUnited States
| |
Collapse
|
17
|
Dowling D, Schmitz JF, Bornberg-Bauer E. Stochastic Gain and Loss of Novel Transcribed Open Reading Frames in the Human Lineage. Genome Biol Evol 2020; 12:2183-2195. [PMID: 33210146 PMCID: PMC7674706 DOI: 10.1093/gbe/evaa194] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/12/2020] [Indexed: 12/12/2022] Open
Abstract
In addition to known genes, much of the human genome is transcribed into RNA. Chance formation of novel open reading frames (ORFs) can lead to the translation of myriad new proteins. Some of these ORFs may yield advantageous adaptive de novo proteins. However, widespread translation of noncoding DNA can also produce hazardous protein molecules, which can misfold and/or form toxic aggregates. The dynamics of how de novo proteins emerge from potentially toxic raw materials and what influences their long-term survival are unknown. Here, using transcriptomic data from human and five other primates, we generate a set of transcribed human ORFs at six conservation levels to investigate which properties influence the early emergence and long-term retention of these expressed ORFs. As these taxa diverged from each other relatively recently, we present a fine scale view of the evolution of novel sequences over recent evolutionary time. We find that novel human-restricted ORFs are preferentially located on GC-rich gene-dense chromosomes, suggesting their retention is linked to pre-existing genes. Sequence properties such as intrinsic structural disorder and aggregation propensity-which have been proposed to play a role in survival of de novo genes-remain unchanged over time. Even very young sequences code for proteins with low aggregation propensities, suggesting that genomic regions with many novel transcribed ORFs are concomitantly less likely to produce ORFs which code for harmful toxic proteins. Our data indicate that the survival of these novel ORFs is largely stochastic rather than shaped by selection.
Collapse
Affiliation(s)
- Daniel Dowling
- Institute for Evolution and Biodiversity, University of Münster, Germany
| | - Jonathan F Schmitz
- Institute for Evolution and Biodiversity, University of Münster, Germany
| | | |
Collapse
|
18
|
Poot Velez AH, Fontove F, Del Rio G. Protein-Protein Interactions Efficiently Modeled by Residue Cluster Classes. Int J Mol Sci 2020; 21:E4787. [PMID: 32640745 PMCID: PMC7370293 DOI: 10.3390/ijms21134787] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2020] [Revised: 06/20/2020] [Accepted: 06/28/2020] [Indexed: 01/22/2023] Open
Abstract
Predicting protein-protein interactions (PPI) represents an important challenge in structural bioinformatics. Current computational methods display different degrees of accuracy when predicting these interactions. Different factors were proposed to help improve these predictions, including choosing the proper descriptors of proteins to represent these interactions, among others. In the current work, we provide a representative protein structure that is amenable to PPI classification using machine learning approaches, referred to as residue cluster classes. Through sampling and optimization, we identified the best algorithm-parameter pair to classify PPI from more than 360 different training sets. We tested these classifiers against PPI datasets that were not included in the training set but shared sequence similarity with proteins in the training set to reproduce the situation of most proteins sharing sequence similarity with others. We identified a model with almost no PPI error (96-99% of correctly classified instances) and showed that residue cluster classes of protein pairs displayed a distinct pattern between positive and negative protein interactions. Our results indicated that residue cluster classes are structural features relevant to model PPI and provide a novel tool to mathematically model the protein structure/function relationship.
Collapse
Affiliation(s)
- Albros Hermes Poot Velez
- Department of biochemistry and structural biology, Instituto de fisiologia celular, UNAM Mexico City 04510, Mexico;
| | | | - Gabriel Del Rio
- Department of biochemistry and structural biology, Instituto de fisiologia celular, UNAM Mexico City 04510, Mexico;
| |
Collapse
|
19
|
Desnos-Ollivier M, Maufrais C, Pihet M, Aznar C, Dromer F. Epidemiological investigation for grouped cases of Trichosporon asahii using whole genome and IGS1 sequencing. Mycoses 2020; 63:942-951. [PMID: 32506754 DOI: 10.1111/myc.13126] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2020] [Revised: 05/28/2020] [Accepted: 05/28/2020] [Indexed: 12/30/2022]
Abstract
BACKGROUND Trichosporonosis is a rare invasive infection in humans mainly due to Trichosporon asahii, and especially recovered from patients having haematological malignancy. Since 2012, IGS1 region sequencing is used as a genotyping method to distinguish isolates, with high frequency of one haplotype worldwide and a geographic specificity for some haplotypes. OBJECTIVES We compared the IGS1 genotyping method and whole genome sequencing (WGS) to study the relationship between clinical isolates involved in two grouped cases in France. METHODS IGS1 sequencing and antifungal susceptibility testing were performed for 54 clinical isolates. Clinical data for 28 isolates included in surveillance programs were analysed. Whole genome was sequenced for 32 clinical isolates and the type strain. RESULTS All isolates were intrinsically resistant to flucytosine, while voriconazole had the most potent in vitro activity. The majority of the isolates was recovered from patients with haematological malignancies (42.86%), with a high proportion of children (<15 yrs-old, 32.14%) and a high mortality rate at three months (46.15%). Based on the WGS analysis, isolates exhibiting IGS1 haplotype 1, 3 and 7 belonged to different clades. Five isolates recovered during the first grouped cases had the same IGS1 haplotype and shared 99% of SNPs similarity. For the second grouped cases, four isolates had 98.7% of SNPs similarity while the isolate recovered 4 years earlier was totally unlinked. CONCLUSIONS We confirmed the usefulness of IGS1 sequencing for grouped cases infection of T. asahii. We underlined its limitation for the study of population structure and the utility of WGS analysis for the study of epidemiologically unrelated isolates.
Collapse
Affiliation(s)
- Marie Desnos-Ollivier
- Molecular Mycology Unit, UMR2000, Institut Pasteur, CNRS, National Reference Center for Invasive Mycoses & Antifungals, Paris, France
| | - Corinne Maufrais
- Center of Bioinformatics for Biology, Institut Pasteur, Paris, France
| | - Marc Pihet
- Laboratoire de Parasitologie-Mycologie, Centre Hospitalier Universitaire d'Angers, Angers, France
| | - Christine Aznar
- Laboratoire de Parasitologie-Mycologie, Centre Hospitalier de Cayenne, Cayenne, France
| | - Françoise Dromer
- Molecular Mycology Unit, UMR2000, Institut Pasteur, CNRS, National Reference Center for Invasive Mycoses & Antifungals, Paris, France
| | | |
Collapse
|
20
|
Evolution of novel genes in three-spined stickleback populations. Heredity (Edinb) 2020; 125:50-59. [PMID: 32499660 PMCID: PMC7413265 DOI: 10.1038/s41437-020-0319-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2019] [Revised: 04/27/2020] [Accepted: 04/30/2020] [Indexed: 12/22/2022] Open
Abstract
Eukaryotic genomes frequently acquire new protein-coding genes which may significantly impact an organism’s fitness. Novel genes can be created, for example, by duplication of large genomic regions or de novo, from previously non-coding DNA. Either way, creation of a novel transcript is an essential early step during novel gene emergence. Most studies on the gain-and-loss dynamics of novel genes so far have compared genomes between species, constraining analyses to genes that have remained fixed over long time scales. However, the importance of novel genes for rapid adaptation among populations has recently been shown. Therefore, since little is known about the evolutionary dynamics of transcripts across natural populations, we here study transcriptomes from several tissues and nine geographically distinct populations of an ecological model species, the three-spined stickleback. Our findings suggest that novel genes typically start out as transcripts with low expression and high tissue specificity. Early expression regulation appears to be mediated by gene-body methylation. Although most new and narrowly expressed genes are rapidly lost, those that survive and subsequently spread through populations tend to gain broader and higher expression levels. The properties of the encoded proteins, such as disorder and aggregation propensity, hardly change. Correspondingly, young novel genes are not preferentially under positive selection but older novel genes more often overlap with FST outlier regions. Taken together, expression of the surviving novel genes is rapidly regulated, probably via epigenetic mechanisms, while structural properties of encoded proteins are non-debilitating and might only change much later.
Collapse
|
21
|
Heames B, Schmitz J, Bornberg-Bauer E. A Continuum of Evolving De Novo Genes Drives Protein-Coding Novelty in Drosophila. J Mol Evol 2020; 88:382-398. [PMID: 32253450 PMCID: PMC7162840 DOI: 10.1007/s00239-020-09939-z] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2019] [Accepted: 03/13/2020] [Indexed: 12/13/2022]
Abstract
Orphan genes, lacking detectable homologs in outgroup species, typically represent 10-30% of eukaryotic genomes. Efforts to find the source of these young genes indicate that de novo emergence from non-coding DNA may in part explain their prevalence. Here, we investigate the roots of orphan gene emergence in the Drosophila genus. Across the annotated proteomes of twelve species, we find 6297 orphan genes within 4953 taxon-specific clusters of orthologs. By inferring the ancestral DNA as non-coding for between 550 and 2467 (8.7-39.2%) of these genes, we describe for the first time how de novo emergence contributes to the abundance of clade-specific Drosophila genes. In support of them having functional roles, we show that de novo genes have robust expression and translational support. However, the distinct nucleotide sequences of de novo genes, which have characteristics intermediate between intergenic regions and conserved genes, reflect their recent birth from non-coding DNA. We find that de novo genes encode more disordered proteins than both older genes and intergenic regions. Together, our results suggest that gene emergence from non-coding DNA provides an abundant source of material for the evolution of new proteins. Following gene birth, gradual evolution over large evolutionary timescales moulds sequence properties towards those of conserved genes, resulting in a continuum of properties whose starting points depend on the nucleotide sequences of an initial pool of novel genes.
Collapse
Affiliation(s)
- Brennen Heames
- Institute for Evolution and Biodiversity, 48149, Münster, Germany
| | - Jonathan Schmitz
- Institute for Evolution and Biodiversity, 48149, Münster, Germany
| | | |
Collapse
|
22
|
Vakirlis N, Acar O, Hsu B, Castilho Coelho N, Van Oss SB, Wacholder A, Medetgul-Ernar K, Bowman RW, Hines CP, Iannotta J, Parikh SB, McLysaght A, Camacho CJ, O'Donnell AF, Ideker T, Carvunis AR. De novo emergence of adaptive membrane proteins from thymine-rich genomic sequences. Nat Commun 2020; 11:781. [PMID: 32034123 PMCID: PMC7005711 DOI: 10.1038/s41467-020-14500-z] [Citation(s) in RCA: 57] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2019] [Accepted: 12/20/2019] [Indexed: 11/14/2022] Open
Abstract
Recent evidence demonstrates that novel protein-coding genes can arise de novo from non-genic loci. This evolutionary innovation is thought to be facilitated by the pervasive translation of non-genic transcripts, which exposes a reservoir of variable polypeptides to natural selection. Here, we systematically characterize how these de novo emerging coding sequences impact fitness in budding yeast. Disruption of emerging sequences is generally inconsequential for fitness in the laboratory and in natural populations. Overexpression of emerging sequences, however, is enriched in adaptive fitness effects compared to overexpression of established genes. We find that adaptive emerging sequences tend to encode putative transmembrane domains, and that thymine-rich intergenic regions harbor a widespread potential to produce transmembrane domains. These findings, together with in-depth examination of the de novo emerging YBR196C-A locus, suggest a novel evolutionary model whereby adaptive transmembrane polypeptides emerge de novo from thymine-rich non-genic regions and subsequently accumulate changes molded by natural selection. There is increasing evidence that protein-coding genes can emerge de novo from noncoding genomic regions. Vakirlis et al. propose that sequences encoding transmembrane polypeptides can emerge de novo in thymine-rich genomic regions and provide organisms with fitness benefits.
Collapse
Affiliation(s)
- Nikolaos Vakirlis
- Smurfit Institute of Genetics, Trinity College Dublin, University of Dublin, Dublin, 2, Ireland
| | - Omer Acar
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15213, United States.,Pittsburgh Center for Evolutionary Biology and Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15213, United States
| | - Brian Hsu
- Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA, 92093, United States
| | - Nelson Castilho Coelho
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15213, United States.,Pittsburgh Center for Evolutionary Biology and Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15213, United States
| | - S Branden Van Oss
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15213, United States.,Pittsburgh Center for Evolutionary Biology and Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15213, United States
| | - Aaron Wacholder
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15213, United States.,Pittsburgh Center for Evolutionary Biology and Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15213, United States
| | - Kate Medetgul-Ernar
- Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA, 92093, United States
| | - Ray W Bowman
- Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA, 15260, United States
| | - Cameron P Hines
- Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA, 92093, United States
| | - John Iannotta
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15213, United States.,Pittsburgh Center for Evolutionary Biology and Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15213, United States
| | - Saurin Bipin Parikh
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15213, United States.,Pittsburgh Center for Evolutionary Biology and Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15213, United States
| | - Aoife McLysaght
- Smurfit Institute of Genetics, Trinity College Dublin, University of Dublin, Dublin, 2, Ireland
| | - Carlos J Camacho
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15213, United States
| | - Allyson F O'Donnell
- Pittsburgh Center for Evolutionary Biology and Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15213, United States. .,Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA, 15260, United States.
| | - Trey Ideker
- Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA, 92093, United States.
| | - Anne-Ruxandra Carvunis
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15213, United States. .,Pittsburgh Center for Evolutionary Biology and Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15213, United States.
| |
Collapse
|
23
|
Oldfield CJ, Peng Z, Uversky VN, Kurgan L. Codon selection reduces GC content bias in nucleic acids encoding for intrinsically disordered proteins. Cell Mol Life Sci 2020; 77:149-160. [PMID: 31175370 PMCID: PMC11104855 DOI: 10.1007/s00018-019-03166-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2019] [Revised: 05/14/2019] [Accepted: 05/28/2019] [Indexed: 02/06/2023]
Abstract
Protein-coding nucleic acids exhibit composition and codon biases between sequences coding for intrinsically disordered regions (IDRs) and those coding for structured regions. IDRs are regions of proteins that are folding self-insufficient and which function without the prerequisite of folded structure. Several authors have investigated composition bias or codon selection in regions encoding for IDRs, primarily in Eukaryota, and concluded that elevated GC content is the result of the biased amino acid composition of IDRs. We substantively extend previous work by examining GC content in regions encoding IDRs, from 44 species in Eukaryota, Archaea, and Bacteria, spanning a wide range of GC content. We confirm that regions coding for IDRs show a significantly elevated GC content, even across all domains of life. Although this is largely attributable to the amino acid composition bias of IDRs, we show that this bias is independent of the overall GC content and, most importantly, we are the first to observe that GC content bias in IDRs is significantly different than expected from IDR amino acid composition alone. We empirically find compensatory codon selection that reduces the observed GC content bias in IDRs. This selection is dependent on the overall GC content of the organism. The codon selection bias manifests as use of infrequent, AT-rich codons in encoding IDRs. Further, we find these relationships to be independent of the intrinsic disorder prediction method used, and independent of estimated translation efficiency. These observations are consistent with the previous work, and we speculate on whether the observed biases are causal or symptomatic of other driving forces.
Collapse
Affiliation(s)
- Christopher J Oldfield
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, 23284, USA.
| | - Zhenling Peng
- Center for Applied Mathematics, Tianjin University, Tianjin, 300072, China
| | - Vladimir N Uversky
- Department of Molecular Medicine and USF Health Byrd Alzheimer's Research Institute, Morsani College of Medicine, University of South Florida, Tampa, FL, 33612, USA
- Institute for Biological Instrumentation, Russian Academy of Sciences, 142290, Pushchino, Moscow Region, Russia
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, 23284, USA.
| |
Collapse
|
24
|
Almagro Armenteros JJ, Salvatore M, Emanuelsson O, Winther O, von Heijne G, Elofsson A, Nielsen H. Detecting sequence signals in targeting peptides using deep learning. Life Sci Alliance 2019; 2:2/5/e201900429. [PMID: 31570514 DOI: 10.1101/639203] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2019] [Revised: 09/18/2019] [Accepted: 09/18/2019] [Indexed: 05/25/2023] Open
Abstract
In bioinformatics, machine learning methods have been used to predict features embedded in the sequences. In contrast to what is generally assumed, machine learning approaches can also provide new insights into the underlying biology. Here, we demonstrate this by presenting TargetP 2.0, a novel state-of-the-art method to identify N-terminal sorting signals, which direct proteins to the secretory pathway, mitochondria, and chloroplasts or other plastids. By examining the strongest signals from the attention layer in the network, we find that the second residue in the protein, that is, the one following the initial methionine, has a strong influence on the classification. We observe that two-thirds of chloroplast and thylakoid transit peptides have an alanine in position 2, compared with 20% in other plant proteins. We also note that in fungi and single-celled eukaryotes, less than 30% of the targeting peptides have an amino acid that allows the removal of the N-terminal methionine compared with 60% for the proteins without targeting peptide. The importance of this feature for predictions has not been highlighted before.
Collapse
Affiliation(s)
- Jose Juan Almagro Armenteros
- Department of Health Technology, Section for Bioinformatics, Technical University of Denmark, Kongen Lyngby, Denmark
| | - Marco Salvatore
- Science for Life Laboratory, Solna, Sweden
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | - Olof Emanuelsson
- Science for Life Laboratory, Solna, Sweden
- Department of Gene Technology, School of Engineering Sciences in Biotechnology, Chemistry and Health, KTH-Royal Institute of Technology, Stockholm, Sweden
| | - Ole Winther
- DTU Compute, Technical University of Denmark, Kongen Lyngby, Denmark
- Computational and RNA Biology, University of Copenhagen, Copenhagen, Denmark
- Centre for Genomic Medicine, Rigshospitalet, Copenhagen University Hospital, Copenhagen, Denmark
| | - Gunnar von Heijne
- Science for Life Laboratory, Solna, Sweden
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | - Arne Elofsson
- Science for Life Laboratory, Solna, Sweden
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | - Henrik Nielsen
- Department of Health Technology, Section for Bioinformatics, Technical University of Denmark, Kongen Lyngby, Denmark
| |
Collapse
|
25
|
Almagro Armenteros JJ, Salvatore M, Emanuelsson O, Winther O, von Heijne G, Elofsson A, Nielsen H. Detecting sequence signals in targeting peptides using deep learning. Life Sci Alliance 2019; 2:2/5/e201900429. [PMID: 31570514 PMCID: PMC6769257 DOI: 10.26508/lsa.201900429] [Citation(s) in RCA: 410] [Impact Index Per Article: 82.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2019] [Revised: 09/18/2019] [Accepted: 09/18/2019] [Indexed: 11/24/2022] Open
Abstract
In bioinformatics, machine learning methods have been used to predict features embedded in the sequences. In contrast to what is generally assumed, machine learning approaches can also provide new insights into the underlying biology. Here, we demonstrate this by presenting TargetP 2.0, a novel state-of-the-art method to identify N-terminal sorting signals, which direct proteins to the secretory pathway, mitochondria, and chloroplasts or other plastids. By examining the strongest signals from the attention layer in the network, we find that the second residue in the protein, that is, the one following the initial methionine, has a strong influence on the classification. We observe that two-thirds of chloroplast and thylakoid transit peptides have an alanine in position 2, compared with 20% in other plant proteins. We also note that in fungi and single-celled eukaryotes, less than 30% of the targeting peptides have an amino acid that allows the removal of the N-terminal methionine compared with 60% for the proteins without targeting peptide. The importance of this feature for predictions has not been highlighted before.
Collapse
Affiliation(s)
- Jose Juan Almagro Armenteros
- Department of Health Technology, Section for Bioinformatics, Technical University of Denmark, Kongen Lyngby, Denmark
| | - Marco Salvatore
- Science for Life Laboratory, Solna, Sweden.,Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | - Olof Emanuelsson
- Science for Life Laboratory, Solna, Sweden.,Department of Gene Technology, School of Engineering Sciences in Biotechnology, Chemistry and Health, KTH-Royal Institute of Technology, Stockholm, Sweden
| | - Ole Winther
- DTU Compute, Technical University of Denmark, Kongen Lyngby, Denmark.,Computational and RNA Biology, University of Copenhagen, Copenhagen, Denmark.,Centre for Genomic Medicine, Rigshospitalet, Copenhagen University Hospital, Copenhagen, Denmark
| | - Gunnar von Heijne
- Science for Life Laboratory, Solna, Sweden.,Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | - Arne Elofsson
- Science for Life Laboratory, Solna, Sweden .,Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | - Henrik Nielsen
- Department of Health Technology, Section for Bioinformatics, Technical University of Denmark, Kongen Lyngby, Denmark
| |
Collapse
|
26
|
Nielly-Thibault L, Landry CR. Differences Between the Raw Material and the Products of de Novo Gene Birth Can Result from Mutational Biases. Genetics 2019; 212:1353-1366. [PMID: 31227545 PMCID: PMC6707459 DOI: 10.1534/genetics.119.302187] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2019] [Accepted: 06/14/2019] [Indexed: 12/03/2022] Open
Abstract
Proteins are among the most important constituents of biological systems. Because all protein-coding genes have a noncoding ancestral form, the properties of noncoding sequences and how they shape the birth of novel proteins may influence the structure and function of all proteins. Differences between the properties of young proteins and random expectations from noncoding sequences have previously been interpreted as the result of natural selection. However, interpreting such deviations requires a yet-unattained understanding of the raw material of de novo gene birth and its relation to novel functional proteins. We mathematically show that the average properties and selective filtering of the "junk" polypeptides of which this raw material is composed are not the only factors influencing the properties of novel functional proteins. We find that in some biological scenarios, they also depend on the variance of the properties of junk polypeptides and their correlation with the rate of allelic turnover, which may itself depend on mutational biases. This suggests for instance that any property of polypeptides that accelerates their exploration of the sequence space could be overrepresented in novel functional proteins, even if it has a limited effect on adaptive value. To exemplify the use of our general theoretical results, we build a simple model that predicts the mean length and mean intrinsic disorder of novel functional proteins from the genomic GC content and a single evolutionary parameter. This work provides a theoretical framework that can guide the prediction and interpretation of results when studying the de novo emergence of protein-coding genes.
Collapse
Affiliation(s)
- Lou Nielly-Thibault
- Institut de Biologie Intégrative et des Systèmes, Université Laval, Quebec, Quebec G1V 0A6, Canada
- Département de Biologie, Université Laval, Quebec, Quebec G1V 0A6, Canada
- Département de Biochimie, de Microbiologie et de Bio-Informatique, Université Laval, Quebec, Quebec G1V 0A6, Canada
- PROTEO, Quebec, Quebec G1V 0A6, Canada
| | - Christian R Landry
- Institut de Biologie Intégrative et des Systèmes, Université Laval, Quebec, Quebec G1V 0A6, Canada
- Département de Biologie, Université Laval, Quebec, Quebec G1V 0A6, Canada
- Département de Biochimie, de Microbiologie et de Bio-Informatique, Université Laval, Quebec, Quebec G1V 0A6, Canada
- PROTEO, Quebec, Quebec G1V 0A6, Canada
| |
Collapse
|
27
|
Basile W, Salvatore M, Bassot C, Elofsson A. Why do eukaryotic proteins contain more intrinsically disordered regions? PLoS Comput Biol 2019; 15:e1007186. [PMID: 31329574 PMCID: PMC6675126 DOI: 10.1371/journal.pcbi.1007186] [Citation(s) in RCA: 49] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2018] [Revised: 08/01/2019] [Accepted: 06/14/2019] [Indexed: 12/12/2022] Open
Abstract
Intrinsic disorder is more abundant in eukaryotic than prokaryotic proteins. Methods predicting intrinsic disorder are based on the amino acid sequence of a protein. Therefore, there must exist an underlying difference in the sequences between eukaryotic and prokaryotic proteins causing the (predicted) difference in intrinsic disorder. By comparing proteins, from complete eukaryotic and prokaryotic proteomes, we show that the difference in intrinsic disorder emerges from the linker regions connecting Pfam domains. Eukaryotic proteins have more extended linker regions, and in addition, the eukaryotic linkers are significantly more disordered, 38% vs. 12-16% disordered residues. Next, we examined the underlying reason for the increase in disorder in eukaryotic linkers, and we found that the changes in abundance of only three amino acids cause the increase. Eukaryotic proteins contain 8.6% serine; while prokaryotic proteins have 6.5%, eukaryotic proteins also contain 5.4% proline and 5.3% isoleucine compared with 4.0% proline and ≈ 7.5% isoleucine in the prokaryotes. All these three differences contribute to the increased disorder in eukaryotic proteins. It is tempting to speculate that the increase in serine frequencies in eukaryotes is related to regulation by kinases, but direct evidence for this is lacking. The differences are observed in all phyla, protein families, structural regions and type of protein but are most pronounced in disordered and linker regions. The observation that differences in the abundance of three amino acids cause the difference in disorder between eukaryotic and prokaryotic proteins raises the question: Are amino acid frequencies different in eukaryotic linkers because the linkers are more disordered or do the differences cause the increased disorder? Intrinsic disorder is essential for various functions in eukaryotic cells and is a signature of eukaryotic proteins. Here, we try to understand the origin of the difference in disorder between eukaryotic and prokaryotic proteins. We show that eukaryotic proteins contain more extended linker regions and that these linker regions are significantly more disordered. Further, we show, for the first time, that the difference in disorder originates from a systematic difference in amino acid frequencies between eukaryotic and prokaryotic proteins. Three amino acids contribute to the difference in disorder; serine and proline are more abundant in eukaryotic linkers, while isoleucine is less frequent. These shifts in frequencies are observed in all phyla, protein families, structural regions and type of protein but are most pronounced in disordered and linker regions. It is tempting to speculate that the increase in serine frequencies in eukaryotes is related to regulation by kinases, but direct evidence for this is lacking. Anyhow the widespread of the shifts in abundance indicates that the differences are ancient and caused be some yet not fully understood selective difference acting on eukaryotic and prokaryotic proteins.
Collapse
Affiliation(s)
- Walter Basile
- Science for Life Laboratory, Stockholm University, Solna, Sweden
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | - Marco Salvatore
- Science for Life Laboratory, Stockholm University, Solna, Sweden
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | - Claudio Bassot
- Science for Life Laboratory, Stockholm University, Solna, Sweden
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | - Arne Elofsson
- Science for Life Laboratory, Stockholm University, Solna, Sweden
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
- Swedish e-Science Research Center (SeRC), Stockholm, Sweden
- * E-mail:
| |
Collapse
|
28
|
Affiliation(s)
- Stephen Branden Van Oss
- Department of Computational and Systems Biology, Pittsburgh Center for Evolutionary Biology and Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
| | - Anne-Ruxandra Carvunis
- Department of Computational and Systems Biology, Pittsburgh Center for Evolutionary Biology and Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
| |
Collapse
|
29
|
Vakirlis N, Hebert AS, Opulente DA, Achaz G, Hittinger CT, Fischer G, Coon JJ, Lafontaine I. A Molecular Portrait of De Novo Genes in Yeasts. Mol Biol Evol 2019; 35:631-645. [PMID: 29220506 DOI: 10.1093/molbev/msx315] [Citation(s) in RCA: 64] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
New genes, with novel protein functions, can evolve "from scratch" out of intergenic sequences. These de novo genes can integrate the cell's genetic network and drive important phenotypic innovations. Therefore, identifying de novo genes and understanding how the transition from noncoding to coding occurs are key problems in evolutionary biology. However, identifying de novo genes is a difficult task, hampered by the presence of remote homologs, fast evolving sequences and erroneously annotated protein coding genes. To overcome these limitations, we developed a procedure that handles the usual pitfalls in de novo gene identification and predicted the emergence of 703 de novo gene candidates in 15 yeast species from 2 genera whose phylogeny spans at least 100 million years of evolution. We validated 85 candidates by proteomic data, providing new translation evidence for 25 of them through mass spectrometry experiments. We also unambiguously identified the mutations that enabled the transition from noncoding to coding for 30 Saccharomyces de novo genes. We established that de novo gene origination is a widespread phenomenon in yeasts, only a few being ultimately maintained by selection. We also found that de novo genes preferentially emerge next to divergent promoters in GC-rich intergenic regions where the probability of finding a fortuitous and transcribed ORF is the highest. Finally, we found a more than 3-fold enrichment of de novo genes at recombination hot spots, which are GC-rich and nucleosome-free regions, suggesting that meiotic recombination contributes to de novo gene emergence in yeasts.
Collapse
Affiliation(s)
- Nikolaos Vakirlis
- Sorbonne Universités, UPMC Univ Paris 06, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative UMR7238, 75005 Paris, France
| | - Alex S Hebert
- Genome Center of Wisconsin, University of Wisconsin-Madison, Madison, WI.,DOE Great Lakes Bioenergy Research Center, University of Wisconsin-Madison, Madison, WI
| | - Dana A Opulente
- Laboratory of Genetics, Genome Center of Wisconsin, J. F. Crow Institute for the Study of Evolution, Wisconsin Energy Institute, University of Wisconsin-Madison, Madison, WI
| | - Guillaume Achaz
- Atelier de BioInformatique, ISyEB UMR7205 Muséum National d'Histoire Naturelle, Paris, France.,SMILE Group, CIRB UMR7241, Collège de France, Paris, France
| | - Chris Todd Hittinger
- DOE Great Lakes Bioenergy Research Center, University of Wisconsin-Madison, Madison, WI.,Laboratory of Genetics, Genome Center of Wisconsin, J. F. Crow Institute for the Study of Evolution, Wisconsin Energy Institute, University of Wisconsin-Madison, Madison, WI
| | - Gilles Fischer
- Sorbonne Universités, UPMC Univ Paris 06, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative UMR7238, 75005 Paris, France
| | - Joshua J Coon
- Genome Center of Wisconsin, University of Wisconsin-Madison, Madison, WI.,DOE Great Lakes Bioenergy Research Center, University of Wisconsin-Madison, Madison, WI.,Department of Biomolecular Chemistry, University of Wisconsin-Madison, Madison, WI.,Department of Chemistry, University of Wisconsin-Madison, Madison, WI.,Morgridge Institute for Research, Madison, WI
| | - Ingrid Lafontaine
- Atelier de BioInformatique, ISyEB UMR7205 Muséum National d'Histoire Naturelle, Paris, France.,Sorbonne Universités, UPMC Univ Paris 06, CNRS, Institut de Biologie Physico-Chimique, Physiologie Membranaire et Moléculaire du Chloroplaste UMR7141, 75005 Paris, France
| |
Collapse
|
30
|
Castillo AI, Nelson ADL, Lyons E. Tail Wags the Dog? Functional Gene Classes Driving Genome-Wide GC Content in Plasmodium spp. Genome Biol Evol 2019; 11:497-507. [PMID: 30689842 PMCID: PMC6385630 DOI: 10.1093/gbe/evz015] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/18/2019] [Indexed: 01/16/2023] Open
Abstract
Plasmodium parasites are valuable models to understand how nucleotide composition affects mutation, diversification, and adaptation. No other observed eukaryotes have undergone such large changes in genomic Guanine-Cytosine (GC) content as seen in the genus Plasmodium (∼30% within 35-40 Myr). Although mutational biases are known to influence GC content in the human-infective Plasmodium vivax and Plasmodium falciparum; no study has addressed how different gene functional classes contribute to genus-wide compositional changes, or if Plasmodium GC content variation is driven by natural selection. Here, we tested the hypothesis that certain gene processes and functions drive variation in global GC content between Plasmodium species. We performed a large-scale comparative genomic analysis using the genomes and predicted genes of 17 Plasmodium species encompassing a wide genomic GC content range. Genic GC content was sorted and divided into ten equally sized quantiles that were then assessed for functional enrichment classes. In agreement that selection on gene classes may drive genomic GC content, trans-membrane proteins were enriched within extreme GC content quantiles (Q1 and Q10). Specifically, variant surface antigens, which primarily interact with vertebrate immune systems, showed skewed GC content distributions compared with other trans-membrane proteins. Although a definitive causation linking GC content, expression, and positive selection within variant surface antigens from Plasmodium vivax, Plasmodium berghei, and Plasmodium falciparum could not be established, we found that regardless of genomic nucleotide composition, genic GC content and expression were positively correlated during trophozoite stages. Overall, these data suggest that, alongside mutational biases, functional protein classes drive Plasmodium GC content change.
Collapse
Affiliation(s)
- Andreina I Castillo
- School of Environmental Science, Policy, and Management, University of California, Berkeley
| | | | - Eric Lyons
- BIO5 Institute, School of Plant Sciences, University of Arizona
| |
Collapse
|
31
|
Casola C. From De Novo to "De Nono": The Majority of Novel Protein-Coding Genes Identified with Phylostratigraphy Are Old Genes or Recent Duplicates. Genome Biol Evol 2018; 10:2906-2918. [PMID: 30346517 PMCID: PMC6239577 DOI: 10.1093/gbe/evy231] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/10/2018] [Indexed: 12/11/2022] Open
Abstract
The evolution of novel protein-coding genes from noncoding regions of the genome is one of the most compelling pieces of evidence for genetic innovations in nature. One popular approach to identify de novo genes is phylostratigraphy, which consists of determining the approximate time of origin (age) of a gene based on its distribution along a species phylogeny. Several studies have revealed significant flaws in determining the age of genes, including de novo genes, using phylostratigraphy alone. However, the rate of false positives in de novo gene surveys, based on phylostratigraphy, remains unknown. Here, I reanalyze the findings from three studies, two of which identified tens to hundreds of rodent-specific de novo genes adopting a phylostratigraphy-centered approach. Most putative de novo genes discovered in these investigations are no longer included in recently updated mouse gene sets. Using a combination of synteny information and sequence similarity searches, I show that ∼60% of the remaining 381 putative de novo genes share homology with genes from other vertebrates, originated through gene duplication, and/or share no synteny information with nonrodent mammals. These results led to an estimated rate of ∼12 de novo genes per million years in mouse. Contrary to a previous study (Wilson BA, Foy SG, Neme R, Masel J. 2017. Young genes are highly disordered as predicted by the preadaptation hypothesis of de novo gene birth. Nat Ecol Evol. 1:0146), I found no evidence supporting the preadaptation hypothesis of de novo gene formation. Nearly half of the de novo genes confirmed in this study are within older genes, indicating that co-option of preexisting regulatory regions and a higher GC content may facilitate the origin of novel genes.
Collapse
Affiliation(s)
- Claudio Casola
- Department of Ecosystem Science and Management, Texas A&M University
| |
Collapse
|
32
|
Incipient de novo genes can evolve from frozen accidents that escaped rapid transcript turnover. Nat Ecol Evol 2018; 2:1626-1632. [DOI: 10.1038/s41559-018-0639-7] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2017] [Accepted: 07/09/2018] [Indexed: 11/08/2022]
|
33
|
Abstract
De novo genes are very important for evolutionary innovation. However, how these genes originate and spread remains largely unknown. To better understand this, we rigorously searched for de novo genes in Saccharomyces cerevisiae S288C and examined their spread and fixation in the population. Here, we identified 84 de novo genes in S. cerevisiae S288C since the divergence with their sister groups. Transcriptome and ribosome profiling data revealed at least 8 (10%) and 28 (33%) de novo genes being expressed and translated only under specific conditions, respectively. DNA microarray data, based on 2-fold change, showed that 87% of the de novo genes are regulated during various biological processes, such as nutrient utilization and sporulation. Our comparative and evolutionary analyses further revealed that some factors, including single nucleotide polymorphism (SNP)/indel mutation, high GC content, and DNA shuffling, contribute to the birth of de novo genes, while domestication and natural selection drive the spread and fixation of these genes. Finally, we also provide evidence suggesting the possible parallel origin of a de novo gene between S. cerevisiae and Saccharomyces paradoxus. Together, our study provides several new insights into the origin and spread of de novo genes. Emergence of de novo genes has occurred in many lineages during evolution, but the birth, spread, and function of these genes remain unresolved. Here we have searched for de novo genes from Saccharomyces cerevisiae S288C using rigorous methods, which reduced the effects of bad annotation and genomic gaps on the identification of de novo genes. Through this analysis, we have found 84 new genes originating de novo from previously noncoding regions, 87% of which are very likely involved in various biological processes. We noticed that 10% and 33% of de novo genes were only expressed and translated under specific conditions, therefore, verification of de novo genes through transcriptome and ribosome profiling, especially from limited expression data, may underestimate the number of bona fide new genes. We further show that SNP/indel mutation, high GC content, and DNA shuffling could be involved in the birth of de novo genes, while domestication and natural selection drive the spread and fixation of these genes. Finally, we provide evidence suggesting the possible parallel origin of a new gene.
Collapse
|