1
|
Glaser-Schmitt A, Lebherz M, Saydam E, Bornberg-Bauer E, Parsch J. Expression of De Novo Open Reading Frames in Natural Populations of Drosophila melanogaster. JOURNAL OF EXPERIMENTAL ZOOLOGY. PART B, MOLECULAR AND DEVELOPMENTAL EVOLUTION 2025. [PMID: 40231390 DOI: 10.1002/jez.b.23297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/12/2025] [Revised: 03/14/2025] [Accepted: 04/03/2025] [Indexed: 04/16/2025]
Abstract
De novo genes, which originate from noncoding DNA, are known to have a high rate of turnover over short evolutionary timescales, such as within a species. Thus, their expression is often lineage- or genetic background-specific. However, little is known about their levels and breadth of expression as populations of a species diverge. In this study, we utilized publicly available RNA-seq data to examine the expression of newly evolved open reading frames (neORFs) in comparison to non- and protein-coding genes in Drosophila melanogaster populations from the derived species range in Europe and the ancestral range in sub-Saharan Africa. Our datasets included two adult tissue types as well as whole bodies at two temperatures for both sexes and three larval/prepupal developmental stages in a single tissue and sex, which allowed us to examine neORF expression and divergence across multiple sample types as well as sex and population. We detected a relatively large proportion (approximately 50%) of annotated neORFs as expressed in the population samples, with neORFs often showing greater expression divergence between populations than non- or protein-coding genes. However, differential expression of neORFs between populations tended to occur in a sample type-specific manner. On the other hand, neORFs displayed less sex-biased expression than the other two gene classes, with the majority of sex-biased neORFs detected in whole bodies, which may be attributable to the presence of the gonads. We also found that neORFs shared among multiple lines in the original set of inbred lines in which they were first detected were more likely to be both expressed and differentially expressed in the new population samples, suggesting that neORFs at a higher frequency (i.e. present in more individuals) within a species are more likely to be functional.
Collapse
Affiliation(s)
- Amanda Glaser-Schmitt
- Division of Evolutionary Biology, Faculty of Biology, Ludwig-Maximilians-Universität München, Munich, Bavaria, Germany
| | - Marie Lebherz
- Institute for Evolution and Biodiversity, University of Münster, Münster, North Rhine-Westphalia, Germany
| | - Ezgi Saydam
- Division of Evolutionary Biology, Faculty of Biology, Ludwig-Maximilians-Universität München, Munich, Bavaria, Germany
| | - Erich Bornberg-Bauer
- Institute for Evolution and Biodiversity, University of Münster, Münster, North Rhine-Westphalia, Germany
| | - John Parsch
- Division of Evolutionary Biology, Faculty of Biology, Ludwig-Maximilians-Universität München, Munich, Bavaria, Germany
| |
Collapse
|
2
|
Hannon Bozorgmehr J. The De Novo Emergence of Two Brain Genes in the Human Lineage Appears to be Unsupported. J Mol Evol 2025; 93:3-10. [PMID: 39725692 DOI: 10.1007/s00239-024-10227-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2024] [Accepted: 12/10/2024] [Indexed: 12/28/2024]
Abstract
Recently, certain studies have claimed that cognitive features and pathologies unique to humans can be traced to certain changes in the nervous system. These are caused by genes that have likely evolved "from scratch," not having any coding precursors. The translated proteins would not appear outside of the human lineage and any orthologs in other species should be non-coding. This contrasts with research that has identified a decisive role for duplication, and modifications to regulatory sequences, for such phenotypic traits. Closer examination, however, reveals that the inferred lineage-specific emergence of at least two of these genes is likely a misinterpretation owing to a lack of peptide verification, experimental oversights, and insufficient species comparisons. A possible pseudogenic origin is proposed for one of them. The implications of these claims for the study of molecular evolution are discussed.
Collapse
|
3
|
de Souza EV, Dalberto PF, Miranda AC, Saghatelian A, Pinto AM, Basso LA, Machado P, Bizarro CV. Large-scale proteogenomics characterization of microproteins in Mycobacterium tuberculosis. Sci Rep 2024; 14:31186. [PMID: 39732784 DOI: 10.1038/s41598-024-82465-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Accepted: 12/05/2024] [Indexed: 12/30/2024] Open
Abstract
Tuberculosis remains a burden to this day, due to the rise of multi and extensively drug-resistant bacterial strains. The genome of Mycobacterium tuberculosis (Mtb) strain H37Rv underwent an annotation process that excluded small Open Reading Frames (smORFs), which encode a class of peptides and small proteins collectively known as microproteins. As a result, there is an overlooked part of its proteome that is a rich source of potentially essential, druggable molecular targets. Here, we employed our recently developed proteogenomics pipeline to identify novel microproteins encoded by non-canonical smORFs in the genome of Mtb using hundreds of mass spectrometry experiments in a large-scale approach. We found protein evidence for hundreds of unannotated microproteins and identified smORFs essential for bacterial survival and involved in bacterial growth and virulence. Moreover, many smORFs are co-expressed and share operons with a myriad of biologically relevant genes and play a role in antibiotic response. Together, our data presents a resource of unknown genes that play a role in the success of Mtb as a widespread pathogen.
Collapse
Affiliation(s)
- Eduardo V de Souza
- Centro de Pesquisas em Biologia Molecular e Funcional (CPBMF) and Instituto Nacional de Ciência e Tecnologia em Tuberculose (INCT-TB), Escola de Ciências da Saúde e da Vida, Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Porto Alegre, Rio Grande do Sul, 90619-900, Brazil
- Programa de Pós-Graduação em Biologia Celular e Molecular, Escola de Ciências da Saúde e da Vida, Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Porto Alegre, Rio Grande do Sul, 90619-900, Brazil
- Clayton Foundation Laboratories for Peptide Biology, Salk Institute for Biological Studies, La Jolla, CA, USA
| | - Pedro F Dalberto
- Centro de Pesquisas em Biologia Molecular e Funcional (CPBMF) and Instituto Nacional de Ciência e Tecnologia em Tuberculose (INCT-TB), Escola de Ciências da Saúde e da Vida, Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Porto Alegre, Rio Grande do Sul, 90619-900, Brazil
| | - Adriana C Miranda
- Centro de Pesquisas em Biologia Molecular e Funcional (CPBMF) and Instituto Nacional de Ciência e Tecnologia em Tuberculose (INCT-TB), Escola de Ciências da Saúde e da Vida, Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Porto Alegre, Rio Grande do Sul, 90619-900, Brazil
- Programa de Pós-Graduação em Biologia Celular e Molecular, Escola de Ciências da Saúde e da Vida, Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Porto Alegre, Rio Grande do Sul, 90619-900, Brazil
| | - Alan Saghatelian
- Clayton Foundation Laboratories for Peptide Biology, Salk Institute for Biological Studies, La Jolla, CA, USA
| | - Antonio M Pinto
- Clayton Foundation Laboratories for Peptide Biology, Salk Institute for Biological Studies, La Jolla, CA, USA
| | - Luiz A Basso
- Centro de Pesquisas em Biologia Molecular e Funcional (CPBMF) and Instituto Nacional de Ciência e Tecnologia em Tuberculose (INCT-TB), Escola de Ciências da Saúde e da Vida, Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Porto Alegre, Rio Grande do Sul, 90619-900, Brazil
- Programa de Pós-Graduação em Biologia Celular e Molecular, Escola de Ciências da Saúde e da Vida, Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Porto Alegre, Rio Grande do Sul, 90619-900, Brazil
| | - Pablo Machado
- Centro de Pesquisas em Biologia Molecular e Funcional (CPBMF) and Instituto Nacional de Ciência e Tecnologia em Tuberculose (INCT-TB), Escola de Ciências da Saúde e da Vida, Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Porto Alegre, Rio Grande do Sul, 90619-900, Brazil
- Programa de Pós-Graduação em Biologia Celular e Molecular, Escola de Ciências da Saúde e da Vida, Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Porto Alegre, Rio Grande do Sul, 90619-900, Brazil
| | - Cristiano V Bizarro
- Centro de Pesquisas em Biologia Molecular e Funcional (CPBMF) and Instituto Nacional de Ciência e Tecnologia em Tuberculose (INCT-TB), Escola de Ciências da Saúde e da Vida, Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Porto Alegre, Rio Grande do Sul, 90619-900, Brazil.
- Programa de Pós-Graduação em Biologia Celular e Molecular, Escola de Ciências da Saúde e da Vida, Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Porto Alegre, Rio Grande do Sul, 90619-900, Brazil.
| |
Collapse
|
4
|
Rives N, Lamba V, Cheng CHC, Zhuang X. Diverse Origins of Near-Identical Antifreeze Proteins in Unrelated Fish Lineages Provide Insights Into Evolutionary Mechanisms of New Gene Birth and Protein Sequence Convergence. Mol Biol Evol 2024; 41:msae182. [PMID: 39213383 PMCID: PMC11403476 DOI: 10.1093/molbev/msae182] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2024] [Revised: 07/04/2024] [Accepted: 08/13/2024] [Indexed: 09/04/2024] Open
Abstract
Determining the origins of novel genes and the mechanisms driving the emergence of new functions is challenging yet crucial for understanding evolutionary innovations. Recently evolved fish antifreeze proteins (AFPs) offer a unique opportunity to explore these processes, particularly the near-identical type I AFP (AFPI) found in four phylogenetically divergent fish taxa. This study tested the hypothesis of protein sequence convergence beyond functional convergence in three unrelated AFPI-bearing fish lineages. Through comprehensive comparative analyses of newly sequenced genomes of winter flounder and grubby sculpin, along with available high-quality genomes of cunner and 14 other related species, the study revealed that near-identical AFPI proteins originated from distinct genetic precursors in each lineage. Each lineage independently evolved a de novo coding region for the novel ice-binding protein while repurposing fragments from their respective ancestors into potential regulatory regions, representing partial de novo origination-a process that bridges de novo gene formation and the neofunctionalization of duplicated genes. The study supports existing models of new gene origination and introduces new ones: the innovation-amplification-divergence model, where novel changes precede gene duplication; the newly proposed duplication-degeneration-divergence model, which describes new functions arising from degenerated pseudogenes; and the duplication-degeneration-divergence gene fission model, where each new sibling gene differentially degenerates and renovates distinct functional domains from their parental gene. These findings highlight the diverse evolutionary pathways through which a novel functional gene with convergent sequences at the protein level can evolve across divergent species, advancing our understanding of the mechanistic intricacies in new gene formation.
Collapse
Affiliation(s)
- Nathan Rives
- Department of Biological Sciences, University of Arkansas, Fayetteville, AR, USA
| | - Vinita Lamba
- Department of Biological Sciences, University of Arkansas, Fayetteville, AR, USA
| | - C H Christina Cheng
- Department of Evolution, Ecology and Behavior, University of Illinois, Urbana-Champaign, IL, USA
| | - Xuan Zhuang
- Department of Biological Sciences, University of Arkansas, Fayetteville, AR, USA
| |
Collapse
|
5
|
Middendorf L, Ravi Iyengar B, Eicholt LA. Sequence, Structure, and Functional Space of Drosophila De Novo Proteins. Genome Biol Evol 2024; 16:evae176. [PMID: 39212966 PMCID: PMC11363682 DOI: 10.1093/gbe/evae176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/29/2024] [Indexed: 09/04/2024] Open
Abstract
During de novo emergence, new protein coding genes emerge from previously nongenic sequences. The de novo proteins they encode are dissimilar in composition and predicted biochemical properties to conserved proteins. However, functional de novo proteins indeed exist. Both identification of functional de novo proteins and their structural characterization are experimentally laborious. To identify functional and structured de novo proteins in silico, we applied recently developed machine learning based tools and found that most de novo proteins are indeed different from conserved proteins both in their structure and sequence. However, some de novo proteins are predicted to adopt known protein folds, participate in cellular reactions, and to form biomolecular condensates. Apart from broadening our understanding of de novo protein evolution, our study also provides a large set of testable hypotheses for focused experimental studies on structure and function of de novo proteins in Drosophila.
Collapse
Affiliation(s)
- Lasse Middendorf
- Institute for Evolution and Biodiversity, University of Muenster, Huefferstrasse 1, 48149 Muenster, Germany
| | - Bharat Ravi Iyengar
- Institute for Evolution and Biodiversity, University of Muenster, Huefferstrasse 1, 48149 Muenster, Germany
| | - Lars A Eicholt
- Institute for Evolution and Biodiversity, University of Muenster, Huefferstrasse 1, 48149 Muenster, Germany
| |
Collapse
|
6
|
Iyengar BR, Grandchamp A, Bornberg-Bauer E. How antisense transcripts can evolve to encode novel proteins. Nat Commun 2024; 15:6187. [PMID: 39043684 PMCID: PMC11266595 DOI: 10.1038/s41467-024-50550-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2023] [Accepted: 07/12/2024] [Indexed: 07/25/2024] Open
Abstract
Protein coding features can emerge de novo in non coding transcripts, resulting in emergence of new protein coding genes. Studies across many species show that a large fraction of evolutionarily novel non-coding RNAs have an antisense overlap with protein coding genes. The open reading frames (ORFs) in these antisense RNAs could also overlap with existing ORFs. In this study, we investigate how the evolution an ORF could be constrained by its overlap with an existing ORF in three different reading frames. Using a combination of mathematical modeling and genome/transcriptome data analysis in two different model organisms, we show that antisense overlap can increase the likelihood of ORF emergence and reduce the likelihood of ORF loss, especially in one of the three reading frames. In addition to rationalising the repeatedly reported prevalence of de novo emerged genes in antisense transcripts, our work also provides a generic modeling and an analytical framework that can be used to understand evolution of antisense genes.
Collapse
Affiliation(s)
- Bharat Ravi Iyengar
- Institute for Evolution and Biodiversity, University of Münster, Hüfferstrasse 1, Münster, Germany.
| | - Anna Grandchamp
- Institute for Evolution and Biodiversity, University of Münster, Hüfferstrasse 1, Münster, Germany
- Aix-Marseille Université, INSERM, TAGC, Marseille, France
| | - Erich Bornberg-Bauer
- Institute for Evolution and Biodiversity, University of Münster, Hüfferstrasse 1, Münster, Germany
- Department of Protein Evolution, Max Planck Institute for Biology Tübingen, Max-Planck-Ring 5, Tübingen, Germany
| |
Collapse
|
7
|
Rives N, Lamba V, Cheng CHC, Zhuang X. Diverse origins of near-identical antifreeze proteins in unrelated fish lineages provide insights into evolutionary mechanisms of new gene birth and protein sequence convergence. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.12.584730. [PMID: 38559027 PMCID: PMC10980009 DOI: 10.1101/2024.03.12.584730] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Determining the origins of novel genes and the genetic mechanisms underlying the emergence of new functions is challenging yet crucial for understanding evolutionary innovations. The convergently evolved fish antifreeze proteins provide excellent opportunities to investigate evolutionary origins and pathways of new genes. Particularly notable is the near-identical type I antifreeze proteins (AFPI) in four phylogenetically divergent fish taxa. This study tested the hypothesis of protein sequence convergence beyond functional convergence in three unrelated AFPI-bearing fish lineages, revealing different paths by which a similar protein arose from diverse genomic resources. Comprehensive comparative analyses of de novo sequenced genome of the winter flounder and grubby sculpin, available high-quality genome of the cunner and 14 other relevant species found that the near-identical AFPI originated from a distinct genetic precursor in each lineage. Each independently evolved a coding region for the novel ice-binding protein while retaining sequence identity in the regulatory regions with their respective ancestor. The deduced evolutionary processes and molecular mechanisms are consistent with the Innovation-Amplification-Divergence (IAD) model applicable to AFPI formation in all three lineages, a new Duplication-Degeneration-Divergence (DDD) model we propose for the sculpin lineage, and a DDD model with gene fission for the cunner lineage. This investigation illustrates the multiple ways by which a novel functional gene with sequence convergence at the protein level could evolve across divergent species, advancing our understanding of the mechanistic intricacies in new gene formation.
Collapse
|
8
|
Delihas N. Evolution of a Human-Specific De Novo Open Reading Frame and Its Linked Transcriptional Silencer. Int J Mol Sci 2024; 25:3924. [PMID: 38612733 PMCID: PMC11011693 DOI: 10.3390/ijms25073924] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2024] [Revised: 03/23/2024] [Accepted: 03/26/2024] [Indexed: 04/14/2024] Open
Abstract
In the human genome, two short open reading frames (ORFs) separated by a transcriptional silencer and a small intervening sequence stem from the gene SMIM45. The two ORFs show different translational characteristics, and they also show divergent patterns of evolutionary development. The studies presented here describe the evolution of the components of SMIM45. One ORF consists of an ultra-conserved 68 amino acid (aa) sequence, whose origins can be traced beyond the evolutionary age of divergence of the elephant shark, ~462 MYA. The silencer also has ancient origins, but it has a complex and divergent pattern of evolutionary formation, as it overlaps both at the 68 aa ORF and the intervening sequence. The other ORF consists of 107 aa. It develops during primate evolution but is found to originate de novo from an ancestral non-coding genomic region with root origins within the Afrothere clade of placental mammals, whose evolutionary age of divergence is ~99 MYA. The formation of the complete 107 aa ORF during primate evolution is outlined, whereby sequence development is found to occur through biased mutations, with disruptive random mutations that also occur but lead to a dead-end. The 107 aa ORF is of particular significance, as there is evidence to suggest it is a protein that may function in human brain development. Its evolutionary formation presents a view of a human-specific ORF and its linked silencer that were predetermined in non-primate ancestral species. The genomic position of the silencer offers interesting possibilities for the regulation of transcription of the 107 aa ORF. A hypothesis is presented with respect to possible spatiotemporal expression of the 107 aa ORF in embryonic tissues.
Collapse
Affiliation(s)
- Nicholas Delihas
- Department of Microbiology and Immunology, Renaissance School of Medicine, Stony Brook University, Stony Brook, NY 11794, USA
| |
Collapse
|
9
|
Grandchamp A, Czuppon P, Bornberg-Bauer E. Quantification and modeling of turnover dynamics of de novo transcripts in Drosophila melanogaster. Nucleic Acids Res 2024; 52:274-287. [PMID: 38000384 PMCID: PMC10783523 DOI: 10.1093/nar/gkad1079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 10/13/2023] [Accepted: 10/28/2023] [Indexed: 11/26/2023] Open
Abstract
Most of the transcribed eukaryotic genomes are composed of non-coding transcripts. Among these transcripts, some are newly transcribed when compared to outgroups and are referred to as de novo transcripts. De novo transcripts have been shown to play a major role in genomic innovations. However, little is known about the rates at which de novo transcripts are gained and lost in individuals of the same species. Here, we address this gap and estimate the de novo transcript turnover rate with an evolutionary model. We use DNA long reads and RNA short reads from seven geographically remote samples of inbred individuals of Drosophila melanogaster to detect de novo transcripts that are gained on a short evolutionary time scale. Overall, each sampled individual contains around 2500 unspliced de novo transcripts, with most of them being sample specific. We estimate that around 0.15 transcripts are gained per year, and that each gained transcript is lost at a rate around 5× 10-5 per year. This high turnover of transcripts suggests frequent exploration of new genomic sequences within species. These rate estimates are essential to comprehend the process and timescale of de novo gene birth.
Collapse
Affiliation(s)
- Anna Grandchamp
- Institute for Evolution and Biodiversity, University of Münster, Münster, Germany
| | - Peter Czuppon
- Institute for Evolution and Biodiversity, University of Münster, Münster, Germany
| | - Erich Bornberg-Bauer
- Institute for Evolution and Biodiversity, University of Münster, Münster, Germany
- Department of Protein Evolution, Max Planck Institute for Biology, Tübingen, Germany
| |
Collapse
|
10
|
Ardern Z. Alternative Reading Frames are an Underappreciated Source of Protein Sequence Novelty. J Mol Evol 2023; 91:570-580. [PMID: 37326679 DOI: 10.1007/s00239-023-10122-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2022] [Accepted: 05/31/2023] [Indexed: 06/17/2023]
Abstract
Protein-coding DNA sequences can be translated into completely different amino acid sequences if the nucleotide triplets used are shifted by a non-triplet amount on the same DNA strand or by translating codons from the opposite strand. Such "alternative reading frames" of protein-coding genes are a major contributor to the evolution of novel protein products. Recent studies demonstrating this include examples across the three domains of cellular life and in viruses. These sequences increase the number of trials potentially available for the evolutionary invention of new genes and also have unusual properties which may facilitate gene origin. There is evidence that the structure of the standard genetic code contributes to the features and gene-likeness of some alternative frame sequences. These findings have important implications across diverse areas of molecular biology, including for genome annotation, structural biology, and evolutionary genomics.
Collapse
|
11
|
Sanejouand YH. On the Unknown Proteins of Eukaryotic Proteomes. J Mol Evol 2023:10.1007/s00239-023-10116-1. [PMID: 37219573 DOI: 10.1007/s00239-023-10116-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2022] [Accepted: 05/07/2023] [Indexed: 05/24/2023]
Abstract
To study unknown proteins on a large scale, a reference system has been set up for the three better studied eukaryotic kingdoms, built with 36 proteomes as taxonomically diverse as possible. Proteins from 362 other eukaryotic proteomes with no known homologue in this set were then analyzed, focusing noteworthy on singletons, that is, on such proteins with no known homologue in their own proteome. Consistently, for a given species, no more than 12% of the singletons thus found are known at the protein level, according to Uniprot. In addition, since they rely on the information found in the alignment of homologous sequences, predictions of AlphaFold2 for their tridimensional structure are poor. In the case of metazoan species, the number of singletons rarely exceeds 1000 for the species the closest to the reference system (divergence times below 75 Myr). Interestingly, in the cases of viridiplantae and fungi, larger amounts of singletons are found for such species, as if the timescale on which singletons are added to proteomes were different in metazoa and in other eukaryotic kingdoms. In order to confirm this phenomenon, further studies of proteomes closer to those of the reference system are, however, needed.
Collapse
Affiliation(s)
- Yves-Henri Sanejouand
- US2B, UMR 6286 of CNRS, Nantes University, rue de la Houssinière, 44322, Nantes, France.
| |
Collapse
|
12
|
Liu J, Yuan R, Shao W, Wang J, Silman I, Sussman JL. Do "Newly Born" orphan proteins resemble "Never Born" proteins? A study using three deep learning algorithms. Proteins 2023. [PMID: 37092778 DOI: 10.1002/prot.26496] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2022] [Revised: 02/26/2023] [Accepted: 04/01/2023] [Indexed: 04/25/2023]
Abstract
"Newly Born" proteins, devoid of detectable homology to any other proteins, known as orphan proteins, occur in a single species or within a taxonomically restricted gene family. They are generated by the expression of novel open reading frames, and appear throughout evolution. We were curious if three recently developed programs for predicting protein structures, namely, AlphaFold2, RoseTTAFold, and ESMFold, might be of value for comparison of such "Newly Born" proteins to random polypeptides with amino acid content similar to that of native proteins, which have been called "Never Born" proteins. The programs were used to compare the structures of two sets of "Never Born" proteins that had been expressed-Group 1, which had been shown experimentally to possess substantial secondary structure, and Group 3, which had been shown to be intrinsically disordered. Overall, although the models generated were scored as being of low quality, they nevertheless revealed some general principles. Specifically, all four members of Group 1 were predicted to be compact by all three algorithms, in agreement with the experimental data, whereas the members of Group 3 were predicted to be very extended, as would be expected for intrinsically disordered proteins, again consistent with the experimental data. These predicted differences were shown to be statistically significant by comparing their accessible surface areas. The three programs were then used to predict the structures of three orphan proteins whose crystal structures had been solved, two of which display novel folds. Surprisingly, only for the protein which did not have a novel fold, and was taxonomically restricted, rather than being a true orphan, did all three algorithms predict very similar, high-quality structures, closely resembling the crystal structure. Finally, they were used to predict the structures of seven orphan proteins with well-identified biological functions, whose 3D structures are not known. Two proteins, which were predicted to be disordered based on their sequences, are predicted by all three structure algorithms to be extended structures. The other five were predicted to be compact structures with only two exceptions in the case of AlphaFold2. All three prediction algorithms make remarkably similar and high-quality predictions for one large protein, HCO_11565, from a nematode. It is conjectured that this is due to many homologs in the taxonomically restricted family of which it is a member, and to the fact that the Dali server revealed several nonrelated proteins with similar folds. An animated Interactive 3D Complement (I3DC) is available in Proteopedia at http://proteopedia.org/w/Journal:Proteins:3.
Collapse
Affiliation(s)
- Jing Liu
- Department of Biotechnology and Food Engineering, Guangdong Technion-Israel Institute of Technology, Shantou, China
- Faculty of Biotechnology and Food Engineering, Technion-Israel Institute of Technology, Haifa, Israel
| | - Rongqing Yuan
- Department of Chemistry, Tsinghua University, Beijing, China
| | - Wei Shao
- School of Chemistry and Chemical Engineering, Shanghai Jiao Tong University, Shanghai, China
| | - Jitong Wang
- Department of Chemistry, Tsinghua University, Beijing, China
| | - Israel Silman
- Department of Brain Sciences, The Weizmann Institute of Science, Rehovot, Israel
| | - Joel L Sussman
- Department of Chemical and Structural Biology, The Weizmann Institute of Science, Rehovot, Israel
| |
Collapse
|
13
|
Iyengar BR, Bornberg-Bauer E. Neutral Models of De Novo Gene Emergence Suggest that Gene Evolution has a Preferred Trajectory. Mol Biol Evol 2023; 40:msad079. [PMID: 37011142 PMCID: PMC10118301 DOI: 10.1093/molbev/msad079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2023] [Revised: 03/01/2023] [Accepted: 03/28/2023] [Indexed: 04/05/2023] Open
Abstract
New protein coding genes can emerge from genomic regions that previously did not contain any genes, via a process called de novo gene emergence. To synthesize a protein, DNA must be transcribed as well as translated. Both processes need certain DNA sequence features. Stable transcription requires promoters and a polyadenylation signal, while translation requires at least an open reading frame. We develop mathematical models based on mutation probabilities, and the assumption of neutral evolution, to find out how quickly genes emerge and are lost. We also investigate the effect of the order by which DNA features evolve, and if sequence composition is biased by mutation rate. We rationalize how genes are lost much more rapidly than they emerge, and how they preferentially arise in regions that are already transcribed. Our study not only answers some fundamental questions on the topic of de novo emergence but also provides a modeling framework for future studies.
Collapse
Affiliation(s)
- Bharat Ravi Iyengar
- Institute for Evolution and Biodiversity, University of Münster, Münster, Germany
| | - Erich Bornberg-Bauer
- Institute for Evolution and Biodiversity, University of Münster, Münster, Germany
- Department of Protein Evolution, Max Planck Institute for Biology Tübingen, Tübingen, Germany
| |
Collapse
|
14
|
Heames B, Buchel F, Aubel M, Tretyachenko V, Loginov D, Novák P, Lange A, Bornberg-Bauer E, Hlouchová K. Experimental characterization of de novo proteins and their unevolved random-sequence counterparts. Nat Ecol Evol 2023; 7:570-580. [PMID: 37024625 PMCID: PMC10089919 DOI: 10.1038/s41559-023-02010-2] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Accepted: 02/10/2023] [Indexed: 04/08/2023]
Abstract
De novo gene emergence provides a route for new proteins to be formed from previously non-coding DNA. Proteins born in this way are considered random sequences and typically assumed to lack defined structure. While it remains unclear how likely a de novo protein is to assume a soluble and stable tertiary structure, intersecting evidence from random sequence and de novo-designed proteins suggests that native-like biophysical properties are abundant in sequence space. Taking putative de novo proteins identified in human and fly, we experimentally characterize a library of these sequences to assess their solubility and structure propensity. We compare this library to a set of synthetic random proteins with no evolutionary history. Bioinformatic prediction suggests that de novo proteins may have remarkably similar distributions of biophysical properties to unevolved random sequences of a given length and amino acid composition. However, upon expression in vitro, de novo proteins exhibit moderately higher solubility which is further induced by the DnaK chaperone system. We suggest that while synthetic random sequences are a useful proxy for de novo proteins in terms of structure propensity, de novo proteins may be better integrated in the cellular system than random expectation, given their higher solubility.
Collapse
Affiliation(s)
- Brennen Heames
- Institute for Evolution and Biodiversity, University of Münster, Münster, Germany
| | - Filip Buchel
- Department of Cell Biology, Charles University, BIOCEV, Prague, Czech Republic
- Department of Biochemistry, Charles University, Prague, Czech Republic
| | - Margaux Aubel
- Institute for Evolution and Biodiversity, University of Münster, Münster, Germany
| | | | - Dmitry Loginov
- Institute of Microbiology, Czech Academy of Sciences, Prague, Czech Republic
| | - Petr Novák
- Institute of Microbiology, Czech Academy of Sciences, Prague, Czech Republic
| | - Andreas Lange
- Institute for Evolution and Biodiversity, University of Münster, Münster, Germany
| | - Erich Bornberg-Bauer
- Institute for Evolution and Biodiversity, University of Münster, Münster, Germany.
- Department of Protein Evolution, MPI for Developmental Biology, Tübingen, Germany.
| | - Klára Hlouchová
- Department of Cell Biology, Charles University, BIOCEV, Prague, Czech Republic.
- Institute of Organic Chemistry and Biochemistry, Czech Academy of Sciences, Prague, Czech Republic.
| |
Collapse
|
15
|
Eicholt LA, Aubel M, Berk K, Bornberg‐Bauer E, Lange A. Heterologous expression of naturally evolved putative de novo proteins with chaperones. Protein Sci 2022; 31:e4371. [PMID: 35900020 PMCID: PMC9278007 DOI: 10.1002/pro.4371] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2022] [Revised: 05/03/2022] [Accepted: 05/14/2022] [Indexed: 11/23/2022]
Abstract
Over the past decade, evidence has accumulated that new protein-coding genes can emerge de novo from previously non-coding DNA. Most studies have focused on large scale computational predictions of de novo protein-coding genes across a wide range of organisms. In contrast, experimental data concerning the folding and function of de novo proteins are scarce. This might be due to difficulties in handling de novo proteins in vitro, as most are short and predicted to be disordered. Here, we propose a guideline for the effective expression of eukaryotic de novo proteins in Escherichia coli. We used 11 sequences from Drosophila melanogaster and 10 from Homo sapiens, that are predicted de novo proteins from former studies, for heterologous expression. The candidate de novo proteins have varying secondary structure and disorder content. Using multiple combinations of purification tags, E. coli expression strains, and chaperone systems, we were able to increase the number of solubly expressed putative de novo proteins from 30% to 62%. Our findings indicate that the best combination for expressing putative de novo proteins in E. coli is a GST-tag with T7 Express cells and co-expressed chaperones. We found that, overall, proteins with higher predicted disorder were easier to express. STATEMENT: Today, we know that proteins do not only evolve by duplication and divergence of existing proteins but also arise from previously non-coding DNA. These proteins are called de novo proteins. Their properties are still poorly understood and their experimental analysis faces major obstacles. Here, we aim to present a starting point for soluble expression of de novo proteins with the help of chaperones and thereby enable further characterization.
Collapse
Affiliation(s)
- Lars A. Eicholt
- Institute for Evolution and BiodiversityUniversity of MuensterMünsterGermany
| | - Margaux Aubel
- Institute for Evolution and BiodiversityUniversity of MuensterMünsterGermany
| | - Katrin Berk
- Institute for Evolution and BiodiversityUniversity of MuensterMünsterGermany
| | - Erich Bornberg‐Bauer
- Institute for Evolution and BiodiversityUniversity of MuensterMünsterGermany
- Max Planck‐Institute for Biology TuebingenTübingenGermany
| | - Andreas Lange
- Institute for Evolution and BiodiversityUniversity of MuensterMünsterGermany
| |
Collapse
|
16
|
Jayaraman V, Toledo‐Patiño S, Noda‐García L, Laurino P. Mechanisms of protein evolution. Protein Sci 2022; 31:e4362. [PMID: 35762715 PMCID: PMC9214755 DOI: 10.1002/pro.4362] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Revised: 05/11/2022] [Accepted: 05/14/2022] [Indexed: 11/06/2022]
Abstract
How do proteins evolve? How do changes in sequence mediate changes in protein structure, and in turn in function? This question has multiple angles, ranging from biochemistry and biophysics to evolutionary biology. This review provides a brief integrated view of some key mechanistic aspects of protein evolution. First, we explain how protein evolution is primarily driven by randomly acquired genetic mutations and selection for function, and how these mutations can even give rise to completely new folds. Then, we also comment on how phenotypic protein variability, including promiscuity, transcriptional and translational errors, may also accelerate this process, possibly via "plasticity-first" mechanisms. Finally, we highlight open questions in the field of protein evolution, with respect to the emergence of more sophisticated protein systems such as protein complexes, pathways, and the emergence of pre-LUCA enzymes.
Collapse
Affiliation(s)
- Vijay Jayaraman
- Department of Molecular Cell BiologyWeizmann Institute of ScienceRehovotIsrael
| | - Saacnicteh Toledo‐Patiño
- Protein Engineering and Evolution UnitOkinawa Institute of Science and Technology Graduate UniversityOkinawaJapan
| | - Lianet Noda‐García
- Department of Plant Pathology and Microbiology, Institute of Environmental Sciences, Robert H. Smith Faculty of Agriculture, Food and EnvironmentHebrew University of JerusalemRehovotIsrael
| | - Paola Laurino
- Protein Engineering and Evolution UnitOkinawa Institute of Science and Technology Graduate UniversityOkinawaJapan
| |
Collapse
|
17
|
Dolatabadian A, Fernando WGD. Genomic Variations and Mutational Events Associated with Plant-Pathogen Interactions. BIOLOGY 2022; 11:421. [PMID: 35336795 PMCID: PMC8945218 DOI: 10.3390/biology11030421] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/07/2022] [Revised: 03/07/2022] [Accepted: 03/08/2022] [Indexed: 12/23/2022]
Abstract
Phytopathologists are actively researching the molecular basis of plant-pathogen interactions. The mechanisms of responses to pathogens have been studied extensively in model crop plant species and natural populations. Today, with the rapid expansion of genomic technologies such as DNA sequencing, transcriptomics, proteomics, and metabolomics, as well as the development of new methods and protocols, data analysis, and bioinformatics, it is now possible to assess the role of genetic variation in plant-microbe interactions and to understand the underlying molecular mechanisms of plant defense and microbe pathogenicity with ever-greater resolution and accuracy. Genetic variation is an important force in evolution that enables organisms to survive in stressful environments. Moreover, understanding the role of genetic variation and mutational events is essential for crop breeders to produce improved cultivars. This review focuses on genetic variations and mutational events associated with plant-pathogen interactions and discusses how these genome compartments enhance plants' and pathogens' evolutionary processes.
Collapse
Affiliation(s)
- Aria Dolatabadian
- Department of Plant Science, Faculty of Agricultural and Food Sciences, University of Manitoba, Winnipeg, MB R3T 2N2, Canada;
| | | |
Collapse
|
18
|
New Genomic Signals Underlying the Emergence of Human Proto-Genes. Genes (Basel) 2022; 13:genes13020284. [PMID: 35205330 PMCID: PMC8871994 DOI: 10.3390/genes13020284] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2021] [Revised: 01/20/2022] [Accepted: 01/24/2022] [Indexed: 12/04/2022] Open
Abstract
De novo genes are novel genes which emerge from non-coding DNA. Until now, little is known about de novo genes’ properties, correlated to their age and mechanisms of emergence. In this study, we investigate four related properties: introns, upstream regulatory motifs, 5′ Untranslated regions (UTRs) and protein domains, in 23,135 human proto-genes. We found that proto-genes contain introns, whose number and position correlates with the genomic position of proto-gene emergence. The origin of these introns is debated, as our results suggest that 41% of proto-genes might have captured existing introns, and 13.7% of them do not splice the ORF. We show that proto-genes which emerged via overprinting tend to be more enriched in core promotor motifs, while intergenic and intronic genes are more enriched in enhancers, even if the TATA motif is most commonly found upstream in these genes. Intergenic and intronic 5′ UTRs of proto-genes have a lower potential to stabilise mRNA structures than exonic proto-genes and established human genes. Finally, we confirm that proteins expressed by proto-genes gain new putative domains with age. Overall, we find that regulatory motifs inducing transcription and translation of previously non-coding sequences may facilitate proto-gene emergence. Our study demonstrates that introns, 5′ UTRs, and domains have specific properties in proto-genes. We also emphasize that the genomic positions of de novo genes strongly impacts these properties.
Collapse
|
19
|
Nikolaidis M, Markoulatos P, Van de Peer Y, Oliver SG, Amoutzias GD. The Neighborhood of the Spike Gene Is a Hotspot for Modular Intertypic Homologous and Nonhomologous Recombination in Coronavirus Genomes. Mol Biol Evol 2022; 39:msab292. [PMID: 34638137 PMCID: PMC8549283 DOI: 10.1093/molbev/msab292] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Coronaviruses (CoVs) have very large RNA viral genomes with a distinct genomic architecture of core and accessory open reading frames (ORFs). It is of utmost importance to understand their patterns and limits of homologous and nonhomologous recombination, because such events may affect the emergence of novel CoV strains, alter their host range, infection rate, tissue tropism pathogenicity, and their ability to escape vaccination programs. Intratypic recombination among closely related CoVs of the same subgenus has often been reported; however, the patterns and limits of genomic exchange between more distantly related CoV lineages (intertypic recombination) need further investigation. Here, we report computational/evolutionary analyses that clearly demonstrate a substantial ability for CoVs of different subgenera to recombine. Furthermore, we show that CoVs can obtain-through nonhomologous recombination-accessory ORFs from core ORFs, exchange accessory ORFs with different CoV genera, with other viruses (i.e., toroviruses, influenza C/D, reoviruses, rotaviruses, astroviruses) and even with hosts. Intriguingly, most of these radical events result from double crossovers surrounding the Spike ORF, thus highlighting both the instability and mobile nature of this genomic region. Although many such events have often occurred during the evolution of various CoVs, the genomic architecture of the relatively young SARS-CoV/SARS-CoV-2 lineage so far appears to be stable.
Collapse
Affiliation(s)
- Marios Nikolaidis
- Bioinformatics Laboratory, Department of Biochemistry and Biotechnology, University of Thessaly, Larissa, Greece
| | - Panayotis Markoulatos
- Microbial Biotechnology-Molecular Bacteriology-Virology Laboratory, Department of Biochemistry and Biotechnology, University of Thessaly, Larissa, Greece
| | - Yves Van de Peer
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
- Center for Plant Systems Biology, VIB, Ghent, Belgium
- Department of Biochemistry, Genetics and Microbiology, University of Pretoria, Pretoria, South Africa
- College of Horticulture, Nanjing Agricultural University, Nanjing, China
| | - Stephen G Oliver
- Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom
| | - Grigorios D Amoutzias
- Bioinformatics Laboratory, Department of Biochemistry and Biotechnology, University of Thessaly, Larissa, Greece
| |
Collapse
|
20
|
Amoutzias GD, Nikolaidis M, Tryfonopoulou E, Chlichlia K, Markoulatos P, Oliver SG. The Remarkable Evolutionary Plasticity of Coronaviruses by Mutation and Recombination: Insights for the COVID-19 Pandemic and the Future Evolutionary Paths of SARS-CoV-2. Viruses 2022; 14:78. [PMID: 35062282 PMCID: PMC8778387 DOI: 10.3390/v14010078] [Citation(s) in RCA: 55] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Revised: 12/22/2021] [Accepted: 12/31/2021] [Indexed: 12/13/2022] Open
Abstract
Coronaviruses (CoVs) constitute a large and diverse subfamily of positive-sense single-stranded RNA viruses. They are found in many mammals and birds and have great importance for the health of humans and farm animals. The current SARS-CoV-2 pandemic, as well as many previous epidemics in humans that were of zoonotic origin, highlights the importance of studying the evolution of the entire CoV subfamily in order to understand how novel strains emerge and which molecular processes affect their adaptation, transmissibility, host/tissue tropism, and patho non-homologous genicity. In this review, we focus on studies over the last two years that reveal the impact of point mutations, insertions/deletions, and intratypic/intertypic homologous and non-homologous recombination events on the evolution of CoVs. We discuss whether the next generations of CoV vaccines should be directed against other CoV proteins in addition to or instead of spike. Based on the observed patterns of molecular evolution for the entire subfamily, we discuss five scenarios for the future evolutionary path of SARS-CoV-2 and the COVID-19 pandemic. Finally, within this evolutionary context, we discuss the recently emerged Omicron (B.1.1.529) VoC.
Collapse
Affiliation(s)
- Grigorios D. Amoutzias
- Bioinformatics Laboratory, Department of Biochemistry and Biotechnology, University of Thessaly, 41500 Larissa, Greece;
| | - Marios Nikolaidis
- Bioinformatics Laboratory, Department of Biochemistry and Biotechnology, University of Thessaly, 41500 Larissa, Greece;
| | - Eleni Tryfonopoulou
- Laboratory of Molecular Immunology, Department of Molecular Biology and Genetics, Democritus University of Thrace, University Campus-Dragana, 68100 Alexandroupolis, Greece; (E.T.); (K.C.)
| | - Katerina Chlichlia
- Laboratory of Molecular Immunology, Department of Molecular Biology and Genetics, Democritus University of Thrace, University Campus-Dragana, 68100 Alexandroupolis, Greece; (E.T.); (K.C.)
| | - Panayotis Markoulatos
- Microbial Biotechnology-Molecular Bacteriology-Virology Laboratory, Department of Biochemistry and Biotechnology, University of Thessaly, 41500 Larissa, Greece;
| | - Stephen G. Oliver
- Department of Biochemistry, University of Cambridge, Sanger Building, 80 Tennis Court Road, Cambridge CB2 1GA, UK
| |
Collapse
|
21
|
Abstract
Influenza A virus has long been known to encode 10 major polypeptides, produced, almost without exception, by every natural isolate of the virus. These polypeptides are expressed in readily detectable amounts during infection and are either fully essential or their loss severely attenuates virus replication. More recent work has shown that this core proteome is elaborated by expression of a suite of accessory gene products that tend to be expressed at lower levels through noncanonical transcriptional and/or translational events. Expression and activity of these accessory proteins varies between virus strains and is nonessential (sometimes inconsequential) for virus replication in cell culture, but in many cases has been shown to affect virulence and/or transmission in vivo. This review describes, when known, the expression mechanisms and functions of this influenza A virus accessory proteome and discusses its significance and evolution.
Collapse
Affiliation(s)
- Rute M Pinto
- The Roslin Institute, University of Edinburgh, Midlothian EH25 9RG, United Kingdom
| | - Samantha Lycett
- The Roslin Institute, University of Edinburgh, Midlothian EH25 9RG, United Kingdom
| | - Eleanor Gaunt
- The Roslin Institute, University of Edinburgh, Midlothian EH25 9RG, United Kingdom
| | - Paul Digard
- The Roslin Institute, University of Edinburgh, Midlothian EH25 9RG, United Kingdom
| |
Collapse
|
22
|
Claverie JM, Santini S. Validation of predicted anonymous proteins simply using Fisher's exact test. BIOINFORMATICS ADVANCES 2021; 1:vbab034. [PMID: 36700095 PMCID: PMC9710694 DOI: 10.1093/bioadv/vbab034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/15/2021] [Revised: 11/03/2021] [Accepted: 11/10/2021] [Indexed: 01/28/2023]
Abstract
Motivation Genomes sequencing has become the primary (and often the sole) experimental method to characterize newly discovered organisms, in particular from the microbial world (bacteria, archaea, viruses). This generates an ever increasing number of predicted proteins the existence of which is unwarranted, in particular among those without homolog in model organisms. As a last resort, the computation of the selection pressure from pairwise alignments of the corresponding 'Open Reading Frames' (ORFs) can be used to validate their existences. However, this approach is error-prone, as not usually associated with a significance test. Results We introduce the use of the straightforward Fisher's exact test as a postprocessing of the results provided by the popular CODEML sequence comparison software. The respective rates of nucleotide changes at the nonsynonymous versus synonymous position (as determined by CODEML) are turned into entries into a 2 × 2 contingency table, the probability of which is computed under the Null hypothesis that they should not behave differently if the ORFs do not encode actual proteins. Using the genome sequences of two recently isolated giant viruses, we show that strong negative selection pressures do not always provide a solid argument in favor of the existence of proteins.
Collapse
Affiliation(s)
- Jean-Michel Claverie
- Aix-Marseille University, CNRS, IGS (UMR7256), IMM (FR3479), Luminy, Marseille F-13288, France,To whom correspondence should be addressed.
| | - Sébastien Santini
- Aix-Marseille University, CNRS, IGS (UMR7256), IMM (FR3479), Luminy, Marseille F-13288, France
| |
Collapse
|
23
|
Zhuang X, Cheng CHC. Propagation of a De Novo Gene under Natural Selection: Antifreeze Glycoprotein Genes and Their Evolutionary History in Codfishes. Genes (Basel) 2021; 12:genes12111777. [PMID: 34828383 PMCID: PMC8622921 DOI: 10.3390/genes12111777] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Revised: 11/08/2021] [Accepted: 11/08/2021] [Indexed: 11/16/2022] Open
Abstract
The de novo birth of functional genes from non-coding DNA as an important contributor to new gene formation is increasingly supported by evidence from diverse eukaryotic lineages. However, many uncertainties remain, including how the incipient de novo genes would continue to evolve and the molecular mechanisms underlying their evolutionary trajectory. Here we address these questions by investigating evolutionary history of the de novo antifreeze glycoprotein (AFGP) gene and gene family in gadid (codfish) lineages. We examined AFGP phenotype on a phylogenetic framework encompassing a broad sampling of gadids from freezing and non-freezing habitats. In three select species representing different AFGP-bearing clades, we analyzed all AFGP gene family members and the broader scale AFGP genomic regions in detail. Codon usage analyses suggest that motif duplication produced the intragenic AFGP tripeptide coding repeats, and rapid sequence divergence post-duplication stabilized the recombination-prone long repetitive coding region. Genomic loci analyses support AFGP originated once from a single ancestral genomic origin, and shed light on how the de novo gene proliferated into a gene family. Results also show the processes of gene duplication and gene loss are distinctive in separate clades, and both genotype and phenotype are commensurate with differential local selective pressures.
Collapse
Affiliation(s)
- Xuan Zhuang
- Department of Biological Sciences, University of Arkansas, Fayetteville, AR 72701, USA
- Correspondence: (X.Z.); (C.-H.C.C.)
| | - C.-H. Christina Cheng
- Department of Evolution, Ecology, and Behavior, University of Illinois, Urbana-Champaign, IL 61801, USA
- Correspondence: (X.Z.); (C.-H.C.C.)
| |
Collapse
|
24
|
Rivard EL, Ludwig AG, Patel PH, Grandchamp A, Arnold SE, Berger A, Scott EM, Kelly BJ, Mascha GC, Bornberg-Bauer E, Findlay GD. A putative de novo evolved gene required for spermatid chromatin condensation in Drosophila melanogaster. PLoS Genet 2021; 17:e1009787. [PMID: 34478447 PMCID: PMC8445463 DOI: 10.1371/journal.pgen.1009787] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Revised: 09/16/2021] [Accepted: 08/19/2021] [Indexed: 02/07/2023] Open
Abstract
Comparative genomics has enabled the identification of genes that potentially evolved de novo from non-coding sequences. Many such genes are expressed in male reproductive tissues, but their functions remain poorly understood. To address this, we conducted a functional genetic screen of over 40 putative de novo genes with testis-enriched expression in Drosophila melanogaster and identified one gene, atlas, required for male fertility. Detailed genetic and cytological analyses showed that atlas is required for proper chromatin condensation during the final stages of spermatogenesis. Atlas protein is expressed in spermatid nuclei and facilitates the transition from histone- to protamine-based chromatin packaging. Complementary evolutionary analyses revealed the complex evolutionary history of atlas. The protein-coding portion of the gene likely arose at the base of the Drosophila genus on the X chromosome but was unlikely to be essential, as it was then lost in several independent lineages. Within the last ~15 million years, however, the gene moved to an autosome, where it fused with a conserved non-coding RNA and evolved a non-redundant role in male fertility. Altogether, this study provides insight into the integration of novel genes into biological processes, the links between genomic innovation and functional evolution, and the genetic control of a fundamental developmental process, gametogenesis.
Collapse
Affiliation(s)
- Emily L. Rivard
- College of the Holy Cross, Worcester, Massachusetts, United States of America
| | - Andrew G. Ludwig
- College of the Holy Cross, Worcester, Massachusetts, United States of America
| | - Prajal H. Patel
- College of the Holy Cross, Worcester, Massachusetts, United States of America
| | | | - Sarah E. Arnold
- College of the Holy Cross, Worcester, Massachusetts, United States of America
| | | | - Emilie M. Scott
- College of the Holy Cross, Worcester, Massachusetts, United States of America
| | - Brendan J. Kelly
- College of the Holy Cross, Worcester, Massachusetts, United States of America
| | - Grace C. Mascha
- College of the Holy Cross, Worcester, Massachusetts, United States of America
| | - Erich Bornberg-Bauer
- University of Münster, Münster, Germany
- Max Planck Institute for Developmental Biology, Tübingen, Germany
| | - Geoffrey D. Findlay
- College of the Holy Cross, Worcester, Massachusetts, United States of America
| |
Collapse
|
25
|
Heizinger L, Merkl R. Evidence for the preferential reuse of sub-domain motifs in primordial protein folds. Proteins 2021; 89:1167-1179. [PMID: 33957009 DOI: 10.1002/prot.26089] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2021] [Revised: 04/15/2021] [Accepted: 04/28/2021] [Indexed: 11/06/2022]
Abstract
A comparison of protein backbones makes clear that not more than approximately 1400 different folds exist, each specifying the three-dimensional topology of a protein domain. Large proteins are composed of specific domain combinations and many domains can accommodate different functions. These findings confirm that the reuse of domains is key for the evolution of multi-domain proteins. If reuse was also the driving force for domain evolution, ancestral fragments of sub-domain size exist that are shared between domains possessing significantly different topologies. For the fully automated detection of putatively ancestral motifs, we developed the algorithm Fragstatt that compares proteins pairwise to identify fragments, that is, instantiations of the same motif. To reach maximal sensitivity, Fragstatt compares sequences by means of cascaded alignments of profile Hidden Markov Models. If the fragment sequences are sufficiently similar, the program determines and scores the structural concordance of the fragments. By analyzing a comprehensive set of proteins from the CATH database, Fragstatt identified 12 532 partially overlapping and structurally similar motifs that clustered to 134 unique motifs. The dissemination of these motifs is limited: We found only two domain topologies that contain two different motifs and generally, these motifs occur in not more than 18% of the CATH topologies. Interestingly, motifs are enriched in topologies that are considered ancestral. Thus, our findings suggest that the reuse of sub-domain sized fragments was relevant in early phases of protein evolution and became less important later on.
Collapse
Affiliation(s)
- Leonhard Heizinger
- Institute of Biophysics and Physical Biochemistry, University of Regensburg, Regensburg, Germany
| | - Rainer Merkl
- Institute of Biophysics and Physical Biochemistry, University of Regensburg, Regensburg, Germany
| |
Collapse
|
26
|
Lange A, Patel PH, Heames B, Damry AM, Saenger T, Jackson CJ, Findlay GD, Bornberg-Bauer E. Structural and functional characterization of a putative de novo gene in Drosophila. Nat Commun 2021; 12:1667. [PMID: 33712569 PMCID: PMC7954818 DOI: 10.1038/s41467-021-21667-6] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2020] [Accepted: 02/03/2021] [Indexed: 11/26/2022] Open
Abstract
Comparative genomic studies have repeatedly shown that new protein-coding genes can emerge de novo from noncoding DNA. Still unknown is how and when the structures of encoded de novo proteins emerge and evolve. Combining biochemical, genetic and evolutionary analyses, we elucidate the function and structure of goddard, a gene which appears to have evolved de novo at least 50 million years ago within the Drosophila genus. Previous studies found that goddard is required for male fertility. Here, we show that Goddard protein localizes to elongating sperm axonemes and that in its absence, elongated spermatids fail to undergo individualization. Combining modelling, NMR and circular dichroism (CD) data, we show that Goddard protein contains a large central α-helix, but is otherwise partially disordered. We find similar results for Goddard's orthologs from divergent fly species and their reconstructed ancestral sequences. Accordingly, Goddard's structure appears to have been maintained with only minor changes over millions of years.
Collapse
Affiliation(s)
- Andreas Lange
- Institute for Evolution and Biodiversity, University of Münster, Münster, Germany
| | - Prajal H Patel
- Department of Biology, College of the Holy Cross, Worcester, MA, USA
| | - Brennen Heames
- Institute for Evolution and Biodiversity, University of Münster, Münster, Germany
| | - Adam M Damry
- Research School of Chemistry, ANU College of Science, Canberra, Australia
| | - Thorsten Saenger
- Department of Pediatric Kidney, Liver and Metabolic Diseases, Hannover Medical School, Hannover, Germany
| | - Colin J Jackson
- Research School of Chemistry, ANU College of Science, Canberra, Australia
| | | | - Erich Bornberg-Bauer
- Institute for Evolution and Biodiversity, University of Münster, Münster, Germany.
| |
Collapse
|
27
|
Structure and function of naturally evolved de novo proteins. Curr Opin Struct Biol 2021; 68:175-183. [PMID: 33567396 DOI: 10.1016/j.sbi.2020.11.010] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2020] [Revised: 11/16/2020] [Accepted: 11/27/2020] [Indexed: 01/05/2023]
Abstract
Comparative evolutionary genomics has revealed that novel protein coding genes can emerge randomly from non-coding DNA. While most of the myriad of transcripts which continuously emerge vanish rapidly, some attain regulatory regions, become translated and survive. More surprisingly, sequence properties of de novo proteins are almost indistinguishable from randomly obtained sequences, yet de novo proteins may gain functions and integrate into eukaryotic cellular networks quite easily. We here discuss current knowledge on de novo proteins, their structures, functions and evolution. Since the existence of de novo proteins seems at odds with decade-long attempts to construct proteins with novel structures and functions from scratch, we suggest that a better understanding of de novo protein evolution may fuel new strategies for protein design.
Collapse
|
28
|
Dowling D, Schmitz JF, Bornberg-Bauer E. Stochastic Gain and Loss of Novel Transcribed Open Reading Frames in the Human Lineage. Genome Biol Evol 2020; 12:2183-2195. [PMID: 33210146 PMCID: PMC7674706 DOI: 10.1093/gbe/evaa194] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/12/2020] [Indexed: 12/12/2022] Open
Abstract
In addition to known genes, much of the human genome is transcribed into RNA. Chance formation of novel open reading frames (ORFs) can lead to the translation of myriad new proteins. Some of these ORFs may yield advantageous adaptive de novo proteins. However, widespread translation of noncoding DNA can also produce hazardous protein molecules, which can misfold and/or form toxic aggregates. The dynamics of how de novo proteins emerge from potentially toxic raw materials and what influences their long-term survival are unknown. Here, using transcriptomic data from human and five other primates, we generate a set of transcribed human ORFs at six conservation levels to investigate which properties influence the early emergence and long-term retention of these expressed ORFs. As these taxa diverged from each other relatively recently, we present a fine scale view of the evolution of novel sequences over recent evolutionary time. We find that novel human-restricted ORFs are preferentially located on GC-rich gene-dense chromosomes, suggesting their retention is linked to pre-existing genes. Sequence properties such as intrinsic structural disorder and aggregation propensity-which have been proposed to play a role in survival of de novo genes-remain unchanged over time. Even very young sequences code for proteins with low aggregation propensities, suggesting that genomic regions with many novel transcribed ORFs are concomitantly less likely to produce ORFs which code for harmful toxic proteins. Our data indicate that the survival of these novel ORFs is largely stochastic rather than shaped by selection.
Collapse
Affiliation(s)
- Daniel Dowling
- Institute for Evolution and Biodiversity, University of Münster, Germany
| | - Jonathan F Schmitz
- Institute for Evolution and Biodiversity, University of Münster, Germany
| | | |
Collapse
|
29
|
Ai Y, Wu S, Zou C, Wei H. LINC00941 promotes oral squamous cell carcinoma progression via activating CAPRIN2 and canonical WNT/β-catenin signaling pathway. J Cell Mol Med 2020; 24:10512-10524. [PMID: 32691935 PMCID: PMC7521336 DOI: 10.1111/jcmm.15667] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2020] [Revised: 06/19/2020] [Accepted: 07/04/2020] [Indexed: 12/25/2022] Open
Abstract
Dysregulation of long non-coding RNAs (lncRNAs) has been implicated in many cancer developments. Previous studies showed that lncRNA LINC00941 was aberrantly expressed in oral squamous cell carcinoma (OSCC). However, its role in OSCC development remains elusive. In this study, we demonstrated that in OSCC cells, EP300 activates LINC00941 transcription through up-regulating its promoter H3K27ac modification. Up-regulated LINC00941 in turn activates CAPRIN2 expression by looping to CAPRIN2 promoter. Functional assays suggest that both LINC00941 and CAPRIN2 play pivotal roles in promoting OSCC cell proliferation and colony formation. In vivo assay further confirmed the role of LINC00941 in promoting OSCC cell tumour formation. Lastly, we showed that the role of LINC00941 and CAPRIN2 in OSCC progression was mediated through activating the canonical WNT/β-catenin signaling pathway. Thus, LINC00941/CAPRIN2/ WNT/β-catenin signaling pathway provides new therapeutic targets for OSCC treatment.
Collapse
MESH Headings
- Animals
- CRISPR-Cas Systems
- Carcinoma, Squamous Cell/genetics
- Carcinoma, Squamous Cell/pathology
- Cell Division
- Cells, Cultured
- DNA, Neoplasm/genetics
- DNA, Neoplasm/ultrastructure
- Disease Progression
- E1A-Associated p300 Protein/physiology
- Gene Expression Regulation, Neoplastic
- Genes, Reporter
- Histone Code
- Keratinocytes
- Mice
- Mice, Inbred BALB C
- Mice, Nude
- Mouth Neoplasms/genetics
- Mouth Neoplasms/pathology
- Neoplasm Proteins/biosynthesis
- Neoplasm Proteins/genetics
- Neoplasm Proteins/physiology
- Neoplasm Transplantation
- Promoter Regions, Genetic/genetics
- RNA, Long Noncoding/genetics
- RNA, Long Noncoding/metabolism
- RNA, Neoplasm/biosynthesis
- RNA, Neoplasm/genetics
- RNA, Neoplasm/physiology
- RNA-Binding Proteins/biosynthesis
- RNA-Binding Proteins/genetics
- RNA-Binding Proteins/physiology
- Recombinant Proteins/metabolism
- Tumor Stem Cell Assay
- Up-Regulation
- Wnt Signaling Pathway/genetics
- Wnt Signaling Pathway/physiology
- RNA, Guide, CRISPR-Cas Systems
Collapse
Affiliation(s)
- Yilong Ai
- Foshan Stomatological HospitalSchool of Stomatology and MedicineFoshan UniversityFoshan, GuangdongChina
| | - Siyuan Wu
- Foshan Stomatological HospitalSchool of Stomatology and MedicineFoshan UniversityFoshan, GuangdongChina
| | - Chen Zou
- Foshan Stomatological HospitalSchool of Stomatology and MedicineFoshan UniversityFoshan, GuangdongChina
| | - Haigang Wei
- Foshan Stomatological HospitalSchool of Stomatology and MedicineFoshan UniversityFoshan, GuangdongChina
| |
Collapse
|
30
|
Heames B, Schmitz J, Bornberg-Bauer E. A Continuum of Evolving De Novo Genes Drives Protein-Coding Novelty in Drosophila. J Mol Evol 2020; 88:382-398. [PMID: 32253450 PMCID: PMC7162840 DOI: 10.1007/s00239-020-09939-z] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2019] [Accepted: 03/13/2020] [Indexed: 12/13/2022]
Abstract
Orphan genes, lacking detectable homologs in outgroup species, typically represent 10-30% of eukaryotic genomes. Efforts to find the source of these young genes indicate that de novo emergence from non-coding DNA may in part explain their prevalence. Here, we investigate the roots of orphan gene emergence in the Drosophila genus. Across the annotated proteomes of twelve species, we find 6297 orphan genes within 4953 taxon-specific clusters of orthologs. By inferring the ancestral DNA as non-coding for between 550 and 2467 (8.7-39.2%) of these genes, we describe for the first time how de novo emergence contributes to the abundance of clade-specific Drosophila genes. In support of them having functional roles, we show that de novo genes have robust expression and translational support. However, the distinct nucleotide sequences of de novo genes, which have characteristics intermediate between intergenic regions and conserved genes, reflect their recent birth from non-coding DNA. We find that de novo genes encode more disordered proteins than both older genes and intergenic regions. Together, our results suggest that gene emergence from non-coding DNA provides an abundant source of material for the evolution of new proteins. Following gene birth, gradual evolution over large evolutionary timescales moulds sequence properties towards those of conserved genes, resulting in a continuum of properties whose starting points depend on the nucleotide sequences of an initial pool of novel genes.
Collapse
Affiliation(s)
- Brennen Heames
- Institute for Evolution and Biodiversity, 48149, Münster, Germany
| | - Jonathan Schmitz
- Institute for Evolution and Biodiversity, 48149, Münster, Germany
| | | |
Collapse
|
31
|
Rödelsperger C, Prabh N, Sommer RJ. New Gene Origin and Deep Taxon Phylogenomics: Opportunities and Challenges. Trends Genet 2019; 35:914-922. [DOI: 10.1016/j.tig.2019.08.007] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2019] [Revised: 08/07/2019] [Accepted: 08/29/2019] [Indexed: 01/22/2023]
|
32
|
Perochon A, Kahla A, Vranić M, Jia J, Malla KB, Craze M, Wallington E, Doohan FM. A wheat NAC interacts with an orphan protein and enhances resistance to Fusarium head blight disease. PLANT BIOTECHNOLOGY JOURNAL 2019; 17:1892-1904. [PMID: 30821405 PMCID: PMC6737021 DOI: 10.1111/pbi.13105] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Revised: 02/19/2019] [Accepted: 02/21/2019] [Indexed: 05/05/2023]
Abstract
Taxonomically-restricted orphan genes play an important role in environmental adaptation, as recently demonstrated by the fact that the Pooideae-specific orphan TaFROG (Triticum aestivum Fusarium Resistance Orphan Gene) enhanced wheat resistance to the economically devastating Fusarium head blight (FHB) disease. Like most orphan genes, little is known about the cellular function of the encoded protein TaFROG, other than it interacts with the central stress regulator TaSnRK1α. Here, we functionally characterized a wheat (T. aestivum) NAC-like transcription factor TaNACL-D1 that interacts with TaFROG and investigated its' role in FHB using studies to assess motif analyses, yeast transactivation, protein-protein interaction, gene expression and the disease response of wheat lines overexpressing TaNACL-D1. TaNACL-D1 is a Poaceae-divergent NAC transcription factor that encodes a Triticeae-specific protein C-terminal region with transcriptional activity and a nuclear localisation signal. The TaNACL-D1/TaFROG interaction was detected in yeast and confirmed in planta, within the nucleus. Analysis of multi-protein interactions indicated that TaFROG could form simultaneously distinct protein complexes with TaNACL-D1 and TaSnRK1α in planta. TaNACL-D1 and TaFROG are co-expressed as an early response to both the causal fungal agent of FHB, Fusarium graminearum and its virulence factor deoxynivalenol (DON). Wheat lines overexpressing TaNACL-D1 were more resistant to FHB disease than wild type plants. Thus, we conclude that the orphan protein TaFROG interacts with TaNACL-D1, a NAC transcription factor that forms part of the disease response evolved within the Triticeae.
Collapse
Affiliation(s)
- Alexandre Perochon
- UCD School of Biology and Environmental Science and Earth InstituteCollege of ScienceUniversity College DublinBelfield, Dublin 4Ireland
| | - Amal Kahla
- UCD School of Biology and Environmental Science and Earth InstituteCollege of ScienceUniversity College DublinBelfield, Dublin 4Ireland
| | - Monika Vranić
- UCD School of Biology and Environmental Science and Earth InstituteCollege of ScienceUniversity College DublinBelfield, Dublin 4Ireland
| | - Jianguang Jia
- UCD School of Biology and Environmental Science and Earth InstituteCollege of ScienceUniversity College DublinBelfield, Dublin 4Ireland
| | - Keshav B. Malla
- UCD School of Biology and Environmental Science and Earth InstituteCollege of ScienceUniversity College DublinBelfield, Dublin 4Ireland
| | | | | | - Fiona M. Doohan
- UCD School of Biology and Environmental Science and Earth InstituteCollege of ScienceUniversity College DublinBelfield, Dublin 4Ireland
| |
Collapse
|
33
|
Genome-Wide Profiling of Polyadenylation Events in Maize Using High-Throughput Transcriptomic Sequences. G3-GENES GENOMES GENETICS 2019; 9:2749-2760. [PMID: 31239292 PMCID: PMC6686930 DOI: 10.1534/g3.119.400196] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Polyadenylation is an essential post-transcriptional modification of eukaryotic transcripts that plays critical role in transcript stability, localization, transport, and translational efficiency. About 70% genes in plants contain alternative polyadenylation (APA) sites. Despite availability of vast amount of sequencing data, to date, a comprehensive map of the polyadenylation events in maize is not available. Here, 9.48 billion RNA-Seq reads were analyzed to characterize 95,345 Poly(A) Clusters (PAC) in 23,705 (51%) maize genes. Of these, 76% were APA genes. However, most APA genes (55%) expressed a dominant PAC rather than favoring multiple PACs equally. The lincRNA genes with PACs were significantly longer in length than the genes without any PAC and about 48% genes had APA sites. Heterogeneity was observed in 52% of the PACs supporting the imprecise nature of the polyadenylation process. Genomic distribution revealed that the majority of the PACs (78%) were located in the genic regions. Unlike previous studies, large number of PACs were observed in the intergenic (n = 21,264), 5′-UTR (735), CDS (2,542), and the intronic regions (12,841). The CDS and introns with PACs were longer in length than without PACs, whereas intergenic PACs were more often associated with transcripts that lacked annotated 3′-UTRs. Nucleotide composition around PACs demonstrated AT-richness and the common upstream motif was AAUAAA, which is consistent with other plants. According to this study, only 2,830 genes still maintained the use of AAUAAA motif. This large-scale data provides useful insights about the gene expression regulation and could be utilized as evidence to validate the annotation of transcript ends.
Collapse
|
34
|
Affiliation(s)
- Stephen Branden Van Oss
- Department of Computational and Systems Biology, Pittsburgh Center for Evolutionary Biology and Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
| | - Anne-Ruxandra Carvunis
- Department of Computational and Systems Biology, Pittsburgh Center for Evolutionary Biology and Medicine, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
| |
Collapse
|
35
|
|
36
|
Wang Y, Zeng Z, Liu TL, Sun L, Yao Q, Chen KP. TA, GT and AC are significantly under-represented in open reading frames of prokaryotic and eukaryotic protein-coding genes. Mol Genet Genomics 2019; 294:637-647. [PMID: 30758669 DOI: 10.1007/s00438-019-01535-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2018] [Accepted: 01/31/2019] [Indexed: 01/01/2023]
Abstract
Genomes can be considered a combination of 16 dinucleotides. Analysing the relative abundance of different dinucleotides may reveal important features of genome evolution. In present study, we conducted extensive surveys on the relative abundances of dinucleotides in various genomic components of 28 bacterial, 20 archaean, 19 fungal, 24 plant and 29 animal species. We found that TA, GT and AC are significantly under-represented in open reading frames of all organisms and in intergenic regions and introns of most organisms. Specific dinucleotides are of greatly varied usage at different codon positions. The significantly low representations of TA, GT and AC are considered the evolutionary consequences of preventing formation of pre-mature stop codons and of reducing intron-splicing options in candidate primary mRNA sequences. These data suggest that a reduction of TA and GT occurred on both strands of the DNA sequence at an early stage of de novo gene birth. Interestingly, GT and AC are also significantly under-represented in current prokaryotic genomes, suggesting that ancient prokaryotic protein-coding genes might have contained introns. The greatly varied usages of specific dinucleotides at different codon positions are considered evolutionary accommodations to compensate the unavailability of specific codons and to avoid formation of pre-mature stop codons. This is the first report presenting data of dinucleotide relative abundance to indicate the possible existence of spliceosomal introns in ancient prokaryotic genes and to hypothesize early steps of de novo gene birth.
Collapse
Affiliation(s)
- Yong Wang
- School of Food and Biological Engineering, Jiangsu University, 301 Xuefu Road, Zhenjiang, 212013, China.
| | - Zhen Zeng
- Institute of Life Sciences, Jiangsu University, 301 Xuefu Road, Zhenjiang, 212013, China
| | - Tian-Lei Liu
- School of Food and Biological Engineering, Jiangsu University, 301 Xuefu Road, Zhenjiang, 212013, China
| | - Ling Sun
- School of Food and Biological Engineering, Jiangsu University, 301 Xuefu Road, Zhenjiang, 212013, China
| | - Qin Yao
- Institute of Life Sciences, Jiangsu University, 301 Xuefu Road, Zhenjiang, 212013, China
| | - Ke-Ping Chen
- Institute of Life Sciences, Jiangsu University, 301 Xuefu Road, Zhenjiang, 212013, China
| |
Collapse
|
37
|
Whittle CA, Extavour CG. Contrasting patterns of molecular evolution in metazoan germ line genes. BMC Evol Biol 2019; 19:53. [PMID: 30744572 PMCID: PMC6371493 DOI: 10.1186/s12862-019-1363-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2018] [Accepted: 01/14/2019] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Germ lines are the cell lineages that give rise to the sperm and eggs in animals. The germ lines first arise from primordial germ cells (PGCs) during embryogenesis: these form from either a presumed derived mode of preformed germ plasm (inheritance) or from an ancestral mechanism of inductive cell-cell signalling (induction). Numerous genes involved in germ line specification and development have been identified and functionally studied. However, little is known about the molecular evolutionary dynamics of germ line genes in metazoan model systems. RESULTS Here, we studied the molecular evolution of germ line genes within three metazoan model systems. These include the genus Drosophila (N=34 genes, inheritance), the fellow insect Apis (N=30, induction), and their more distant relative Caenorhabditis (N=23, inheritance). Using multiple species and established phylogenies in each genus, we report that germ line genes exhibited marked variation in the constraint on protein sequence divergence (dN/dS) and codon usage bias (CUB) within each genus. Importantly, we found that de novo lineage-specific inheritance (LSI) genes in Drosophila (osk, pgc) and in Caenorhabditis (pie-1, pgl-1), which are essential to germ plasm functions under the derived inheritance mode, displayed rapid protein sequence divergence relative to the other germ line genes within each respective genus. We show this may reflect the evolution of specialized germ plasm functions and/or low pleiotropy of LSI genes, features not shared with other germ line genes. In addition, we observed that the relative ranking of dN/dS and of CUB between genera were each more strongly correlated between Drosophila and Caenorhabditis, from different phyla, than between Drosophila and its insect relative Apis, suggesting taxonomic differences in how germ line genes have evolved. CONCLUSIONS Taken together, the present results advance our understanding of the evolution of animal germ line genes within three well-known metazoan models. Further, the findings provide insights to the molecular evolution of germ line genes with respect to LSI status, pleiotropy, adaptive evolution as well as PGC-specification mode.
Collapse
Affiliation(s)
- Carrie A Whittle
- Department of Organismic and Evolutionary Biology, Harvard University, 16 Divinity Avenue, Cambridge, MA, 02138, USA
| | - Cassandra G Extavour
- Department of Organismic and Evolutionary Biology, Harvard University, 16 Divinity Avenue, Cambridge, MA, 02138, USA.
- Department of Molecular and Cellular Biology, Harvard University, 16 Divinity Avenue, Cambridge, MA, 02138, USA.
| |
Collapse
|
38
|
Exaptation at the molecular genetic level. SCIENCE CHINA-LIFE SCIENCES 2018; 62:437-452. [PMID: 30798493 DOI: 10.1007/s11427-018-9447-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/10/2018] [Accepted: 12/01/2018] [Indexed: 12/22/2022]
Abstract
The realization that body parts of animals and plants can be recruited or coopted for novel functions dates back to, or even predates the observations of Darwin. S.J. Gould and E.S. Vrba recognized a mode of evolution of characters that differs from adaptation. The umbrella term aptation was supplemented with the concept of exaptation. Unlike adaptations, which are restricted to features built by selection for their current role, exaptations are features that currently enhance fitness, even though their present role was not a result of natural selection. Exaptations can also arise from nonaptations; these are characters which had previously been evolving neutrally. All nonaptations are potential exaptations. The concept of exaptation was expanded to the molecular genetic level which aided greatly in understanding the enormous potential of neutrally evolving repetitive DNA-including transposed elements, formerly considered junk DNA-for the evolution of genes and genomes. The distinction between adaptations and exaptations is outlined in this review and examples are given. Also elaborated on is the fact that such distinctions are sometimes more difficult to determine; this is a widespread phenomenon in biology, where continua abound and clear borders between states and definitions are rare.
Collapse
|
39
|
Incipient de novo genes can evolve from frozen accidents that escaped rapid transcript turnover. Nat Ecol Evol 2018; 2:1626-1632. [DOI: 10.1038/s41559-018-0639-7] [Citation(s) in RCA: 42] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2017] [Accepted: 07/09/2018] [Indexed: 11/08/2022]
|
40
|
Klasberg S, Bitard-Feildel T, Callebaut I, Bornberg-Bauer E. Origins and structural properties of novel and de novo protein domains during insect evolution. FEBS J 2018; 285:2605-2625. [PMID: 29802682 DOI: 10.1111/febs.14504] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2017] [Revised: 04/12/2018] [Accepted: 05/11/2018] [Indexed: 12/11/2022]
Abstract
Over long time scales, protein evolution is characterized by modular rearrangements of protein domains. Such rearrangements are mainly caused by gene duplication, fusion and terminal losses. To better understand domain emergence mechanisms we investigated 32 insect genomes covering a speciation gradient ranging from ~ 2 to ~ 390 mya. We use established domain models and foldable domains delineated by hydrophobic cluster analysis (HCA), which does not require homologous sequences, to also identify domains which have likely arisen de novo, that is, from previously noncoding DNA. Our results indicate that most novel domains emerge terminally as they originate from ORF extensions while fewer arise in middle arrangements, resulting from exonization of intronic or intergenic regions. Many novel domains rapidly migrate between terminal or middle positions and single- and multidomain arrangements. Young domains, such as most HCA-defined domains, are under strong selection pressure as they show signals of purifying selection. De novo domains, linked to ancient domains or defined by HCA, have higher degrees of intrinsic disorder and disorder-to-order transition upon binding than ancient domains. However, the corresponding DNA sequences of the novel domains of de novo origins could only rarely be found in sister genomes. We conclude that novel domains are often recruited by other proteins and undergo important structural modifications shortly after their emergence, but evolve too fast to be characterized by cross-species comparisons alone.
Collapse
Affiliation(s)
- Steffen Klasberg
- Institute for Evolution and Biodiversity, Westfalian Wilhelms University Muenster, Germany
| | - Tristan Bitard-Feildel
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), Paris, France
| | - Isabelle Callebaut
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, IRD, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| | - Erich Bornberg-Bauer
- Institute for Evolution and Biodiversity, Westfalian Wilhelms University Muenster, Germany
| |
Collapse
|
41
|
Diversity and evolution of the emerging Pandoraviridae family. Nat Commun 2018; 9:2285. [PMID: 29891839 PMCID: PMC5995976 DOI: 10.1038/s41467-018-04698-4] [Citation(s) in RCA: 93] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2018] [Accepted: 05/17/2018] [Indexed: 02/02/2023] Open
Abstract
With DNA genomes reaching 2.5 Mb packed in particles of bacterium-like shape and dimension, the first two Acanthamoeba-infecting pandoraviruses remained up to now the most complex viruses since their discovery in 2013. Our isolation of three new strains from distant locations and environments is now used to perform the first comparative genomics analysis of the emerging worldwide-distributed Pandoraviridae family. Thorough annotation of the genomes combining transcriptomic, proteomic, and bioinformatic analyses reveals many non-coding transcripts and significantly reduces the former set of predicted protein-coding genes. Here we show that the pandoraviruses exhibit an open pan-genome, the enormous size of which is not adequately explained by gene duplications or horizontal transfers. As most of the strain-specific genes have no extant homolog and exhibit statistical features comparable to intergenic regions, we suggest that de novo gene creation could contribute to the evolution of the giant pandoravirus genomes. Giant viruses are visible by light microscopy and have unusually long genomes. Here, the authors report three new members of the Pandoraviridae family and investigate their evolution and diversity.
Collapse
|
42
|
Banerjee S, Chakraborty S. Protein intrinsic disorder negatively associates with gene age in different eukaryotic lineages. MOLECULAR BIOSYSTEMS 2018; 13:2044-2055. [PMID: 28783193 DOI: 10.1039/c7mb00230k] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
The emergence of new protein-coding genes in a specific lineage or species provides raw materials for evolutionary adaptations. Until recently, the biology of new genes emerging particularly from non-genic sequences remained unexplored. Although the new genes are subjected to variable selection pressure and face rapid deletion, some of them become functional and are retained in the gene pool. To acquire functional novelties, new genes often get integrated into the pre-existing ancestral networks. However, the mechanism by which young proteins acquire novel interactions remains unanswered till date. Since structural orientation contributes hugely to the mode of proteins' physical interactions, in this regard, we put forward an interesting question - Do new genes encode proteins with stable folds? Addressing the question, we demonstrated that the intrinsic disorder inversely correlates with the evolutionary gene ages - i.e. young proteins are richer in intrinsic disorder than the ancient ones. We further noted that young proteins, which are initially poorly connected hubs, prefer to be structurally more disordered than well-connected ancient proteins. The phenomenon strikingly defies the usual trend of well-connected proteins being highly disordered in structure. We justified that structural disorder might help poorly connected young proteins to undergo promiscuous interactions, which provides the foundation for novel protein interactions. The study focuses on the evolutionary perspectives of young proteins in the light of structural adaptations.
Collapse
Affiliation(s)
- Sanghita Banerjee
- Machine Intelligence Unit, Indian Statistical Institute, 203 Barrackpore Trunk Road, Kolkata 700108, India.
| | | |
Collapse
|
43
|
Foldability of a Natural De Novo Evolved Protein. Structure 2017; 25:1687-1696.e4. [PMID: 29033289 DOI: 10.1016/j.str.2017.09.006] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2017] [Revised: 07/22/2017] [Accepted: 09/15/2017] [Indexed: 12/13/2022]
Abstract
The de novo evolution of protein-coding genes from noncoding DNA is emerging as a source of molecular innovation in biology. Studies of random sequence libraries, however, suggest that young de novo proteins will not fold into compact, specific structures typical of native globular proteins. Here we show that Bsc4, a functional, natural de novo protein encoded by a gene that evolved recently from noncoding DNA in the yeast S. cerevisiae, folds to a partially specific three-dimensional structure. Bsc4 forms soluble, compact oligomers with high β sheet content and a hydrophobic core, and undergoes cooperative, reversible denaturation. Bsc4 lacks a specific quaternary state, however, existing instead as a continuous distribution of oligomer sizes, and binds dyes indicative of amyloid oligomers or molten globules. The combination of native-like and non-native-like properties suggests a rudimentary fold that could potentially act as a functional intermediate in the emergence of new folded proteins de novo.
Collapse
|
44
|
Lopez-Ezquerra A, Harrison MC, Bornberg-Bauer E. Comparative analysis of lincRNA in insect species. BMC Evol Biol 2017; 17:155. [PMID: 28673235 PMCID: PMC5494802 DOI: 10.1186/s12862-017-0985-0] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2017] [Accepted: 06/02/2017] [Indexed: 01/19/2023] Open
Abstract
BACKGROUND The ever increasing availability of genomes makes it possible to investigate and compare not only the genomic complements of genes and proteins, but also of RNAs. One class of RNAs, the long noncoding RNAs (lncRNAs) and, in particular, their subclass of long intergenic noncoding RNAs (lincRNAs) have recently gained much attention because of their roles in regulation of important biological processes such as immune response or cell differentiation and as possible evolutionary precursors for protein coding genes. lincRNAs seem to be poorly conserved at the sequence level but at least some lincRNAs have conserved structural elements and syntenic genomic positions. Previous studies showed that transposable elements are a main contribution to the evolution of lincRNAs in mammals. In contrast, plant lincRNA emergence and evolution has been linked with local duplication events. However, little is known about their evolutionary dynamics in general and in insect genomes in particular. RESULTS Here we compared lincRNAs between seven insect genomes and investigated possible evolutionary changes and functional roles. We find very low sequence conservation between different species and that similarities within a species are mostly due to their association with transposable elements (TE) and simple repeats. Furthermore, we find that TEs are less frequent in lincRNA exons than in their introns, indicating that TEs may have been removed by selection. When we analysed the predicted thermodynamic stabilities of lincRNAs we found that they are more stable than their randomized controls which might indicate some selection pressure to maintain certain structural elements. We list several of the most stable lincRNAs which could serve as prime candidates for future functional studies. We also discuss the possibility of de novo protein coding genes emerging from lincRNAs. This is because lincRNAs with high GC content and potentially with longer open reading frames (ORF) are candidate loci where de novo gene emergence might occur. CONCLUSION The processes responsible for the emergence and diversification of lincRNAs in insects remain unclear. Both duplication and transposable elements may be important for the creation of new lincRNAs in insects.
Collapse
Affiliation(s)
- Alberto Lopez-Ezquerra
- Institute of Evolution and Biodiversity, University of Münster, Hüfferstrasse,1, Münster, Münster, Germany
| | - Mark C Harrison
- Institute of Evolution and Biodiversity, University of Münster, Hüfferstrasse,1, Münster, Münster, Germany
| | - Erich Bornberg-Bauer
- Institute of Evolution and Biodiversity, University of Münster, Hüfferstrasse,1, Münster, Münster, Germany.
| |
Collapse
|
45
|
Abstract
In 2004, when the protein estimate from the finished human genome was only 24,000, the surprise was compounded as reviewed estimates fell to 19,000 by 2014. However, variability in the total canonical protein counts (i.e. excluding alternative splice forms) of open reading frames (ORFs) in different annotation portals persists. This work assesses these differences and possible causes. A 16-year analysis of Ensembl and UniProtKB/Swiss-Prot shows convergence to a protein number of ~20,000. The former had shown some yo-yoing, but both have now plateaued. Nine major annotation portals, reviewed at the beginning of 2017, gave a spread of counts from 21,819 down to 18,891. The 4-way cross-reference concordance (within UniProt) between Ensembl, Swiss-Prot, Entrez Gene and the Human Gene Nomenclature Committee (HGNC) drops to 18,690, indicating methodological differences in protein definitions and experimental existence support between sources. The Swiss-Prot and neXtProt evidence criteria include mass spectrometry peptide verification and also cross-references for antibody detection from the Human Protein Atlas. Notwithstanding, hundreds of Swiss-Prot entries are classified as non-coding biotypes by HGNC. The only inference that protein numbers might still rise comes from numerous reports of small ORF (smORF) discovery. However, while there have been recent cases of protein verifications from previous miss-annotation of non-coding RNA, very few have passed the Swiss-Prot curation and genome annotation thresholds. The post-genomic era has seen both advances in data generation and improvements in the human reference assembly. Notwithstanding, current numbers, while persistently discordant, show that the earlier yo-yoing has largely ceased. Given the importance to biology and biomedicine of defining the canonical human proteome, the task will need more collaborative inter-source curation combined with broader and deeper experimental confirmation in vivo and in vitro of proteins predicted in silico. The eventual closure could be well be below ~19,000.
Collapse
Affiliation(s)
- Christopher Southan
- IUPHAR/BPS Guide to Pharmacology, Centre for Integrative Physiology, University of Edinburgh, Edinburgh, EH8 9XD, UK
| |
Collapse
|