1
|
Fang A, Zhang Z, Zhou A, Zitnik M. ATOMICA: Learning Universal Representations of Intermolecular Interactions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.04.02.646906. [PMID: 40291688 PMCID: PMC12026499 DOI: 10.1101/2025.04.02.646906] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/30/2025]
Abstract
Molecular interactions underlie nearly all biological processes, but most machine learning models treat molecules in isolation or specialize in a single type of interaction, such as protein-ligand or protein-protein binding. This siloed approach prevents generalization across biomolecular classes and limits the ability to model interaction interfaces systematically. We introduce ATOMICA, a geometric deep learning model that learns atomic-scale representations of intermolecular interfaces across diverse biomolecular modalities, including small molecules, metal ions, amino acids, and nucleic acids. ATOMICA uses a self-supervised denoising and masking objective to train on 2,037,972 interaction complexes and generate hierarchical embeddings at the levels of atoms, chemical blocks, and molecular interfaces. The model generalizes across molecular classes and recovers shared physicochemical features without supervision. Its latent space captures compositional and chemical similarities across interaction types and follows scaling laws that improve representation quality with increasing biomolecular data modalities. We apply ATOMICA to construct five modality-specific interfaceome networks, termed ATOMICAN et s, which connect proteins based on interaction similarity with ions, small molecules, nucleic acids, lipids, and proteins. These networks identify disease pathways across 27 conditions and predict disease-associated proteins in autoimmune neuropathies and lymphoma. Finally, we use ATOMICA to annotate the dark proteome-proteins lacking known structure or function-by predicting 2,646 previously uncharacterized ligand-binding sites. These include putative zinc finger motifs and transmembrane cytochrome subunits, demonstrating that ATOMICA enables systematic annotation of molecular interactions across the proteome.
Collapse
|
2
|
Yang H, Li Q, Stroup EK, Wang S, Ji Z. Widespread stable noncanonical peptides identified by integrated analyses of ribosome profiling and ORF features. Nat Commun 2024; 15:1932. [PMID: 38431639 PMCID: PMC10908861 DOI: 10.1038/s41467-024-46240-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Accepted: 02/18/2024] [Indexed: 03/05/2024] Open
Abstract
Studies have revealed dozens of functional peptides in putative 'noncoding' regions and raised the question of how many proteins are encoded by noncanonical open reading frames (ORFs). Here, we comprehensively annotate genome-wide translated ORFs across five eukaryotes (human, mouse, zebrafish, worm, and yeast) by analyzing ribosome profiling data. We develop a logistic regression model named PepScore based on ORF features (expected length, encoded domain, and conservation) to calculate the probability that the encoded peptide is stable in humans. Systematic ectopic expression validates PepScore and shows that stable complex-associating microproteins can be encoded in 5'/3' untranslated regions and overlapping coding regions of mRNAs besides annotated noncoding RNAs. Stable noncanonical proteins follow conventional rules and localize to different subcellular compartments. Inhibition of proteasomal/lysosomal degradation pathways can stabilize some peptides especially those with moderate PepScores, but cannot rescue the expression of short ones with low PepScores suggesting they are directly degraded by cellular proteases. The majority of human noncanonical peptides with high PepScores show longer lengths but low conservation across species/mammals, and hundreds contain trait-associated genetic variants. Our study presents a statistical framework to identify stable noncanonical peptides in the genome and provides a valuable resource for functional characterization of noncanonical translation during development and disease.
Collapse
Affiliation(s)
- Haiwang Yang
- Department of Pharmacology, Feinberg School of Medicine, Northwestern University, Chicago, IL, 60611, USA
| | - Qianru Li
- Department of Pharmacology, Feinberg School of Medicine, Northwestern University, Chicago, IL, 60611, USA
| | - Emily K Stroup
- Department of Pharmacology, Feinberg School of Medicine, Northwestern University, Chicago, IL, 60611, USA
| | - Sheng Wang
- Department of Biomedical Engineering, McCormick School of Engineering, Northwestern University, Evanston, IL, 60628, USA
| | - Zhe Ji
- Department of Pharmacology, Feinberg School of Medicine, Northwestern University, Chicago, IL, 60611, USA.
- Department of Biomedical Engineering, McCormick School of Engineering, Northwestern University, Evanston, IL, 60628, USA.
| |
Collapse
|
3
|
Bruley A, Bitard-Feildel T, Callebaut I, Duprat E. A sequence-based foldability score combined with AlphaFold2 predictions to disentangle the protein order/disorder continuum. Proteins 2023; 91:466-484. [PMID: 36306150 DOI: 10.1002/prot.26441] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Revised: 10/14/2022] [Accepted: 10/18/2022] [Indexed: 11/11/2022]
Abstract
Order and disorder govern protein functions, but there is a great diversity in disorder, from regions that are-and stay-fully disordered to conditional order. This diversity is still difficult to decipher even though it is encoded in the amino acid sequences. Here, we developed an analytic Python package, named pyHCA, to estimate the foldability of a protein segment from the only information of its amino acid sequence and based on a measure of its density in regular secondary structures associated with hydrophobic clusters, as defined by the hydrophobic cluster analysis (HCA) approach. The tool was designed by optimizing the separation between foldable segments from databases of disorder (DisProt) and order (SCOPe [soluble domains] and OPM [transmembrane domains]). It allows to specify the ratio between order, embodied by regular secondary structures (either participating in the hydrophobic core of well-folded 3D structures or conditionally formed in intrinsically disordered regions) and disorder. We illustrated the relevance of pyHCA with several examples and applied it to the sequences of the proteomes of 21 species ranging from prokaryotes and archaea to unicellular and multicellular eukaryotes, for which structure models are provided in the AlphaFold protein structure database. Cases of low-confidence scores related to disorder were distinguished from those of sequences that we identified as foldable but are still excluded from accurate modeling by AlphaFold2 due to a lack of sequence homologs or to compositional biases. Overall, our approach is complementary to AlphaFold2, providing guides to map structural innovations through evolutionary processes, at proteome and gene scales.
Collapse
Affiliation(s)
- Apolline Bruley
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| | - Tristan Bitard-Feildel
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| | - Isabelle Callebaut
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| | - Elodie Duprat
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| |
Collapse
|
4
|
Bruley A, Mornon JP, Duprat E, Callebaut I. Digging into the 3D Structure Predictions of AlphaFold2 with Low Confidence: Disorder and Beyond. Biomolecules 2022; 12:1467. [PMID: 36291675 PMCID: PMC9599455 DOI: 10.3390/biom12101467] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 10/04/2022] [Accepted: 10/05/2022] [Indexed: 01/12/2023] Open
Abstract
AlphaFold2 (AF2) has created a breakthrough in biology by providing three-dimensional structure models for whole-proteome sequences, with unprecedented levels of accuracy. In addition, the AF2 pLDDT score, related to the model confidence, has been shown to provide a good measure of residue-wise disorder. Here, we combined AF2 predictions with pyHCA, a tool we previously developed to identify foldable segments and estimate their order/disorder ratio, from a single protein sequence. We focused our analysis on the AF2 predictions available for 21 reference proteomes (AFDB v1), in particular on their long foldable segments (>30 amino acids) that exhibit characteristics of soluble domains, as estimated by pyHCA. Among these segments, we provided a global analysis of those with very low pLDDT values along their entire length and compared their characteristics to those of segments with very high pLDDT values. We highlighted cases containing conditional order, as well as cases that could form well-folded structures but escape the AF2 prediction due to a shallow multiple sequence alignment and/or undocumented structure or fold. AF2 and pyHCA can therefore be advantageously combined to unravel cryptic structural features in whole proteomes and to refine predictions for different flavors of disorder.
Collapse
|
5
|
Vanni C, Schechter MS, Acinas SG, Barberán A, Buttigieg PL, Casamayor EO, Delmont TO, Duarte CM, Eren AM, Finn RD, Kottmann R, Mitchell A, Sánchez P, Siren K, Steinegger M, Gloeckner FO, Fernàndez-Guerra A. Unifying the known and unknown microbial coding sequence space. eLife 2022; 11:e67667. [PMID: 35356891 PMCID: PMC9132574 DOI: 10.7554/elife.67667] [Citation(s) in RCA: 41] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2021] [Accepted: 03/30/2022] [Indexed: 12/02/2022] Open
Abstract
Genes of unknown function are among the biggest challenges in molecular biology, especially in microbial systems, where 40-60% of the predicted genes are unknown. Despite previous attempts, systematic approaches to include the unknown fraction into analytical workflows are still lacking. Here, we present a conceptual framework, its translation into the computational workflow AGNOSTOS and a demonstration on how we can bridge the known-unknown gap in genomes and metagenomes. By analyzing 415,971,742 genes predicted from 1749 metagenomes and 28,941 bacterial and archaeal genomes, we quantify the extent of the unknown fraction, its diversity, and its relevance across multiple organisms and environments. The unknown sequence space is exceptionally diverse, phylogenetically more conserved than the known fraction and predominantly taxonomically restricted at the species level. From the 71 M genes identified to be of unknown function, we compiled a collection of 283,874 lineage-specific genes of unknown function for Cand. Patescibacteria (also known as Candidate Phyla Radiation, CPR), which provides a significant resource to expand our understanding of their unusual biology. Finally, by identifying a target gene of unknown function for antibiotic resistance, we demonstrate how we can enable the generation of hypotheses that can be used to augment experimental data.
Collapse
Affiliation(s)
- Chiara Vanni
- Microbial Genomics and Bioinformatics Research G, Max Planck Institute for Marine MicrobiologyBremenGermany
- Jacobs University BremenBremenGermany
| | - Matthew S Schechter
- Microbial Genomics and Bioinformatics Research G, Max Planck Institute for Marine MicrobiologyBremenGermany
- Department of Medicine, University of ChicagoChicagoUnited States
| | - Silvia G Acinas
- Department of Marine Biology and Oceanography, Institut de Ciències del Mar (CSIC)BarcelonaSpain
| | - Albert Barberán
- Department of Environmental Science, University of ArizonaTucsonUnited States
| | - Pier Luigi Buttigieg
- Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research, Alfred Wegener InstituteBremerhavenGermany
| | - Emilio O Casamayor
- Center for Advanced Studies of Blanes CEAB-CSIC, Spanish Council for ResearchBlanesSpain
| | - Tom O Delmont
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-SaclayEvryFrance
| | - Carlos M Duarte
- Red Sea Research Centre and Computational Bioscience Research Center, King Abdullah University of Science and TechnologyThuwalSaudi Arabia
| | - A Murat Eren
- Department of Medicine, University of ChicagoChicagoUnited States
- Josephine Bay Paul Center, Marine Biological LaboratoryWoods HoleUnited States
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome CampusHinxtonUnited Kingdom
| | - Renzo Kottmann
- Microbial Genomics and Bioinformatics Research G, Max Planck Institute for Marine MicrobiologyBremenGermany
| | - Alex Mitchell
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome CampusHinxtonUnited Kingdom
| | - Pablo Sánchez
- Department of Marine Biology and Oceanography, Institut de Ciències del Mar (CSIC)BarcelonaSpain
| | - Kimmo Siren
- Section for Evolutionary Genomics, The GLOBE Institute, University of CopenhagenCopenhagenDenmark
| | - Martin Steinegger
- School of Biological Sciences, Seoul National UniversitySeoulRepublic of Korea
- Institute of Molecular Biology and Genetics, Seoul National UniversitySeoulRepublic of Korea
| | - Frank Oliver Gloeckner
- Jacobs University BremenBremenGermany
- University of Bremen and Life Sciences and ChemistryBremenGermany
- Computing Center, Helmholtz Center for Polar and Marine ResearchBremerhavenGermany
| | - Antonio Fernàndez-Guerra
- Microbial Genomics and Bioinformatics Research G, Max Planck Institute for Marine MicrobiologyBremenGermany
- Lundbeck Foundation GeoGenetics Centre, GLOBE Institute, University of CopenhagenCopenhagenDenmark
| |
Collapse
|
6
|
Kulkarni P, Bhattacharya S, Achuthan S, Behal A, Jolly MK, Kotnala S, Mohanty A, Rangarajan G, Salgia R, Uversky V. Intrinsically Disordered Proteins: Critical Components of the Wetware. Chem Rev 2022; 122:6614-6633. [PMID: 35170314 PMCID: PMC9250291 DOI: 10.1021/acs.chemrev.1c00848] [Citation(s) in RCA: 44] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
Despite the wealth of knowledge gained about intrinsically disordered proteins (IDPs) since their discovery, there are several aspects that remain unexplored and, hence, poorly understood. A living cell is a complex adaptive system that can be described as a wetware─a metaphor used to describe the cell as a computer comprising both hardware and software and attuned to logic gates─capable of "making" decisions. In this focused Review, we discuss how IDPs, as critical components of the wetware, influence cell-fate decisions by wiring protein interaction networks to keep them minimally frustrated. Because IDPs lie between order and chaos, we explore the possibility that they can be modeled as attractors. Further, we discuss how the conformational dynamics of IDPs manifests itself as conformational noise, which can potentially amplify transcriptional noise to stochastically switch cellular phenotypes. Finally, we explore the potential role of IDPs in prebiotic evolution, in forming proteinaceous membrane-less organelles, in the origin of multicellularity, and in protein conformation-based transgenerational inheritance of acquired characteristics. Together, these ideas provide a new conceptual framework to discern how IDPs may perform critical biological functions despite their lack of structure.
Collapse
Affiliation(s)
- Prakash Kulkarni
- Department of Medical Oncology and Therapeutics Research, City of Hope National Medical Center, Duarte, CA, USA
| | - Supriyo Bhattacharya
- Integrative Genomics Core, City of Hope National Medical Center, Duarte, CA, USA
| | - Srisairam Achuthan
- Division of Research Informatics, Center for Informatics, City of Hope National Medical Center, Duarte, CA 91010, USA
| | - Amita Behal
- Department of Medical Oncology and Therapeutics Research, City of Hope National Medical Center, Duarte, CA, USA
| | - Mohit Kumar Jolly
- Center for BioSystems Science and Engineering, Indian Institute of Science, Bangalore 560012, India
| | - Sourabh Kotnala
- Department of Medical Oncology and Therapeutics Research, City of Hope National Medical Center, Duarte, CA, USA
| | - Atish Mohanty
- Department of Medical Oncology and Therapeutics Research, City of Hope National Medical Center, Duarte, CA, USA
| | - Govindan Rangarajan
- Department of Mathematics, Indian Institute of Science, Bangalore 560012, India
- Center for Neuroscience, Indian Institute of Science, Bangalore 560012, India
| | - Ravi Salgia
- Department of Medical Oncology and Therapeutics Research, City of Hope National Medical Center, Duarte, CA, USA
| | - Vladimir Uversky
- Department of Molecular Medicine, Morsani College of Medicine, University of South Florida, Tampa, FL, USA
- Center for Molecular Mechanisms of Aging and Age-Related Diseases, Moscow Institute of Physics and Technology, Institutskiy pereulok, 9, Dolgoprudny, Moscow region 141700, Russia
| |
Collapse
|
7
|
Papadopoulos C, Chevrollier N, Lopes A. Exploring the Peptide Potential of Genomes. Methods Mol Biol 2022; 2405:63-82. [PMID: 35298808 DOI: 10.1007/978-1-0716-1855-4_3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Recent studies attribute a central role to the noncoding genome in the emergence of novel genes. The widespread transcription of noncoding regions and the pervasive translation of the resulting RNAs offer to the organisms a vast reservoir of novel peptides. Although the majority of these peptides are anticipated as deleterious or neutral, and thereby expected to be degraded right away or short-lived in evolutionary history, some of them can confer an advantage to the organism. The latter can be further subjected to natural selection and be established as novel genes. In any case, characterizing the structural properties of these pervasively translated peptides is crucial to understand (1) their impact on the cell and (2) how some of these peptides, derived from presumed noncoding regions, can give rise to structured and functional de novo proteins. Therefore, we present a protocol that aims to explore the potential of a genome to produce novel peptides. It consists in annotating all the open reading frames (ORFs) of a genome (i.e., coding and noncoding ones) and characterizing the fold potential and other structural properties of their corresponding potential peptides. Here, we apply our protocol to a small genome and show how to apply it to very large genomes. Finally, we present a case study which aims to probe the fold potential of a set of 721 translated ORFs in mouse lncRNAs, identified with ribosome profiling experiments. Interestingly, we show that the distribution of their fold potential is different from that of the nontranslated lncRNAs and more generally from the other noncoding ORFs of the mouse.
Collapse
Affiliation(s)
- Chris Papadopoulos
- Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, Gif-sur-Yvette, cedex, France
| | - Nicolas Chevrollier
- Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, Gif-sur-Yvette, cedex, France
| | - Anne Lopes
- Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, Gif-sur-Yvette, cedex, France.
| |
Collapse
|
8
|
Papadopoulos C, Callebaut I, Gelly JC, Hatin I, Namy O, Renard M, Lespinet O, Lopes A. Intergenic ORFs as elementary structural modules of de novo gene birth and protein evolution. Genome Res 2021; 31:2303-2315. [PMID: 34810219 PMCID: PMC8647833 DOI: 10.1101/gr.275638.121] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 09/23/2021] [Indexed: 01/08/2023]
Abstract
The noncoding genome plays an important role in de novo gene birth and in the emergence of genetic novelty. Nevertheless, how noncoding sequences' properties could promote the birth of novel genes and shape the evolution and the structural diversity of proteins remains unclear. Therefore, by combining different bioinformatic approaches, we characterized the fold potential diversity of the amino acid sequences encoded by all intergenic open reading frames (ORFs) of S. cerevisiae with the aim of (1) exploring whether the structural states' diversity of proteomes is already present in noncoding sequences, and (2) estimating the potential of the noncoding genome to produce novel protein bricks that could either give rise to novel genes or be integrated into pre-existing proteins, thus participating in protein structure diversity and evolution. We showed that amino acid sequences encoded by most yeast intergenic ORFs contain the elementary building blocks of protein structures. Moreover, they encompass the large structural state diversity of canonical proteins, with the majority predicted as foldable. Then, we investigated the early stages of de novo gene birth by reconstructing the ancestral sequences of 70 yeast de novo genes and characterized the sequence and structural properties of intergenic ORFs with a strong translation signal. This enabled us to highlight sequence and structural factors determining de novo gene emergence. Finally, we showed a strong correlation between the fold potential of de novo proteins and one of their ancestral amino acid sequences, reflecting the relationship between the noncoding genome and the protein structure universe.
Collapse
Affiliation(s)
- Chris Papadopoulos
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198 Gif-sur-Yvette, France
| | - Isabelle Callebaut
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, 75005 Paris, France
| | - Jean-Christophe Gelly
- Université de Paris, Biologie Intégrée du Globule Rouge, UMR_S1134, BIGR, INSERM, F-75015 Paris, France
- Laboratoire d'Excellence GR-Ex, 75015 Paris, France
- Institut National de la Transfusion Sanguine, F-75015 Paris, France
| | - Isabelle Hatin
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198 Gif-sur-Yvette, France
| | - Olivier Namy
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198 Gif-sur-Yvette, France
| | - Maxime Renard
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198 Gif-sur-Yvette, France
| | - Olivier Lespinet
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198 Gif-sur-Yvette, France
| | - Anne Lopes
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198 Gif-sur-Yvette, France
| |
Collapse
|
9
|
Watson AK, Lopez P, Bapteste E. Hundreds of out-of-frame remodelled gene families in the E. coli pangenome. Mol Biol Evol 2021; 39:6430988. [PMID: 34792602 PMCID: PMC8788219 DOI: 10.1093/molbev/msab329] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
All genomes include gene families with very limited taxonomic distributions that potentially represent new genes and innovations in protein-coding sequence, raising questions on the origins of such genes. Some of these genes are hypothesized to have formed de novo, from noncoding sequences, and recent work has begun to elucidate the processes by which de novo gene formation can occur. A special case of de novo gene formation, overprinting, describes the origin of new genes from noncoding alternative reading frames of existing open reading frames (ORFs). We argue that additionally, out-of-frame gene fission/fusion events of alternative reading frames of ORFs and out-of-frame lateral gene transfers could contribute to the origin of new gene families. To demonstrate this, we developed an original pattern-search in sequence similarity networks, enhancing the use of these graphs, commonly used to detect in-frame remodeled genes. We applied this approach to gene families in 524 complete genomes of Escherichia coli. We identified 767 gene families whose evolutionary history likely included at least one out-of-frame remodeling event. These genes with out-of-frame components represent ∼2.5% of all genes in the E. coli pangenome, suggesting that alternative reading frames of existing ORFs can contribute to a significant proportion of de novo genes in bacteria.
Collapse
Affiliation(s)
- Andrew K Watson
- Institut de Systématique, Evolution, Biodiversité (ISYEB), Sorbonne Université, CNRS, Museum National d'Histoire Naturelle, EPHE, Université des Antilles, 7, quai Saint Bernard, Paris, 75005, France
| | - Philippe Lopez
- Institut de Systématique, Evolution, Biodiversité (ISYEB), Sorbonne Université, CNRS, Museum National d'Histoire Naturelle, EPHE, Université des Antilles, 7, quai Saint Bernard, Paris, 75005, France
| | - Eric Bapteste
- Institut de Systématique, Evolution, Biodiversité (ISYEB), Sorbonne Université, CNRS, Museum National d'Histoire Naturelle, EPHE, Université des Antilles, 7, quai Saint Bernard, Paris, 75005, France
| |
Collapse
|
10
|
Uversky VN. Torches, Candles, Lamps, Lanterns, Flashlights, Spotlights, Night Vision Goggles… You Need Them All to See in Darkness. Proteomics 2020; 19:e1900085. [PMID: 30829430 DOI: 10.1002/pmic.201900085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Articles assembled in the second part of this Special Issue describe some experimental and computational approaches for the structural and functional characterization of intrinsically disordered proteins. Since these tools represent specialized gear for the focused analysis of various aspects of dark proteome, they can be viewed as torches, candles, lamps, lanterns, flashlights, spotlights, night vision goggles, and other means needed to see in darkness.
Collapse
Affiliation(s)
- Vladimir N Uversky
- Department of Molecular Medicine, Morsani College of Medicine, University of South Florida, Tampa, FL, 33612, USA.,Laboratory of New Methods in Biology, Institute for Biological Instrumentation, Russian Academy of Sciences, Pushchino, Moscow, 142290, Russia
| |
Collapse
|
11
|
Lamiable A, Bitard-Feildel T, Rebehmed J, Quintus F, Schoentgen F, Mornon JP, Callebaut I. A topology-based investigation of protein interaction sites using Hydrophobic Cluster Analysis. Biochimie 2019; 167:68-80. [PMID: 31525399 DOI: 10.1016/j.biochi.2019.09.009] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2019] [Accepted: 09/11/2019] [Indexed: 01/20/2023]
Abstract
Hydrophobic clusters, as defined by Hydrophobic Cluster Analysis (HCA), are conditioned binary patterns, made of hydrophobic and non-hydrophobic positions, whose limits fit well those of regular secondary structures. They were proved to be useful for predicting secondary structures in proteins from the only information of a single amino acid sequence and have permitted to assess, in a comprehensive way, the leading role of binary patterns in secondary structure preference towards a particular state. Here, we considered the available experimental 3D structures of protein globular domains to enlarge our previously reported hydrophobic cluster database (HCDB), almost doubling the number of hydrophobic cluster species (each species being defined by a unique binary pattern) that represent the most frequent structural bricks encountered within protein globular domains. We then used this updated HCDB to show that the hydrophobic amino acids of discordant clusters, i.e. those less abundant clusters for which the observed secondary structure is in disagreement with the binary pattern preference of the species to which they belong, are more exposed to solvent and are more involved in protein interfaces than the hydrophobic amino acids of concordant clusters. As amino acid composition differs between concordant/discordant clusters, considering binary patterns may be used to gain novel insights into key features of protein globular domain cores and surfaces. It can also provide useful information on possible conformational plasticity, including disorder to order transitions.
Collapse
Affiliation(s)
- Alexis Lamiable
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, 75005, Paris, France
| | - Tristan Bitard-Feildel
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, 75005, Paris, France
| | - Joseph Rebehmed
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, 75005, Paris, France; Lebanese American University, Department of Computer Science and Mathematics, Beirut, Lebanon
| | - Flavien Quintus
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, 75005, Paris, France
| | - Françoise Schoentgen
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, 75005, Paris, France
| | - Jean-Paul Mornon
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, 75005, Paris, France
| | - Isabelle Callebaut
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, 75005, Paris, France.
| |
Collapse
|
12
|
Uversky VN. Bringing Darkness to Light: Intrinsic Disorder as a Means to Dig into the Dark Proteome. Proteomics 2019; 18:e1800352. [PMID: 30334344 DOI: 10.1002/pmic.201800352] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Affiliation(s)
- Vladimir N Uversky
- Department of Molecular Medicine, Morsani College of Medicine, University of South Florida, Tampa, FL, 33612, USA.,Laboratory of New Methods in Biology, Institute for Biological Instrumentation, Russian Academy of Sciences, Pushchino, 142290, Moscow Region, Russia
| |
Collapse
|
13
|
Perdigão N, Rosa A. Dark Proteome Database: Studies on Dark Proteins. High Throughput 2019; 8:ht8020008. [PMID: 30934744 PMCID: PMC6630768 DOI: 10.3390/ht8020008] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2018] [Revised: 03/12/2019] [Accepted: 03/15/2019] [Indexed: 12/27/2022] Open
Abstract
The dark proteome, as we define it, is the part of the proteome where 3D structure has not been observed either by homology modeling or by experimental characterization in the protein universe. From the 550.116 proteins available in Swiss-Prot (as of July 2016), 43.2% of the eukarya universe and 49.2% of the virus universe are part of the dark proteome. In bacteria and archaea, the percentage of the dark proteome presence is significantly less, at 12.6% and 13.3% respectively. In this work, we present a necessary step to complete the dark proteome picture by introducing the map of the dark proteome in the human and in other model organisms of special importance to mankind. The most significant result is that around 40% to 50% of the proteome of these organisms are still in the dark, where the higher percentages belong to higher eukaryotes (mouse and human organisms). Due to the amount of darkness present in the human organism being more than 50%, deeper studies were made, including the identification of ‘dark’ genes that are responsible for the production of so-called dark proteins, as well as the identification of the ‘dark’ tissues where dark proteins are over represented, namely, the heart, cervical mucosa, and natural killer cells. This is a step forward in the direction of gaining a deeper knowledge of the human dark proteome.
Collapse
Affiliation(s)
- Nelson Perdigão
- Instituto Superior Técnico, Universidade de Lisboa, 1049-001 Lisbon, Portugal.
- Instituto de Sistemas e Robótica, 1049-001 Lisbon, Portugal.
| | - Agostinho Rosa
- Instituto Superior Técnico, Universidade de Lisboa, 1049-001 Lisbon, Portugal.
- Instituto de Sistemas e Robótica, 1049-001 Lisbon, Portugal.
| |
Collapse
|
14
|
Faure G, Jézéquel K, Roisné-Hamelin F, Bitard-Feildel T, Lamiable A, Marcand S, Callebaut I. Discovery and Evolution of New Domains in Yeast Heterochromatin Factor Sir4 and Its Partner Esc1. Genome Biol Evol 2019; 11:572-585. [PMID: 30668669 PMCID: PMC6394760 DOI: 10.1093/gbe/evz010] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/20/2019] [Indexed: 12/22/2022] Open
Abstract
Sir4 is a core component of heterochromatin found in yeasts of the Saccharomycetaceae family, whose general hallmark is to harbor a three-loci mating-type system with two silent loci. However, a large part of the Sir4 amino acid sequences has remained unexplored, belonging to the dark proteome. Here, we analyzed the phylogenetic profile of yet undescribed foldable regions present in Sir4 as well as in Esc1, an Sir4-interacting perinuclear anchoring protein. Within Sir4, we identified a new conserved motif (TOC) adjacent to the N-terminal KU-binding motif. We also found that the Esc1-interacting region of Sir4 is a Dbf4-related H-BRCT domain, only present in species possessing the HO endonuclease and in Kluveryomyces lactis. In addition, we found new motifs within Esc1 including a motif (Esc1-F) that is unique to species where Sir4 possesses an H-BRCT domain. Mutagenesis of conserved amino acids of the Sir4 H-BRCT domain, known to play a critical role in the Dbf4 function, shows that the function of this domain is separable from the essential role of Sir4 in transcriptional silencing and the protection from HO-induced cutting in Saccharomyces cerevisiae. In the more distant methylotrophic clade of yeasts, which often harbor a two-loci mating-type system with one silent locus, we also found a yet undescribed H-BRCT domain in a distinct protein, the ISWI2 chromatin-remodeling factor subunit Itc1. This study provides new insights on yeast heterochromatin evolution and emphasizes the interest of using sensitive methods of sequence analysis for identifying hitherto ignored functional regions within the dark proteome.
Collapse
Affiliation(s)
- Guilhem Faure
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, IRD, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France.,National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD
| | - Kévin Jézéquel
- Institut de Biologie François Jacob, IRCM/SIGRR/LTR, INSERM U1274, Université Paris-Saclay, CEA Paris-Saclay, Paris, France.,National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD
| | - Florian Roisné-Hamelin
- Institut de Biologie François Jacob, IRCM/SIGRR/LTR, INSERM U1274, Université Paris-Saclay, CEA Paris-Saclay, Paris, France.,National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD
| | - Tristan Bitard-Feildel
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, IRD, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| | - Alexis Lamiable
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, IRD, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| | - Stéphane Marcand
- Institut de Biologie François Jacob, IRCM/SIGRR/LTR, INSERM U1274, Université Paris-Saclay, CEA Paris-Saclay, Paris, France.,Sorbonne Université, UMR CNRS 7238, IBPS, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), Paris, France
| | - Isabelle Callebaut
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, IRD, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France.,Sorbonne Université, UMR CNRS 7238, IBPS, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), Paris, France
| |
Collapse
|
15
|
Bitard‐Feildel T, Lamiable A, Mornon J, Callebaut I. Order in Disorder as Observed by the "Hydrophobic Cluster Analysis" of Protein Sequences. Proteomics 2018; 18:e1800054. [PMID: 30299594 PMCID: PMC7168002 DOI: 10.1002/pmic.201800054] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2018] [Revised: 08/29/2018] [Indexed: 12/17/2022]
Abstract
Hydrophobic cluster analysis (HCA) is an original approach for protein sequence analysis, which provides access to the foldable repertoire of the protein universe, including yet unannotated protein segments ("dark proteome"). Foldable segments correspond to ordered regions, as well as to intrinsically disordered regions (IDRs) undergoing disorder to order transitions. In this review, how HCA can be used to give insight into this last category of foldable segments is illustrated, with examples matching known 3D structures. After reviewing the HCA principles, examples of short foldable segments are given, which often contain short linear motifs, typically matching hydrophobic clusters. These segments become ordered upon contact with partners, with secondary structure preferences generally corresponding to those observed in the 3D structures within the complexes. Such small foldable segments are sometimes larger than the segments of known 3D structures, including flanking hydrophobic clusters that may be critical for interaction specificity or regulation, as well as intervening sequences allowing fuzziness. Cases of larger conditionally disordered domains are also presented, with lower density in hydrophobic clusters than well-folded globular domains or with exposed hydrophobic patches, which are stabilized by interaction with partners.
Collapse
Affiliation(s)
- Tristan Bitard‐Feildel
- Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie (IMPMC)Institut de recherche pour le développement (IRD)UMR CNRS 7590Muséum National d'Histoire NaturelleSorbonne Université75005ParisFrance
- Laboratoire de Biologie Computationnelle et Quantitative (LCQB)Institute of Biology Paris‐Seine (IBPS)Centre national de la recherche scientifique (CNRS)Sorbonne Université75005ParisFrance
| | - Alexis Lamiable
- Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie (IMPMC)Institut de recherche pour le développement (IRD)UMR CNRS 7590Muséum National d'Histoire NaturelleSorbonne Université75005ParisFrance
| | - Jean‐Paul Mornon
- Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie (IMPMC)Institut de recherche pour le développement (IRD)UMR CNRS 7590Muséum National d'Histoire NaturelleSorbonne Université75005ParisFrance
| | - Isabelle Callebaut
- Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie (IMPMC)Institut de recherche pour le développement (IRD)UMR CNRS 7590Muséum National d'Histoire NaturelleSorbonne Université75005ParisFrance
| |
Collapse
|
16
|
Kulkarni P, Uversky VN. Intrinsically Disordered Proteins: The Dark Horse of the Dark Proteome. Proteomics 2018; 18:e1800061. [DOI: 10.1002/pmic.201800061] [Citation(s) in RCA: 51] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2018] [Revised: 09/07/2018] [Indexed: 12/27/2022]
Affiliation(s)
- Prakash Kulkarni
- Department of Medical Oncology and Therapeutics Research; City of Hope National Medical Center; Duarte CA 91010 USA
| | - Vladimir N. Uversky
- Department of Molecular Medicine; Morsani College of Medicine; University of South Florida; Tampa FL 33612 USA
- Laboratory of New methods in Biology; Institute for Biological Instrumentation; Russian Academy of Sciences; Pushchino Moscow Region 142290 Russia
| |
Collapse
|
17
|
Hoffmann B, Elbahnsi A, Lehn P, Décout JL, Pietrucci F, Mornon JP, Callebaut I. Combining theoretical and experimental data to decipher CFTR 3D structures and functions. Cell Mol Life Sci 2018; 75:3829-3855. [PMID: 29779042 PMCID: PMC11105360 DOI: 10.1007/s00018-018-2835-7] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2017] [Revised: 05/04/2018] [Accepted: 05/07/2018] [Indexed: 12/15/2022]
Abstract
Cryo-electron microscopy (cryo-EM) has recently provided invaluable experimental data about the full-length cystic fibrosis transmembrane conductance regulator (CFTR) 3D structure. However, this experimental information deals with inactive states of the channel, either in an apo, quiescent conformation, in which nucleotide-binding domains (NBDs) are widely separated or in an ATP-bound, yet closed conformation. Here, we show that 3D structure models of the open and closed forms of the channel, now further supported by metadynamics simulations and by comparison with the cryo-EM data, could be used to gain some insights into critical features of the conformational transition toward active CFTR forms. These critical elements lie within membrane-spanning domains but also within NBD1 and the N-terminal extension, in which conformational plasticity is predicted to occur to help the interaction with filamin, one of the CFTR cellular partners.
Collapse
Affiliation(s)
- Brice Hoffmann
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, IRD, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, 75005, Paris, France
- Iktos, Paris, France
| | - Ahmad Elbahnsi
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, IRD, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, 75005, Paris, France
| | - Pierre Lehn
- INSERM U1078, SFR ScInBioS, Université de Bretagne Occidentale, Brest, France
| | | | - Fabio Pietrucci
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, IRD, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, 75005, Paris, France
| | - Jean-Paul Mornon
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, IRD, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, 75005, Paris, France.
| | - Isabelle Callebaut
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, IRD, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, 75005, Paris, France
| |
Collapse
|
18
|
Klasberg S, Bitard-Feildel T, Callebaut I, Bornberg-Bauer E. Origins and structural properties of novel and de novo protein domains during insect evolution. FEBS J 2018; 285:2605-2625. [PMID: 29802682 DOI: 10.1111/febs.14504] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2017] [Revised: 04/12/2018] [Accepted: 05/11/2018] [Indexed: 12/11/2022]
Abstract
Over long time scales, protein evolution is characterized by modular rearrangements of protein domains. Such rearrangements are mainly caused by gene duplication, fusion and terminal losses. To better understand domain emergence mechanisms we investigated 32 insect genomes covering a speciation gradient ranging from ~ 2 to ~ 390 mya. We use established domain models and foldable domains delineated by hydrophobic cluster analysis (HCA), which does not require homologous sequences, to also identify domains which have likely arisen de novo, that is, from previously noncoding DNA. Our results indicate that most novel domains emerge terminally as they originate from ORF extensions while fewer arise in middle arrangements, resulting from exonization of intronic or intergenic regions. Many novel domains rapidly migrate between terminal or middle positions and single- and multidomain arrangements. Young domains, such as most HCA-defined domains, are under strong selection pressure as they show signals of purifying selection. De novo domains, linked to ancient domains or defined by HCA, have higher degrees of intrinsic disorder and disorder-to-order transition upon binding than ancient domains. However, the corresponding DNA sequences of the novel domains of de novo origins could only rarely be found in sister genomes. We conclude that novel domains are often recruited by other proteins and undergo important structural modifications shortly after their emergence, but evolve too fast to be characterized by cross-species comparisons alone.
Collapse
Affiliation(s)
- Steffen Klasberg
- Institute for Evolution and Biodiversity, Westfalian Wilhelms University Muenster, Germany
| | - Tristan Bitard-Feildel
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), Paris, France
| | - Isabelle Callebaut
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, IRD, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| | - Erich Bornberg-Bauer
- Institute for Evolution and Biodiversity, Westfalian Wilhelms University Muenster, Germany
| |
Collapse
|
19
|
Abstract
The vast, mostly unknown protein universe can be explored by analyzing protein sequences as a string of domains. A broader coverage can be achieved when these domains, the essential blocks in protein evolution, are detected using sequence profiles. Using clustering to collapse redundant profiles into unique function words (UFWs), we find that over the years 2009–2016, the number of UFWs saturates while the number of sequences matched by a combination of two or more UFWs grows exponentially. Between 2009 and 2016 the number of protein sequences from known species increased 10-fold from 8 million to 85 million. About 80% of these sequences contain at least one region recognized by the conserved domain architecture retrieval tool (CDART) as a sequence motif. Motifs provide clues to biological function but CDART often matches the same region of a protein by two or more profiles. Such synonyms complicate estimates of functional complexity. We do full-linkage clustering of redundant profiles by finding maximum disjoint cliques: Each cluster is replaced by a single representative profile to give what we term a unique function word (UFW). From 2009 to 2016, the number of sequence profiles used by CDART increased by 80%; the number of UFWs increased more slowly by 30%, indicating that the number of UFWs may be saturating. The number of sequences matched by a single UFW (sequences with single domain architectures) increased as slowly as the number of different words, whereas the number of sequences matched by a combination of two or more UFWs in sequences with multiple domain architectures (MDAs) increased at the same rate as the total number of sequences. This combinatorial arrangement of a limited number of UFWs in MDAs accounts for the genomic diversity of protein sequences. Although eukaryotes and prokaryotes use very similar sets of “words” or UFWs (57% shared), the “sentences” (MDAs) are different (1.3% shared).
Collapse
|
20
|
Hücker SM, Ardern Z, Goldberg T, Schafferhans A, Bernhofer M, Vestergaard G, Nelson CW, Schloter M, Rost B, Scherer S, Neuhaus K. Discovery of numerous novel small genes in the intergenic regions of the Escherichia coli O157:H7 Sakai genome. PLoS One 2017; 12:e0184119. [PMID: 28902868 PMCID: PMC5597208 DOI: 10.1371/journal.pone.0184119] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2017] [Accepted: 08/20/2017] [Indexed: 12/29/2022] Open
Abstract
In the past, short protein-coding genes were often disregarded by genome annotation pipelines. Transcriptome sequencing (RNAseq) signals outside of annotated genes have usually been interpreted to indicate either ncRNA or pervasive transcription. Therefore, in addition to the transcriptome, the translatome (RIBOseq) of the enteric pathogen Escherichia coli O157:H7 strain Sakai was determined at two optimal growth conditions and a severe stress condition combining low temperature and high osmotic pressure. All intergenic open reading frames potentially encoding a protein of ≥ 30 amino acids were investigated with regard to coverage by transcription and translation signals and their translatability expressed by the ribosomal coverage value. This led to discovery of 465 unique, putative novel genes not yet annotated in this E. coli strain, which are evenly distributed over both DNA strands of the genome. For 255 of the novel genes, annotated homologs in other bacteria were found, and a machine-learning algorithm, trained on small protein-coding E. coli genes, predicted that 89% of these translated open reading frames represent bona fide genes. The remaining 210 putative novel genes without annotated homologs were compared to the 255 novel genes with homologs and to 250 short annotated genes of this E. coli strain. All three groups turned out to be similar with respect to their translatability distribution, fractions of differentially regulated genes, secondary structure composition, and the distribution of evolutionary constraint, suggesting that both novel groups represent legitimate genes. However, the machine-learning algorithm only recognized a small fraction of the 210 genes without annotated homologs. It is possible that these genes represent a novel group of genes, which have unusual features dissimilar to the genes of the machine-learning algorithm training set.
Collapse
Affiliation(s)
- Sarah M. Hücker
- Chair for Microbial Ecology, Technische Universität München, Freising, Germany
- ZIEL - Institute for Food & Health, Technische Universität München, Freising, Germany
| | - Zachary Ardern
- Chair for Microbial Ecology, Technische Universität München, Freising, Germany
- ZIEL - Institute for Food & Health, Technische Universität München, Freising, Germany
| | - Tatyana Goldberg
- Department of Informatics—Bioinformatics & TUM-IAS, Technische Universität München, Garching, Germany
| | - Andrea Schafferhans
- Department of Informatics—Bioinformatics & TUM-IAS, Technische Universität München, Garching, Germany
| | - Michael Bernhofer
- Department of Informatics—Bioinformatics & TUM-IAS, Technische Universität München, Garching, Germany
| | - Gisle Vestergaard
- Research Unit Environmental Genomics, Helmholtz Zentrum München, Neuherberg, Germany
| | - Chase W. Nelson
- Sackler Institute for Comparative Genomics, American Museum of Natural History New York, New York, United States of America
| | - Michael Schloter
- Research Unit Environmental Genomics, Helmholtz Zentrum München, Neuherberg, Germany
| | - Burkhard Rost
- Department of Informatics—Bioinformatics & TUM-IAS, Technische Universität München, Garching, Germany
| | - Siegfried Scherer
- Chair for Microbial Ecology, Technische Universität München, Freising, Germany
- ZIEL - Institute for Food & Health, Technische Universität München, Freising, Germany
| | - Klaus Neuhaus
- Chair for Microbial Ecology, Technische Universität München, Freising, Germany
- Core Facility Microbiome/NGS, ZIEL - Institute for Food & Health, Technische Universität München, Freising, Germany
- * E-mail:
| |
Collapse
|