1
|
Podoliak E, Lamm GHU, Marin E, Schellbach AV, Fedotov DA, Stetsenko A, Asido M, Maliar N, Bourenkov G, Balandin T, Baeken C, Astashkin R, Schneider TR, Bateman A, Wachtveitl J, Schapiro I, Busskamp V, Guskov A, Gordeliy V, Alekseev A, Kovalev K. A subgroup of light-driven sodium pumps with an additional Schiff base counterion. Nat Commun 2024; 15:3119. [PMID: 38600129 PMCID: PMC11006869 DOI: 10.1038/s41467-024-47469-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Accepted: 04/01/2024] [Indexed: 04/12/2024] Open
Abstract
Light-driven sodium pumps (NaRs) are unique ion-transporting microbial rhodopsins. The major group of NaRs is characterized by an NDQ motif and has two aspartic acid residues in the central region essential for sodium transport. Here we identify a subgroup of the NDQ rhodopsins bearing an additional glutamic acid residue in the close vicinity to the retinal Schiff base. We thoroughly characterize a member of this subgroup, namely the protein ErNaR from Erythrobacter sp. HL-111 and show that the additional glutamic acid results in almost complete loss of pH sensitivity for sodium-pumping activity, which is in contrast to previously studied NaRs. ErNaR is capable of transporting sodium efficiently even at acidic pH levels. X-ray crystallography and single particle cryo-electron microscopy reveal that the additional glutamic acid residue mediates the connection between the other two Schiff base counterions and strongly interacts with the aspartic acid of the characteristic NDQ motif. Hence, it reduces its pKa. Our findings shed light on a subgroup of NaRs and might serve as a basis for their rational optimization for optogenetics.
Collapse
Affiliation(s)
- E Podoliak
- Department of Ophthalmology, University Hospital Bonn, Medical Faculty, Bonn, Germany
| | - G H U Lamm
- Institute of Physical and Theoretical Chemistry, Goethe University Frankfurt, 60438, Frankfurt am Main, Germany
| | - E Marin
- Groningen Institute for Biomolecular Sciences and Biotechnology, University of Groningen, 9747AG, Groningen, the Netherlands
| | - A V Schellbach
- Institute of Physical and Theoretical Chemistry, Goethe University Frankfurt, 60438, Frankfurt am Main, Germany
- School of Chemistry, University of Edinburgh, Edinburgh, EH9 3FJ, UK
| | - D A Fedotov
- Fritz Haber Center for Molecular Dynamics Research, Institute of Chemistry, The Hebrew University of Jerusalem, Jerusalem, 9190401, Israel
| | - A Stetsenko
- Groningen Institute for Biomolecular Sciences and Biotechnology, University of Groningen, 9747AG, Groningen, the Netherlands
| | - M Asido
- Institute of Physical and Theoretical Chemistry, Goethe University Frankfurt, 60438, Frankfurt am Main, Germany
| | - N Maliar
- Department of Biochemistry, University of Cambridge, 80 Tennis Court Road, Cambridge, CB2 1GA, UK
| | - G Bourenkov
- European Molecular Biology Laboratory, EMBL Hamburg c/o DESY, 22607, Hamburg, Germany
| | - T Balandin
- Institute of Biological Information Processing (IBI-7: Structural Biochemistry), Forschungszentrum Jülich, Jülich, Germany
- JuStruct: Jülich Center for Structural Biology, Forschungszentrum Jülich, Jülich, Germany
| | - C Baeken
- Institute of Biological Information Processing (IBI-7: Structural Biochemistry), Forschungszentrum Jülich, Jülich, Germany
- JuStruct: Jülich Center for Structural Biology, Forschungszentrum Jülich, Jülich, Germany
| | - R Astashkin
- Univ. Grenoble Alpes, CEA, CNRS, Institut de Biologie Structurale (IBS), 38000, Grenoble, France
| | - T R Schneider
- European Molecular Biology Laboratory, EMBL Hamburg c/o DESY, 22607, Hamburg, Germany
| | - A Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
| | - J Wachtveitl
- Institute of Physical and Theoretical Chemistry, Goethe University Frankfurt, 60438, Frankfurt am Main, Germany
| | - I Schapiro
- Fritz Haber Center for Molecular Dynamics Research, Institute of Chemistry, The Hebrew University of Jerusalem, Jerusalem, 9190401, Israel
| | - V Busskamp
- Department of Ophthalmology, University Hospital Bonn, Medical Faculty, Bonn, Germany
| | - A Guskov
- Groningen Institute for Biomolecular Sciences and Biotechnology, University of Groningen, 9747AG, Groningen, the Netherlands
| | - V Gordeliy
- Institute of Biological Information Processing (IBI-7: Structural Biochemistry), Forschungszentrum Jülich, Jülich, Germany
- JuStruct: Jülich Center for Structural Biology, Forschungszentrum Jülich, Jülich, Germany
- Univ. Grenoble Alpes, CEA, CNRS, Institut de Biologie Structurale (IBS), 38000, Grenoble, France
| | - A Alekseev
- University Medical Center Göttingen, Institute for Auditory Neuroscience and InnerEarLab, Robert-Koch-Str. 40, 37075, Göttingen, Germany.
- Cluster of Excellence "Multiscale Bioimaging: from Molecular Machines to Networks of Excitable Cells" (MBExC), University of Göttingen, Göttingen, Germany.
| | - K Kovalev
- European Molecular Biology Laboratory, EMBL Hamburg c/o DESY, 22607, Hamburg, Germany.
| |
Collapse
|
2
|
Insana G, Ignatchenko A, Martin M, Bateman A. MBDBMetrics: an online metrics tool to measure the impact of biological data resources. Bioinform Adv 2023; 3:vbad180. [PMID: 38130879 PMCID: PMC10733715 DOI: 10.1093/bioadv/vbad180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Revised: 11/13/2023] [Indexed: 12/23/2023]
Abstract
Motivation There now exist thousands of molecular biology databases covering every aspect of biological data. This database infrastructure takes significant effort and funding to develop and maintain. The creators of these databases need to make strong justifications to funders to prove their impact or importance. There are many publication metrics and tools available such as Google Scholar to measure citation impact or AltMetrics covering multiple measures including social media coverage. Results In this article, we describe a series of novel impact metrics that have been applied initially to the UniProt database, and now made available via a Google Colab to enable any molecular biology resource to gain several additional metrics. These metrics, powered by freely available APIs from Europe PubMedCentral and SureCHEMBL cover mentions of the resource in full text articles, including which section of the paper the mention occurs in, grant acknowledgements and mentions in patent applications. This tool, that we call MBDBMetrics, is a useful adjunct to existing tools. Availability and implementation The MBDBMetrics tool is available at the following locations: https://colab.research.google.com/drive/1aEmSQR9DGQIZmHAIuQV9mLv7Mw9Ppkin and https://github.com/g-insana/MBDBMetrics.
Collapse
Affiliation(s)
- Giuseppe Insana
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
| | - Alex Ignatchenko
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
| | - Maria Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
| |
Collapse
|
3
|
Monzon AM, Arrías PN, Elofsson A, Mier P, Andrade-Navarro MA, Bevilacqua M, Clementel D, Bateman A, Hirsh L, Fornasari MS, Parisi G, Piovesan D, Kajava AV, Tosatto SCE. A STRP-ed definition of Structured Tandem Repeats in Proteins. J Struct Biol 2023; 215:108023. [PMID: 37652396 DOI: 10.1016/j.jsb.2023.108023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Revised: 07/31/2023] [Accepted: 08/28/2023] [Indexed: 09/02/2023]
Abstract
Tandem Repeat Proteins (TRPs) are a class of proteins with repetitive amino acid sequences that have been studied extensively for over two decades. Different features at the level of sequence, structure, function and evolution have been attributed to them by various authors. And yet many of its salient features appear only when looking at specific subclasses of protein tandem repeats. Here, we attempt to rationalize the existing knowledge on Tandem Repeat Proteins (TRPs) by pointing out several dichotomies. The emerging picture is more nuanced than generally assumed and allows us to draw some boundaries of what is not a "proper" TRP. We conclude with an operational definition of a specific subset, which we have denominated STRPs (Structural Tandem Repeat Proteins), which separates a subclass of tandem repeats with distinctive features from several other less well-defined types of repeats. We believe that this definition will help researchers in the field to better characterize the biological meaning of this large yet largely understudied group of proteins.
Collapse
Affiliation(s)
- Alexander Miguel Monzon
- Dept. of Information Engineering, University of Padova, via Giovanni Gradenigo 6/B, 35131 Padova, Italy
| | - Paula Nazarena Arrías
- Dept. of Biomedical Sciences, University of Padova, via U. Bassi 58/b, 35121 Padova, Italy
| | - Arne Elofsson
- Dept. of Biochemistry and Biophysics and Science for Life Laboratory, Stockholm University, Tomtebodavägen 23, 171 21 Solna, Sweden
| | - Pablo Mier
- Institute of Organismic and Molecular Evolution, Faculty of Biology, Johannes Gutenberg University of Mainz, Hanns-Dieter-Hüsch-Weg 15, 55128 Mainz, Germany
| | - Miguel A Andrade-Navarro
- Institute of Organismic and Molecular Evolution, Faculty of Biology, Johannes Gutenberg University of Mainz, Hanns-Dieter-Hüsch-Weg 15, 55128 Mainz, Germany
| | - Martina Bevilacqua
- Dept. of Biomedical Sciences, University of Padova, via U. Bassi 58/b, 35121 Padova, Italy
| | - Damiano Clementel
- Dept. of Biomedical Sciences, University of Padova, via U. Bassi 58/b, 35121 Padova, Italy
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Layla Hirsh
- Dept. of Engineering, Faculty of Science and Engineering, Pontifical Catholic University of Peru, Av. Universitaria 1801 San Miguel, Lima 32, Lima, Peru
| | - Maria Silvina Fornasari
- Departamento de Ciencia y Tecnología, Universidad Nacional de Quilmes, CONICET, Bernal, Buenos Aires, Argentina
| | - Gustavo Parisi
- Departamento de Ciencia y Tecnología, Universidad Nacional de Quilmes, CONICET, Bernal, Buenos Aires, Argentina
| | - Damiano Piovesan
- Dept. of Biomedical Sciences, University of Padova, via U. Bassi 58/b, 35121 Padova, Italy
| | - Andrey V Kajava
- Centre de Recherche en Biologie cellulaire de Montpellier (CRBM), UMR 5237 CNRS, Université Montpellier, 1919 Route de Mende, Cedex 5, 34293 Montpellier, France
| | - Silvio C E Tosatto
- Dept. of Biomedical Sciences, University of Padova, via U. Bassi 58/b, 35121 Padova, Italy.
| |
Collapse
|
4
|
Schneider B, Sweeney BA, Bateman A, Cerny J, Zok T, Szachniuk M. When will RNA get its AlphaFold moment? Nucleic Acids Res 2023; 51:9522-9532. [PMID: 37702120 PMCID: PMC10570031 DOI: 10.1093/nar/gkad726] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2023] [Revised: 08/13/2023] [Accepted: 08/22/2023] [Indexed: 09/14/2023] Open
Abstract
The protein structure prediction problem has been solved for many types of proteins by AlphaFold. Recently, there has been considerable excitement to build off the success of AlphaFold and predict the 3D structures of RNAs. RNA prediction methods use a variety of techniques, from physics-based to machine learning approaches. We believe that there are challenges preventing the successful development of deep learning-based methods like AlphaFold for RNA in the short term. Broadly speaking, the challenges are the limited number of structures and alignments making data-hungry deep learning methods unlikely to succeed. Additionally, there are several issues with the existing structure and sequence data, as they are often of insufficient quality, highly biased and missing key information. Here, we discuss these challenges in detail and suggest some steps to remedy the situation. We believe that it is possible to create an accurate RNA structure prediction method, but it will require solving several data quality and volume issues, usage of data beyond simple sequence alignments, or the development of new less data-hungry machine learning methods.
Collapse
Affiliation(s)
- Bohdan Schneider
- Institute of Biotechnology of the Czech Academy of Sciences, Prumyslova 595, CZ-252 50 Vestec, Czech Republic
| | - Blake Alexander Sweeney
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SD, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SD, UK
| | - Jiri Cerny
- Institute of Biotechnology of the Czech Academy of Sciences, Prumyslova 595, CZ-252 50 Vestec, Czech Republic
| | - Tomasz Zok
- Institute of Computing Science and European Centre for Bioinformatics and Genomics, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland
| | - Marta Szachniuk
- Institute of Computing Science and European Centre for Bioinformatics and Genomics, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland
- Institute of Bioorganic Chemistry, Polish Academy of Sciences, Noskowskiego 12/14, 61-704 Poznan, Poland
| |
Collapse
|
5
|
Ma L, Zou D, Liu L, Shireen H, Abbasi AA, Bateman A, Xiao J, Zhao W, Bao Y, Zhang Z. Database Commons: A Catalog of Worldwide Biological Databases. Genomics Proteomics Bioinformatics 2023; 21:1054-1058. [PMID: 36572336 PMCID: PMC10928426 DOI: 10.1016/j.gpb.2022.12.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Revised: 12/13/2022] [Accepted: 12/14/2022] [Indexed: 12/25/2022]
Abstract
Biological databases serve as a global fundamental infrastructure for the worldwide scientific community, which dramatically aid the transformation of big data into knowledge discovery and drive significant innovations in a wide range of research fields. Given the rapid data production, biological databases continue to increase in size and importance. To build a catalog of worldwide biological databases, we curate a total of 5825 biological databases from 8931 publications, which are geographically distributed in 72 countries/regions and developed by 1975 institutions (as of September 20, 2022). We further devise a z-index, a novel index to characterize the scientific impact of a database, and rank all these biological databases as well as their hosting institutions and countries in terms of citation and z-index. Consequently, we present a series of statistics and trends of worldwide biological databases, yielding a global perspective to better understand their status and impact for life and health sciences. An up-to-date catalog of worldwide biological databases, as well as their curated meta-information and derived statistics, is publicly available at Database Commons (https://ngdc.cncb.ac.cn/databasecommons/).
Collapse
Affiliation(s)
- Lina Ma
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China.
| | - Dong Zou
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Lin Liu
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China
| | - Huma Shireen
- National Center for Bioinformatics, Programme of Comparative and Evolutionary Genomics, Faculty of Biological Sciences, Quaid-i-Azam University, Islamabad 45320, Pakistan
| | - Amir A Abbasi
- National Center for Bioinformatics, Programme of Comparative and Evolutionary Genomics, Faculty of Biological Sciences, Quaid-i-Azam University, Islamabad 45320, Pakistan
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Jingfa Xiao
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Wenming Zhao
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yiming Bao
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zhang Zhang
- National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China.
| |
Collapse
|
6
|
Durairaj J, Waterhouse AM, Mets T, Brodiazhenko T, Abdullah M, Studer G, Tauriello G, Akdel M, Andreeva A, Bateman A, Tenson T, Hauryliuk V, Schwede T, Pereira J. Uncovering new families and folds in the natural protein universe. Nature 2023; 622:646-653. [PMID: 37704037 PMCID: PMC10584680 DOI: 10.1038/s41586-023-06622-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Accepted: 09/07/2023] [Indexed: 09/15/2023]
Abstract
We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database1. These models cover nearly all proteins that are known, including those challenging to annotate for function or putative biological role using standard homology-based approaches. In this study, we examine the extent to which the AlphaFold database has structurally illuminated this 'dark matter' of the natural protein universe at high predicted accuracy. We further describe the protein diversity that these models cover as an annotated interactive sequence similarity network, accessible at https://uniprot3d.org/atlas/AFDB90v4 . By searching for novelties from sequence, structure and semantic perspectives, we uncovered the β-flower fold, added several protein families to Pfam database2 and experimentally demonstrated that one of these belongs to a new superfamily of translation-targeting toxin-antitoxin systems, TumE-TumA. This work underscores the value of large-scale efforts in identifying, annotating and prioritizing new protein families. By leveraging the recent deep learning revolution in protein bioinformatics, we can now shed light into uncharted areas of the protein universe at an unprecedented scale, paving the way to innovations in life sciences and biotechnology.
Collapse
Affiliation(s)
- Janani Durairaj
- Biozentrum, University of Basel, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Basel, Basel, Switzerland
| | - Andrew M Waterhouse
- Biozentrum, University of Basel, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Basel, Basel, Switzerland
| | - Toomas Mets
- Institute of Technology, University of Tartu, Tartu, Estonia
- Department of Experimental Medical Science, Lund University, Lund, Sweden
| | | | - Minhal Abdullah
- Institute of Technology, University of Tartu, Tartu, Estonia
- Department of Experimental Medical Science, Lund University, Lund, Sweden
| | - Gabriel Studer
- Biozentrum, University of Basel, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Basel, Basel, Switzerland
| | - Gerardo Tauriello
- Biozentrum, University of Basel, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Basel, Basel, Switzerland
| | | | - Antonina Andreeva
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, UK
| | - Tanel Tenson
- Institute of Technology, University of Tartu, Tartu, Estonia
| | - Vasili Hauryliuk
- Institute of Technology, University of Tartu, Tartu, Estonia
- Department of Experimental Medical Science, Lund University, Lund, Sweden
- Science for Life Laboratory, Lund, Sweden
- Virus Centre, Lund University, Lund, Sweden
| | - Torsten Schwede
- Biozentrum, University of Basel, Basel, Switzerland.
- SIB Swiss Institute of Bioinformatics, University of Basel, Basel, Switzerland.
| | - Joana Pereira
- Biozentrum, University of Basel, Basel, Switzerland.
- SIB Swiss Institute of Bioinformatics, University of Basel, Basel, Switzerland.
| |
Collapse
|
7
|
Ormazábal A, Carletti MS, Saldaño TE, Gonzalez Buitron M, Marchetti J, Palopoli N, Bateman A. Expanding the repertoire of human tandem repeat RNA-binding proteins. PLoS One 2023; 18:e0290890. [PMID: 37729217 PMCID: PMC10511089 DOI: 10.1371/journal.pone.0290890] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Accepted: 08/15/2023] [Indexed: 09/22/2023] Open
Abstract
Protein regions consisting of arrays of tandem repeats are known to bind other molecular partners, including nucleic acid molecules. Although the interactions between repeat proteins and DNA are already widely explored, studies characterising tandem repeat RNA-binding proteins are lacking. We performed a large-scale analysis of human proteins devoted to expanding the knowledge about tandem repeat proteins experimentally reported as RNA-binding molecules. This work is timely because of the release of a full set of accurate structural models for the human proteome amenable to repeat detection using structural methods. The main goal of our analysis was to build a comprehensive set of human RNA-binding proteins that contain repeats at the sequence or structure level. Our results showed that the combination of sequence and structural methods finds significantly more tandem repeat proteins than either method alone. We identified 219 tandem repeat proteins that bind RNA molecules and characterised the overlap between repeat regions and RNA-binding regions as a first step towards assessing their functional relationship. We observed differences in the characteristics of repeat regions predicted by sequence-based or structure-based methods in terms of their sequence composition, their functions and their protein domains.
Collapse
Affiliation(s)
- Agustín Ormazábal
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, United Kingdom
- Departamento de Ciencia y Tecnología, Universidad Nacional de Quilmes, Bernal, Buenos Aires, Argentina
- Consejo Nacional de Investigaciones Científicas y Técnicas, Godoy Cruz, Buenos Aires, Argentina
| | - Matías Sebastián Carletti
- Departamento de Ciencia y Tecnología, Universidad Nacional de Quilmes, Bernal, Buenos Aires, Argentina
- Consejo Nacional de Investigaciones Científicas y Técnicas, Godoy Cruz, Buenos Aires, Argentina
| | - Tadeo Enrique Saldaño
- Departamento de Ciencia y Tecnología, Universidad Nacional de Quilmes, Bernal, Buenos Aires, Argentina
- Consejo Nacional de Investigaciones Científicas y Técnicas, Godoy Cruz, Buenos Aires, Argentina
- Departamento de Ciencias Básicas, Facultad de Agronomía, Universidad Nacional del Centro de la Provincia de Buenos Aires, Azul, Buenos Aires, Argentina
| | - Martín Gonzalez Buitron
- Departamento de Ciencia y Tecnología, Universidad Nacional de Quilmes, Bernal, Buenos Aires, Argentina
- Consejo Nacional de Investigaciones Científicas y Técnicas, Godoy Cruz, Buenos Aires, Argentina
| | - Julia Marchetti
- Departamento de Ciencia y Tecnología, Universidad Nacional de Quilmes, Bernal, Buenos Aires, Argentina
| | - Nicolas Palopoli
- Departamento de Ciencia y Tecnología, Universidad Nacional de Quilmes, Bernal, Buenos Aires, Argentina
- Consejo Nacional de Investigaciones Científicas y Técnicas, Godoy Cruz, Buenos Aires, Argentina
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, United Kingdom
| |
Collapse
|
8
|
Barringer R, Parnell AE, Lafita A, Monzon V, Back CR, Madej M, Potempa J, Nobbs AH, Burston SG, Bateman A, Race PR. Domain shuffling of a highly mutable ligand-binding fold drives adhesin generation across the bacterial kingdom. Proteins 2023; 91:1007-1020. [PMID: 36912614 PMCID: PMC10952558 DOI: 10.1002/prot.26487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Revised: 02/27/2023] [Accepted: 02/28/2023] [Indexed: 03/14/2023]
Abstract
Bacterial fibrillar adhesins are specialized extracellular polypeptides that promote the attachment of bacteria to the surfaces of other cells or materials. Adhesin-mediated interactions are critical for the establishment and persistence of stable bacterial populations within diverse environmental niches and are important determinants of virulence. The fibronectin (Fn)-binding fibrillar adhesin CshA, and its paralogue CshB, play important roles in host colonization by the oral commensal and opportunistic pathogen Streptococcus gordonii. As paralogues are often catalysts for functional diversification, we have probed the early stages of structural and functional divergence in Csh proteins by determining the X-ray crystal structure of the CshB adhesive domain NR2 and characterizing its Fn-binding properties in vitro. Despite sharing a common fold, CshB_NR2 displays an ~1.7-fold reduction in Fn-binding affinity relative to CshA_NR2. This correlates with reduced electrostatic charge in the Fn-binding cleft. Complementary bioinformatic studies reveal that homologues of CshA/B_NR2 domains are widely distributed in both Gram-positive and Gram-negative bacteria, where they are found housed within functionally cryptic multi-domain polypeptides. Our findings are consistent with the classification of Csh adhesins and their relatives as members of the recently defined polymer adhesin domain (PAD) family of bacterial proteins.
Collapse
Affiliation(s)
- Rob Barringer
- School of BiochemistryUniversity of Bristol, University WalkBristolBS8 1TDUK
| | - Alice E. Parnell
- School of BiochemistryUniversity of Bristol, University WalkBristolBS8 1TDUK
- BrisSynBio Synthetic Biology Research CentreUniversity of Bristol, Life Sciences BuildingTyndall AvenueBristolBS8 1TQUK
| | - Aleix Lafita
- European Molecular Biology LaboratoryEuropean Bioinformatics Institute (EMBL‐EBI)Wellcome Genome CampusHinxtonCB10 1SDUK
| | - Vivian Monzon
- European Molecular Biology LaboratoryEuropean Bioinformatics Institute (EMBL‐EBI)Wellcome Genome CampusHinxtonCB10 1SDUK
| | - Catherine R. Back
- School of BiochemistryUniversity of Bristol, University WalkBristolBS8 1TDUK
- BrisSynBio Synthetic Biology Research CentreUniversity of Bristol, Life Sciences BuildingTyndall AvenueBristolBS8 1TQUK
| | - Mariusz Madej
- Department of Microbiology, Faculty of Biochemistry, Biophysics, and BiotechnologyJagiellonian UniversityKrakowPoland
| | - Jan Potempa
- Department of Microbiology, Faculty of Biochemistry, Biophysics, and BiotechnologyJagiellonian UniversityKrakowPoland
- Department of Oral Immunology and Infectious DiseasesUniversity of Louisville School of DentistryLouisvilleKentuckyUSA
| | - Angela H. Nobbs
- Bristol Dental School, University of BristolLower Maudlin StreetBristolBS1 2LYUK
| | - Steven G. Burston
- School of BiochemistryUniversity of Bristol, University WalkBristolBS8 1TDUK
| | - Alex Bateman
- European Molecular Biology LaboratoryEuropean Bioinformatics Institute (EMBL‐EBI)Wellcome Genome CampusHinxtonCB10 1SDUK
| | - Paul R. Race
- School of BiochemistryUniversity of Bristol, University WalkBristolBS8 1TDUK
- BrisSynBio Synthetic Biology Research CentreUniversity of Bristol, Life Sciences BuildingTyndall AvenueBristolBS8 1TQUK
| |
Collapse
|
9
|
Lannelongue L, Aronson HEG, Bateman A, Birney E, Caplan T, Juckes M, McEntyre J, Morris AD, Reilly G, Inouye M. GREENER principles for environmentally sustainable computational science. Nat Comput Sci 2023; 3:514-521. [PMID: 38177425 DOI: 10.1038/s43588-023-00461-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/06/2022] [Accepted: 05/09/2023] [Indexed: 01/06/2024]
Abstract
The carbon footprint of scientific computing is substantial, but environmentally sustainable computational science (ESCS) is a nascent field with many opportunities to thrive. To realize the immense green opportunities and continued, yet sustainable, growth of computer science, we must take a coordinated approach to our current challenges, including greater awareness and transparency, improved estimation and wider reporting of environmental impacts. Here, we present a snapshot of where ESCS stands today and introduce the GREENER set of principles, as well as guidance for best practices moving forward.
Collapse
Affiliation(s)
- Loïc Lannelongue
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK.
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK.
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK.
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK.
| | | | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
| | - Ewan Birney
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
| | | | - Martin Juckes
- RAL Space, Science and Technology Facilities Council, Harwell Campus, Didcot, UK
| | - Johanna McEntyre
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
| | | | | | - Michael Inouye
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge, UK
- The Alan Turing Institute, London, UK
| |
Collapse
|
10
|
Salazar GA, Luciani A, Watkins X, Kandasaamy S, Rice DL, Blum M, Bateman A, Martin M. Nightingale: web components for protein feature visualization. Bioinform Adv 2023; 3:vbad064. [PMID: 37359723 PMCID: PMC10287899 DOI: 10.1093/bioadv/vbad064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Revised: 04/20/2023] [Accepted: 05/23/2023] [Indexed: 06/28/2023]
Abstract
Motivation The visualization of biological data is a fundamental technique that enables researchers to understand and explain biology. Some of these visualizations have become iconic, for instance: tree views for taxonomy, cartoon rendering of 3D protein structures or tracks to represent features in a gene or protein, for instance in a genome browser. Nightingale provides visualizations in the context of proteins and protein features. Results Nightingale is a library of re-usable data visualization web components that are currently used by UniProt and InterPro, among other projects. The components can be used to display protein sequence features, variants, interaction data, 3D structure, etc. These components are flexible, allowing users to easily view multiple data sources within the same context, as well as compose these components to create a customized view. Availability and implementation Nightingale examples and documentation are freely available at https://ebi-webcomponents.github.io/nightingale/. It is distributed under the MIT license, and its source code can be found at https://github.com/ebi-webcomponents/nightingale.
Collapse
Affiliation(s)
- Gustavo A Salazar
- Macromolecules, Structure, Chemistry and Bioimaging Section, European Bioinformatics Institute, European Molecular Biology Laboratory (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SD, UK
| | - Aurélien Luciani
- Macromolecules, Structure, Chemistry and Bioimaging Section, European Bioinformatics Institute, European Molecular Biology Laboratory (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SD, UK
| | - Xavier Watkins
- Macromolecules, Structure, Chemistry and Bioimaging Section, European Bioinformatics Institute, European Molecular Biology Laboratory (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SD, UK
| | - Swaathi Kandasaamy
- Macromolecules, Structure, Chemistry and Bioimaging Section, European Bioinformatics Institute, European Molecular Biology Laboratory (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SD, UK
| | - Daniel L Rice
- Macromolecules, Structure, Chemistry and Bioimaging Section, European Bioinformatics Institute, European Molecular Biology Laboratory (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SD, UK
| | - Matthias Blum
- Macromolecules, Structure, Chemistry and Bioimaging Section, European Bioinformatics Institute, European Molecular Biology Laboratory (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SD, UK
| | - Alex Bateman
- Macromolecules, Structure, Chemistry and Bioimaging Section, European Bioinformatics Institute, European Molecular Biology Laboratory (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SD, UK
| | - Maria Martin
- Macromolecules, Structure, Chemistry and Bioimaging Section, European Bioinformatics Institute, European Molecular Biology Laboratory (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SD, UK
| |
Collapse
|
11
|
Aleksander SA, Balhoff J, Carbon S, Cherry JM, Drabkin HJ, Ebert D, Feuermann M, Gaudet P, Harris NL, Hill DP, Lee R, Mi H, Moxon S, Mungall CJ, Muruganugan A, Mushayahama T, Sternberg PW, Thomas PD, Van Auken K, Ramsey J, Siegele DA, Chisholm RL, Fey P, Aspromonte MC, Nugnes MV, Quaglia F, Tosatto S, Giglio M, Nadendla S, Antonazzo G, Attrill H, Dos Santos G, Marygold S, Strelets V, Tabone CJ, Thurmond J, Zhou P, Ahmed SH, Asanitthong P, Luna Buitrago D, Erdol MN, Gage MC, Ali Kadhum M, Li KYC, Long M, Michalak A, Pesala A, Pritazahra A, Saverimuttu SCC, Su R, Thurlow KE, Lovering RC, Logie C, Oliferenko S, Blake J, Christie K, Corbani L, Dolan ME, Drabkin HJ, Hill DP, Ni L, Sitnikov D, Smith C, Cuzick A, Seager J, Cooper L, Elser J, Jaiswal P, Gupta P, Jaiswal P, Naithani S, Lera-Ramirez M, Rutherford K, Wood V, De Pons JL, Dwinell MR, Hayman GT, Kaldunski ML, Kwitek AE, Laulederkind SJF, Tutaj MA, Vedi M, Wang SJ, D'Eustachio P, Aimo L, Axelsen K, Bridge A, Hyka-Nouspikel N, Morgat A, Aleksander SA, Cherry JM, Engel SR, Karra K, Miyasato SR, Nash RS, Skrzypek MS, Weng S, Wong ED, Bakker E, Berardini TZ, Reiser L, Auchincloss A, Axelsen K, Argoud-Puy G, Blatter MC, Boutet E, Breuza L, Bridge A, Casals-Casas C, Coudert E, Estreicher A, Livia Famiglietti M, Feuermann M, Gos A, Gruaz-Gumowski N, Hulo C, Hyka-Nouspikel N, Jungo F, Le Mercier P, Lieberherr D, Masson P, Morgat A, Pedruzzi I, Pourcel L, Poux S, Rivoire C, Sundaram S, Bateman A, Bowler-Barnett E, Bye-A-Jee H, Denny P, Ignatchenko A, Ishtiaq R, Lock A, Lussi Y, Magrane M, Martin MJ, Orchard S, Raposo P, Speretta E, Tyagi N, Warner K, Zaru R, Diehl AD, Lee R, Chan J, Diamantakis S, Raciti D, Zarowiecki M, Fisher M, James-Zorn C, Ponferrada V, Zorn A, Ramachandran S, Ruzicka L, Westerfield M. The Gene Ontology knowledgebase in 2023. Genetics 2023; 224:iyad031. [PMID: 36866529 PMCID: PMC10158837 DOI: 10.1093/genetics/iyad031] [Citation(s) in RCA: 218] [Impact Index Per Article: 218.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Revised: 02/10/2023] [Accepted: 02/11/2023] [Indexed: 03/04/2023] Open
Abstract
The Gene Ontology (GO) knowledgebase (http://geneontology.org) is a comprehensive resource concerning the functions of genes and gene products (proteins and noncoding RNAs). GO annotations cover genes from organisms across the tree of life as well as viruses, though most gene function knowledge currently derives from experiments carried out in a relatively small number of model organisms. Here, we provide an updated overview of the GO knowledgebase, as well as the efforts of the broad, international consortium of scientists that develops, maintains, and updates the GO knowledgebase. The GO knowledgebase consists of three components: (1) the GO-a computational knowledge structure describing the functional characteristics of genes; (2) GO annotations-evidence-supported statements asserting that a specific gene product has a particular functional characteristic; and (3) GO Causal Activity Models (GO-CAMs)-mechanistic models of molecular "pathways" (GO biological processes) created by linking multiple GO annotations using defined relations. Each of these components is continually expanded, revised, and updated in response to newly published discoveries and receives extensive QA checks, reviews, and user feedback. For each of these components, we provide a description of the current contents, recent developments to keep the knowledgebase up to date with new discoveries, and guidance on how users can best make use of the data that we provide. We conclude with future directions for the project.
Collapse
|
12
|
Paysan-Lafosse T, Blum M, Chuguransky S, Grego T, Pinto BL, Salazar G, Bileschi M, Bork P, Bridge A, Colwell L, Gough J, Haft D, Letunić I, Marchler-Bauer A, Mi H, Natale D, Orengo C, Pandurangan A, Rivoire C, Sigrist CJA, Sillitoe I, Thanki N, Thomas PD, Tosatto SCE, Wu C, Bateman A. InterPro in 2022. Nucleic Acids Res 2023; 51:D418-D427. [PMID: 36350672 PMCID: PMC9825450 DOI: 10.1093/nar/gkac993] [Citation(s) in RCA: 462] [Impact Index Per Article: 462.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 10/12/2022] [Accepted: 10/28/2022] [Indexed: 11/11/2022] Open
Abstract
The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. Here, we report recent developments with InterPro (version 90.0) and its associated software, including updates to data content and to the website. These developments extend and enrich the information provided by InterPro, and provide a more user friendly access to the data. Additionally, we have worked on adding Pfam website features to the InterPro website, as the Pfam website will be retired in late 2022. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB. Moreover, we report the development of a card game as a method of engaging the non-scientific community. Finally, we discuss the benefits and challenges brought by the use of artificial intelligence for protein structure prediction.
Collapse
Affiliation(s)
- Typhaine Paysan-Lafosse
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Matthias Blum
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Sara Chuguransky
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Tiago Grego
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Beatriz Lázaro Pinto
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Gustavo A Salazar
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | | | - Peer Bork
- European Molecular Biology Laboratory, Structural and Computational Biology Unit, Meyerhofstraße 1, 69117 Heidelberg, Germany
- Yonsei Frontier Lab (YFL), Yonsei University, 03722 Seoul, South Korea
- Department of Bioinformatics, Biocenter, University of Würzburg, 97074 Würzburg, Germany
| | - Alan Bridge
- Swiss-Prot Group, Swiss Institute of Bioinformatics, CMU, 1 rue Michel Servet, CH-1211, Geneva 4, Switzerland
| | - Lucy Colwell
- Google Research, Brain team, Cambridge, MA, USA
- Department of Chemistry, University of Cambridge, Cambridge, UK
| | - Julian Gough
- Medical Research Council Laboratory of Molecular Biology, Cambridge Biomedical Campus, Francis Crick Ave, Trumpington, Cambridge CB2 0QH, UK
| | - Daniel H Haft
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Ivica Letunić
- Biobyte Solutions GmbH, Bothestr 142, 69126 Heidelberg, Germany
| | - Aron Marchler-Bauer
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Huaiyu Mi
- Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90033, USA
| | - Darren A Natale
- Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA
| | - Christine A Orengo
- Department of Structural and Molecular Biology, University College London, Gower St, Bloomsbury, London WC1E 6BT, UK
| | - Arun P Pandurangan
- Medical Research Council Laboratory of Molecular Biology, Cambridge Biomedical Campus, Francis Crick Ave, Trumpington, Cambridge CB2 0QH, UK
- Department of Biochemistry, Sanger Building, University of Cambridge, Cambridge, UK
| | - Catherine Rivoire
- Swiss-Prot Group, Swiss Institute of Bioinformatics, CMU, 1 rue Michel Servet, CH-1211, Geneva 4, Switzerland
| | - Christian J A Sigrist
- Swiss-Prot Group, Swiss Institute of Bioinformatics, CMU, 1 rue Michel Servet, CH-1211, Geneva 4, Switzerland
| | - Ian Sillitoe
- Department of Structural and Molecular Biology, University College London, Gower St, Bloomsbury, London WC1E 6BT, UK
| | - Narmada Thanki
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Paul D Thomas
- Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90033, USA
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padua, via U. Bassi 58/b, 35131 Padua, Italy
| | - Cathy H Wu
- Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA
- Center for Bioinformatics and Computational Biology and Protein Information Resource, University of Delaware, Newark, DE 19711, USA
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| |
Collapse
|
13
|
Thakur M, Bateman A, Brooksbank C, Freeberg M, Harrison M, Hartley M, Keane T, Kleywegt G, Leach A, Levchenko M, Morgan S, McDonagh E, Orchard S, Papatheodorou I, Velankar S, Vizcaino J, Witham R, Zdrazil B, McEntyre J. EMBL's European Bioinformatics Institute (EMBL-EBI) in 2022. Nucleic Acids Res 2023; 51:D9-D17. [PMID: 36477213 PMCID: PMC9825486 DOI: 10.1093/nar/gkac1098] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2022] [Revised: 10/21/2022] [Accepted: 10/31/2022] [Indexed: 12/13/2022] Open
Abstract
The European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) is one of the world's leading sources of public biomolecular data. Based at the Wellcome Genome Campus in Hinxton, UK, EMBL-EBI is one of six sites of the European Molecular Biology Laboratory (EMBL), Europe's only intergovernmental life sciences organisation. This overview summarises the status of services that EMBL-EBI data resources provide to scientific communities globally. The scale, openness, rich metadata and extensive curation of EMBL-EBI added-value databases makes them particularly well-suited as training sets for deep learning, machine learning and artificial intelligence applications, a selection of which are described here. The data resources at EMBL-EBI can catalyse such developments because they offer sustainable, high-quality data, collected in some cases over decades and made openly availability to any researcher, globally. Our aim is for EMBL-EBI data resources to keep providing the foundations for tools and research insights that transform fields across the life sciences.
Collapse
Affiliation(s)
| | - Alex Bateman
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Cath Brooksbank
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Mallory Freeberg
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Melissa Harrison
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Matthew Hartley
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Thomas Keane
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Gerard Kleywegt
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Andrew Leach
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Mariia Levchenko
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Sarah Morgan
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Ellen M McDonagh
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
- OpenTargets, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Sandra Orchard
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Irene Papatheodorou
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Sameer Velankar
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Juan Antonio Vizcaino
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Rick Witham
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Barbara Zdrazil
- Data Services Teams, EMBL’s European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | | |
Collapse
|
14
|
Akdel M, Pires DEV, Pardo EP, Jänes J, Zalevsky AO, Mészáros B, Bryant P, Good LL, Laskowski RA, Pozzati G, Shenoy A, Zhu W, Kundrotas P, Serra VR, Rodrigues CHM, Dunham AS, Burke D, Borkakoti N, Velankar S, Frost A, Basquin J, Lindorff-Larsen K, Bateman A, Kajava AV, Valencia A, Ovchinnikov S, Durairaj J, Ascher DB, Thornton JM, Davey NE, Stein A, Elofsson A, Croll TI, Beltrao P. A structural biology community assessment of AlphaFold2 applications. Nat Struct Mol Biol 2022; 29:1056-1067. [PMID: 36344848 PMCID: PMC9663297 DOI: 10.1038/s41594-022-00849-w] [Citation(s) in RCA: 176] [Impact Index Per Article: 88.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2021] [Accepted: 09/20/2022] [Indexed: 11/09/2022]
Abstract
Most proteins fold into 3D structures that determine how they function and orchestrate the biological processes of the cell. Recent developments in computational methods for protein structure predictions have reached the accuracy of experimentally determined models. Although this has been independently verified, the implementation of these methods across structural-biology applications remains to be tested. Here, we evaluate the use of AlphaFold2 (AF2) predictions in the study of characteristic structural elements; the impact of missense variants; function and ligand binding site predictions; modeling of interactions; and modeling of experimental structural data. For 11 proteomes, an average of 25% additional residues can be confidently modeled when compared with homology modeling, identifying structural features rarely seen in the Protein Data Bank. AF2-based predictions of protein disorder and complexes surpass dedicated tools, and AF2 models can be used across diverse applications equally well compared with experimentally determined structures, when the confidence metrics are critically considered. In summary, we find that these advances are likely to have a transformative impact in structural biology and broader life-science research.
Collapse
Affiliation(s)
- Mehmet Akdel
- Bioinformatics Group, Department of Plant Sciences, Wageningen University and Research, Wageningen, the Netherlands
| | - Douglas E V Pires
- School of Computing and Information Systems, University of Melbourne, Melbourne, Victoria, Australia
| | - Eduard Porta Pardo
- Josep Carreras Leukaemia Research Institute (IJC), Badalona, Spain
- Barcelona Supercomputing Center (BSC), Barcelona, Spain
| | - Jürgen Jänes
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Arthur O Zalevsky
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, Moscow, Russian Federation
| | | | - Patrick Bryant
- Dep of Biochemistry and Biophysics and Science for Life Laboratory, Solna, Sweden
| | - Lydia L Good
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Roman A Laskowski
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Gabriele Pozzati
- Dep of Biochemistry and Biophysics and Science for Life Laboratory, Solna, Sweden
| | - Aditi Shenoy
- Dep of Biochemistry and Biophysics and Science for Life Laboratory, Solna, Sweden
| | - Wensi Zhu
- Dep of Biochemistry and Biophysics and Science for Life Laboratory, Solna, Sweden
| | - Petras Kundrotas
- Dep of Biochemistry and Biophysics and Science for Life Laboratory, Solna, Sweden
| | | | - Carlos H M Rodrigues
- School of Computing and Information Systems, University of Melbourne, Melbourne, Victoria, Australia
| | - Alistair S Dunham
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - David Burke
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Neera Borkakoti
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Sameer Velankar
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Adam Frost
- Department of Biochemistry and Biophysics University of California, San Francisco, CA, USA
| | - Jérôme Basquin
- Department of Structural Cell Biology, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Kresten Lindorff-Larsen
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Andrey V Kajava
- Université de Montpellier, Centre de Recherche en Biologie Cellulaire de Montpellier (CRBM) CNRS, Montpellier, France
| | | | - Sergey Ovchinnikov
- Faculty of Arts and Sciences, Division of Science, Harvard University, Cambridge, MA, USA.
| | | | - David B Ascher
- School of Chemistry and Molecular Biology, University of Queensland, Brisbane, Queensland, Australia.
| | - Janet M Thornton
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK.
| | | | - Amelie Stein
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark.
| | - Arne Elofsson
- Dep of Biochemistry and Biophysics and Science for Life Laboratory, Solna, Sweden.
| | - Tristan I Croll
- Cambridge Institute for Medical Research, Department of Haematology, The University of Cambridge, Cambridge, UK.
| | - Pedro Beltrao
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK.
- Institute of Molecular Systems Biology, ETH Zürich, Zürich, Switzerland.
| |
Collapse
|
15
|
Russo ET, Barone F, Bateman A, Cozzini S, Punta M, Laio A. DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets. PLoS Comput Biol 2022; 18:e1010610. [PMID: 36260616 PMCID: PMC9621593 DOI: 10.1371/journal.pcbi.1010610] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Revised: 10/31/2022] [Accepted: 09/26/2022] [Indexed: 11/07/2022] Open
Abstract
Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds ∼ 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 72% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.
Collapse
Affiliation(s)
| | - Federico Barone
- SISSA, Trieste, Italy
- AREA SCIENCE PARK, Trieste, Italy
- Department of Mathematics and Geosciences, University of Trieste, Trieste, Italy
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, United Kingdom
| | | | - Marco Punta
- Center for Omics Sciences, IRCCS San Raffaele Institute, Milan, Italy
- Unit of Immunogenetics, Leukemia Genomics and Immunobiology, Division of Immunology, Transplantation and Infectious Disease, IRCCS San Raffaele Scientific Institute, Milan, Italy
- * E-mail: (MP); (AL)
| | - Alessandro Laio
- SISSA, Trieste, Italy
- ICTP, Trieste, Italy
- * E-mail: (MP); (AL)
| |
Collapse
|
16
|
Monzon V, Paysan-Lafosse T, Wood V, Bateman A. Reciprocal best structure hits: using AlphaFold models to discover distant homologues. Bioinform Adv 2022; 2:vbac072. [PMID: 36408459 PMCID: PMC9666668 DOI: 10.1093/bioadv/vbac072] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Revised: 09/16/2022] [Accepted: 10/05/2022] [Indexed: 11/17/2022]
Abstract
Motivation The conventional methods to detect homologous protein pairs use the comparison of protein sequences. But the sequences of two homologous proteins may diverge significantly and consequently may be undetectable by standard approaches. The release of the AlphaFold 2.0 software enables the prediction of highly accurate protein structures and opens many opportunities to advance our understanding of protein functions, including the detection of homologous protein structure pairs. Results In this proof-of-concept work, we search for the closest homologous protein pairs using the structure models of five model organisms from the AlphaFold database. We compare the results with homologous protein pairs detected by their sequence similarity and show that the structural matching approach finds a similar set of results. In addition, we detect potential novel homologs solely with the structural matching approach, which can help to understand the function of uncharacterized proteins and make previously overlooked connections between well-characterized proteins. We also observe limitations of our implementation of the structure-based approach, particularly when handling highly disordered proteins or short protein structures. Our work shows that high accuracy protein structure models can be used to discover homologous protein pairs, and we expose areas for improvement of this structural matching approach. Availability and Implementation Information to the discovered homologous protein pairs can be found at the following URL: https://doi.org/10.17863/CAM.87873. The code can be accessed here: https://github.com/VivianMonzon/Reciprocal_Best_Structure_Hits. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Vivian Monzon
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB21 4HH, UK
| | - Typhaine Paysan-Lafosse
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB21 4HH, UK
| | - Valerie Wood
- Department of Biochemistry, University of Cambridge, Cambridge CB2 1GA, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB21 4HH, UK
| |
Collapse
|
17
|
de Crécy-lagard V, Amorin de Hegedus R, Arighi C, Babor J, Bateman A, Blaby I, Blaby-Haas C, Bridge AJ, Burley SK, Cleveland S, Colwell LJ, Conesa A, Dallago C, Danchin A, de Waard A, Deutschbauer A, Dias R, Ding Y, Fang G, Friedberg I, Gerlt J, Goldford J, Gorelik M, Gyori BM, Henry C, Hutinet G, Jaroch M, Karp PD, Kondratova L, Lu Z, Marchler-Bauer A, Martin MJ, McWhite C, Moghe GD, Monaghan P, Morgat A, Mungall CJ, Natale DA, Nelson WC, O’Donoghue S, Orengo C, O’Toole KH, Radivojac P, Reed C, Roberts RJ, Rodionov D, Rodionova IA, Rudolf JD, Saleh L, Sheynkman G, Thibaud-Nissen F, Thomas PD, Uetz P, Vallenet D, Carter EW, Weigele PR, Wood V, Wood-Charlson EM, Xu J. A roadmap for the functional annotation of protein families: a community perspective. Database (Oxford) 2022; 2022:6663924. [PMID: 35961013 PMCID: PMC9374478 DOI: 10.1093/database/baac062] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 06/28/2022] [Accepted: 08/03/2022] [Indexed: 12/23/2022]
Abstract
Over the last 25 years, biology has entered the genomic era and is becoming a science of ‘big data’. Most interpretations of genomic analyses rely on accurate functional annotations of the proteins encoded by more than 500 000 genomes sequenced to date. By different estimates, only half the predicted sequenced proteins carry an accurate functional annotation, and this percentage varies drastically between different organismal lineages. Such a large gap in knowledge hampers all aspects of biological enterprise and, thereby, is standing in the way of genomic biology reaching its full potential. A brainstorming meeting to address this issue funded by the National Science Foundation was held during 3–4 February 2022. Bringing together data scientists, biocurators, computational biologists and experimentalists within the same venue allowed for a comprehensive assessment of the current state of functional annotations of protein families. Further, major issues that were obstructing the field were identified and discussed, which ultimately allowed for the proposal of solutions on how to move forward.
Collapse
Affiliation(s)
- Valérie de Crécy-lagard
- Department of Microbiology and Cell Sciences, University of Florida , Gainesville, FL 32611, USA
| | | | - Cecilia Arighi
- Department of Computer and Information Sciences, University of Delaware , Newark, DE 19713, USA
| | - Jill Babor
- Department of Microbiology and Cell Sciences, University of Florida , Gainesville, FL 32611, USA
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus , Hinxton CB10 1SD, UK
| | - Ian Blaby
- US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory , Berkeley, CA 94720, USA
| | - Crysten Blaby-Haas
- Biology Department, Brookhaven National Laboratory , Upton, NY 11973, USA
| | - Alan J Bridge
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire , Geneva 4 CH-1211, Switzerland
| | - Stephen K Burley
- RCSB Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey , Piscataway, NJ 08854, USA
| | - Stacey Cleveland
- Department of Microbiology and Cell Sciences, University of Florida , Gainesville, FL 32611, USA
| | - Lucy J Colwell
- Departmenf of Chemistry, University of Cambridge , Lensfield Road, Cambridge CB2 1EW, UK
| | - Ana Conesa
- Spanish National Research Council, Institute for Integrative Systems Biology , Paterna, Valencia 46980, Spain
| | - Christian Dallago
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology , i12, Boltzmannstr. 3, Garching/Munich 85748, Germany
| | - Antoine Danchin
- School of Biomedical Sciences, Li KaShing Faculty of Medicine, The University of Hong Kong , 21 Sassoon Road, Pokfulam, SAR Hong Kong 999077, China
| | - Anita de Waard
- Research Collaboration Unit, Elsevier , Jericho, VT 05465, USA
| | - Adam Deutschbauer
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory , Berkeley, CA 94720, USA
| | - Raquel Dias
- Department of Microbiology and Cell Sciences, University of Florida , Gainesville, FL 32611, USA
| | - Yousong Ding
- Department of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development, University of Florida , Gainesville, FL 32610, USA
| | - Gang Fang
- NYU-Shanghai , Shanghai 200120, China
| | - Iddo Friedberg
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University , Ames, IA 50011, USA
| | - John Gerlt
- Institute for Genomic Biology and Departments of Biochemistry and Chemistry, University of Illinois at Urbana-Champaign , Urbana, IL 61801, USA
| | - Joshua Goldford
- Physics of Living Systems, Massachusetts Institute of Technology , Cambridge, MA 02139, USA
| | - Mark Gorelik
- Department of Microbiology and Cell Sciences, University of Florida , Gainesville, FL 32611, USA
| | - Benjamin M Gyori
- Laboratory of Systems Pharmacology, Harvard Medical School , Boston, MA 02115, USA
| | - Christopher Henry
- Mathematics and Computer Science Division, Argonne National Laboratory , Argonne, IL 60439, USA
| | - Geoffrey Hutinet
- Department of Microbiology and Cell Sciences, University of Florida , Gainesville, FL 32611, USA
| | - Marshall Jaroch
- Department of Microbiology and Cell Sciences, University of Florida , Gainesville, FL 32611, USA
| | - Peter D Karp
- Bioinformatics Research Group, SRI International , Menlo Park, CA 94025, USA
| | | | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH) , 8600 Rockville Pike, Bethesda, MD 20817, USA
| | - Aron Marchler-Bauer
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH) , 8600 Rockville Pike, Bethesda, MD 20817, USA
| | - Maria-Jesus Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus , Hinxton CB10 1SD, UK
| | - Claire McWhite
- Lewis-Sigler Institute for Integrative Genomics, Princeton University , Princeton, NJ 08540, USA
| | - Gaurav D Moghe
- Plant Biology Section, School of Integrative Plant Science, Cornell University , Ithaca, NY 14853, USA
| | - Paul Monaghan
- Department of Agricultural Education and Communication, University of Florida , Gainesville, FL 32611, USA
| | - Anne Morgat
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire , Geneva 4 CH-1211, Switzerland
| | - Christopher J Mungall
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory , Berkeley, CA 94720, USA
| | - Darren A Natale
- Georgetown University Medical Center , Washington, DC 20007, USA
| | - William C Nelson
- Biological Sciences Division, Pacific Northwest National Laboratories , Richland, WA 99354, USA
| | - Seán O’Donoghue
- School of Biotechnology and Biomolecular Sciences, University of NSW , Sydney, NSW 2052, Australia
| | - Christine Orengo
- Department of Structural and Molecular Biology, University College London , London WC1E 6BT, UK
| | | | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University , Boston, MA 02115, USA
| | - Colbie Reed
- Department of Microbiology and Cell Sciences, University of Florida , Gainesville, FL 32611, USA
| | | | - Dmitri Rodionov
- Sanford Burnham Prebys Medical Discovery Institute , La Jolla, CA 92037, USA
| | - Irina A Rodionova
- Department of Bioengineering, Division of Engineering, University of California at San Diego , La Jolla, CA 92093-0412, USA
| | - Jeffrey D Rudolf
- Department of Chemistry, University of Florida , Gainesville, FL 32611, USA
| | - Lana Saleh
- New England Biolabs , Ipswich, MA 01938, USA
| | - Gloria Sheynkman
- Department of Molecular Physiology and Biological Physics, University of Virginia , Charlottesville, VA, USA
| | - Francoise Thibaud-Nissen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH) , 8600 Rockville Pike, Bethesda, MD 20817, USA
| | - Paul D Thomas
- Department of Population and Public Health Sciences, University of Southern California , Los Angeles, CA 90033, USA
| | - Peter Uetz
- Center for Biological Data Science, Virginia Commonwealth University , Richmond, VA 23284, USA
| | - David Vallenet
- LABGeM, Génomique Métabolique, CEA, Genoscope, Institut François Jacob, Université d’Évry, Université Paris-Saclay, CNRS , Evry 91057, France
| | - Erica Watson Carter
- Department of Plant Pathology, University of Florida Citrus Research and Education Center , 700 Experiment Station Rd., Lake Alfred, FL 33850, USA
| | | | - Valerie Wood
- Department of Biochemistry, University of Cambridge , Cambridge CB2 1GA, UK
| | - Elisha M Wood-Charlson
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory , Berkeley, CA 94720, USA
| | - Jin Xu
- Department of Plant Pathology, University of Florida Citrus Research and Education Center , 700 Experiment Station Rd., Lake Alfred, FL 33850, USA
| |
Collapse
|
18
|
Bileschi ML, Belanger D, Bryant DH, Sanderson T, Carter B, Sculley D, Bateman A, DePristo MA, Colwell LJ. Using deep learning to annotate the protein universe. Nat Biotechnol 2022; 40:932-937. [PMID: 35190689 DOI: 10.1038/s41587-021-01179-w] [Citation(s) in RCA: 71] [Impact Index Per Article: 35.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2021] [Accepted: 12/02/2021] [Indexed: 12/30/2022]
Abstract
Understanding the relationship between amino acid sequence and protein function is a long-standing challenge with far-reaching scientific and translational implications. State-of-the-art alignment-based techniques cannot predict function for one-third of microbial protein sequences, hampering our ability to exploit data from diverse organisms. Here, we train deep learning models to accurately predict functional annotations for unaligned amino acid sequences across rigorous benchmark assessments built from the 17,929 families of the protein families database Pfam. The models infer known patterns of evolutionary substitutions and learn representations that accurately cluster sequences from unseen families. Combining deep models with existing methods significantly improves remote homology detection, suggesting that the deep models learn complementary information. This approach extends the coverage of Pfam by >9.5%, exceeding additions made over the last decade, and predicts function for 360 human reference proteome proteins with no previous Pfam annotation. These results suggest that deep learning models will be a core component of future protein annotation tools.
Collapse
Affiliation(s)
| | | | | | - Theo Sanderson
- Google Research, Cambridge, MA, USA
- The Francis Crick Institute, London, UK
| | - Brandon Carter
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
| | | | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, UK
| | - Mark A DePristo
- Google Research, Cambridge, MA, USA
- BigHat Biosciences, San Mateo, CA, USA
| | - Lucy J Colwell
- Google Research, Cambridge, MA, USA.
- Department of Chemistry, University of Cambridge, Cambridge, UK.
| |
Collapse
|
19
|
Monzon V, Haft DH, Bateman A. Folding the unfoldable: using AlphaFold to explore spurious proteins. Bioinform Adv 2022; 2:vbab043. [PMID: 36699409 PMCID: PMC9710616 DOI: 10.1093/bioadv/vbab043] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Revised: 12/07/2021] [Accepted: 12/14/2021] [Indexed: 01/28/2023]
Abstract
Motivation The release of AlphaFold 2.0 has revolutionized our ability to determine protein structures from sequences. This tool also inadvertently opens up many unanticipated opportunities. In this article, we investigate the AntiFam resource, which contains 250 protein sequence families that we believe to be spurious protein translations. We would not expect proteins belonging to these families to fold into well-ordered globular structures. To test this hypothesis, we have attempted to computationally determine the structure of a representative sequence from all AntiFam 6.0 families. Results Although the large majority of families showed no evidence of globular structure, we have identified one example for which a globular structure is predicted. Proteins in this AntiFam entry indeed seem likely to be bona fide proteins, based on additional considerations, and thus AlphaFold provides a useful quality control for the AntiFam database. Conversely, known spurious proteins offer useful set of quality controls for AlphaFold. We have identified a trend that the mean structure prediction confidence score pLDDT is higher for shorter sequences. Of the 131 AntiFam representative sequences <100 amino acids in length, AlphaFold predicts a mean pLDDT of 80 or greater for six of them. Thus, particular care should be taken when applying AlphaFold to short protein sequences. Availability and implementation The AlphaFold predictions for representative sequences can be found at the following URL: https://drive.google.com/drive/folders/1u9OocRIAabGQn56GljoG1JTDAxjkY1ro. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Vivian Monzon
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Daniel H Haft
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA
| | - Alex Bateman
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA,To whom correspondence should be addressed.
| |
Collapse
|
20
|
Cantelli G, Bateman A, Brooksbank C, Petrov AI, Malik-Sheriff R, Ide-Smith M, Hermjakob H, Flicek P, Apweiler R, Birney E, McEntyre J. The European Bioinformatics Institute (EMBL-EBI) in 2021. Nucleic Acids Res 2022; 50:D11-D19. [PMID: 34850134 PMCID: PMC8690175 DOI: 10.1093/nar/gkab1127] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 10/14/2021] [Accepted: 11/23/2021] [Indexed: 11/28/2022] Open
Abstract
The European Bioinformatics Institute (EMBL-EBI) maintains a comprehensive range of freely available and up-to-date molecular data resources, which includes over 40 resources covering every major data type in the life sciences. This year's service update for EMBL-EBI includes new resources, PGS Catalog and AlphaFold DB, and updates on existing resources, including the COVID-19 Data Platform, trRosetta and RoseTTAfold models introduced in Pfam and InterPro, and the launch of Genome Integrations with Function and Sequence by UniProt and Ensembl. Furthermore, we highlight projects through which EMBL-EBI has contributed to the development of community-driven data standards and guidelines, including the Recommended Metadata for Biological Images (REMBI), and the BioModels Reproducibility Scorecard. Training is one of EMBL-EBI's core missions and a key component of the provision of bioinformatics services to users: this year's update includes many of the improvements that have been developed to EMBL-EBI's online training offering.
Collapse
Affiliation(s)
- Gaia Cantelli
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Cath Brooksbank
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Anton I Petrov
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Rahuman S Malik-Sheriff
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Michele Ide-Smith
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Henning Hermjakob
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Rolf Apweiler
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Ewan Birney
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Johanna McEntyre
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| |
Collapse
|
21
|
Randzavola LO, Mortimer PM, Garside E, Dufficy ER, Schejtman A, Roumelioti G, Yu L, Pardo M, Spirohn K, Tolley C, Brandt C, Harcourt K, Nichols E, Nahorski M, Woods G, Williamson JC, Suresh S, Sowerby JM, Matsumoto M, Santos CXC, Kiar CS, Mukhopadhyay S, Rae WM, Dougan GJ, Grainger J, Lehner PJ, Calderwood MA, Choudhary J, Clare S, Speak A, Santilli G, Bateman A, Smith KGC, Magnani F, Thomas DC. EROS is a selective chaperone regulating the phagocyte NADPH oxidase and purinergic signalling. eLife 2022; 11:76387. [PMID: 36421765 PMCID: PMC9767466 DOI: 10.7554/elife.76387] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Accepted: 10/31/2022] [Indexed: 11/25/2022] Open
Abstract
EROS (essential for reactive oxygen species) protein is indispensable for expression of gp91phox, the catalytic core of the phagocyte NADPH oxidase. EROS deficiency in humans is a novel cause of the severe immunodeficiency, chronic granulomatous disease, but its mechanism of action was unknown until now. We elucidate the role of EROS, showing it acts at the earliest stages of gp91phox maturation. It binds the immature 58 kDa gp91phox directly, preventing gp91phox degradation and allowing glycosylation via the oligosaccharyltransferase machinery and the incorporation of the heme prosthetic groups essential for catalysis. EROS also regulates the purine receptors P2X7 and P2X1 through direct interactions, and P2X7 is almost absent in EROS-deficient mouse and human primary cells. Accordingly, lack of murine EROS results in markedly abnormal P2X7 signalling, inflammasome activation, and T cell responses. The loss of both ROS and P2X7 signalling leads to resistance to influenza infection in mice. Our work identifies EROS as a highly selective chaperone for key proteins in innate and adaptive immunity and a rheostat for immunity to infection. It has profound implications for our understanding of immune physiology, ROS dysregulation, and possibly gene therapy.
Collapse
Affiliation(s)
- Lyra O Randzavola
- Department of Immunology and Inflammation, Centre for Inflammatory Disease, Imperial College LondonLondonUnited Kingdom
| | - Paige M Mortimer
- Department of Immunology and Inflammation, Centre for Inflammatory Disease, Imperial College LondonLondonUnited Kingdom
| | - Emma Garside
- Department of Immunology and Inflammation, Centre for Inflammatory Disease, Imperial College LondonLondonUnited Kingdom
| | - Elizabeth R Dufficy
- The Department of Medicine, University of Cambridge School of Clinical MedicineCambridgeUnited Kingdom
| | - Andrea Schejtman
- Molecular Immunology Unit, UCL Great Ormond Street Institute of Child HealthLondonUnited Kingdom
| | - Georgia Roumelioti
- Functional Proteomics, Division of Cancer Biology, Institute of Cancer ResearchLondonUnited Kingdom
| | - Lu Yu
- Functional Proteomics, Division of Cancer Biology, Institute of Cancer ResearchLondonUnited Kingdom
| | - Mercedes Pardo
- Functional Proteomics, Division of Cancer Biology, Institute of Cancer ResearchLondonUnited Kingdom
| | - Kerstin Spirohn
- Center for Cancer Systems Biology (CCSB), Dana-Farber Cancer InstituteBostonUnited States,Department of Genetics, Blavatnik Institute, Harvard Medical SchoolBostonUnited States,Department of Cancer Biology, Dana-Farber Cancer InstituteBostonUnited States
| | | | | | | | - Esme Nichols
- Department of Immunology and Inflammation, Centre for Inflammatory Disease, Imperial College LondonLondonUnited Kingdom
| | - Mike Nahorski
- Cambridge Institute of Medical Research, University of CambridgeCambridgeUnited Kingdom
| | - Geoff Woods
- Cambridge Institute of Medical Research, University of CambridgeCambridgeUnited Kingdom
| | - James C Williamson
- The Department of Medicine, University of Cambridge School of Clinical MedicineCambridgeUnited Kingdom,Cambridge Institute of Therapeutic Immunology & Infectious Disease, Jeffrey Cheah Biomedical Centre Cambridge Biomedical CampusCambridgeUnited Kingdom
| | - Shreehari Suresh
- The Department of Medicine, University of Cambridge School of Clinical MedicineCambridgeUnited Kingdom
| | - John M Sowerby
- The Department of Medicine, University of Cambridge School of Clinical MedicineCambridgeUnited Kingdom,Cambridge Institute of Therapeutic Immunology & Infectious Disease, Jeffrey Cheah Biomedical Centre Cambridge Biomedical CampusCambridgeUnited Kingdom
| | - Misaki Matsumoto
- Department of Pharmacology, Kyoto Prefectural University of MedicineKyotoJapan
| | - Celio XC Santos
- School of Cardiovascular Medicine and Sciences, James Black Centre, King's College LondonLondonUnited Kingdom
| | - Cher Shen Kiar
- Peter Gorer Department of Immunobiology, School of Immunology & Microbial Sciences, King's College LondonLondonUnited Kingdom
| | - Subhankar Mukhopadhyay
- Peter Gorer Department of Immunobiology, School of Immunology & Microbial Sciences, King's College LondonLondonUnited Kingdom
| | - William M Rae
- The Department of Medicine, University of Cambridge School of Clinical MedicineCambridgeUnited Kingdom,Cambridge Institute of Therapeutic Immunology & Infectious Disease, Jeffrey Cheah Biomedical Centre Cambridge Biomedical CampusCambridgeUnited Kingdom
| | - Gordon J Dougan
- The Department of Medicine, University of Cambridge School of Clinical MedicineCambridgeUnited Kingdom
| | - John Grainger
- Functional Proteomics, Division of Cancer Biology, Institute of Cancer ResearchLondonUnited Kingdom,Lydia Becker Institute of Immunology and Inflammation, Faculty of Biology, Medicine and Health, University of ManchesterManchesterUnited Kingdom
| | - Paul J Lehner
- The Department of Medicine, University of Cambridge School of Clinical MedicineCambridgeUnited Kingdom,Cambridge Institute of Therapeutic Immunology & Infectious Disease, Jeffrey Cheah Biomedical Centre Cambridge Biomedical CampusCambridgeUnited Kingdom
| | - Michael A Calderwood
- Center for Cancer Systems Biology (CCSB), Dana-Farber Cancer InstituteBostonUnited States,Department of Genetics, Blavatnik Institute, Harvard Medical SchoolBostonUnited States,Department of Cancer Biology, Dana-Farber Cancer InstituteBostonUnited States
| | - Jyoti Choudhary
- Functional Proteomics, Division of Cancer Biology, Institute of Cancer ResearchLondonUnited Kingdom
| | - Simon Clare
- Wellcome Trust Sanger InstituteHinxtonUnited Kingdom
| | | | - Giorgia Santilli
- Molecular Immunology Unit, UCL Great Ormond Street Institute of Child HealthLondonUnited Kingdom
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome CampusHinxtonUnited Kingdom
| | - Kenneth GC Smith
- The Department of Medicine, University of Cambridge School of Clinical MedicineCambridgeUnited Kingdom,Cambridge Institute of Therapeutic Immunology & Infectious Disease, Jeffrey Cheah Biomedical Centre Cambridge Biomedical CampusCambridgeUnited Kingdom
| | - Francesca Magnani
- Department of Biology and Biotechnology, University of PaviaPaviaItaly
| | - David C Thomas
- Department of Immunology and Inflammation, Centre for Inflammatory Disease, Imperial College LondonLondonUnited Kingdom
| |
Collapse
|
22
|
Bhome R, Karavias D, Armstrong T, Hamady Z, Arshad A, Primrose J, Bateman A, Pearce N, Takhar A. Intraoperative radiotherapy for pancreatic cancer: implementation and initial experience. Br J Surg 2021; 108:e400-e401. [PMID: 34586375 DOI: 10.1093/bjs/znab335] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Accepted: 09/01/2021] [Indexed: 11/13/2022]
Abstract
This article reports on the first series of patients to receive intraoperative radiotherapy for pancreatic cancer in the UK. The data suggest that this treatment modality is feasible and safe, laying a platform for collaborative multicentre trials to better assess efficacy.
Collapse
Affiliation(s)
- R Bhome
- Hepatopancreatobiliary Unit, University Hospitals Southampton NHS Trust, Southampton General Hospital, Southampton, UK.,Cancer Sciences, University of Southampton, Southampton, UK
| | - D Karavias
- Hepatopancreatobiliary Unit, University Hospitals Southampton NHS Trust, Southampton General Hospital, Southampton, UK
| | - T Armstrong
- Hepatopancreatobiliary Unit, University Hospitals Southampton NHS Trust, Southampton General Hospital, Southampton, UK
| | - Z Hamady
- Hepatopancreatobiliary Unit, University Hospitals Southampton NHS Trust, Southampton General Hospital, Southampton, UK
| | - A Arshad
- Hepatopancreatobiliary Unit, University Hospitals Southampton NHS Trust, Southampton General Hospital, Southampton, UK
| | - J Primrose
- Hepatopancreatobiliary Unit, University Hospitals Southampton NHS Trust, Southampton General Hospital, Southampton, UK.,Cancer Sciences, University of Southampton, Southampton, UK
| | - A Bateman
- Clinical Oncology, University Hospitals Southampton NHS Trust, Southampton General Hospital, Southampton, UK
| | - N Pearce
- Hepatopancreatobiliary Unit, University Hospitals Southampton NHS Trust, Southampton General Hospital, Southampton, UK
| | - A Takhar
- Hepatopancreatobiliary Unit, University Hospitals Southampton NHS Trust, Southampton General Hospital, Southampton, UK
| |
Collapse
|
23
|
Abstract
Non-coding RNAs are essential for all life and carry out a wide range of functions. Information about these molecules is distributed across dozens of specialized resources. RNAcentral is a database of non-coding RNA sequences that provides a unified access point to non-coding RNA annotations from >40 member databases and helps provide insight into the function of these RNAs. This article describes different ways of accessing the data, including searching the website and retrieving the data programmatically over web APIs and a public database. We also demonstrate an example Galaxy workflow for using RNAcentral for RNA-seq differential expression analysis. RNAcentral is available at https://rnacentral.org. © 2020 The Authors. Basic Protocol 1: Viewing RNAcentral sequence reports Basic Protocol 2: Using RNAcentral text search to explore ncRNA sequences Basic Protocol 3: Using RNAcentral sequence search Basic Protocol 4: Using RNAcentral FTP archive Support Protocol 1: Using web APIs for programmatic data access Support Protocol 2: Using public Postgres database to export large datasets Support Protocol 3: Analyze non-coding RNA in RNA-seq datasets using RNAcentral and Galaxy.
Collapse
Affiliation(s)
- Blake A Sweeney
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Arina A Tagmazian
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK.,Federal State Budget Scientific Institution Center of Experimental Embryology and Reproductive Biotechnologies, Moscow, Russia
| | - Carlos E Ribas
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Anton I Petrov
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| |
Collapse
|
24
|
Affiliation(s)
- Loïc Lannelongue
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, United Kingdom
| | - Jason Grealey
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia
- Department of Mathematics and Statistics, La Trobe University, Melbourne, Australia
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, United Kingdom
| | - Michael Inouye
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, United Kingdom
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge, United Kingdom
- The Alan Turing Institute, London, United Kingdom
| |
Collapse
|
25
|
Tunyasuvunakool K, Adler J, Wu Z, Green T, Zielinski M, Žídek A, Bridgland A, Cowie A, Meyer C, Laydon A, Velankar S, Kleywegt GJ, Bateman A, Evans R, Pritzel A, Figurnov M, Ronneberger O, Bates R, Kohl SAA, Potapenko A, Ballard AJ, Romera-Paredes B, Nikolov S, Jain R, Clancy E, Reiman D, Petersen S, Senior AW, Kavukcuoglu K, Birney E, Kohli P, Jumper J, Hassabis D. Highly accurate protein structure prediction for the human proteome. Nature 2021; 596:590-596. [PMID: 34293799 PMCID: PMC8387240 DOI: 10.1038/s41586-021-03828-1] [Citation(s) in RCA: 1287] [Impact Index Per Article: 429.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Accepted: 07/16/2021] [Indexed: 02/07/2023]
Abstract
Protein structures can provide invaluable information, both for reasoning about biological processes and for enabling interventions such as structure-based drug development or targeted mutagenesis. After decades of effort, 17% of the total residues in human protein sequences are covered by an experimentally determined structure1. Here we markedly expand the structural coverage of the proteome by applying the state-of-the-art machine learning method, AlphaFold2, at a scale that covers almost the entire human proteome (98.5% of human proteins). The resulting dataset covers 58% of residues with a confident prediction, of which a subset (36% of all residues) have very high confidence. We introduce several metrics developed by building on the AlphaFold model and use them to interpret the dataset, identifying strong multi-domain predictions as well as regions that are likely to be disordered. Finally, we provide some case studies to illustrate how high-quality predictions could be used to generate biological hypotheses. We are making our predictions freely available to the community and anticipate that routine large-scale and high-accuracy structure prediction will become an important tool that will allow new questions to be addressed from a structural perspective.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | - Sameer Velankar
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Gerard J Kleywegt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Ewan Birney
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | | | | | | |
Collapse
|
26
|
Abstract
BACKGROUND Fibrillar adhesins are long multidomain proteins that form filamentous structures at the cell surface of bacteria. They are an important yet understudied class of proteins composed of adhesive and stalk domains that mediate interactions of bacteria with their environment. This study aims to characterize fibrillar adhesins in a wide range of bacterial phyla and to identify new fibrillar adhesin-like proteins to improve our understanding of host-bacteria interactions. RESULTS Through careful literature and computational searches, we identified 82 stalk and 27 adhesive domain families in fibrillar adhesins. Based on the presence of these domains in the UniProt Reference Proteomes database, we identified and analysed 3,542 fibrillar adhesin-like proteins across species of the most common bacterial phyla. We further enumerate the adhesive and stalk domain combinations found in nature and demonstrate that fibrillar adhesins have complex and variable domain architectures, which differ across species. By analysing the domain architecture of fibrillar adhesins, we show that in Gram positive bacteria, adhesive domains are mostly positioned at the N-terminus and cell surface anchors at the C-terminus of the protein, while their positions are more variable in Gram negative bacteria. We provide an open repository of fibrillar adhesin-like proteins and domains to enable further studies of this class of bacterial surface proteins. CONCLUSION This study provides a domain-based characterization of fibrillar adhesins and demonstrates that they are widely found in species across the main bacterial phyla. We have discovered numerous novel fibrillar adhesins and improved our understanding of pathogenic adhesion and invasion mechanisms.
Collapse
Affiliation(s)
- Vivian Monzon
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SD, UK.
| | - Aleix Lafita
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SD, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SD, UK
| |
Collapse
|
27
|
Mukherjee S, O'Connor H, Harman R, O'Donovan M, Debiram-Beecham I, Alias B, Bailey A, Bateman A, de Caestecker J, Crosby T, Falk S, Gollins S, Hawkins M, Levy S, Radhakrishna G, Roy R, Sripadam R, Fitzgerald R. P-109 CYTOFLOC: Evaluation of a non-endoscopic immunocytological device (Cytosponge™) for post-chemo-radiotherapy surveillance in patients with oesophageal cancer – a feasibility study. Ann Oncol 2021. [DOI: 10.1016/j.annonc.2021.05.164] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022] Open
|
28
|
Smyshlyaev G, Bateman A, Barabas O. Sequence analysis of tyrosine recombinases allows annotation of mobile genetic elements in prokaryotic genomes. Mol Syst Biol 2021; 17:e9880. [PMID: 34018328 PMCID: PMC8138268 DOI: 10.15252/msb.20209880] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Revised: 04/18/2021] [Accepted: 04/20/2021] [Indexed: 11/16/2022] Open
Abstract
Mobile genetic elements (MGEs) sequester and mobilize antibiotic resistance genes across bacterial genomes. Efficient and reliable identification of such elements is necessary to follow resistance spreading. However, automated tools for MGE identification are missing. Tyrosine recombinase (YR) proteins drive MGE mobilization and could provide markers for MGE detection, but they constitute a diverse family also involved in housekeeping functions. Here, we conducted a comprehensive survey of YRs from bacterial, archaeal, and phage genomes and developed a sequence‐based classification system that dissects the characteristics of MGE‐borne YRs. We revealed that MGE‐related YRs evolved from non‐mobile YRs by acquisition of a regulatory arm‐binding domain that is essential for their mobility function. Based on these results, we further identified numerous unknown MGEs. This work provides a resource for comparative analysis and functional annotation of YRs and aids the development of computational tools for MGE annotation. Additionally, we reveal how YRs adapted to drive gene transfer across species and provide a tool to better characterize antibiotic resistance dissemination.
Collapse
Affiliation(s)
- Georgy Smyshlyaev
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, UK.,European Molecular Biology Laboratory (EMBL), Structural and Computational Biology Unit, Heidelberg, Germany.,Department of Molecular Biology, University of Geneva, Geneva, Switzerland
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, UK
| | - Orsolya Barabas
- European Molecular Biology Laboratory (EMBL), Structural and Computational Biology Unit, Heidelberg, Germany.,Department of Molecular Biology, University of Geneva, Geneva, Switzerland
| |
Collapse
|
29
|
Affiliation(s)
| | - Thomas Lengauer
- Editors-in-Chief, Bioinformatics Advances,Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany,Institute of Virology, Faculty of Medicine and University Hospital Cologne, Cologne, Germany,To whom correspondence should be addressed.
| |
Collapse
|
30
|
Maishman T, Sheikh H, Boger P, Kelly J, Cozens K, Bateman A, Davies S, Fay M, Sharland D, Jackson A. A Phase II Study of Biodegradable Stents Plus Palliative Radiotherapy in Oesophageal Cancer. Clin Oncol (R Coll Radiol) 2021; 33:e225-e231. [PMID: 33402268 DOI: 10.1016/j.clon.2020.12.010] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Revised: 12/02/2020] [Accepted: 12/10/2020] [Indexed: 01/21/2023]
Abstract
AIMS Self-expanding metal stents provide rapid improvement of dysphagia in oesophageal cancer but are associated with complications. The aim of the present study was to test the effectiveness of an alternative treatment of combining biodegradable stents with radiotherapy. MATERIALS AND METHODS A Simon two-stage single-arm prospective phase II trial design was used to determine the efficacy of biodegradable stents plus radiotherapy in patients with dysphagia caused by oesophagus cancer who were unsuitable for radical treatment. Fourteen patients were recruited and data from 12 were included in the final analyses. RESULTS Five of 12 patients met the primary end point: one stent-related patient death; four further interventions for dysphagia within 16 weeks of stenting (41.7%, 95% confidence interval 15.2-72.3%). The median time to a 10-point deterioration of quality of life was 2.7 weeks. Nine patients died within 52 weeks of registration. The median time to death from any cause was 15.0 weeks (95% confidence interval 9.6-not reached). CONCLUSION The high re-intervention observed, which met the pre-defined early stopping criteria, meant that the suggested alternative treatment was not sufficiently effective to be considered for a larger scale trial design. Further work is needed to define the place of biodegradable stents in the management of malignant oesophageal strictures.
Collapse
Affiliation(s)
- T Maishman
- Southampton Clinical Trials Unit, University of Southampton, Southampton, UK
| | - H Sheikh
- The Christie NHS Foundation Trust, Manchester, UK
| | - P Boger
- University Hospital Southampton NHS Foundation Trust, Southampton, UK
| | - J Kelly
- University Hospital Southampton NHS Foundation Trust, Southampton, UK
| | - K Cozens
- Southampton Clinical Trials Unit, University of Southampton, Southampton, UK
| | - A Bateman
- University Hospital Southampton NHS Foundation Trust, Southampton, UK
| | - S Davies
- University Hospital Southampton NHS Foundation Trust, Southampton, UK
| | - M Fay
- University Hospital Southampton NHS Foundation Trust, Southampton, UK
| | - D Sharland
- University Hospital Southampton NHS Foundation Trust, Southampton, UK
| | - A Jackson
- University Hospital Southampton NHS Foundation Trust, Southampton, UK.
| |
Collapse
|
31
|
Hufsky F, Lamkiewicz K, Almeida A, Aouacheria A, Arighi C, Bateman A, Baumbach J, Beerenwinkel N, Brandt C, Cacciabue M, Chuguransky S, Drechsel O, Finn RD, Fritz A, Fuchs S, Hattab G, Hauschild AC, Heider D, Hoffmann M, Hölzer M, Hoops S, Kaderali L, Kalvari I, von Kleist M, Kmiecinski R, Kühnert D, Lasso G, Libin P, List M, Löchel HF, Martin MJ, Martin R, Matschinske J, McHardy AC, Mendes P, Mistry J, Navratil V, Nawrocki EP, O’Toole ÁN, Ontiveros-Palacios N, Petrov AI, Rangel-Pineros G, Redaschi N, Reimering S, Reinert K, Reyes A, Richardson L, Robertson DL, Sadegh S, Singer JB, Theys K, Upton C, Welzel M, Williams L, Marz M. Computational strategies to combat COVID-19: useful tools to accelerate SARS-CoV-2 and coronavirus research. Brief Bioinform 2021; 22:642-663. [PMID: 33147627 PMCID: PMC7665365 DOI: 10.1093/bib/bbaa232] [Citation(s) in RCA: 81] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 07/28/2020] [Accepted: 08/26/2020] [Indexed: 12/16/2022] Open
Abstract
SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is a novel virus of the family Coronaviridae. The virus causes the infectious disease COVID-19. The biology of coronaviruses has been studied for many years. However, bioinformatics tools designed explicitly for SARS-CoV-2 have only recently been developed as a rapid reaction to the need for fast detection, understanding and treatment of COVID-19. To control the ongoing COVID-19 pandemic, it is of utmost importance to get insight into the evolution and pathogenesis of the virus. In this review, we cover bioinformatics workflows and tools for the routine detection of SARS-CoV-2 infection, the reliable analysis of sequencing data, the tracking of the COVID-19 pandemic and evaluation of containment measures, the study of coronavirus evolution, the discovery of potential drug targets and development of therapeutic strategies. For each tool, we briefly describe its use case and how it advances research specifically for SARS-CoV-2. All tools are free to use and available online, either through web applications or public code repositories. Contact:evbc@unj-jena.de.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | - Christian Brandt
- Institute of Infectious Disease and Infection Control at Jena University Hospital, Germany
| | - Marco Cacciabue
- Consejo Nacional de Investigaciones Científicas y Tócnicas (CONICET) working on FMDV virology at the Instituto de Agrobiotecnología y Biología Molecular (IABiMo, INTA-CONICET) and at the Departamento de Ciencias Básicas, Universidad Nacional de Luján (UNLu), Argentina
| | | | - Oliver Drechsel
- bioinformatics department at the Robert Koch-Institute, Germany
| | | | - Adrian Fritz
- Computational Biology of Infection Research group of Alice C. McHardy at the Helmholtz Centre for Infection Research, Germany
| | - Stephan Fuchs
- bioinformatics department at the Robert Koch-Institute, Germany
| | - Georges Hattab
- Bioinformatics Division at Philipps-University Marburg, Germany
| | | | - Dominik Heider
- Data Science in Biomedicine at the Philipps-University of Marburg, Germany
| | | | | | - Stefan Hoops
- Biocomplexity Institute and Initiative at the University of Virginia, USA
| | - Lars Kaderali
- Bioinformatics and head of the Institute of Bioinformatics at University Medicine Greifswald, Germany
| | | | - Max von Kleist
- bioinformatics department at the Robert Koch-Institute, Germany
| | - Renó Kmiecinski
- bioinformatics department at the Robert Koch-Institute, Germany
| | | | - Gorka Lasso
- Chandran Lab, Albert Einstein College of Medicine, USA
| | | | | | | | | | | | | | - Alice C McHardy
- Computational Biology of Infection Research Lab at the Helmholtz Centre for Infection Research in Braunschweig, Germany
| | - Pedro Mendes
- Center for Quantitative Medicine of the University of Connecticut School of Medicine, USA
| | | | - Vincent Navratil
- Bioinformatics and Systems Biology at the Rhône Alpes Bioinformatics core facility, Universitó de Lyon, France
| | | | | | | | | | | | - Nicole Redaschi
- Development of the Swiss-Prot group at the SIB for UniProt and SIB resources that cover viral biology (ViralZone)
| | - Susanne Reimering
- Computational Biology of Infection Research group of Alice C. McHardy at the Helmholtz Centre for Infection Research
| | | | | | | | | | - Sepideh Sadegh
- Chair of Experimental Bioinformatics at Technical University of Munich, Germany
| | - Joshua B Singer
- MRC-University of Glasgow Centre for Virus Research, Glasgow, Scotland, UK
| | | | - Chris Upton
- Department of Biochemistry and Microbiology, University of Victoria, Canada
| | | | | | - Manja Marz
- Friedrich Schiller University Jena, Germany
| |
Collapse
|
32
|
Blum M, Chang HY, Chuguransky S, Grego T, Kandasaamy S, Mitchell A, Nuka G, Paysan-Lafosse T, Qureshi M, Raj S, Richardson L, Salazar GA, Williams L, Bork P, Bridge A, Gough J, Haft DH, Letunic I, Marchler-Bauer A, Mi H, Natale DA, Necci M, Orengo CA, Pandurangan AP, Rivoire C, Sigrist CJA, Sillitoe I, Thanki N, Thomas PD, Tosatto SCE, Wu CH, Bateman A, Finn RD. The InterPro protein families and domains database: 20 years on. Nucleic Acids Res 2021; 49:D344-D354. [PMID: 33156333 PMCID: PMC7778928 DOI: 10.1093/nar/gkaa977] [Citation(s) in RCA: 1037] [Impact Index Per Article: 345.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 10/08/2020] [Accepted: 10/23/2020] [Indexed: 01/22/2023] Open
Abstract
The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. InterProScan is the underlying software that allows protein and nucleic acid sequences to be searched against InterPro's signatures. Signatures are predictive models which describe protein families, domains or sites, and are provided by multiple databases. InterPro combines signatures representing equivalent families, domains or sites, and provides additional information such as descriptions, literature references and Gene Ontology (GO) terms, to produce a comprehensive resource for protein classification. Founded in 1999, InterPro has become one of the most widely used resources for protein family annotation. Here, we report the status of InterPro (version 81.0) in its 20th year of operation, and its associated software, including updates to database content, the release of a new website and REST API, and performance improvements in InterProScan.
Collapse
Affiliation(s)
- Matthias Blum
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Hsin-Yu Chang
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Sara Chuguransky
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Tiago Grego
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Swaathi Kandasaamy
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Alex Mitchell
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Gift Nuka
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Typhaine Paysan-Lafosse
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Matloob Qureshi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Shriya Raj
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Lorna Richardson
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Gustavo A Salazar
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Lowri Williams
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Peer Bork
- European Molecular Biology Laboratory, Structural and Computational Biology Unit, Meyerhofstraße 1, 69117 Heidelberg, Germany
| | - Alan Bridge
- Swiss-Prot Group, Swiss Institute of Bioinformatics, CMU, 1 rue Michel Servet, CH-1211, Geneva 4, Switzerland
| | - Julian Gough
- Medical Research Council Laboratory of Molecular Biology, Cambridge Biomedical Campus, Francis Crick Ave, Trumpington, Cambridge CB2 0QH, UK
| | - Daniel H Haft
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda MD 20894 USA
| | - Ivica Letunic
- Biobyte Solutions GmbH, Bothestr 142, 69126 Heidelberg, Germany
| | - Aron Marchler-Bauer
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda MD 20894 USA
| | - Huaiyu Mi
- Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90033, USA
| | - Darren A Natale
- Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA
| | - Marco Necci
- Department of Biomedical Sciences, University of Padua, via U. Bassi 58/b, 35131 Padua, Italy
| | - Christine A Orengo
- Department of Structural and Molecular Biology, University College London, Gower St, Bloomsbury, London WC1E 6BT, UK
| | - Arun P Pandurangan
- Medical Research Council Laboratory of Molecular Biology, Cambridge Biomedical Campus, Francis Crick Ave, Trumpington, Cambridge CB2 0QH, UK
| | - Catherine Rivoire
- Swiss-Prot Group, Swiss Institute of Bioinformatics, CMU, 1 rue Michel Servet, CH-1211, Geneva 4, Switzerland
| | - Christian J A Sigrist
- Swiss-Prot Group, Swiss Institute of Bioinformatics, CMU, 1 rue Michel Servet, CH-1211, Geneva 4, Switzerland
| | - Ian Sillitoe
- Department of Structural and Molecular Biology, University College London, Gower St, Bloomsbury, London WC1E 6BT, UK
| | - Narmada Thanki
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda MD 20894 USA
| | - Paul D Thomas
- Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90033, USA
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padua, via U. Bassi 58/b, 35131 Padua, Italy
| | - Cathy H Wu
- Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| |
Collapse
|
33
|
Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL, Tosatto SCE, Paladin L, Raj S, Richardson LJ, Finn RD, Bateman A. Pfam: The protein families database in 2021. Nucleic Acids Res 2021; 49:D412-D419. [PMID: 33125078 PMCID: PMC7779014 DOI: 10.1093/nar/gkaa913] [Citation(s) in RCA: 2287] [Impact Index Per Article: 762.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2020] [Revised: 10/01/2020] [Accepted: 10/06/2020] [Indexed: 12/19/2022] Open
Abstract
The Pfam database is a widely used resource for classifying protein sequences into families and domains. Since Pfam was last described in this journal, over 350 new families have been added in Pfam 33.1 and numerous improvements have been made to existing entries. To facilitate research on COVID-19, we have revised the Pfam entries that cover the SARS-CoV-2 proteome, and built new entries for regions that were not covered by Pfam. We have reintroduced Pfam-B which provides an automatically generated supplement to Pfam and contains 136 730 novel clusters of sequences that are not yet matched by a Pfam family. The new Pfam-B is based on a clustering by the MMseqs2 software. We have compared all of the regions in the RepeatsDB to those in Pfam and have started to use the results to build and refine Pfam repeat families. Pfam is freely available for browsing and download at http://pfam.xfam.org/.
Collapse
Affiliation(s)
- Jaina Mistry
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Sara Chuguransky
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Lowri Williams
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Matloob Qureshi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Gustavo A Salazar
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy
| | - Lisanna Paladin
- Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy
| | - Shriya Raj
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Lorna J Richardson
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| |
Collapse
|
34
|
Kalvari I, Nawrocki EP, Ontiveros-Palacios N, Argasinska J, Lamkiewicz K, Marz M, Griffiths-Jones S, Toffano-Nioche C, Gautheret D, Weinberg Z, Rivas E, Eddy SR, Finn RD, Bateman A, Petrov AI. Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res 2021; 49:D192-D200. [PMID: 33211869 PMCID: PMC7779021 DOI: 10.1093/nar/gkaa1047] [Citation(s) in RCA: 336] [Impact Index Per Article: 112.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Revised: 10/14/2020] [Accepted: 10/21/2020] [Indexed: 12/15/2022] Open
Abstract
Rfam is a database of RNA families where each of the 3444 families is represented by a multiple sequence alignment of known RNA sequences and a covariance model that can be used to search for additional members of the family. Recent developments have involved expert collaborations to improve the quality and coverage of Rfam data, focusing on microRNAs, viral and bacterial RNAs. We have completed the first phase of synchronising microRNA families in Rfam and miRBase, creating 356 new Rfam families and updating 40. We established a procedure for comprehensive annotation of viral RNA families starting with Flavivirus and Coronaviridae RNAs. We have also increased the coverage of bacterial and metagenome-based RNA families from the ZWD database. These developments have enabled a significant growth of the database, with the addition of 759 new families in Rfam 14. To facilitate further community contribution to Rfam, expert users are now able to build and submit new families using the newly developed Rfam Cloud family curation system. New Rfam website features include a new sequence similarity search powered by RNAcentral, as well as search and visualisation of families with pseudoknots. Rfam is freely available at https://rfam.org.
Collapse
Affiliation(s)
- Ioanna Kalvari
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Eric P Nawrocki
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Nancy Ontiveros-Palacios
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Joanna Argasinska
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Kevin Lamkiewicz
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany.,European Virus Bioinformatics Center, Leutragraben 1, 07743 Jena, Germany
| | - Manja Marz
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany.,European Virus Bioinformatics Center, Leutragraben 1, 07743 Jena, Germany
| | - Sam Griffiths-Jones
- Faculty of Biology, Medicine and Health, University of Manchester, Oxford Road, Manchester, M13 9PT, UK
| | - Claire Toffano-Nioche
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France
| | - Daniel Gautheret
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France
| | - Zasha Weinberg
- Bioinformatics Group, Department of Computer Science and Interdisciplinary Centre for Bioinformatics, Leipzig University, 04107 Leipzig, Germany
| | - Elena Rivas
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA 02138, USA
| | - Sean R Eddy
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA 02138, USA.,Howard Hughes Medical Institute, Harvard University, Cambridge, MA 02138, USA.,John A. Paulson School of Engineering and Applied Science, Harvard University, Cambridge, MA 02138, USA
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Anton I Petrov
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| |
Collapse
|
35
|
Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL, Tosatto SCE, Paladin L, Raj S, Richardson LJ, Finn RD, Bateman A. Pfam: The protein families database in 2021. Nucleic Acids Res 2021. [PMID: 33125078 DOI: 10.6019/tol.pfam_fams-t.2018.00001.1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/14/2023] Open
Abstract
The Pfam database is a widely used resource for classifying protein sequences into families and domains. Since Pfam was last described in this journal, over 350 new families have been added in Pfam 33.1 and numerous improvements have been made to existing entries. To facilitate research on COVID-19, we have revised the Pfam entries that cover the SARS-CoV-2 proteome, and built new entries for regions that were not covered by Pfam. We have reintroduced Pfam-B which provides an automatically generated supplement to Pfam and contains 136 730 novel clusters of sequences that are not yet matched by a Pfam family. The new Pfam-B is based on a clustering by the MMseqs2 software. We have compared all of the regions in the RepeatsDB to those in Pfam and have started to use the results to build and refine Pfam repeat families. Pfam is freely available for browsing and download at http://pfam.xfam.org/.
Collapse
Affiliation(s)
- Jaina Mistry
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Sara Chuguransky
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Lowri Williams
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Matloob Qureshi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Gustavo A Salazar
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy
| | - Lisanna Paladin
- Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy
| | - Shriya Raj
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Lorna J Richardson
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| |
Collapse
|
36
|
Bateman A, Martin MJ, Orchard S, Magrane M, Agivetova R, Ahmad S, Alpi E, Bowler-Barnett EH, Britto R, Bursteinas B, Bye-A-Jee H, Coetzee R, Cukura A, Da Silva A, Denny P, Dogan T, Ebenezer T, Fan J, Castro LG, Garmiri P, Georghiou G, Gonzales L, Hatton-Ellis E, Hussein A, Ignatchenko A, Insana G, Ishtiaq R, Jokinen P, Joshi V, Jyothi D, Lock A, Lopez R, Luciani A, Luo J, Lussi Y, MacDougall A, Madeira F, Mahmoudy M, Menchi M, Mishra A, Moulang K, Nightingale A, Oliveira CS, Pundir S, Qi G, Raj S, Rice D, Lopez MR, Saidi R, Sampson J, Sawford T, Speretta E, Turner E, Tyagi N, Vasudev P, Volynkin V, Warner K, Watkins X, Zaru R, Zellner H, Bridge A, Poux S, Redaschi N, Aimo L, Argoud-Puy G, Auchincloss A, Axelsen K, Bansal P, Baratin D, Blatter MC, Bolleman J, Boutet E, Breuza L, Casals-Casas C, de Castro E, Echioukh KC, Coudert E, Cuche B, Doche M, Dornevil D, Estreicher A, Famiglietti ML, Feuermann M, Gasteiger E, Gehant S, Gerritsen V, Gos A, Gruaz-Gumowski N, Hinz U, Hulo C, Hyka-Nouspikel N, Jungo F, Keller G, Kerhornou A, Lara V, Le Mercier P, Lieberherr D, Lombardot T, Martin X, Masson P, Morgat A, Neto TB, Paesano S, Pedruzzi I, Pilbout S, Pourcel L, Pozzato M, Pruess M, Rivoire C, Sigrist C, Sonesson K, Stutz A, Sundaram S, Tognolli M, Verbregue L, Wu CH, Arighi CN, Arminski L, Chen C, Chen Y, Garavelli JS, Huang H, Laiho K, McGarvey P, Natale DA, Ross K, Vinayaka CR, Wang Q, Wang Y, Yeh LS, Zhang J, Ruch P, Teodoro D. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res 2021; 49:D480-D489. [PMID: 33237286 PMCID: PMC7778908 DOI: 10.1093/nar/gkaa1100] [Citation(s) in RCA: 3474] [Impact Index Per Article: 1158.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Revised: 10/21/2020] [Accepted: 11/02/2020] [Indexed: 02/07/2023] Open
Abstract
The aim of the UniProt Knowledgebase is to provide users with a comprehensive, high-quality and freely accessible set of protein sequences annotated with functional information. In this article, we describe significant updates that we have made over the last two years to the resource. The number of sequences in UniProtKB has risen to approximately 190 million, despite continued work to reduce sequence redundancy at the proteome level. We have adopted new methods of assessing proteome completeness and quality. We continue to extract detailed annotations from the literature to add to reviewed entries and supplement these in unreviewed entries with annotations provided by automated systems such as the newly implemented Association-Rule-Based Annotator (ARBA). We have developed a credit-based publication submission interface to allow the community to contribute publications and annotations to UniProt entries. We describe how UniProtKB responded to the COVID-19 pandemic through expert curation of relevant entries that were rapidly made available to the research community through a dedicated portal. UniProt resources are available under a CC-BY (4.0) license via the web at https://www.uniprot.org/.
Collapse
|
37
|
Sweeney BA, Petrov AI, Ribas CE, Finn RD, Bateman A, Szymanski M, Karlowski WM, Seemann SE, Gorodkin J, Cannone JJ, Gutell RR, Kay S, Marygold S, dos Santos G, Frankish A, Mudge JM, Barshir R, Fishilevich S, Chan PP, Lowe TM, Seal R, Bruford E, Panni S, Porras P, Karagkouni D, Hatzigeorgiou AG, Ma L, Zhang Z, Volders PJ, Mestdagh P, Griffiths-Jones S, Fromm B, Peterson KJ, Kalvari I, Nawrocki EP, Petrov AS, Weng S, Bouchard-Bourelle P, Scott M, Lui LM, Hoksza D, Lovering RC, Kramarz B, Mani P, Ramachandran S, Weinberg Z. RNAcentral 2021: secondary structure integration, improved sequence search and new member databases. Nucleic Acids Res 2021; 49:D212-D220. [PMID: 33106848 PMCID: PMC7779037 DOI: 10.1093/nar/gkaa921] [Citation(s) in RCA: 115] [Impact Index Per Article: 38.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Accepted: 10/05/2020] [Indexed: 12/16/2022] Open
Abstract
RNAcentral is a comprehensive database of non-coding RNA (ncRNA) sequences that provides a single access point to 44 RNA resources and >18 million ncRNA sequences from a wide range of organisms and RNA types. RNAcentral now also includes secondary (2D) structure information for >13 million sequences, making RNAcentral the world's largest RNA 2D structure database. The 2D diagrams are displayed using R2DT, a new 2D structure visualization method that uses consistent, reproducible and recognizable layouts for related RNAs. The sequence similarity search has been updated with a faster interface featuring facets for filtering search results by RNA type, organism, source database or any keyword. This sequence search tool is available as a reusable web component, and has been integrated into several RNAcentral member databases, including Rfam, miRBase and snoDB. To allow for a more fine-grained assignment of RNA types and subtypes, all RNAcentral sequences have been annotated with Sequence Ontology terms. The RNAcentral database continues to grow and provide a central data resource for the RNA community. RNAcentral is freely available at https://rnacentral.org.
Collapse
|
38
|
Rawlings ND, Bateman A. How to use the MEROPS database and website to help understand peptidase specificity. Protein Sci 2020; 30:83-92. [PMID: 32920969 PMCID: PMC7737757 DOI: 10.1002/pro.3948] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2020] [Revised: 09/09/2020] [Accepted: 09/11/2020] [Indexed: 01/23/2023]
Abstract
The MEROPS website (https://www.ebi.ac.uk/merops) and database was established in 1996 to present the classification and nomenclature of proteolytic enzymes. This was expanded to include a classification of protein inhibitors of proteolytic enzymes in 2004. Each peptidase or inhibitor is assigned to a distinct identifier, based on its biochemical and biological properties, and homologous sequences are assembled into a family. Families in which the proteins share similar tertiary structures are assembled into a clan. The MEROPS classification is thus a hierarchy with at least three levels (protein-species, family, and clan) showing the evolutionary relationship. Several other data collections have been assembled, which are accessed from all levels in the hierarchy. These include, sequence homologs, selective bibliographies, substrate cleavage sites, peptidase-inhibitor interactions, alignments, and phylogenetic trees. The substrate cleavage collection has been assembled from the literature and includes physiological, pathological, and nonphysiological cleavages in proteins, peptides, and synthetic substrates. In this article, we make recommendations about how best to analyze these data and show analyses to indicate peptidase binding site preferences and exclusions. We also identify peptidases where co-operative binding occurs between adjacent binding sites.
Collapse
Affiliation(s)
- Neil D Rawlings
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK
| |
Collapse
|
39
|
Walton SJ, Wang H, Quintero-Cadena P, Bateman A, Sternberg PW. Caenorhabditis elegans AF4/FMR2 Family Homolog affl-2 Regulates Heat-Shock-Induced Gene Expression. Genetics 2020; 215:1039-1054. [PMID: 32518061 PMCID: PMC7404228 DOI: 10.1534/genetics.120.302923] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2019] [Accepted: 05/27/2020] [Indexed: 02/08/2023] Open
Abstract
To mitigate the deleterious effects of temperature increases on cellular organization and proteotoxicity, organisms have developed mechanisms to respond to heat stress. In eukaryotes, HSF1 is the master regulator of the heat shock transcriptional response, but the heat shock response pathway is not yet fully understood. From a forward genetic screen for suppressors of heat-shock-induced gene expression in Caenorhabditis elegans, we found a new allele of hsf-1 that alters its DNA-binding domain, and we found three additional alleles of sup-45, a previously molecularly uncharacterized genetic locus. We identified sup-45 as one of the two hitherto unknown C. elegans orthologs of the human AF4/FMR2 family proteins, which are involved in regulation of transcriptional elongation rate. We thus renamed sup-45 as affl-2 (AF4/FMR2-Like). Through RNA-seq, we demonstrated that affl-2 mutants are deficient in heat-shock-induced transcription. Additionally, affl-2 mutants have herniated intestines, while worms lacking its sole paralog (affl-1) appear wild type. AFFL-2 is a broadly expressed nuclear protein, and nuclear localization of AFFL-2 is necessary for its role in heat shock response. affl-2 and its paralog are not essential for proper HSF-1 expression and localization after heat shock, which suggests that affl-2 may function downstream of, or parallel to, hsf-1 Our characterization of affl-2 provides insights into the regulation of heat-shock-induced gene expression to protect against heat stress.
Collapse
Affiliation(s)
- Sophie J Walton
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125
| | - Han Wang
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125
| | - Porfirio Quintero-Cadena
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute, (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Paul W Sternberg
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125
| |
Collapse
|
40
|
Lafita A, Tian P, Best RB, Bateman A. TADOSS: computational estimation of tandem domain swap stability. Bioinformatics 2020; 35:2507-2508. [PMID: 30500878 PMCID: PMC6612889 DOI: 10.1093/bioinformatics/bty974] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Revised: 10/13/2018] [Accepted: 11/28/2018] [Indexed: 11/17/2022] Open
Abstract
Summary Proteins with highly similar tandem domains have shown an increased propensity for misfolding and aggregation. Several molecular explanations have been put forward, such as swapping of adjacent domains, but there is a lack of computational tools to systematically analyze them. We present the TAndem DOmain Swap Stability predictor (TADOSS), a method to computationally estimate the stability of tandem domain-swapped conformations from the structures of single domains, based on previous coarse-grained simulation studies. The tool is able to discriminate domains susceptible to domain swapping and to identify structural regions with high propensity to form hinge loops. TADOSS is a scalable method and suitable for large scale analyses. Availability and implementation Source code and documentation are freely available under an MIT license on GitHub at https://github.com/lafita/tadoss. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Aleix Lafita
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Pengfei Tian
- Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD, USA
| | - Robert B Best
- Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD, USA
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| |
Collapse
|
41
|
El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, Qureshi M, Richardson LJ, Salazar GA, Smart A, Sonnhammer ELL, Hirsh L, Paladin L, Piovesan D, Tosatto SCE, Finn RD. The Pfam protein families database in 2019. Nucleic Acids Res 2020; 47:D427-D432. [PMID: 30357350 PMCID: PMC6324024 DOI: 10.1093/nar/gky995] [Citation(s) in RCA: 2808] [Impact Index Per Article: 702.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2018] [Accepted: 10/09/2018] [Indexed: 12/11/2022] Open
Abstract
The last few years have witnessed significant changes in Pfam (https://pfam.xfam.org). The number of families has grown substantially to a total of 17,929 in release 32.0. New additions have been coupled with efforts to improve existing families, including refinement of domain boundaries, their classification into Pfam clans, as well as their functional annotation. We recently began to collaborate with the RepeatsDB resource to improve the definition of tandem repeat families within Pfam. We carried out a significant comparison to the structural classification database, namely the Evolutionary Classification of Protein Domains (ECOD) that led to the creation of 825 new families based on their set of uncharacterized families (EUFs). Furthermore, we also connected Pfam entries to the Sequence Ontology (SO) through mapping of the Pfam type definitions to SO terms. Since Pfam has many community contributors, we recently enabled the linking between authorship of all Pfam entries with the corresponding authors’ ORCID identifiers. This effectively permits authors to claim credit for their Pfam curation and link them to their ORCID record.
Collapse
Affiliation(s)
- Sara El-Gebali
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Jaina Mistry
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Sean R Eddy
- HHMI, Harvard University, 16 Divinity Ave Cambridge, MA 02138 USA
| | - Aurélien Luciani
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Simon C Potter
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Matloob Qureshi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Lorna J Richardson
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Gustavo A Salazar
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Alfredo Smart
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Erik L L Sonnhammer
- Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, 17121 Solna, Sweden
| | - Layla Hirsh
- Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy.,Dept. of Engineering, Pontificia Universidad Católica del Perú 1801, San Miguel 15088, Lima, Perú
| | - Lisanna Paladin
- Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy
| | - Damiano Piovesan
- Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| |
Collapse
|
42
|
Xu ER, Lafita A, Bateman A, Hyvönen M. The thrombospondin module 1 domain of the matricellular protein CCN3 shows an atypical disulfide pattern and incomplete CWR layers. Acta Crystallogr D Struct Biol 2020; 76:124-134. [PMID: 32038043 DOI: 10.1107/s2059798319016747] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/26/2019] [Accepted: 12/14/2019] [Indexed: 05/04/2023]
Abstract
The members of the CCN (Cyr61/CTGF/Nov) family are a group of matricellular regulatory proteins that are essential to a wide range of functional pathways in cell signalling. Through interacting with extracellular matrix components and growth factors via one of their four domains, the CCN proteins are involved in critical biological processes such as angiogenesis, cell proliferation, bone development, fibrogenesis and tumorigenesis. Here, the crystal structure of the thrombospondin module 1 (TSP1) domain of CCN3 (previously known as Nov) is presented, which shares a similar three-stranded fold with the thrombospondin type 1 repeats of thrombospondin-1 and spondin-1, but with variations in the disulfide connectivity. Moreover, the CCN3 TSP1 domain lacks the typical π-stacked ladder of charged and aromatic residues on one side of the domain that is seen in other TSP1 domains. Using conservation analysis among orthologous domains, it is shown that a charged cluster in the centre of the domain is the most conserved site and this cluster is predicted to be a potential functional epitope for heparan sulfate binding. This variant TSP1 domain has also been used to revise the sequence determinants of TSP1 domains and to derive improved Pfam sequence profiles for the identification of novel TSP1 domains in more than 10 000 proteins across diverse phyla.
Collapse
Affiliation(s)
- Emma Ruoqi Xu
- Department of Biochemistry, University of Cambridge, 80 Tennis Court Road, Cambridge CB2 1GA, England
| | - Aleix Lafita
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, England
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, England
| | - Marko Hyvönen
- Department of Biochemistry, University of Cambridge, 80 Tennis Court Road, Cambridge CB2 1GA, England
| |
Collapse
|
43
|
Whelan F, Lafita A, Griffiths SC, Cooper REM, Whittingham JL, Turkenburg JP, Manfield IW, St. John AN, Paci E, Bateman A, Potts JR. Defining the remarkable structural malleability of a bacterial surface protein Rib domain implicated in infection. Proc Natl Acad Sci U S A 2019; 116:26540-26548. [PMID: 31818940 PMCID: PMC6936399 DOI: 10.1073/pnas.1911776116] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Streptococcus groups A and B cause serious infections, including early onset sepsis and meningitis in newborns. Rib domain-containing surface proteins are found associated with invasive strains and elicit protective immunity in animal models. Yet, despite their apparent importance in infection, the structure of the Rib domain was previously unknown. Structures of single Rib domains of differing length reveal a rare case of domain atrophy through deletion of 2 core antiparallel strands, resulting in the loss of an entire sheet of the β-sandwich from an immunoglobulin-like fold. Previously, observed variation in the number of Rib domains within these bacterial cell wall-attached proteins has been suggested as a mechanism of immune evasion. Here, the structure of tandem domains, combined with molecular dynamics simulations and small angle X-ray scattering, suggests that variability in Rib domain number would result in differential projection of an N-terminal host-colonization domain from the bacterial surface. The identification of 2 further structures where the typical B-D-E immunoglobulin β-sheet is replaced with an α-helix further confirms the extensive structural malleability of the Rib domain.
Collapse
Affiliation(s)
- Fiona Whelan
- Department of Biology, The University of York, YO10 5DD York, United Kingdom
| | - Aleix Lafita
- European Molecular Biology Laboratory, European Bioinformatics Institute, CB10 1SD Hinxton, United Kingdom
| | - Samuel C. Griffiths
- Department of Biology, The University of York, YO10 5DD York, United Kingdom
| | | | - Jean L. Whittingham
- York Structural Biology Laboratory, Department of Chemistry, The University of York, YO10 5DD York, United Kingdom
| | - Johan P. Turkenburg
- York Structural Biology Laboratory, Department of Chemistry, The University of York, YO10 5DD York, United Kingdom
| | - Iain W. Manfield
- Astbury Centre for Structural Molecular Biology, The University of Leeds, LS2 9JT Leeds, United Kingdom
| | - Alexander N. St. John
- Astbury Centre for Structural Molecular Biology, The University of Leeds, LS2 9JT Leeds, United Kingdom
| | - Emanuele Paci
- Astbury Centre for Structural Molecular Biology, The University of Leeds, LS2 9JT Leeds, United Kingdom
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute, CB10 1SD Hinxton, United Kingdom
| | - Jennifer R. Potts
- Department of Biology, The University of York, YO10 5DD York, United Kingdom
| |
Collapse
|
44
|
Bortnov V, Tonelli M, Lee W, Lin Z, Annis DS, Demerdash ON, Bateman A, Mitchell JC, Ge Y, Markley JL, Mosher DF. Solution structure of human myeloid-derived growth factor suggests a conserved function in the endoplasmic reticulum. Nat Commun 2019; 10:5612. [PMID: 31819058 PMCID: PMC6901522 DOI: 10.1038/s41467-019-13577-5] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2019] [Accepted: 11/13/2019] [Indexed: 12/12/2022] Open
Abstract
Human myeloid-derived growth factor (hMYDGF) is a 142-residue protein with a C-terminal endoplasmic reticulum (ER) retention sequence (ERS). Extracellular MYDGF mediates cardiac repair in mice after anoxic injury. Although homologs of hMYDGF are found in eukaryotes as distant as protozoans, its structure and function are unknown. Here we present the NMR solution structure of hMYDGF, which consists of a short α-helix and ten β-strands distributed in three β-sheets. Conserved residues map to the unstructured ERS, loops on the face opposite the ERS, and the surface of a cavity underneath the conserved loops. The only protein or portion of a protein known to have a similar fold is the base domain of VNN1. We suggest, in analogy to the tethering of the VNN1 nitrilase domain to the plasma membrane via its base domain, that MYDGF complexed to the KDEL receptor binds cargo via its conserved residues for transport to the ER.
Collapse
Affiliation(s)
- Valeriu Bortnov
- Department of Biomolecular Chemistry, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Marco Tonelli
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, 53706, USA
- National Magnetic Resonance Facility at Madison, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Woonghee Lee
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, 53706, USA
- National Magnetic Resonance Facility at Madison, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Ziqing Lin
- Departments of Cell and Regenerative Biology and Chemistry, University of Wisconsin-Madison, Madison, WI, 53706, USA
- Human Proteomics Program, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Douglas S Annis
- Department of Biomolecular Chemistry, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Omar N Demerdash
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, 37831, USA
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SD, UK
| | - Julie C Mitchell
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, 37831, USA
| | - Ying Ge
- Departments of Cell and Regenerative Biology and Chemistry, University of Wisconsin-Madison, Madison, WI, 53706, USA
- Human Proteomics Program, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - John L Markley
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, 53706, USA
- National Magnetic Resonance Facility at Madison, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Deane F Mosher
- Department of Biomolecular Chemistry, University of Wisconsin-Madison, Madison, WI, 53706, USA.
- Department of Medicine, University of Wisconsin-Madison, Madison, WI, 53706, USA.
| |
Collapse
|
45
|
Tørresen OK, Star B, Mier P, Andrade-Navarro MA, Bateman A, Jarnot P, Gruca A, Grynberg M, Kajava AV, Promponas VJ, Anisimova M, Jakobsen KS, Linke D. Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases. Nucleic Acids Res 2019; 47:10994-11006. [PMID: 31584084 PMCID: PMC6868369 DOI: 10.1093/nar/gkz841] [Citation(s) in RCA: 147] [Impact Index Per Article: 29.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2019] [Revised: 09/03/2019] [Accepted: 10/01/2019] [Indexed: 12/13/2022] Open
Abstract
The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with 'ready-to-use' deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others.
Collapse
Affiliation(s)
- Ole K Tørresen
- Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, NO-0316 Oslo, Norway
| | - Bastiaan Star
- Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, NO-0316 Oslo, Norway
| | - Pablo Mier
- Faculty of Biology, Johannes Gutenberg University Mainz, Hans-Dieter-Husch-Weg 15, 55128 Mainz, Germany
| | - Miguel A Andrade-Navarro
- Faculty of Biology, Johannes Gutenberg University Mainz, Hans-Dieter-Husch-Weg 15, 55128 Mainz, Germany
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton. CB10 1SD, UK
| | - Patryk Jarnot
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| | - Aleksandra Gruca
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| | - Marcin Grynberg
- Institute of Biochemistry and Biophysics PAS, Pawińskiego 5A, 02-106 Warsaw, Poland
| | - Andrey V Kajava
- Centre de Recherche en Biologie cellulaire de Montpellier, UMR 5237 CNRS, Universite Montpellier 1919 Route de Mende, CEDEX 5, 34293 Montpellier, France
- Institut de Biologie Computationnelle, 34095 Montpellier, France
| | - Vasilis J Promponas
- Bioinformatics Research Laboratory, Department of Biological Sciences, University of Cyprus, PO Box 20537, CY 1678 Nicosia, Cyprus
| | - Maria Anisimova
- Institute of Applied Simulations, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), Wädenswil, Switzerland
- Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Kjetill S Jakobsen
- Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, NO-0316 Oslo, Norway
| | - Dirk Linke
- Section for Genetics and Evolutionary Biology, Department of Biosciences, University of Oslo, NO-0316 Oslo, Norway
| |
Collapse
|
46
|
Rangarajan K, Pucher PH, Armstrong T, Bateman A, Hamady ZZR. Systemic neoadjuvant chemotherapy in modern pancreatic cancer treatment: a systematic review and meta-analysis. Ann R Coll Surg Engl 2019; 101:453-462. [PMID: 31304767 PMCID: PMC6667953 DOI: 10.1308/rcsann.2019.0060] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/04/2018] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Pancreatic ductal adenocarcinoma remains a disease with a poor prognosis despite advances in surgery and systemic therapies. Neoadjuvant therapy strategies are a promising alternative to adjuvant chemotherapy. However, their role remains controversial. This meta-analysis aims to clarify the benefits of neoadjuvant therapy in resectable pancreatic ductal adenocarcinoma. METHODS Eligible studies were identified from MEDLINE, Embase, Web of Science and the Cochrane Library. Studies comparing neoadjuvant therapy with a surgery first approach (with or without adjuvant therapy) in resectable pancreatic ductal adenocarcinoma were included. The primary outcome assessed was overall survival. A random-effects meta-analysis was performed, together with pooling of unadjusted Kaplan-Meier curve data. RESULTS A total of 533 studies were identified that analysed the effect of neoadjuvant therapy in pancreatic ductal adenocarcinoma. Twenty-seven studies were included in the final data synthesis. Meta-analysis suggested beneficial effects of neoadjuvant therapy with prolonged survival compared with a surgery-first approach, (hazard ratio 0.72, 95% confidence interval 0.69-0.76). In addition, R0 resection rates were significantly higher in patients receiving neoadjuvant therapy (relative risk 0.51, 95% confidence interval 0.47-0.55). Individual patient data analysis suggested that overall survival was better for patients receiving neoadjuvant therapy (P = 0.008). CONCLUSIONS Current evidence suggests that neoadjuvant chemotherapy has a beneficial effect on overall survival in resectable pancreatic ductal adenocarcinoma in comparison with upfront surgery and adjuvant therapy. Further trials are needed to address the need for practice change.
Collapse
Affiliation(s)
- K Rangarajan
- Department of Surgery, University Hospital Southampton NHS Foundation Trust, Southampton, UK
| | - PH Pucher
- Department of Surgery, University Hospital Southampton NHS Foundation Trust, Southampton, UK
- Department of Surgery, St Mary’s Hospital, Imperial College London, Southampton, UK
| | - T Armstrong
- Department of Surgery, University Hospital Southampton NHS Foundation Trust, Southampton, UK
| | - A Bateman
- Department of Clinical Oncology, University Hospital Southampton NHS Foundation Trust, Southampton, UK
| | - ZZR Hamady
- Department of Surgery, University Hospital Southampton NHS Foundation Trust, Southampton, UK
| |
Collapse
|
47
|
Rawlings ND, Barrett AJ, Thomas PD, Huang X, Bateman A, Finn RD. The MEROPS database of proteolytic enzymes, their substrates and inhibitors in 2017 and a comparison with peptidases in the PANTHER database. Nucleic Acids Res 2019; 46:D624-D632. [PMID: 29145643 PMCID: PMC5753285 DOI: 10.1093/nar/gkx1134] [Citation(s) in RCA: 928] [Impact Index Per Article: 185.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2017] [Accepted: 10/30/2017] [Indexed: 12/15/2022] Open
Abstract
The MEROPS database (http://www.ebi.ac.uk/merops/) is an integrated source of information about peptidases, their substrates and inhibitors. The hierarchical classification is: protein-species, family, clan, with an identifier at each level. The MEROPS website moved to the EMBL-EBI in 2017, requiring refactoring of the code-base and services provided. The interface to sequence searching has changed and the MEROPS protein sequence libraries can be searched at the EMBL-EBI with HMMER, FastA and BLASTP. Cross-references have been established between MEROPS and the PANTHER database at both the family and protein-species level, which will help to improve curation and coverage between the resources. Because of the increasing size of the MEROPS sequence collection, in future only sequences of characterized proteins, and from completely sequenced genomes of organisms of evolutionary, medical or commercial significance will be added. As an example, peptidase homologues in four proteomes from the Asgard superphylum of Archaea have been identified and compared to other archaean, bacterial and eukaryote proteomes. This has given insights into the origins and evolution of peptidase families, including an expansion in the number of proteasome components in Asgard archaeotes and as organisms increase in complexity. Novel structures for proteasome complexes in archaea are postulated.
Collapse
Affiliation(s)
- Neil D Rawlings
- EMBL European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Alan J Barrett
- EMBL European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Paul D Thomas
- Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, 1450 Biggy St, NRT 2502, Los Angeles, CA 90033, USA
| | - Xiaosong Huang
- Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, 1450 Biggy St, NRT 2502, Los Angeles, CA 90033, USA
| | - Alex Bateman
- EMBL European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Robert D Finn
- EMBL European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| |
Collapse
|
48
|
Kalvari I, Argasinska J, Quinones-Olvera N, Nawrocki EP, Rivas E, Eddy SR, Bateman A, Finn RD, Petrov AI. Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families. Nucleic Acids Res 2019; 46:D335-D342. [PMID: 29112718 PMCID: PMC5753348 DOI: 10.1093/nar/gkx1038] [Citation(s) in RCA: 584] [Impact Index Per Article: 116.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2017] [Accepted: 10/19/2017] [Indexed: 11/13/2022] Open
Abstract
The Rfam database is a collection of RNA families in which each family is represented by a multiple sequence alignment, a consensus secondary structure, and a covariance model. In this paper we introduce Rfam release 13.0, which switches to a new genome-centric approach that annotates a non-redundant set of reference genomes with RNA families. We describe new web interface features including faceted text search and R-scape secondary structure visualizations. We discuss a new literature curation workflow and a pipeline for building families based on RNAcentral. There are 236 new families in release 13.0, bringing the total number of families to 2687. The Rfam website is http://rfam.org.
Collapse
Affiliation(s)
- Ioanna Kalvari
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Joanna Argasinska
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | | | - Eric P Nawrocki
- National Center for Biotechnology Information; National Institutes of Health; Department of Health and Human Services; Bethesda, MD 20894, USA
| | - Elena Rivas
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA 02138, USA
| | - Sean R Eddy
- Howard Hughes Medical Institute, Harvard University, 16 Divinity Avenue, Cambridge, MA 02138, USA
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Anton I Petrov
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| |
Collapse
|
49
|
Abstract
The distribution of all peptidase homologues across all phyla of organisms was analysed to determine within which kingdom each of the 271 families originated. No family was found to be ubiquitous and even peptidases thought to be essential for life, such as signal peptidase and methionyl aminopeptides are missing from some clades. There are 33 peptidase families common to archaea, bacteria and eukaryotes and are assumed to have originated in the last universal common ancestor (LUCA). These include peptidases with different catalytic types, exo- and endopeptidases, peptidases with different tertiary structures and peptidases from different families but with similar structures. This implies that the different catalytic types and structures pre-date LUCA. Other families have had their origins in the ancestors of viruses, archaea, bacteria, fungi, plants and animals, and a number of families have had their origins in the ancestors of particular phyla. The evolution of peptidases is compared to recent hypotheses about the evolution of organisms. Sequences of proteolytic enzymes can be clustered into 271 families. No family is present in all organisms. Only 33 families are predicted to originate in the last universal common ancestor. Different structures and activities predate the last universal common ancestor. Other families have originated in organism kingdoms, phyla or even families.
Collapse
Affiliation(s)
- Neil D Rawlings
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK.
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| |
Collapse
|
50
|
Lafita A, Tian P, Best RB, Bateman A. Tandem domain swapping: determinants of multidomain protein misfolding. Curr Opin Struct Biol 2019; 58:97-104. [PMID: 31260947 PMCID: PMC6863430 DOI: 10.1016/j.sbi.2019.05.012] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2019] [Accepted: 05/13/2019] [Indexed: 11/25/2022]
Abstract
Domain swapping refers to the exchange of structural elements between protein domains. Experiments show that tandem homologous domains are prone to domain swapping. Recent studies establish a framework to understand the formation of tandem domain swaps. Prediction of tandem domain swaps is possible but hindered by the amount of available data.
Tandem homologous domains in proteins are susceptible to misfolding through the formation of domain swaps, non-native conformations involving the exchange of equivalent structural elements between adjacent domains. Cutting-edge biophysical experiments have recently allowed the observation of tandem domain swapping events at the single molecule level. In addition, computer simulations have shed light into the molecular mechanisms of domain swap formation and serve as the basis for methods to systematically predict them. At present, the number of studies on tandem domain swaps is still small and limited to a few domain folds, but they offer important insights into the folding and evolution of multidomain proteins with applications in the field of protein design.
Collapse
Affiliation(s)
- Aleix Lafita
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK.
| | - Pengfei Tian
- Novozymes A/S, Krogshøjvej 36, DK-2880 Bagsværd, Denmark
| | - Robert B Best
- Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD, USA
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| |
Collapse
|