1
|
Kawabata T, Kinoshita K. Assessing Structural Classification Using AlphaFold2 Models Through ECOD-Based Comparative Analysis. Proteins 2025. [PMID: 40251890 DOI: 10.1002/prot.26828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2024] [Revised: 03/27/2025] [Accepted: 03/30/2025] [Indexed: 04/21/2025]
Abstract
Identifying homologous proteins is a fundamental task in structural bioinformatics. While AlphaFold2 has revolutionized protein structure prediction, the extent to which structure comparison of its models can reliably detect homologs remains unclear. In this study, we evaluate the feasibility of homology detection using AlphaFold2-predicted structures through structural comparisons. We considered the classification of the ECOD database for experimental structures as the correct standard and obtained their corresponding predicted models from AlphaFoldDB. To ensure blind assessment, we divided the structures into test and train sets according to their release date. Predicted and experimental 3D structures in the test and train sets were compared using 3D structure comparisons (MATRAS, Dali, and Foldseek) and sequence comparisons (BLAST and HHsearch). The results were evaluated based on the homology annotations in the ECOD database. For top-1 accuracy, the performance of structural comparisons was comparable to that of HHsearch. However, when considering metrics that included all structural pairs, including more remote homology, structural comparisons outperformed HHsearch. No significant differences were observed between comparisons of experimental versus experimental, predicted versus experimental, and predicted versus predicted structures with pLDDT (prediction confidence) values greater than 60. We also demonstrate that predicted protein structures, determined by NMR, had lower pLDDT values and contained fewer coils than their experimental counterparts. These findings highlight the potential of AlphaFold2 models in structural classification and suggest that 3D structural searches should be conducted not only against the PDB but also against AlphaFoldDB to identify more potential homologs.
Collapse
Affiliation(s)
- Takeshi Kawabata
- Graduate School of Information Sciences, Tohoku University, Sendai, Japan
| | - Kengo Kinoshita
- Graduate School of Information Sciences, Tohoku University, Sendai, Japan
| |
Collapse
|
2
|
Schaeffer RD, Pei J, Zhang J, Cong Q, Grishin NV. Refinement and curation of homologous groups facilitated by structure prediction. Protein Sci 2025; 34:e70074. [PMID: 39968854 PMCID: PMC11836899 DOI: 10.1002/pro.70074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2024] [Revised: 01/09/2025] [Accepted: 02/05/2025] [Indexed: 02/20/2025]
Abstract
Domain classification of protein predictions released in the AlphaFold Database (AFDB) has been a recent focus of the Evolutionary Classification of protein Domains (ECOD). Although a primary focus of our recent work has been the partition and assignment of domains from these predictions, we here show how these diverse predictions can be used to examine the reference domain set more closely. Using results from DPAM, our AlphaFold-specific domain parsing algorithm, we examine hierarchical groupings that share significant levels of homologous links, both between groups that were not previously assessed to be definitively homologous and between groups that were not previously observed to share significant homologous links. Combined with manual analysis, these large datasets of structural and sequence similarities allow us to merge homologous groups in multiple cases which we detail within. These domains tend to be families of domains from families that are either small, previously had few experimental representatives, or had unknown function. The exception to this is the chromodomains, a large homologous group which were increased from "possibly homologous" to "definitely homologous" to increase the consistency of ECOD based their strong homologous links to the SH3 domains.
Collapse
Affiliation(s)
| | - Jimin Pei
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Eugene McDermott Center for Human Growth and DevelopmentUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Harold C. Simmons Comprehensive Cancer CenterUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| | - Jing Zhang
- Eugene McDermott Center for Human Growth and DevelopmentUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Harold C. Simmons Comprehensive Cancer CenterUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| | - Qian Cong
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Eugene McDermott Center for Human Growth and DevelopmentUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Harold C. Simmons Comprehensive Cancer CenterUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| | - Nick V. Grishin
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Department of BiochemistryUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| |
Collapse
|
3
|
Kim G, Lee S, Levy Karin E, Kim H, Moriwaki Y, Ovchinnikov S, Steinegger M, Mirdita M. Easy and accurate protein structure prediction using ColabFold. Nat Protoc 2025; 20:620-642. [PMID: 39402428 DOI: 10.1038/s41596-024-01060-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 08/07/2024] [Indexed: 03/12/2025]
Abstract
Since its public release in 2021, AlphaFold2 (AF2) has made investigating biological questions, by using predicted protein structures of single monomers or full complexes, a common practice. ColabFold-AF2 is an open-source Jupyter Notebook inside Google Colaboratory and a command-line tool that makes it easy to use AF2 while exposing its advanced options. ColabFold-AF2 shortens turnaround times of experiments because of its optimized usage of AF2's models. In this protocol, we guide the reader through ColabFold best practices by using three scenarios: (i) monomer prediction, (ii) complex prediction and (iii) conformation sampling. The first two scenarios cover classic static structure prediction and are demonstrated on the human glycosylphosphatidylinositol transamidase protein. The third scenario demonstrates an alternative use case of the AF2 models by predicting two conformations of the human alanine serine transporter 2. Users can run the protocol without computational expertise via Google Colaboratory or in a command-line environment for advanced users. Using Google Colaboratory, it takes <2 h to run each procedure. The data and code for this protocol are available at https://protocol.colabfold.com .
Collapse
Affiliation(s)
- Gyuri Kim
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea
| | - Sewon Lee
- School of Biological Sciences, Seoul National University, Seoul, South Korea
| | | | - Hyunbin Kim
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea
| | - Yoshitaka Moriwaki
- Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan
- Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, Tokyo, Japan
- Department of Computational Drug Discovery and Design, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan
| | | | - Martin Steinegger
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea.
- School of Biological Sciences, Seoul National University, Seoul, South Korea.
- Artificial Intelligence Institute, Seoul National University, Seoul, South Korea.
- Institute of Molecular Biology and Genetics, Seoul National University, Seoul, South Korea.
| | - Milot Mirdita
- School of Biological Sciences, Seoul National University, Seoul, South Korea.
| |
Collapse
|
4
|
Wu EJ, Kandalkar AT, Ehrmann JF, Tong AB, Zhang J, Cong Q, Wu H. A structural atlas of death domain fold proteins reveals their versatile roles in biology and function. Proc Natl Acad Sci U S A 2025; 122:e2426986122. [PMID: 39977327 PMCID: PMC11874512 DOI: 10.1073/pnas.2426986122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2024] [Accepted: 01/23/2025] [Indexed: 02/22/2025] Open
Abstract
Death domain fold (DDF) superfamily proteins are critically important players in pathways of cell death and inflammation. DDFs are often essential scaffolding domains in receptors, adaptors, or effectors of these pathways by mediating homo- and hetero-oligomerization including helical filament assembly. At the downstream ends of these pathways, effector oligomerization by DDFs brings the enzyme domains into proximity for their dimerization and activation. Hundreds of structures of these domains have been solved. However, a comprehensive understanding of DDFs is lacking. In this article, we report the curation of a DDF structural atlas as a public website (deathdomain.org) and deduce the common and distinct principles of DDF-mediated oligomerization among the four families (death domain or DD, death effector domain or DED, caspase recruitment domain or CARD, and pyrin domain or PYD). We further annotate DDFs genome-wide based on AlphaFold-predicted models and protein sequences. These studies reveal mechanistic rules for this widely distributed domain superfamily.
Collapse
Affiliation(s)
- Emily J. Wu
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, MA02115
- Program in Cellular and Molecular Medicine, Boston Children’s Hospital, Boston, MA02115
- Saratoga High School, Saratoga, CA95070
| | - Ankita T. Kandalkar
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, MA02115
- Program in Cellular and Molecular Medicine, Boston Children’s Hospital, Boston, MA02115
- Department of Biology, College of Science, Northeastern University, Boston, MA02115
| | - Julian F. Ehrmann
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, MA02115
- Program in Cellular and Molecular Medicine, Boston Children’s Hospital, Boston, MA02115
| | - Alexander B. Tong
- Jason L. Choy Laboratory of Single-Molecule Biophysics, Institute for Quantitative Biosciences, Chemistry Graduate Group, University of California, Berkeley, CA94720
| | - Jing Zhang
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX75390
- Eugene McDermott Center for Human Growth and Development, University of Texas, Southwestern Medical Center, Dallas, TX75390
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX75390
| | - Qian Cong
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX75390
- Eugene McDermott Center for Human Growth and Development, University of Texas, Southwestern Medical Center, Dallas, TX75390
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX75390
| | - Hao Wu
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, MA02115
- Program in Cellular and Molecular Medicine, Boston Children’s Hospital, Boston, MA02115
- Department of Biology, College of Science, Northeastern University, Boston, MA02115
| |
Collapse
|
5
|
Schaeffer R, Medvedev K, Andreeva A, Chuguransky S, Pinto B, Zhang J, Cong Q, Bateman A, Grishin N. ECOD: integrating classifications of protein domains from experimental and predicted structures. Nucleic Acids Res 2025; 53:D411-D418. [PMID: 39565196 PMCID: PMC11701565 DOI: 10.1093/nar/gkae1029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2024] [Revised: 10/11/2024] [Accepted: 10/17/2024] [Indexed: 11/21/2024] Open
Abstract
The evolutionary classification of protein domains (ECOD) classifies protein domains using a combination of sequence and structural data (http://prodata.swmed.edu/ecod). Here we present the culmination of our previous efforts at classifying domains from predicted structures, principally from the AlphaFold Database (AFDB), by integrating these domains with our existing classification of PDB structures. This combined classification includes both domains from our previous, purely experimental, classification of domains as well as domains from our provisional classification of 48 proteomes in AFDB predicted from model organisms and organisms of concern to global health. ECOD classifies over 1.8 M domains from over 1000 000 proteins collectively deposited in the PDB and AFDB. Additionally, we have changed the F-group classification reference used for ECOD, deprecating our original ECODf library and instead relying on direct collaboration with the Pfam sequence family database to inform our classification. Pfam provides similar coverage of ECOD with family classification while being more accurate and less redundant. By eliminating duplication of effort, we can improve both classifications. Finally, we discuss the initial deployment of DrugDomain, a database of domain-ligand interactions, on ECOD and discuss future plans.
Collapse
Affiliation(s)
- R Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd. Dallas, TX, 75390-8816 USA
| | - Kirill E Medvedev
- Department of Biophysics, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd. Dallas, TX, 75390-8816 USA
| | - Antonina Andreeva
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Sara Rocio Chuguransky
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Beatriz Lazaro Pinto
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Jing Zhang
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd. Dallas, TX, 75390-8591, USA
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd. Dallas, TX, 75390- USA
| | - Qian Cong
- Department of Biophysics, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd. Dallas, TX, 75390-8816 USA
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd. Dallas, TX, 75390-8591, USA
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd. Dallas, TX, 75390- USA
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Nick V Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd. Dallas, TX, 75390-8816 USA
- Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd. Dallas, TX, 75390-9038, USA
| |
Collapse
|
6
|
Paysan-Lafosse T, Andreeva A, Blum M, Chuguransky S, Grego T, Pinto B, Salazar G, Bileschi M, Llinares-López F, Meng-Papaxanthos L, Colwell L, Grishin N, Schaeffer RD, Clementel D, Tosatto SE, Sonnhammer E, Wood V, Bateman A. The Pfam protein families database: embracing AI/ML. Nucleic Acids Res 2025; 53:D523-D534. [PMID: 39540428 PMCID: PMC11701544 DOI: 10.1093/nar/gkae997] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2024] [Revised: 10/09/2024] [Accepted: 10/16/2024] [Indexed: 11/16/2024] Open
Abstract
The Pfam protein families database is a comprehensive collection of protein domains and families used for genome annotation and protein structure and function analysis (https://www.ebi.ac.uk/interpro/). This update describes major developments in Pfam since 2020, including decommissioning the Pfam website and integration with InterPro, harmonization with the ECOD structural classification, and expanded curation of metagenomic, microprotein and repeat-containing families. We highlight how AlphaFold structure predictions are being leveraged to refine domain boundaries and identify new domains. New families discovered through large-scale sequence similarity analysis of AlphaFold models are described. We also detail the development of Pfam-N, which uses deep learning to expand family coverage, achieving an 8.8% increase in UniProtKB coverage compared to standard Pfam. We discuss plans for more frequent Pfam releases integrated with InterPro and the potential for artificial intelligence to further assist curation. Despite recent advances, many protein families remain to be classified, and Pfam continues working toward comprehensive coverage of the protein universe.
Collapse
Affiliation(s)
- Typhaine Paysan-Lafosse
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Antonina Andreeva
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Matthias Blum
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Sara Rocio Chuguransky
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Tiago Grego
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Beatriz Lazaro Pinto
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Gustavo A Salazar
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | | | | | | | - Lucy J Colwell
- Google DeepMind, 355 Main Street, Cambridge, MA 02142, USA
- Department of Chemistry, University of Cambridge, Lansfield road, Cambridge CB2 1EW, UK
| | - Nick V Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd., Dallas, TX75390, USA
- Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd., Dallas, TX75390, USA
| | - R Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd., Dallas, TX75390, USA
| | - Damiano Clementel
- Department of Biomedical Sciences, University of Padova, Via 8 Febbraio, 2, 35122 Padova, Italy
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padova, Via 8 Febbraio, 2, 35122 Padova, Italy
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR-IBIOM), Via Giovanni Amendola, 122/O, 70126 Bari, Italy
| | - Erik Sonnhammer
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Tomtebodavägen 23A, 17165 Solna, Sweden
| | - Valerie Wood
- Department of Biochemistry, University of Cambridge, Hopkins Building Downing Site, Tennis Court Road, Cambridge CB2 1QW, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| |
Collapse
|
7
|
Durham J, Zhang J, Schaeffer RD, Cong Q. DPAM-AI: a domain parser for AlphaFold models powered by artificial intelligence. Bioinformatics 2024; 41:btae740. [PMID: 39672676 PMCID: PMC11723527 DOI: 10.1093/bioinformatics/btae740] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Revised: 10/29/2024] [Accepted: 12/12/2024] [Indexed: 12/15/2024] Open
Abstract
MOTIVATION Due to the breakthrough in protein structure prediction by AlphaFold, the scientific community has access to 200 million predicted protein structures with near-atomic accuracy from the AlphaFold protein structure DataBase (AFDB), covering nearly the entire protein universe. Segmenting these models into domains and classifying them into an evolutionary hierarchy hold tremendous potential for unraveling essential insights into protein function. RESULTS We introduce DPAM-AI, a Domain Parser for AlphaFold Models based on Artificial Intelligence. DPAM-AI utilizes a convolutional neural network trained with previously classified domains in the Evolutionary Classification Of protein Domains (ECOD) database. DPAM-AI integrates inter-residue distances, predicted aligned errors, and sequence and structural alignments to previously classified domains detected via sequence (HHsuite) and structural (Dali) similarity searches. DPAM-AI has demonstrated its power through rigorous tests, excelling in several benchmark sets compared to its predecessor, DPAM, and other recently published domain parsers, Merizo and Chainsaw. We applied DPAM-AI to representative AFDB models for proteins classified in Pfam. We obtained representative 3D structures for 18 487 (89%) of the 20 795 Pfam families. The remaining families either (i) belong to viral proteins that were excluded from AFDB or (ii) do not adopt globular 3D structures. Our structure-aware domain delineation uncovered a considerable fraction (15%) of Pfam domains containing multiple structural and evolutionary units and refined the boundaries for over half. AVAILABILITY AND IMPLEMENTATION Pfam and corresponding DPAM-AI domains are at http://prodata.swmed.edu/DPAM-pfam/. Our code is deposited at https://github.com/Jsauce5p/DPAM/tree/dpam_ai, and updates will be released through https://github.com/CongLabCode/DPAM.
Collapse
Affiliation(s)
- Jesse Durham
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
| | - Jing Zhang
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
| | - Richard D Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
| | - Qian Cong
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
| |
Collapse
|
8
|
Pei J, Andreeva A, Chuguransky S, Lázaro Pinto B, Paysan-Lafosse T, Dustin Schaeffer R, Bateman A, Cong Q, Grishin NV. Bridging the Gap between Sequence and Structure Classifications of Proteins with AlphaFold Models. J Mol Biol 2024; 436:168764. [PMID: 39197652 DOI: 10.1016/j.jmb.2024.168764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Revised: 08/13/2024] [Accepted: 08/20/2024] [Indexed: 09/01/2024]
Abstract
Classification of protein domains based on homology and structural similarity serves as a fundamental tool to gain biological insights into protein function. Recent advancements in protein structure prediction, exemplified by AlphaFold, have revolutionized the availability of protein structural data. We focus on classifying about 9000 Pfam families into ECOD (Evolutionary Classification of Domains) by using predicted AlphaFold models and the DPAM (Domain Parser for AlphaFold Models) tool. Our results offer insights into their homologous relationships and domain boundaries. More than half of these Pfam families contain DPAM domains that can be confidently assigned to the ECOD hierarchy. Most assigned domains belong to highly populated folds such as Immunoglobulin-like (IgL), Armadillo (ARM), helix-turn-helix (HTH), and Src homology 3 (SH3). A large fraction of DPAM domains, however, cannot be confidently assigned to ECOD homologous groups. These unassigned domains exhibit statistically different characteristics, including shorter average length, fewer secondary structure elements, and more abundant transmembrane segments. They could potentially define novel families remotely related to domains with known structures or novel superfamilies and folds. Manual scrutiny of a subset of these domains revealed an abundance of internal duplications and recurring structural motifs. Exploring sequence and structural features such as disulfide bond patterns, metal-binding sites, and enzyme active sites helped uncover novel structural folds as well as remote evolutionary relationships. By bridging the gap between sequence-based Pfam and structure-based ECOD domain classifications, our study contributes to a more comprehensive understanding of the protein universe by providing structural and functional insights into previously uncharacterized proteins.
Collapse
Affiliation(s)
- Jimin Pei
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Antonina Andreeva
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Sara Chuguransky
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Beatriz Lázaro Pinto
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Typhaine Paysan-Lafosse
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - R Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK.
| | - Qian Cong
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA.
| | - Nick V Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, USA.
| |
Collapse
|
9
|
Iovino BG, Tang H, Ye Y. Protein domain embeddings for fast and accurate similarity search. Genome Res 2024; 34:1434-1444. [PMID: 39237301 PMCID: PMC11529836 DOI: 10.1101/gr.279127.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Accepted: 09/03/2024] [Indexed: 09/07/2024]
Abstract
Recently developed protein language models have enabled a variety of applications with the protein contextual embeddings they produce. Per-protein representations (each protein is represented as a vector of fixed dimension) can be derived via averaging the embeddings of individual residues, or applying matrix transformation techniques such as the discrete cosine transformation (DCT) to matrices of residue embeddings. Such protein-level embeddings have been applied to enable fast searches of similar proteins; however, limitations have been found; for example, PROST is good at detecting global homologs but not local homologs, and knnProtT5 excels for proteins with single domains but not multidomain proteins. Here, we propose a novel approach that first segments proteins into domains (or subdomains) and then applies the DCT to the vectorized embeddings of residues in each domain to infer domain-level contextual vectors. Our approach, called DCTdomain, uses predicted contact maps from ESM-2 for domain segmentation, which is formulated as a domain segmentation problem and can be solved using a recursive cut algorithm (RecCut in short) in quadratic time to the protein length; for comparison, an existing approach for domain segmentation uses a cubic-time algorithm. We show such domain-level contextual vectors (termed as DCT fingerprints) enable fast and accurate detection of similarity between proteins that share global similarities but with undefined extended regions between shared domains, and those that only share local similarities. In addition, tests on a database search benchmark show that the DCTdomain is able to detect distant homologs by leveraging the structural information in the contextual embeddings.
Collapse
Affiliation(s)
- Benjamin Giovanni Iovino
- Luddy School of Informatics, Computing and Engineering, Indiana University, Bloomington, Indiana 47408, USA
| | - Haixu Tang
- Luddy School of Informatics, Computing and Engineering, Indiana University, Bloomington, Indiana 47408, USA
| | - Yuzhen Ye
- Luddy School of Informatics, Computing and Engineering, Indiana University, Bloomington, Indiana 47408, USA
| |
Collapse
|
10
|
Baek K, Metivier RJ, Roy Burman SS, Bushman JW, Yoon H, Lumpkin RJ, Abeja DM, Lakshminarayan M, Yue H, Ojeda S, Verano AL, Gray NS, Donovan KA, Fischer ES. Unveiling the hidden interactome of CRBN molecular glues with chemoproteomics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.11.612438. [PMID: 39314457 PMCID: PMC11419069 DOI: 10.1101/2024.09.11.612438] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 09/25/2024]
Abstract
Targeted protein degradation and induced proximity refer to strategies that leverage the recruitment of proteins to facilitate their modification, regulation or degradation. As prospective design of glues remains challenging, unbiased discovery methods are needed to unveil hidden chemical targets. Here we establish a high throughput affinity purification mass spectrometry workflow in cell lysates for the unbiased identification of molecular glue targets. By mapping the targets of 20 CRBN-binding molecular glues, we identify 298 protein targets and demonstrate the utility of enrichment methods for identifying novel targets overlooked using established methods. We use a computational workflow to estimate target confidence and perform a biochemical screen to identify a lead compound for the new non-ZF target PPIL4. Our study provides a comprehensive inventory of targets chemically recruited to CRBN and delivers a robust and scalable workflow for identifying new drug-induced protein interactions in cell lysates.
Collapse
Affiliation(s)
- Kheewoong Baek
- Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Rebecca J. Metivier
- Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Shourya S. Roy Burman
- Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Jonathan W. Bushman
- Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Hojong Yoon
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA 02215, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Ryan J. Lumpkin
- Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Dinah M. Abeja
- Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
| | - Megha Lakshminarayan
- Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
| | - Hong Yue
- Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Samuel Ojeda
- Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Alyssa L. Verano
- Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Nathanael S. Gray
- Department of Chemical and Systems Biology, ChEM-H and Stanford Cancer Institute, Stanford Medical School, Stanford University, Stanford, CA, 94305, USA
| | - Katherine A. Donovan
- Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Eric S. Fischer
- Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, Massachusetts 02115, USA
| |
Collapse
|
11
|
Gao J, Tong M, Lee C, Gaertig J, Legal T, Bui KH. DomainFit: Identification of protein domains in cryo-EM maps at intermediate resolution using AlphaFold2-predicted models. Structure 2024; 32:1248-1259.e5. [PMID: 38754431 PMCID: PMC11316655 DOI: 10.1016/j.str.2024.04.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 03/18/2024] [Accepted: 04/19/2024] [Indexed: 05/18/2024]
Abstract
Cryoelectron microscopy (cryo-EM) has revolutionized the structural determination of macromolecular complexes. With the paradigm shift to structure determination of highly complex endogenous macromolecular complexes ex vivo and in situ structural biology, there are an increasing number of structures of native complexes. These complexes often contain unidentified proteins, related to different cellular states or processes. Identifying proteins at resolutions lower than 4 Å remains challenging because side chains cannot be visualized reliably. Here, we present DomainFit, a program for semi-automated domain-level protein identification from cryo-EM maps, particularly at resolutions lower than 4 Å. By fitting domains from AlphaFold2-predicted models into cryo-EM maps, the program performs statistical analyses and attempts to identify the domains and protein candidates forming the density. Using DomainFit, we identified two microtubule inner proteins, one of which contains a CCDC81 domain and is exclusively localized in the proximal region of the doublet microtubule in Tetrahymena thermophila.
Collapse
Affiliation(s)
- Jerry Gao
- Department of Anatomy and Cell Biology, Faculty of Medicine and Health Sciences, McGill University, Montréal, QC H3A 0C7, Canada; Centre de recherche en biologie structurale, McGill University, Montréal, QC H3G 0B1, Canada
| | - Maxwell Tong
- Department of Anatomy and Cell Biology, Faculty of Medicine and Health Sciences, McGill University, Montréal, QC H3A 0C7, Canada; Centre de recherche en biologie structurale, McGill University, Montréal, QC H3G 0B1, Canada
| | - Chinkyu Lee
- Department of Cellular Biology, University of Georgia, Athens 30602-2607, GA, USA
| | - Jacek Gaertig
- Department of Cellular Biology, University of Georgia, Athens 30602-2607, GA, USA
| | - Thibault Legal
- Department of Anatomy and Cell Biology, Faculty of Medicine and Health Sciences, McGill University, Montréal, QC H3A 0C7, Canada; Centre de recherche en biologie structurale, McGill University, Montréal, QC H3G 0B1, Canada.
| | - Khanh Huy Bui
- Department of Anatomy and Cell Biology, Faculty of Medicine and Health Sciences, McGill University, Montréal, QC H3A 0C7, Canada; Centre de recherche en biologie structurale, McGill University, Montréal, QC H3G 0B1, Canada.
| |
Collapse
|
12
|
Buchan DWA, Moffat L, Lau A, Kandathil S, Jones D. Deep learning for the PSIPRED Protein Analysis Workbench. Nucleic Acids Res 2024; 52:W287-W293. [PMID: 38747351 PMCID: PMC11223827 DOI: 10.1093/nar/gkae328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2024] [Revised: 04/08/2024] [Accepted: 04/24/2024] [Indexed: 07/06/2024] Open
Abstract
The PSIRED Workbench is a long established and popular bioinformatics web service offering a wide range of machine learning based analyses for characterizing protein structure and function. In this paper we provide an update of the recent additions and developments to the webserver, with a focus on new Deep Learning based methods. We briefly discuss some trends in server usage since the publication of AlphaFold2 and we give an overview of some upcoming developments for the service. The PSIPRED Workbench is available at http://bioinf.cs.ucl.ac.uk/psipred.
Collapse
Affiliation(s)
- Daniel W A Buchan
- UCL Bioinformatics Group, Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - Lewis Moffat
- UCL Bioinformatics Group, Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - Andy Lau
- UCL Bioinformatics Group, Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - Shaun M Kandathil
- UCL Bioinformatics Group, Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - David T Jones
- UCL Bioinformatics Group, Department of Computer Science, University College London, London, WC1E 6BT, UK
| |
Collapse
|
13
|
Medvedev KE, Zhang J, Schaeffer RD, Kinch LN, Cong Q, Grishin NV. Structure classification of the proteins from Salmonella enterica pangenome revealed novel potential pathogenicity islands. Sci Rep 2024; 14:12260. [PMID: 38806511 PMCID: PMC11133325 DOI: 10.1038/s41598-024-60991-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Accepted: 04/30/2024] [Indexed: 05/30/2024] Open
Abstract
Salmonella enterica is a pathogenic bacterium known for causing severe typhoid fever in humans, making it important to study due to its potential health risks and significant impact on public health. This study provides evolutionary classification of proteins from Salmonella enterica pangenome. We classified 17,238 domains from 13,147 proteins from 79,758 Salmonella enterica strains and studied in detail domains of 272 proteins from 14 characterized Salmonella pathogenicity islands (SPIs). Among SPIs-related proteins, 90 proteins function in the secretion machinery. 41% domains of SPI proteins have no previous sequence annotation. By comparing clinical and environmental isolates, we identified 3682 proteins that are overrepresented in clinical group that we consider as potentially pathogenic. Among domains of potentially pathogenic proteins only 50% domains were annotated by sequence methods previously. Moreover, 36% (1330 out of 3682) of potentially pathogenic proteins cannot be classified into Evolutionary Classification of Protein Domains database (ECOD). Among classified domains of potentially pathogenic proteins the most populated homology groups include helix-turn-helix (HTH), Immunoglobulin-related, and P-loop domains-related. Functional analysis revealed overrepresentation of these protein in biological processes related to viral entry into host cell, antibiotic biosynthesis, DNA metabolism and conformation change, and underrepresentation in translational processes. Analysis of the potentially pathogenic proteins indicates that they form 119 clusters or novel potential pathogenicity islands (NPPIs) within the Salmonella genome, suggesting their potential contribution to the bacterium's virulence. One of the NPPIs revealed significant overrepresentation of potentially pathogenic proteins. Overall, our analysis revealed that identified potentially pathogenic proteins are poorly studied.
Collapse
Affiliation(s)
- Kirill E Medvedev
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
| | - Jing Zhang
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - R Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Lisa N Kinch
- Department of Molecular Biology, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Qian Cong
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Nick V Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| |
Collapse
|
14
|
Wells J, Hawkins-Hooker A, Bordin N, Sillitoe I, Paige B, Orengo C. Chainsaw: protein domain segmentation with fully convolutional neural networks. Bioinformatics 2024; 40:btae296. [PMID: 38718225 PMCID: PMC11256964 DOI: 10.1093/bioinformatics/btae296] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Revised: 03/23/2024] [Accepted: 05/07/2024] [Indexed: 05/23/2024] Open
Abstract
MOTIVATION Protein domains are fundamental units of protein structure and play a pivotal role in understanding folding, function, evolution, and design. The advent of accurate structure prediction techniques has resulted in an influx of new structural data, making the partitioning of these structures into domains essential for inferring evolutionary relationships and functional classification. RESULTS This article presents Chainsaw, a supervised learning approach to domain parsing that achieves accuracy that surpasses current state-of-the-art methods. Chainsaw uses a fully convolutional neural network which is trained to predict the probability that each pair of residues is in the same domain. Domain predictions are then derived from these pairwise predictions using an algorithm that searches for the most likely assignment of residues to domains given the set of pairwise co-membership probabilities. Chainsaw matches CATH domain annotations in 78% of protein domains versus 72% for the next closest method. When predicting on AlphaFold models, expert human evaluators were twice as likely to prefer Chainsaw's predictions versus the next best method. AVAILABILITY AND IMPLEMENTATION github.com/JudeWells/Chainsaw.
Collapse
Affiliation(s)
- Jude Wells
- Centre for Artificial Intelligence, University College London, WC1E 6BT, United Kingdom
| | - Alex Hawkins-Hooker
- Centre for Artificial Intelligence, University College London, WC1E 6BT, United Kingdom
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, WC1E 6BT, United Kingdom
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, WC1E 6BT, United Kingdom
| | - Brooks Paige
- Centre for Artificial Intelligence, University College London, WC1E 6BT, United Kingdom
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, WC1E 6BT, United Kingdom
| |
Collapse
|
15
|
Versini R, Sritharan S, Aykac Fas B, Tubiana T, Aimeur SZ, Henri J, Erard M, Nüsse O, Andreani J, Baaden M, Fuchs P, Galochkina T, Chatzigoulas A, Cournia Z, Santuz H, Sacquin-Mora S, Taly A. A Perspective on the Prospective Use of AI in Protein Structure Prediction. J Chem Inf Model 2024; 64:26-41. [PMID: 38124369 DOI: 10.1021/acs.jcim.3c01361] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2023]
Abstract
AlphaFold2 (AF2) and RoseTTaFold (RF) have revolutionized structural biology, serving as highly reliable and effective methods for predicting protein structures. This article explores their impact and limitations, focusing on their integration into experimental pipelines and their application in diverse protein classes, including membrane proteins, intrinsically disordered proteins (IDPs), and oligomers. In experimental pipelines, AF2 models help X-ray crystallography in resolving the phase problem, while complementarity with mass spectrometry and NMR data enhances structure determination and protein flexibility prediction. Predicting the structure of membrane proteins remains challenging for both AF2 and RF due to difficulties in capturing conformational ensembles and interactions with the membrane. Improvements in incorporating membrane-specific features and predicting the structural effect of mutations are crucial. For intrinsically disordered proteins, AF2's confidence score (pLDDT) serves as a competitive disorder predictor, but integrative approaches including molecular dynamics (MD) simulations or hydrophobic cluster analyses are advocated for accurate dynamics representation. AF2 and RF show promising results for oligomeric models, outperforming traditional docking methods, with AlphaFold-Multimer showing improved performance. However, some caveats remain in particular for membrane proteins. Real-life examples demonstrate AF2's predictive capabilities in unknown protein structures, but models should be evaluated for their agreement with experimental data. Furthermore, AF2 models can be used complementarily with MD simulations. In this Perspective, we propose a "wish list" for improving deep-learning-based protein folding prediction models, including using experimental data as constraints and modifying models with binding partners or post-translational modifications. Additionally, a meta-tool for ranking and suggesting composite models is suggested, driving future advancements in this rapidly evolving field.
Collapse
Affiliation(s)
- Raphaelle Versini
- Laboratoire de Biochimie Théorique, CNRS (UPR9080), Université Paris Cité, F-75005 Paris, France
| | - Sujith Sritharan
- Laboratoire de Biochimie Théorique, CNRS (UPR9080), Université Paris Cité, F-75005 Paris, France
| | - Burcu Aykac Fas
- Laboratoire de Biochimie Théorique, CNRS (UPR9080), Université Paris Cité, F-75005 Paris, France
| | - Thibault Tubiana
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198 Gif-sur-Yvette, France
| | - Sana Zineb Aimeur
- Université Paris-Saclay, CNRS, Institut de Chimie Physique, 91405 Orsay, France
| | - Julien Henri
- Sorbonne Université, CNRS, Laboratoire de Biologie, Computationnelle et Quantitative UMR 7238, Institut de Biologie Paris-Seine, 4 Place Jussieu, F-75005 Paris, France
| | - Marie Erard
- Université Paris-Saclay, CNRS, Institut de Chimie Physique, 91405 Orsay, France
| | - Oliver Nüsse
- Université Paris-Saclay, CNRS, Institut de Chimie Physique, 91405 Orsay, France
| | - Jessica Andreani
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198 Gif-sur-Yvette, France
| | - Marc Baaden
- Laboratoire de Biochimie Théorique, CNRS (UPR9080), Université Paris Cité, F-75005 Paris, France
| | - Patrick Fuchs
- Sorbonne Université, École Normale Supérieure, PSL University, CNRS, Laboratoire des Biomolécules, LBM, 75005 Paris, France
- Université de Paris, UFR Sciences du Vivant, 75013 Paris, France
| | - Tatiana Galochkina
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, F-75014 Paris, France
| | - Alexios Chatzigoulas
- Biomedical Research Foundation, Academy of Athens, 11527 Athens, Greece
- Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, 15784 Athens, Greece
| | - Zoe Cournia
- Biomedical Research Foundation, Academy of Athens, 11527 Athens, Greece
- Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, 15784 Athens, Greece
| | - Hubert Santuz
- Laboratoire de Biochimie Théorique, CNRS (UPR9080), Université Paris Cité, F-75005 Paris, France
| | - Sophie Sacquin-Mora
- Laboratoire de Biochimie Théorique, CNRS (UPR9080), Université Paris Cité, F-75005 Paris, France
| | - Antoine Taly
- Laboratoire de Biochimie Théorique, CNRS (UPR9080), Université Paris Cité, F-75005 Paris, France
| |
Collapse
|
16
|
Haft DH, Badretdin A, Coulouris G, DiCuccio M, Durkin A, Jovenitti E, Li W, Mersha M, O’Neill K, Virothaisakun J, Thibaud-Nissen F. RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes. Nucleic Acids Res 2024; 52:D762-D769. [PMID: 37962425 PMCID: PMC10767926 DOI: 10.1093/nar/gkad988] [Citation(s) in RCA: 36] [Impact Index Per Article: 36.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 10/13/2023] [Accepted: 10/18/2023] [Indexed: 11/15/2023] Open
Abstract
The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains over 315 000 bacterial and archaeal genomes and 236 million proteins with up-to-date and consistent annotation. In the past 3 years, we have expanded the diversity of the RefSeq collection by including the best quality metagenome-assembled genomes (MAGs) submitted to INSDC (DDBJ, ENA and GenBank), while maintaining its quality by adding validation checks. Assemblies are now more stringently evaluated for contamination and for completeness of annotation prior to acceptance into RefSeq. MAGs now account for over 17000 assemblies in RefSeq, split over 165 orders and 362 families. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP), which is used to annotate nearly all RefSeq assemblies include better detection of protein-coding genes. Nearly 83% of RefSeq proteins are now named by a curated Protein Family Model, a 4.7% increase in the past three years ago. In addition to literature citations, Enzyme Commission numbers, and gene symbols, Gene Ontology terms are now assigned to 48% of RefSeq proteins, allowing for easier multi-genome comparison. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/. PGAP is available as a stand-alone tool able to produce GenBank-ready files at https://github.com/ncbi/pgap.
Collapse
Affiliation(s)
- Daniel H Haft
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Azat Badretdin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - George Coulouris
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Michael DiCuccio
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - A Scott Durkin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Eric Jovenitti
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Wenjun Li
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Megdelawit Mersha
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Kathleen R O’Neill
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Joel Virothaisakun
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
17
|
Lau AM, Kandathil SM, Jones DT. Merizo: a rapid and accurate protein domain segmentation method using invariant point attention. Nat Commun 2023; 14:8445. [PMID: 38114456 PMCID: PMC10730818 DOI: 10.1038/s41467-023-43934-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Accepted: 11/24/2023] [Indexed: 12/21/2023] Open
Abstract
The AlphaFold Protein Structure Database, containing predictions for over 200 million proteins, has been met with enthusiasm over its potential in enriching structural biological research and beyond. Currently, access to the database is precluded by an urgent need for tools that allow the efficient traversal, discovery, and documentation of its contents. Identifying domain regions in the database is a non-trivial endeavour and doing so will aid our understanding of protein structure and function, while facilitating drug discovery and comparative genomics. Here, we describe a deep learning method for domain segmentation called Merizo, which learns to cluster residues into domains in a bottom-up manner. Merizo is trained on CATH domains and fine-tuned on AlphaFold2 models via self-distillation, enabling it to be applied to both experimental and AlphaFold2 models. As proof of concept, we apply Merizo to the human proteome, identifying 40,818 putative domains that can be matched to CATH representative domains.
Collapse
Affiliation(s)
- Andy M Lau
- Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - Shaun M Kandathil
- Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - David T Jones
- Department of Computer Science, University College London, London, WC1E 6BT, UK.
| |
Collapse
|
18
|
Gao J, Tong M, Lee C, Gaertig J, Legal T, Bui KH. DomainFit: Identification of Protein Domains in cryo-EM maps at Intermediate Resolution using AlphaFold2-predicted Models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.28.569001. [PMID: 38077012 PMCID: PMC10705406 DOI: 10.1101/2023.11.28.569001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/20/2023]
Abstract
Cryo-electron microscopy (cryo-EM) has revolutionized our understanding of macromolecular complexes, enabling high-resolution structure determination. With the paradigm shift to in situ structural biology recently driven by the ground-breaking development of cryo-focused ion beam milling and cryo-electron tomography, there are an increasing number of structures at sub-nanometer resolution of complexes solved directly within their cellular environment. These cellular complexes often contain unidentified proteins, related to different cellular states or processes. Identifying proteins at resolutions lower than 4 Å remains challenging because the side chains cannot be visualized reliably. Here, we present DomainFit, a program for automated domain-level protein identification from cryo-EM maps at resolutions lower than 4 Å. By fitting domains from artificial intelligence-predicted models such as AlphaFold2-predicted models into cryo-EM maps, the program performs statistical analyses and attempts to identify the proteins forming the density. Using DomainFit, we identified two microtubule inner proteins, one of them, a CCDC81 domain-containing protein, is exclusively localized in the proximal region of the doublet microtubule from the ciliate Tetrahymena thermophila. The flexibility and capability of DomainFit makes it a valuable tool for analyzing in situ structures.
Collapse
Affiliation(s)
- Jerry Gao
- Department of Anatomy and Cell Biology, Faculty of Medicine and Health Sciences, McGill University, Québec, Canada
- Centre de recherche en biologie structurale, McGill University, Montréal, Quebec, Canada
| | - Max Tong
- Department of Anatomy and Cell Biology, Faculty of Medicine and Health Sciences, McGill University, Québec, Canada
- Centre de recherche en biologie structurale, McGill University, Montréal, Quebec, Canada
| | - Chinkyu Lee
- Department of Cellular Biology, University of Georgia, Athens, GA, USA
| | - Jacek Gaertig
- Department of Cellular Biology, University of Georgia, Athens, GA, USA
| | - Thibault Legal
- Department of Anatomy and Cell Biology, Faculty of Medicine and Health Sciences, McGill University, Québec, Canada
- Centre de recherche en biologie structurale, McGill University, Montréal, Quebec, Canada
| | - Khanh Huy Bui
- Department of Anatomy and Cell Biology, Faculty of Medicine and Health Sciences, McGill University, Québec, Canada
- Centre de recherche en biologie structurale, McGill University, Montréal, Quebec, Canada
| |
Collapse
|
19
|
Wang Y, Gallagher LA, Andrade PA, Liu A, Humphreys IR, Turkarslan S, Cutler KJ, Arrieta-Ortiz ML, Li Y, Radey MC, McLean JS, Cong Q, Baker D, Baliga NS, Peterson SB, Mougous JD. Genetic manipulation of Patescibacteria provides mechanistic insights into microbial dark matter and the epibiotic lifestyle. Cell 2023; 186:4803-4817.e13. [PMID: 37683634 PMCID: PMC10633639 DOI: 10.1016/j.cell.2023.08.017] [Citation(s) in RCA: 27] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Revised: 07/06/2023] [Accepted: 08/16/2023] [Indexed: 09/10/2023]
Abstract
Patescibacteria, also known as the candidate phyla radiation (CPR), are a diverse group of bacteria that constitute a disproportionately large fraction of microbial dark matter. Its few cultivated members, belonging mostly to Saccharibacteria, grow as epibionts on host Actinobacteria. Due to a lack of suitable tools, the genetic basis of this lifestyle and other unique features of Patescibacteira remain unexplored. Here, we show that Saccharibacteria exhibit natural competence, and we exploit this property for their genetic manipulation. Imaging of fluorescent protein-labeled Saccharibacteria provides high spatiotemporal resolution of phenomena accompanying epibiotic growth, and a transposon-insertion sequencing (Tn-seq) genome-wide screen reveals the contribution of enigmatic Saccharibacterial genes to growth on their hosts. Finally, we leverage metagenomic data to provide cutting-edge protein structure-based bioinformatic resources that support the strain Southlakia epibionticum and its corresponding host, Actinomyces israelii, as a model system for unlocking the molecular underpinnings of the epibiotic lifestyle.
Collapse
Affiliation(s)
- Yaxi Wang
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Larry A Gallagher
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Pia A Andrade
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Andi Liu
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Ian R Humphreys
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA; Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | | | - Kevin J Cutler
- Department of Physics, University of Washington, Seattle, WA 98195, USA
| | | | - Yaqiao Li
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA; Institute for Systems Biology, Seattle, WA 98109, USA
| | - Matthew C Radey
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Jeffrey S McLean
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA; Department of Periodontics, University of Washington, Seattle, WA 98195, USA
| | - Qian Cong
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA; Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| | - David Baker
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA; Institute for Protein Design, University of Washington, Seattle, WA 98195, USA; Howard Hughes Medical Institute, University of Washington, Seattle, WA 98109, USA
| | | | - S Brook Peterson
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Joseph D Mougous
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA; Howard Hughes Medical Institute, University of Washington, Seattle, WA 98109, USA; Microbial Interactions and Microbiome Center, University of Washington, Seattle, WA 98195, USA.
| |
Collapse
|
20
|
Pei J, Cong Q. Computational analysis of regulatory regions in human protein kinases. Protein Sci 2023; 32:e4764. [PMID: 37632170 PMCID: PMC10503413 DOI: 10.1002/pro.4764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Revised: 08/08/2023] [Accepted: 08/22/2023] [Indexed: 08/27/2023]
Abstract
Eukaryotic proteins often feature modular domain structures comprising globular domains that are connected by linker regions and intrinsically disordered regions that may contain important functional motifs. The intramolecular interactions of globular domains and nonglobular regions can play critical roles in different aspects of protein function. However, studying these interactions and their regulatory roles can be challenging due to the flexibility of nonglobular regions, the long insertions separating interacting modules, and the transient nature of some interactions. Obtaining the experimental structures of multiple domains and functional regions is more difficult than determining the structures of individual globular domains. High-quality structural models generated by AlphaFold offer a unique opportunity to study intramolecular interactions in eukaryotic proteins. In this study, we systematically explored intramolecular interactions between human protein kinase domains (KDs) and potential regulatory regions, including globular domains, N- and C-terminal tails, long insertions, and distal nonglobular regions. Our analysis identified intramolecular interactions between human KDs and 35 different types of globular domains, exhibiting a variety of interaction modes that could contribute to orthosteric or allosteric regulation of kinase activity. We also identified prevalent interactions between human KDs and their flanking regions (N- and C-terminal tails). These interactions exhibit group-specific characteristics and can vary within each specific kinase group. Although long-range interactions between KDs and nonglobular regions are relatively rare, structural details of these interactions offer new insights into the regulation mechanisms of several kinases, such as HASPIN, MAPK7, MAPK15, and SIK1B.
Collapse
Affiliation(s)
- Jimin Pei
- Eugene McDermott Center for Human Growth and DevelopmentUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Harold C. Simmons Comprehensive Cancer CenterUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| | - Qian Cong
- Eugene McDermott Center for Human Growth and DevelopmentUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Harold C. Simmons Comprehensive Cancer CenterUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| |
Collapse
|
21
|
Boys IN, Johnson AG, Quinlan MR, Kranzusch PJ, Elde NC. Structural homology screens reveal host-derived poxvirus protein families impacting inflammasome activity. Cell Rep 2023; 42:112878. [PMID: 37494187 DOI: 10.1016/j.celrep.2023.112878] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2023] [Revised: 06/20/2023] [Accepted: 07/11/2023] [Indexed: 07/28/2023] Open
Abstract
Viruses acquire host genes via horizontal transfer and can express them to manipulate host biology during infections. Some homologs retain sequence identity, but evolutionary divergence can obscure host origins. We use structural modeling to compare vaccinia virus proteins with metazoan proteomes. We identify vaccinia A47L as a homolog of gasdermins, the executioners of pyroptosis. An X-ray crystal structure of A47 confirms this homology, and cell-based assays reveal that A47 interferes with caspase function. We also identify vaccinia C1L as the product of a cryptic gene fusion event coupling a Bcl-2-related fold with a pyrin domain. C1 associates with components of the inflammasome, a cytosolic innate immune sensor involved in pyroptosis, yet paradoxically enhances inflammasome activity, suggesting differential modulation during infections. Our findings demonstrate the increasing power of structural homology screens to reveal proteins with unique combinations of domains that viruses capture from host genes and combine in unique ways.
Collapse
Affiliation(s)
- Ian N Boys
- Department of Human Genetics, University of Utah, Salt Lake City, UT 84112, USA; Howard Hughes Medical Institute, Chevy Chase, MD 20815, USA
| | - Alex G Johnson
- Department of Microbiology, Harvard Medical School, Boston, MA 02115, USA; Department of Cancer Immunology and Virology, Dana-Farber Cancer Institute, Boston, MA 02115, USA
| | - Meghan R Quinlan
- Department of Human Genetics, University of Utah, Salt Lake City, UT 84112, USA; Howard Hughes Medical Institute, Chevy Chase, MD 20815, USA
| | - Philip J Kranzusch
- Department of Microbiology, Harvard Medical School, Boston, MA 02115, USA; Department of Cancer Immunology and Virology, Dana-Farber Cancer Institute, Boston, MA 02115, USA
| | - Nels C Elde
- Department of Human Genetics, University of Utah, Salt Lake City, UT 84112, USA; Howard Hughes Medical Institute, Chevy Chase, MD 20815, USA.
| |
Collapse
|
22
|
Wang Y, Gallagher LA, Andrade PA, Liu A, Humphreys IR, Turkarslan S, Cutler KJ, Arrieta-Ortiz ML, Li Y, Radey MC, McLean JS, Cong Q, Baker D, Baliga NS, Peterson SB, Mougous JD. Genetic manipulation of candidate phyla radiation bacteria provides functional insights into microbial dark matter. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.02.539146. [PMID: 37205512 PMCID: PMC10187176 DOI: 10.1101/2023.05.02.539146] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
The study of bacteria has yielded fundamental insights into cellular biology and physiology, biotechnological advances and many therapeutics. Yet due to a lack of suitable tools, the significant portion of bacterial diversity held within the candidate phyla radiation (CPR) remains inaccessible to such pursuits. Here we show that CPR bacteria belonging to the phylum Saccharibacteria exhibit natural competence. We exploit this property to develop methods for their genetic manipulation, including the insertion of heterologous sequences and the construction of targeted gene deletions. Imaging of fluorescent protein-labeled Saccharibacteria provides high spatiotemporal resolution of phenomena accompanying epibiotic growth and a transposon insertion sequencing genome-wide screen reveals the contribution of enigmatic Saccharibacterial genes to growth on their Actinobacteria hosts. Finally, we leverage metagenomic data to provide cutting-edge protein structure-based bioinformatic resources that support the strain Southlakia epibionticum and its corresponding host, Actinomyces israelii , as a model system for unlocking the molecular underpinnings of the epibiotic lifestyle.
Collapse
Affiliation(s)
- Yaxi Wang
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Larry A. Gallagher
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Pia A. Andrade
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Andi Liu
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Ian R. Humphreys
- Department of Biochemistry, University of Washington, Seattle, WA 98109, USA
- Institute for Protein Design, Seattle, WA 98109, USA
| | | | - Kevin J. Cutler
- Department of Physics, University of Washington, Seattle, WA 98195, USA
| | | | - Yaqiao Li
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
- Institute for Systems Biology, Seattle, WA 98109, USA
| | - Matthew C. Radey
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Jeffrey S. McLean
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
- Department of Periodontics, University of Washington, Seattle, WA 98195, USA
| | - Qian Cong
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - David Baker
- Department of Biochemistry, University of Washington, Seattle, WA 98109, USA
- Institute for Protein Design, Seattle, WA 98109, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA
| | | | - S. Brook Peterson
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Joseph D. Mougous
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA
- Microbial Interactions and Microbiome Center, University of Washington, Seattle, WA 98109, USA
| |
Collapse
|
23
|
Boys IN, Johnson AG, Quinlan M, Kranzusch PJ, Elde NC. Structural homology screens reveal poxvirus-encoded proteins impacting inflammasome-mediated defenses. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.26.529821. [PMID: 36909515 PMCID: PMC10002665 DOI: 10.1101/2023.02.26.529821] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/03/2023]
Abstract
Viruses acquire host genes via horizontal gene transfer and can express them to manipulate host biology during infections. Some viral and host homologs retain sequence identity, but evolutionary divergence can obscure host origins. We used structural modeling to compare vaccinia virus proteins with metazoan proteomes. We identified vaccinia A47L as a homolog of gasdermins, the executioners of pyroptosis. An X-ray crystal structure of A47 confirmed this homology and cell-based assays revealed that A47 inhibits pyroptosis. We also identified vaccinia C1L as the product of a cryptic gene fusion event coupling a Bcl-2 related fold with a pyrin domain. C1 associates with components of the inflammasome, a cytosolic innate immune sensor involved in pyroptosis, yet paradoxically enhances inflammasome activity, suggesting a benefit to poxvirus replication in some circumstances. Our findings demonstrate the potential of structural homology screens to reveal genes that viruses capture from hosts and repurpose to benefit viral fitness.
Collapse
Affiliation(s)
- Ian N. Boys
- Department of Human Genetics, University of Utah, Salt Lake City, Utah, 84112 USA
- Howard Hughes Medical Institute, Chevy Chase, Maryland, 20815, USA
| | - Alex G. Johnson
- Department of Microbiology, Harvard Medical School, Boston, MA, 02115, USA
- Department of Cancer Immunology and Virology, Dana-Farber Cancer Institute, Boston, MA, 02115, USA
| | - Meghan Quinlan
- Department of Human Genetics, University of Utah, Salt Lake City, Utah, 84112 USA
- Howard Hughes Medical Institute, Chevy Chase, Maryland, 20815, USA
| | - Philip J. Kranzusch
- Department of Microbiology, Harvard Medical School, Boston, MA, 02115, USA
- Department of Cancer Immunology and Virology, Dana-Farber Cancer Institute, Boston, MA, 02115, USA
| | - Nels C. Elde
- Department of Human Genetics, University of Utah, Salt Lake City, Utah, 84112 USA
- Howard Hughes Medical Institute, Chevy Chase, Maryland, 20815, USA
| |
Collapse
|
24
|
Schaeffer RD, Zhang J, Kinch LN, Pei J, Cong Q, Grishin NV. Classification of domains in predicted structures of the human proteome. Proc Natl Acad Sci U S A 2023; 120:e2214069120. [PMID: 36917664 PMCID: PMC10041065 DOI: 10.1073/pnas.2214069120] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Accepted: 02/06/2023] [Indexed: 03/16/2023] Open
Abstract
Recent advances in protein structure prediction have generated accurate structures of previously uncharacterized human proteins. Identifying domains in these predicted structures and classifying them into an evolutionary hierarchy can reveal biological insights. Here, we describe the detection and classification of domains from the human proteome. Our classification indicates that only 62% of residues are located in globular domains. We further classify these globular domains and observe that the majority (65%) can be classified among known folds by sequence, with a smaller fraction (33%) requiring structural data to refine the domain boundaries and/or to support their homology. A relatively small number (966 domains) cannot be confidently assigned using our automatic pipelines, thus demanding manual inspection. We classify 47,576 domains, of which only 23% have been included in experimental structures. A portion (6.3%) of these classified globular domains lack sequence-based annotation in InterPro. A quarter (23%) have not been structurally modeled by homology, and they contain 2,540 known disease-causing single amino acid variations whose pathogenesis can now be inferred using AF models. A comparison of classified domains from a series of model organisms revealed expansions of several immune response-related domains in humans and a depletion of olfactory receptors. Finally, we use this classification to expand well-known protein families of biological significance. These classifications are presented on the ECOD website (http://prodata.swmed.edu/ecod/index_human.php).
Collapse
Affiliation(s)
- R. Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX75390
| | - Jing Zhang
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX75390
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX75390
| | - Lisa N. Kinch
- Department of Molecular Biology, University of Texas Southwestern Medical Center, Dallas, TX75390
- HHMI, University of Texas Southwestern Medical Center, Dallas, TX75390
| | - Jimin Pei
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX75390
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX75390
| | - Qian Cong
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX75390
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX75390
| | - Nick V. Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX75390
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX75390
| |
Collapse
|