1
|
Yuan R, Zhang J, Zhou J, Cong Q. Recent progress and future challenges in structure-based protein-protein interaction prediction. Mol Ther 2025; 33:2252-2268. [PMID: 40195117 DOI: 10.1016/j.ymthe.2025.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2025] [Revised: 03/05/2025] [Accepted: 04/02/2025] [Indexed: 04/09/2025] Open
Abstract
Protein-protein interactions (PPIs) play a fundamental role in cellular processes, and understanding these interactions is crucial for advances in both basic biological science and biomedical applications. This review presents an overview of recent progress in computational methods for modeling protein complexes and predicting PPIs based on 3D structures, focusing on the transformative role of artificial intelligence-based approaches. We further discuss the expanding biomedical applications of PPI research, including the elucidation of disease mechanisms, drug discovery, and therapeutic design. Despite these advances, significant challenges remain in predicting host-pathogen interactions, interactions between intrinsically disordered regions, and interactions related to immune responses. These challenges are worthwhile for future explorations and represent the frontier of research in this field.
Collapse
Affiliation(s)
- Rongqing Yuan
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Jing Zhang
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Jian Zhou
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Qian Cong
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA.
| |
Collapse
|
2
|
Kawabata T, Kinoshita K. Assessing Structural Classification Using AlphaFold2 Models Through ECOD-Based Comparative Analysis. Proteins 2025. [PMID: 40251890 DOI: 10.1002/prot.26828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2024] [Revised: 03/27/2025] [Accepted: 03/30/2025] [Indexed: 04/21/2025]
Abstract
Identifying homologous proteins is a fundamental task in structural bioinformatics. While AlphaFold2 has revolutionized protein structure prediction, the extent to which structure comparison of its models can reliably detect homologs remains unclear. In this study, we evaluate the feasibility of homology detection using AlphaFold2-predicted structures through structural comparisons. We considered the classification of the ECOD database for experimental structures as the correct standard and obtained their corresponding predicted models from AlphaFoldDB. To ensure blind assessment, we divided the structures into test and train sets according to their release date. Predicted and experimental 3D structures in the test and train sets were compared using 3D structure comparisons (MATRAS, Dali, and Foldseek) and sequence comparisons (BLAST and HHsearch). The results were evaluated based on the homology annotations in the ECOD database. For top-1 accuracy, the performance of structural comparisons was comparable to that of HHsearch. However, when considering metrics that included all structural pairs, including more remote homology, structural comparisons outperformed HHsearch. No significant differences were observed between comparisons of experimental versus experimental, predicted versus experimental, and predicted versus predicted structures with pLDDT (prediction confidence) values greater than 60. We also demonstrate that predicted protein structures, determined by NMR, had lower pLDDT values and contained fewer coils than their experimental counterparts. These findings highlight the potential of AlphaFold2 models in structural classification and suggest that 3D structural searches should be conducted not only against the PDB but also against AlphaFoldDB to identify more potential homologs.
Collapse
Affiliation(s)
- Takeshi Kawabata
- Graduate School of Information Sciences, Tohoku University, Sendai, Japan
| | - Kengo Kinoshita
- Graduate School of Information Sciences, Tohoku University, Sendai, Japan
| |
Collapse
|
3
|
Schaeffer RD, Pei J, Zhang J, Cong Q, Grishin NV. Refinement and curation of homologous groups facilitated by structure prediction. Protein Sci 2025; 34:e70074. [PMID: 39968854 PMCID: PMC11836899 DOI: 10.1002/pro.70074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2024] [Revised: 01/09/2025] [Accepted: 02/05/2025] [Indexed: 02/20/2025]
Abstract
Domain classification of protein predictions released in the AlphaFold Database (AFDB) has been a recent focus of the Evolutionary Classification of protein Domains (ECOD). Although a primary focus of our recent work has been the partition and assignment of domains from these predictions, we here show how these diverse predictions can be used to examine the reference domain set more closely. Using results from DPAM, our AlphaFold-specific domain parsing algorithm, we examine hierarchical groupings that share significant levels of homologous links, both between groups that were not previously assessed to be definitively homologous and between groups that were not previously observed to share significant homologous links. Combined with manual analysis, these large datasets of structural and sequence similarities allow us to merge homologous groups in multiple cases which we detail within. These domains tend to be families of domains from families that are either small, previously had few experimental representatives, or had unknown function. The exception to this is the chromodomains, a large homologous group which were increased from "possibly homologous" to "definitely homologous" to increase the consistency of ECOD based their strong homologous links to the SH3 domains.
Collapse
Affiliation(s)
| | - Jimin Pei
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Eugene McDermott Center for Human Growth and DevelopmentUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Harold C. Simmons Comprehensive Cancer CenterUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| | - Jing Zhang
- Eugene McDermott Center for Human Growth and DevelopmentUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Harold C. Simmons Comprehensive Cancer CenterUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| | - Qian Cong
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Eugene McDermott Center for Human Growth and DevelopmentUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Harold C. Simmons Comprehensive Cancer CenterUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| | - Nick V. Grishin
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Department of BiochemistryUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| |
Collapse
|
4
|
Wu EJ, Kandalkar AT, Ehrmann JF, Tong AB, Zhang J, Cong Q, Wu H. A structural atlas of death domain fold proteins reveals their versatile roles in biology and function. Proc Natl Acad Sci U S A 2025; 122:e2426986122. [PMID: 39977327 PMCID: PMC11874512 DOI: 10.1073/pnas.2426986122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2024] [Accepted: 01/23/2025] [Indexed: 02/22/2025] Open
Abstract
Death domain fold (DDF) superfamily proteins are critically important players in pathways of cell death and inflammation. DDFs are often essential scaffolding domains in receptors, adaptors, or effectors of these pathways by mediating homo- and hetero-oligomerization including helical filament assembly. At the downstream ends of these pathways, effector oligomerization by DDFs brings the enzyme domains into proximity for their dimerization and activation. Hundreds of structures of these domains have been solved. However, a comprehensive understanding of DDFs is lacking. In this article, we report the curation of a DDF structural atlas as a public website (deathdomain.org) and deduce the common and distinct principles of DDF-mediated oligomerization among the four families (death domain or DD, death effector domain or DED, caspase recruitment domain or CARD, and pyrin domain or PYD). We further annotate DDFs genome-wide based on AlphaFold-predicted models and protein sequences. These studies reveal mechanistic rules for this widely distributed domain superfamily.
Collapse
Affiliation(s)
- Emily J. Wu
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, MA02115
- Program in Cellular and Molecular Medicine, Boston Children’s Hospital, Boston, MA02115
- Saratoga High School, Saratoga, CA95070
| | - Ankita T. Kandalkar
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, MA02115
- Program in Cellular and Molecular Medicine, Boston Children’s Hospital, Boston, MA02115
- Department of Biology, College of Science, Northeastern University, Boston, MA02115
| | - Julian F. Ehrmann
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, MA02115
- Program in Cellular and Molecular Medicine, Boston Children’s Hospital, Boston, MA02115
| | - Alexander B. Tong
- Jason L. Choy Laboratory of Single-Molecule Biophysics, Institute for Quantitative Biosciences, Chemistry Graduate Group, University of California, Berkeley, CA94720
| | - Jing Zhang
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX75390
- Eugene McDermott Center for Human Growth and Development, University of Texas, Southwestern Medical Center, Dallas, TX75390
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX75390
| | - Qian Cong
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX75390
- Eugene McDermott Center for Human Growth and Development, University of Texas, Southwestern Medical Center, Dallas, TX75390
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX75390
| | - Hao Wu
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, MA02115
- Program in Cellular and Molecular Medicine, Boston Children’s Hospital, Boston, MA02115
- Department of Biology, College of Science, Northeastern University, Boston, MA02115
| |
Collapse
|
5
|
Șulea TA, Martin EC, Bugeac CA, Bectaș FS, Iacob AL, Spiridon L, Petrescu AJ. Lessons from Deep Learning Structural Prediction of Multistate Multidomain Proteins-The Case Study of Coiled-Coil NOD-like Receptors. Int J Mol Sci 2025; 26:500. [PMID: 39859213 PMCID: PMC11765006 DOI: 10.3390/ijms26020500] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2024] [Revised: 01/03/2025] [Accepted: 01/07/2025] [Indexed: 01/27/2025] Open
Abstract
We test here the prediction capabilities of the new generation of deep learning predictors in the more challenging situation of multistate multidomain proteins by using as a case study a coiled-coil family of Nucleotide-binding Oligomerization Domain-like (NOD-like) receptors from A. thaliana and a few extra examples for reference. Results reveal a truly remarkable ability of these platforms to correctly predict the 3D structure of modules that fold in well-established topologies. A lower performance is noticed in modeling morphing regions of these proteins, such as the coiled coils. Predictors also display a good sensitivity to local sequence drifts upon the modeling solution of the overall modular configuration. In multivalued 1D to 3D mappings, the platforms display a marked tendency to model proteins in the most compact configuration and must be retrained by information filtering to drive modeling toward the sparser ones. Bias toward order and compactness is seen at the secondary structure level as well. All in all, using AI predictors for modeling multidomain multistate proteins when global templates are at hand is fruitful, but the above challenges have to be taken into account. In the absence of global templates, a piecewise modeling approach with experimentally constrained reconstruction of the global architecture might give more realistic results.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Andrei-Jose Petrescu
- Department of Bioinformatics and Structural Biochemistry, Institute of Biochemistry of the Romanian Academy, Splaiul Independentei 296, 060031 Bucharest, Romania; (T.A.Ș.); (E.C.M.); (C.A.B.); (F.S.B.); (A.-L.I.); (L.S.)
| |
Collapse
|
6
|
Schaeffer R, Medvedev K, Andreeva A, Chuguransky S, Pinto B, Zhang J, Cong Q, Bateman A, Grishin N. ECOD: integrating classifications of protein domains from experimental and predicted structures. Nucleic Acids Res 2025; 53:D411-D418. [PMID: 39565196 PMCID: PMC11701565 DOI: 10.1093/nar/gkae1029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2024] [Revised: 10/11/2024] [Accepted: 10/17/2024] [Indexed: 11/21/2024] Open
Abstract
The evolutionary classification of protein domains (ECOD) classifies protein domains using a combination of sequence and structural data (http://prodata.swmed.edu/ecod). Here we present the culmination of our previous efforts at classifying domains from predicted structures, principally from the AlphaFold Database (AFDB), by integrating these domains with our existing classification of PDB structures. This combined classification includes both domains from our previous, purely experimental, classification of domains as well as domains from our provisional classification of 48 proteomes in AFDB predicted from model organisms and organisms of concern to global health. ECOD classifies over 1.8 M domains from over 1000 000 proteins collectively deposited in the PDB and AFDB. Additionally, we have changed the F-group classification reference used for ECOD, deprecating our original ECODf library and instead relying on direct collaboration with the Pfam sequence family database to inform our classification. Pfam provides similar coverage of ECOD with family classification while being more accurate and less redundant. By eliminating duplication of effort, we can improve both classifications. Finally, we discuss the initial deployment of DrugDomain, a database of domain-ligand interactions, on ECOD and discuss future plans.
Collapse
Affiliation(s)
- R Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd. Dallas, TX, 75390-8816 USA
| | - Kirill E Medvedev
- Department of Biophysics, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd. Dallas, TX, 75390-8816 USA
| | - Antonina Andreeva
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Sara Rocio Chuguransky
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Beatriz Lazaro Pinto
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Jing Zhang
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd. Dallas, TX, 75390-8591, USA
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd. Dallas, TX, 75390- USA
| | - Qian Cong
- Department of Biophysics, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd. Dallas, TX, 75390-8816 USA
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd. Dallas, TX, 75390-8591, USA
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd. Dallas, TX, 75390- USA
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Nick V Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd. Dallas, TX, 75390-8816 USA
- Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd. Dallas, TX, 75390-9038, USA
| |
Collapse
|
7
|
Durham J, Zhang J, Schaeffer RD, Cong Q. DPAM-AI: a domain parser for AlphaFold models powered by artificial intelligence. Bioinformatics 2024; 41:btae740. [PMID: 39672676 PMCID: PMC11723527 DOI: 10.1093/bioinformatics/btae740] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Revised: 10/29/2024] [Accepted: 12/12/2024] [Indexed: 12/15/2024] Open
Abstract
MOTIVATION Due to the breakthrough in protein structure prediction by AlphaFold, the scientific community has access to 200 million predicted protein structures with near-atomic accuracy from the AlphaFold protein structure DataBase (AFDB), covering nearly the entire protein universe. Segmenting these models into domains and classifying them into an evolutionary hierarchy hold tremendous potential for unraveling essential insights into protein function. RESULTS We introduce DPAM-AI, a Domain Parser for AlphaFold Models based on Artificial Intelligence. DPAM-AI utilizes a convolutional neural network trained with previously classified domains in the Evolutionary Classification Of protein Domains (ECOD) database. DPAM-AI integrates inter-residue distances, predicted aligned errors, and sequence and structural alignments to previously classified domains detected via sequence (HHsuite) and structural (Dali) similarity searches. DPAM-AI has demonstrated its power through rigorous tests, excelling in several benchmark sets compared to its predecessor, DPAM, and other recently published domain parsers, Merizo and Chainsaw. We applied DPAM-AI to representative AFDB models for proteins classified in Pfam. We obtained representative 3D structures for 18 487 (89%) of the 20 795 Pfam families. The remaining families either (i) belong to viral proteins that were excluded from AFDB or (ii) do not adopt globular 3D structures. Our structure-aware domain delineation uncovered a considerable fraction (15%) of Pfam domains containing multiple structural and evolutionary units and refined the boundaries for over half. AVAILABILITY AND IMPLEMENTATION Pfam and corresponding DPAM-AI domains are at http://prodata.swmed.edu/DPAM-pfam/. Our code is deposited at https://github.com/Jsauce5p/DPAM/tree/dpam_ai, and updates will be released through https://github.com/CongLabCode/DPAM.
Collapse
Affiliation(s)
- Jesse Durham
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
| | - Jing Zhang
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
| | - Richard D Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
| | - Qian Cong
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
| |
Collapse
|
8
|
Shimpi AA, Naegle KM. Linguistic networks uncover grammatical constraints of protein sentences comprised of domain-based words. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.12.04.626803. [PMID: 39677636 PMCID: PMC11643033 DOI: 10.1101/2024.12.04.626803] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/17/2024]
Abstract
Evolution has developed a set of principles that determine feasible domain combinations analogous to grammar within natural languages. Treating domains as words and proteins as sentences, made up of words, we apply a linguistic approach to represent the human proteome as an n-gram network. Combining this with network theory and application, we explore the functional language and rules of the human proteome. Additionally, we explored subnetwork languages by focusing on reversible post-translational modifications (PTMs) systems that follow a reader-writer-eraser paradigm. We find that PTM systems appear to sample grammar rules near the onset of the system expansion, but then convergently evolve towards similar grammar rules, which stabilize during the post-metazoan switch. For example, reader and writer domains are typically tightly connected through shared n-grams, but eraser domains are almost always loosely or completely disconnected from readers and writers. Additionally, after grammar fixation, domains with verb-like properties, such as writers and erasers, never appear - consistent with the idea of natural grammar that leads to clarity and limits futile enzymatic cycles. Then, given how some cancer fusion genes represent the possibility for the emergence of novel language, we investigate how cancer fusion genes alter the human proteome n-gram network. We find most cancer fusion genes follow existing grammar rules. Collectively, these results suggest that n-gram based analysis of proteomes is a complement to the more direct protein-protein interaction networks. N-grams can capture abstract functional connections in a more fully described manner, limited only by the definition of domains within the proteome and not by the combinatorial challenge of capturing all protein interaction connections.
Collapse
Affiliation(s)
- Adrian A. Shimpi
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, 22903
- Department of Genome Sciences, University of Virginia, Charlottesville, VA, 22903
| | - Kristen M. Naegle
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, 22903
- Department of Genome Sciences, University of Virginia, Charlottesville, VA, 22903
| |
Collapse
|
9
|
Pei J, Andreeva A, Chuguransky S, Lázaro Pinto B, Paysan-Lafosse T, Dustin Schaeffer R, Bateman A, Cong Q, Grishin NV. Bridging the Gap between Sequence and Structure Classifications of Proteins with AlphaFold Models. J Mol Biol 2024; 436:168764. [PMID: 39197652 DOI: 10.1016/j.jmb.2024.168764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Revised: 08/13/2024] [Accepted: 08/20/2024] [Indexed: 09/01/2024]
Abstract
Classification of protein domains based on homology and structural similarity serves as a fundamental tool to gain biological insights into protein function. Recent advancements in protein structure prediction, exemplified by AlphaFold, have revolutionized the availability of protein structural data. We focus on classifying about 9000 Pfam families into ECOD (Evolutionary Classification of Domains) by using predicted AlphaFold models and the DPAM (Domain Parser for AlphaFold Models) tool. Our results offer insights into their homologous relationships and domain boundaries. More than half of these Pfam families contain DPAM domains that can be confidently assigned to the ECOD hierarchy. Most assigned domains belong to highly populated folds such as Immunoglobulin-like (IgL), Armadillo (ARM), helix-turn-helix (HTH), and Src homology 3 (SH3). A large fraction of DPAM domains, however, cannot be confidently assigned to ECOD homologous groups. These unassigned domains exhibit statistically different characteristics, including shorter average length, fewer secondary structure elements, and more abundant transmembrane segments. They could potentially define novel families remotely related to domains with known structures or novel superfamilies and folds. Manual scrutiny of a subset of these domains revealed an abundance of internal duplications and recurring structural motifs. Exploring sequence and structural features such as disulfide bond patterns, metal-binding sites, and enzyme active sites helped uncover novel structural folds as well as remote evolutionary relationships. By bridging the gap between sequence-based Pfam and structure-based ECOD domain classifications, our study contributes to a more comprehensive understanding of the protein universe by providing structural and functional insights into previously uncharacterized proteins.
Collapse
Affiliation(s)
- Jimin Pei
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Antonina Andreeva
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Sara Chuguransky
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Beatriz Lázaro Pinto
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Typhaine Paysan-Lafosse
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - R Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK.
| | - Qian Cong
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA.
| | - Nick V Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, USA.
| |
Collapse
|
10
|
Waman VP, Bordin N, Alcraft R, Vickerstaff R, Rauer C, Chan Q, Sillitoe I, Yamamori H, Orengo C. CATH 2024: CATH-AlphaFlow Doubles the Number of Structures in CATH and Reveals Nearly 200 New Folds. J Mol Biol 2024; 436:168551. [PMID: 38548261 DOI: 10.1016/j.jmb.2024.168551] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 03/20/2024] [Accepted: 03/22/2024] [Indexed: 04/07/2024]
Abstract
CATH (https://www.cathdb.info) classifies domain structures from experimental protein structures in the PDB and predicted structures in the AlphaFold Database (AFDB). To cope with the scale of the predicted data a new NextFlow workflow (CATH-AlphaFlow), has been developed to classify high-quality domains into CATH superfamilies and identify novel fold groups and superfamilies. CATH-AlphaFlow uses a novel state-of-the-art structure-based domain boundary prediction method (ChainSaw) for identifying domains in multi-domain proteins. We applied CATH-AlphaFlow to process PDB structures not classified in CATH and AFDB structures from 21 model organisms, expanding CATH by over 100%. Domains not classified in existing CATH superfamilies or fold groups were used to seed novel folds, giving 253 new folds from PDB structures (September 2023 release) and 96 from AFDB structures of proteomes of 21 model organisms. Where possible, functional annotations were obtained using (i) predictions from publicly available methods (ii) annotations from structural relatives in AFDB/UniProt50. We also predicted functional sites and highly conserved residues. Some folds are associated with important functions such as photosynthetic acclimation (in flowering plants), iron permease activity (in fungi) and post-natal spermatogenesis (in mice). CATH-AlphaFlow will allow us to identify many more CATH relatives in the AFDB, further characterising the protein structure landscape.
Collapse
Affiliation(s)
- Vaishali P Waman
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Rachel Alcraft
- Advanced Research Computing Centre, University College London, London, United Kingdom
| | - Robert Vickerstaff
- Advanced Research Computing Centre, University College London, London, United Kingdom
| | - Clemens Rauer
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Qian Chan
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Hazuki Yamamori
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom.
| |
Collapse
|
11
|
Murata H, Toko K, Chikenji G. Protein superfolds are characterised as frustration-free topologies: A case study of pure parallel β-sheet topologies. PLoS Comput Biol 2024; 20:e1012282. [PMID: 39110764 PMCID: PMC11333010 DOI: 10.1371/journal.pcbi.1012282] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2023] [Revised: 08/19/2024] [Accepted: 06/26/2024] [Indexed: 08/21/2024] Open
Abstract
A protein superfold is a type of protein fold that is observed in at least three distinct, non-homologous protein families. Structural classification studies have revealed a limited number of prevalent superfolds alongside several infrequent occurring folds, and in α/β type superfolds, the C-terminal β-strand tends to favor the edge of the β-sheet, while the N-terminal β-strand is often found in the middle. The reasons behind these observations, whether they are due to evolutionary sampling bias or physical interactions, remain unclear. This article offers a physics-based explanation for these observations, specifically for pure parallel β-sheet topologies. Our investigation is grounded in several established structural rules that are based on physical interactions. We have identified "frustration-free topologies" which are topologies that can satisfy all the rules simultaneously. In contrast, topologies that cannot are termed "frustrated topologies." Our findings reveal that frustration-free topologies represent only a fraction of all theoretically possible patterns, these topologies strongly favor positioning the C-terminal β-strand at the edge of the β-sheet and the N-terminal β-strand in the middle, and there is significant overlap between frustration-free topologies and superfolds. We also used a lattice protein model to thoroughly investigate sequence-structure relationships. Our results show that frustration-free structures are highly designable, while frustrated structures are poorly designable. These findings suggest that superfolds are highly designable due to their lack of frustration, and the preference for positioning C-terminal β-strands at the edge of the β-sheet is a direct result of frustration-free topologies. These insights not only enhance our understanding of sequence-structure relationships but also have significant implications for de novo protein design.
Collapse
Affiliation(s)
- Hiroto Murata
- Department of Applied Physics, Nagoya University, Nagoya, Aichi, Japan
| | - Kazuma Toko
- Department of Applied Physics, Nagoya University, Nagoya, Aichi, Japan
| | - George Chikenji
- Department of Applied Physics, Nagoya University, Nagoya, Aichi, Japan
| |
Collapse
|
12
|
Medvedev KE, Schaeffer RD, Grishin NV. DrugDomain: The evolutionary context of drugs and small molecules bound to domains. Protein Sci 2024; 33:e5116. [PMID: 38979784 PMCID: PMC11231930 DOI: 10.1002/pro.5116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2024] [Revised: 06/27/2024] [Accepted: 06/29/2024] [Indexed: 07/10/2024]
Abstract
Interactions between proteins and small organic compounds play a crucial role in regulating protein functions. These interactions can modulate various aspects of protein behavior, including enzymatic activity, signaling cascades, and structural stability. By binding to specific sites on proteins, small organic compounds can induce conformational changes, alter protein-protein interactions, or directly affect catalytic activity. Therefore, many drugs available on the market today are small molecules (72% of all approved drugs in the last 5 years). Proteins are composed of one or more domains: evolutionary units that convey function or fitness either singly or in concert with others. Understanding which domain(s) of the target protein binds to a drug can lead to additional opportunities for discovering novel targets. The evolutionary classification of protein domains (ECOD) classifies domains into an evolutionary hierarchy that focuses on distant homology. Previously, no structure-based protein domain classification existed that included information about both the interaction between small molecules or drugs and the structural domains of a target protein. This data is especially important for multidomain proteins and large complexes. Here, we present the DrugDomain database that reports the interaction between ECOD of human target proteins and DrugBank molecules and drugs. The pilot version of DrugDomain describes the interaction of 5160 DrugBank molecules associated with 2573 human proteins. It describes domains for all experimentally determined structures of these proteins and incorporates AlphaFold models when such structures are unavailable. The DrugDomain database is available online: http://prodata.swmed.edu/DrugDomain/.
Collapse
Affiliation(s)
- Kirill E. Medvedev
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| | - R. Dustin Schaeffer
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| | - Nick V. Grishin
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Department of BiochemistryUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| |
Collapse
|
13
|
Kandoor A, Martinez G, Hitchcock JM, Angel S, Campbell L, Rizvi S, Naegle KM. CoDIAC: A comprehensive approach for interaction analysis reveals novel insights into SH2 domain function and regulation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.18.604100. [PMID: 39091881 PMCID: PMC11291013 DOI: 10.1101/2024.07.18.604100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/04/2024]
Abstract
Protein domains are conserved structural and functional units and are the functional building blocks of proteins. Evolutionary expansion means that domain families are often represented by many members in a species, which are found in various configurations with other domains, which have evolved new specificity for interacting partners. Here, we develop a structure-based interface analysis to comprehensively map domain interfaces from available experimental and predicted structures, including interfaces with other macromolecules and intraprotein interfaces (such as might exist between domains in a protein). We hypothesized that a comprehensive approach to contact mapping of domains could yield new insights. Specifically, we use it to gain information about how domains selectivity interact with ligands, whether domain-domain interfaces of repeated domain partnerships are conserved across diverse proteins, and identify regions of conserved post-translational modifications, using relationship to interaction interfaces as a method to hypothesize the effect of post-translational modifications (and mutations). We applied this approach to the human SH2 domain family, an extensive modular unit that is the foundation of phosphotyrosine-mediated signaling, where we identified a novel approach to understanding the binding selectivity of SH2 domains and evidence that there is coordinated and conserved regulation of multiple SH2 domain binding interfaces by tyrosine and serine/threonine phosphorylation and acetylation, suggesting that multiple signaling systems can regulate protein activity and SH2 domain interactions in a regulated manner. We provide the extensive features of the human SH2 domain family and this modular approach, as an open source Python package for COmprehensive Domain Interface Analysis of Contacts (CoDIAC).
Collapse
Affiliation(s)
- Alekhya Kandoor
- Department of Biomedical Engineering and the Center for Public Health Genomics, University of Virginia, Charlottesville, Virginia, United States of America
| | - Gabrielle Martinez
- Department of Biomedical Engineering and the Center for Public Health Genomics, University of Virginia, Charlottesville, Virginia, United States of America
| | - Julianna M Hitchcock
- Department of Biomedical Engineering and the Center for Public Health Genomics, University of Virginia, Charlottesville, Virginia, United States of America
| | - Savannah Angel
- Department of Biomedical Engineering and the Center for Public Health Genomics, University of Virginia, Charlottesville, Virginia, United States of America
| | - Logan Campbell
- Department of Biomedical Engineering and the Center for Public Health Genomics, University of Virginia, Charlottesville, Virginia, United States of America
| | - Saqib Rizvi
- Department of Biomedical Engineering and the Center for Public Health Genomics, University of Virginia, Charlottesville, Virginia, United States of America
| | - Kristen M Naegle
- Department of Biomedical Engineering and the Center for Public Health Genomics, University of Virginia, Charlottesville, Virginia, United States of America
| |
Collapse
|
14
|
Medvedev KE, Zhang J, Schaeffer RD, Kinch LN, Cong Q, Grishin NV. Structure classification of the proteins from Salmonella enterica pangenome revealed novel potential pathogenicity islands. Sci Rep 2024; 14:12260. [PMID: 38806511 PMCID: PMC11133325 DOI: 10.1038/s41598-024-60991-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Accepted: 04/30/2024] [Indexed: 05/30/2024] Open
Abstract
Salmonella enterica is a pathogenic bacterium known for causing severe typhoid fever in humans, making it important to study due to its potential health risks and significant impact on public health. This study provides evolutionary classification of proteins from Salmonella enterica pangenome. We classified 17,238 domains from 13,147 proteins from 79,758 Salmonella enterica strains and studied in detail domains of 272 proteins from 14 characterized Salmonella pathogenicity islands (SPIs). Among SPIs-related proteins, 90 proteins function in the secretion machinery. 41% domains of SPI proteins have no previous sequence annotation. By comparing clinical and environmental isolates, we identified 3682 proteins that are overrepresented in clinical group that we consider as potentially pathogenic. Among domains of potentially pathogenic proteins only 50% domains were annotated by sequence methods previously. Moreover, 36% (1330 out of 3682) of potentially pathogenic proteins cannot be classified into Evolutionary Classification of Protein Domains database (ECOD). Among classified domains of potentially pathogenic proteins the most populated homology groups include helix-turn-helix (HTH), Immunoglobulin-related, and P-loop domains-related. Functional analysis revealed overrepresentation of these protein in biological processes related to viral entry into host cell, antibiotic biosynthesis, DNA metabolism and conformation change, and underrepresentation in translational processes. Analysis of the potentially pathogenic proteins indicates that they form 119 clusters or novel potential pathogenicity islands (NPPIs) within the Salmonella genome, suggesting their potential contribution to the bacterium's virulence. One of the NPPIs revealed significant overrepresentation of potentially pathogenic proteins. Overall, our analysis revealed that identified potentially pathogenic proteins are poorly studied.
Collapse
Affiliation(s)
- Kirill E Medvedev
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
| | - Jing Zhang
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - R Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Lisa N Kinch
- Department of Molecular Biology, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Qian Cong
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Nick V Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| |
Collapse
|
15
|
Gracia B, Montes P, Gutierrez AM, Arun B, Karras GI. Protein-folding chaperones predict structure-function relationships and cancer risk in BRCA1 mutation carriers. Cell Rep 2024; 43:113803. [PMID: 38368609 PMCID: PMC10941025 DOI: 10.1016/j.celrep.2024.113803] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Revised: 12/28/2023] [Accepted: 02/01/2024] [Indexed: 02/20/2024] Open
Abstract
Predicting the risk of cancer mutations is critical for early detection and prevention, but differences in allelic severity of human carriers confound risk predictions. Here, we elucidate protein folding as a cellular mechanism driving differences in mutation severity of tumor suppressor BRCA1. Using a high-throughput protein-protein interaction assay, we show that protein-folding chaperone binding patterns predict the pathogenicity of variants in the BRCA1 C-terminal (BRCT) domain. HSP70 selectively binds 94% of pathogenic BRCA1-BRCT variants, most of which engage HSP70 more than HSP90. Remarkably, the magnitude of HSP70 binding linearly correlates with loss of folding and function. We identify a prevalent class of human hypomorphic BRCA1 variants that bind moderately to chaperones and retain partial folding and function. Furthermore, chaperone binding signifies greater mutation penetrance and earlier cancer onset in the clinic. Our findings demonstrate the utility of chaperones as quantitative cellular biosensors of variant folding, phenotypic severity, and cancer risk.
Collapse
Affiliation(s)
- Brant Gracia
- Department of Genetics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Patricia Montes
- Department of Genetics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Angelica Maria Gutierrez
- Department of Breast Medical Oncology and Clinical Cancer Genetics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Banu Arun
- Department of Breast Medical Oncology and Clinical Cancer Genetics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Georgios Ioannis Karras
- Department of Genetics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA; Genetics and Epigenetics Graduate Program, The University of Texas MD Anderson Cancer Center UTHealth Houston Graduate School of Biomedical Sciences, Houston, TX, USA.
| |
Collapse
|
16
|
Baranowski B, Krysińska M, Gradowski M. KINtaro: protein kinase-like database. BMC Res Notes 2024; 17:50. [PMID: 38365785 PMCID: PMC10870513 DOI: 10.1186/s13104-024-06713-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Accepted: 02/01/2024] [Indexed: 02/18/2024] Open
Abstract
OBJECTIVE The superfamily of protein kinases features a common Protein Kinase-like (PKL) three-dimensional fold. Proteins with PKL structure can also possess enzymatic activities other than protein phosphorylation, such as AMPylation or glutamylation. PKL proteins play a vital role in the world of living organisms, contributing to the survival of pathogenic bacteria inside host cells, as well as being involved in carcinogenesis and neurological diseases in humans. The superfamily of PKL proteins is constantly growing. Therefore, it is crucial to gather new information about PKL families. RESULTS To this end, the KINtaro database ( http://bioinfo.sggw.edu.pl/kintaro/ ) has been created as a resource for collecting and sharing such information. KINtaro combines protein sequence information and additional annotations for more than 70 PKL families, including 32 families not associated with PKL superfamily in established protein domain databases. KINtaro is searchable by keywords and by protein sequence and provides family descriptions, sequences, sequence alignments, HMM models, 3D structure models, experimental structures with PKL domain annotations and sequence logos with catalytic residue annotations.
Collapse
Affiliation(s)
- Bartosz Baranowski
- Laboratory of Plant Pathogenesis, Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Warsaw, Poland
| | - Marianna Krysińska
- Department of Biochemistry and Microbiology, Warsaw University of Life Sciences (SGGW), Warsaw, Poland
| | - Marcin Gradowski
- Department of Biochemistry and Microbiology, Warsaw University of Life Sciences (SGGW), Warsaw, Poland.
| |
Collapse
|
17
|
Schaeffer RD, Zhang J, Medvedev KE, Kinch LN, Cong Q, Grishin NV. ECOD domain classification of 48 whole proteomes from AlphaFold Structure Database using DPAM2. PLoS Comput Biol 2024; 20:e1011586. [PMID: 38416793 PMCID: PMC10927120 DOI: 10.1371/journal.pcbi.1011586] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 03/11/2024] [Accepted: 02/20/2024] [Indexed: 03/01/2024] Open
Abstract
Protein structure prediction has now been deployed widely across several different large protein sets. Large-scale domain annotation of these predictions can aid in the development of biological insights. Using our Evolutionary Classification of Protein Domains (ECOD) from experimental structures as a basis for classification, we describe the detection and cataloging of domains from 48 whole proteomes deposited in the AlphaFold Database. On average, we can provide positive classification (either of domains or other identifiable non-domain regions) for 90% of residues in all proteomes. We classified 746,349 domains from 536,808 proteins comprised of over 226,424,000 amino acid residues. We examine the varying populations of homologous groups in both eukaryotes and bacteria. In addition to containing a higher fraction of disordered regions and unassigned domains, eukaryotes show a higher proportion of repeated proteins, both globular and small repeats. We enumerate those highly populated domains that are shared in both eukaryotes and bacteria, such as the Rossmann domains, TIM barrels, and P-loop domains. Additionally, we compare the sampling of homologous groups from this whole proteome set against our stable ECOD reference and discuss groups that have been enriched by structure predictions. Finally, we discuss the implication of these results for protein target selection for future classification strategies for very large protein sets.
Collapse
Affiliation(s)
- R. Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Jing Zhang
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Kirill E. Medvedev
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Lisa N. Kinch
- Department of Molecular Biology, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Qian Cong
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Nick V. Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| |
Collapse
|
18
|
Kinch LN, Schaeffer RD, Zhang J, Cong Q, Orth K, Grishin N. Insights into virulence: structure classification of the Vibrio parahaemolyticus RIMD mobilome. mSystems 2023; 8:e0079623. [PMID: 38014954 PMCID: PMC10734457 DOI: 10.1128/msystems.00796-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Accepted: 10/17/2023] [Indexed: 11/29/2023] Open
Abstract
IMPORTANCE The pandemic Vpar strain RIMD causes seafood-borne illness worldwide. Previous comparative genomic studies have revealed pathogenicity islands in RIMD that contribute to the success of the strain in infection. However, not all virulence determinants have been identified, and many of the proteins encoded in known pathogenicity islands are of unknown function. Based on the EOCD database, we used evolution-based classification of structure models for the RIMD proteome to improve our functional understanding of virulence determinants acquired by the pandemic strain. We further identify and classify previously unknown mobile protein domains as well as fast evolving residue positions in structure models that contribute to virulence and adaptation with respect to a pre-pandemic strain. Our work highlights key contributions of phage in mediating seafood born illness, suggesting this strain balances its avoidance of phage predators with its successful colonization of human hosts.
Collapse
Affiliation(s)
- Lisa N. Kinch
- Department of Molecular Biology, University of Texas Southwestern Medical Center, Dallas, Texas, USA
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - R. Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Jing Zhang
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, USA
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Qian Cong
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, USA
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, Texas, USA
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Kim Orth
- Department of Molecular Biology, University of Texas Southwestern Medical Center, Dallas, Texas, USA
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, USA
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Nick Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, USA
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| |
Collapse
|
19
|
Lau AM, Kandathil SM, Jones DT. Merizo: a rapid and accurate protein domain segmentation method using invariant point attention. Nat Commun 2023; 14:8445. [PMID: 38114456 PMCID: PMC10730818 DOI: 10.1038/s41467-023-43934-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Accepted: 11/24/2023] [Indexed: 12/21/2023] Open
Abstract
The AlphaFold Protein Structure Database, containing predictions for over 200 million proteins, has been met with enthusiasm over its potential in enriching structural biological research and beyond. Currently, access to the database is precluded by an urgent need for tools that allow the efficient traversal, discovery, and documentation of its contents. Identifying domain regions in the database is a non-trivial endeavour and doing so will aid our understanding of protein structure and function, while facilitating drug discovery and comparative genomics. Here, we describe a deep learning method for domain segmentation called Merizo, which learns to cluster residues into domains in a bottom-up manner. Merizo is trained on CATH domains and fine-tuned on AlphaFold2 models via self-distillation, enabling it to be applied to both experimental and AlphaFold2 models. As proof of concept, we apply Merizo to the human proteome, identifying 40,818 putative domains that can be matched to CATH representative domains.
Collapse
Affiliation(s)
- Andy M Lau
- Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - Shaun M Kandathil
- Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - David T Jones
- Department of Computer Science, University College London, London, WC1E 6BT, UK.
| |
Collapse
|
20
|
Pei J, Cong Q. Computational analysis of regulatory regions in human protein kinases. Protein Sci 2023; 32:e4764. [PMID: 37632170 PMCID: PMC10503413 DOI: 10.1002/pro.4764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Revised: 08/08/2023] [Accepted: 08/22/2023] [Indexed: 08/27/2023]
Abstract
Eukaryotic proteins often feature modular domain structures comprising globular domains that are connected by linker regions and intrinsically disordered regions that may contain important functional motifs. The intramolecular interactions of globular domains and nonglobular regions can play critical roles in different aspects of protein function. However, studying these interactions and their regulatory roles can be challenging due to the flexibility of nonglobular regions, the long insertions separating interacting modules, and the transient nature of some interactions. Obtaining the experimental structures of multiple domains and functional regions is more difficult than determining the structures of individual globular domains. High-quality structural models generated by AlphaFold offer a unique opportunity to study intramolecular interactions in eukaryotic proteins. In this study, we systematically explored intramolecular interactions between human protein kinase domains (KDs) and potential regulatory regions, including globular domains, N- and C-terminal tails, long insertions, and distal nonglobular regions. Our analysis identified intramolecular interactions between human KDs and 35 different types of globular domains, exhibiting a variety of interaction modes that could contribute to orthosteric or allosteric regulation of kinase activity. We also identified prevalent interactions between human KDs and their flanking regions (N- and C-terminal tails). These interactions exhibit group-specific characteristics and can vary within each specific kinase group. Although long-range interactions between KDs and nonglobular regions are relatively rare, structural details of these interactions offer new insights into the regulation mechanisms of several kinases, such as HASPIN, MAPK7, MAPK15, and SIK1B.
Collapse
Affiliation(s)
- Jimin Pei
- Eugene McDermott Center for Human Growth and DevelopmentUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Harold C. Simmons Comprehensive Cancer CenterUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| | - Qian Cong
- Eugene McDermott Center for Human Growth and DevelopmentUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Harold C. Simmons Comprehensive Cancer CenterUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| |
Collapse
|
21
|
Gracia B, Montes P, Gutierrez AM, Arun B, Karras GI. Protein-Folding Chaperones Predict Structure-Function Relationships and Cancer Risk in BRCA1 Mutation Carriers. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.14.557795. [PMID: 37745493 PMCID: PMC10515940 DOI: 10.1101/2023.09.14.557795] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/26/2023]
Abstract
Identifying pathogenic mutations and predicting their impact on protein structure, function and phenotype remain major challenges in genome sciences. Protein-folding chaperones participate in structure-function relationships by facilitating the folding of protein variants encoded by mutant genes. Here, we utilize a high-throughput protein-protein interaction assay to test HSP70 and HSP90 chaperone interactions as predictors of pathogenicity for variants in the tumor suppressor BRCA1. Chaperones bind 77% of pathogenic BRCA1-BRCT variants, most of which engaged HSP70 more than HSP90. Remarkably, the magnitude of chaperone binding to variants is proportional to the degree of structural and phenotypic defect induced by BRCA1 mutation. Quantitative chaperone interactions identified BRCA1-BRCT separation-of-function variants and hypomorphic alleles missed by pathogenicity prediction algorithms. Furthermore, increased chaperone binding signified greater cancer risk in human BRCA1 carriers. Altogether, our study showcases the utility of chaperones as quantitative cellular biosensors of variant folding and phenotypic severity. HIGHLIGHTS Chaperones detect an abundance of pathogenic folding variants of BRCA1-BRCT.Degree of chaperone binding reflects severity of structural and phenotypic defect.Chaperones identify separation-of-function and hypomorphic variants. Chaperone interactions indicate penetrance and expressivity of BRCA1 alleles.
Collapse
|
22
|
Medvedev KE, Schaeffer RD, Chen KS, Grishin NV. Pan-cancer structurome reveals overrepresentation of beta sandwiches and underrepresentation of alpha helical domains. Sci Rep 2023; 13:11988. [PMID: 37491511 PMCID: PMC10368619 DOI: 10.1038/s41598-023-39273-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2023] [Accepted: 07/22/2023] [Indexed: 07/27/2023] Open
Abstract
The recent progress in the prediction of protein structures marked a historical milestone. AlphaFold predicted 200 million protein models with an accuracy comparable to experimental methods. Protein structures are widely used to understand evolution and to identify potential drug targets for the treatment of various diseases, including cancer. Thus, these recently predicted structures might convey previously unavailable information about cancer biology. Evolutionary classification of protein domains is challenging and different approaches exist. Recently our team presented a classification of domains from human protein models released by AlphaFold. Here we evaluated the pan-cancer structurome, domains from over and under expressed proteins in 21 cancer types, using the broadest levels of the ECOD classification: the architecture (A-groups) and possible homology (X-groups) levels. Our analysis reveals that AlphaFold has greatly increased the three-dimensional structural landscape for proteins that are differentially expressed in these 21 cancer types. We show that beta sandwich domains are significantly overrepresented and alpha helical domains are significantly underrepresented in the majority of cancer types. Our data suggest that the prevalence of the beta sandwiches is due to the high levels of immunoglobulins and immunoglobulin-like domains that arise during tumor development-related inflammation. On the other hand, proteins with exclusively alpha domains are important elements of homeostasis, apoptosis and transmembrane transport. Therefore cancer cells tend to reduce representation of these proteins to promote successful oncogeneses.
Collapse
Affiliation(s)
- Kirill E Medvedev
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
| | - R Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Kenneth S Chen
- Department of Pediatrics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Children's Medical Center Research Institute, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Nick V Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| |
Collapse
|