1
|
Zhao B, Basu S, Kurgan L. DescribePROT Database of Residue-Level Protein Structure and Function Annotations. Methods Mol Biol 2025; 2867:169-184. [PMID: 39576581 DOI: 10.1007/978-1-0716-4196-5_10] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2024]
Abstract
DescribePROT is a freely available online database of structural and functional descriptors of proteins at the amino acid level. It provides access to 13 diverse descriptors that include sequence conservation, putative secondary structure, solvent accessibility, intrinsic disorder, and signal peptides, and putative annotations of residues that interact with proteins, peptides and nucleic acids. These data can be used to elucidate protein functions, to support efforts to develop therapeutics, and to develop and evaluate future predictors of protein structure and function. DescribePROT includes 7.8 billion predictions for 1.4 million proteins from 83 complete proteomes of popular model organisms. This information can be downloaded at multiple levels of scope (entire database, specific organisms, and individual proteins) and can be interacted with using a graphical interface that simultaneously displays data on multiple descriptors. We describe the contents of this resource, provide directions on how to use its interface, and offer instructions on how to obtain and interact with the underlying data. Moreover, we briefly discuss plans for a future expansion of this database. DescribePROT is available at http://biomine.cs.vcu.edu/servers/DESCRIBEPROT/ .
Collapse
Affiliation(s)
- Bi Zhao
- Genomics program, College of Public Health, University of South Florida, Tampa, FL, USA
| | - Sushmita Basu
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA.
| |
Collapse
|
2
|
Basu S, Kurgan L. Taxonomy-specific assessment of intrinsic disorder predictions at residue and region levels in higher eukaryotes, protists, archaea, bacteria and viruses. Comput Struct Biotechnol J 2024; 23:1968-1977. [PMID: 38765610 PMCID: PMC11098722 DOI: 10.1016/j.csbj.2024.04.059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2024] [Revised: 04/23/2024] [Accepted: 04/24/2024] [Indexed: 05/22/2024] Open
Abstract
Intrinsic disorder predictors were evaluated in several studies including the two large CAID experiments. However, these studies are biased towards eukaryotic proteins and focus primarily on the residue-level predictions. We provide first-of-its-kind assessment that comprehensively covers the taxonomy and evaluates predictions at the residue and disordered region levels. We curate a benchmark dataset that uniformly covers eukaryotic, archaeal, bacterial, and viral proteins. We find that predictive performance differs substantially across taxonomy, where viruses are predicted most accurately, followed by protists and higher eukaryotes, while bacterial and archaeal proteins suffer lower levels of accuracy. These trends are consistent across predictors. We also find that current tools, except for flDPnn, struggle with reproducing native distributions of the numbers and sizes of the disordered regions. Moreover, analysis of two variants of disorder predictions derived from the AlphaFold2 predicted structures reveals that they produce accurate residue-level propensities for archaea, bacteria and protists. However, they underperform for higher eukaryotes and generally struggle to accurately identify disordered regions. Our results motivate development of new predictors that target bacteria and archaea and which produce accurate results at both residue and region levels. We also stress the need to include the region-level assessments in future assessments.
Collapse
Affiliation(s)
- Sushmita Basu
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| |
Collapse
|
3
|
Basu S, Yu J, Kihara D, Kurgan L. Twenty years of advances in prediction of nucleic acid-binding residues in protein sequences. Brief Bioinform 2024; 26:bbaf016. [PMID: 39833102 PMCID: PMC11745544 DOI: 10.1093/bib/bbaf016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2024] [Revised: 12/24/2024] [Accepted: 01/06/2025] [Indexed: 01/22/2025] Open
Abstract
Computational prediction of nucleic acid-binding residues in protein sequences is an active field of research, with over 80 methods that were released in the past 2 decades. We identify and discuss 87 sequence-based predictors that include dozens of recently published methods that are surveyed for the first time. We overview historical progress and examine multiple practical issues that include availability and impact of predictors, key features of their predictive models, and important aspects related to their training and assessment. We observe that the past decade has brought increased use of deep neural networks and protein language models, which contributed to substantial gains in the predictive performance. We also highlight advancements in vital and challenging issues that include cross-predictions between deoxyribonucleic acid (DNA)-binding and ribonucleic acid (RNA)-binding residues and targeting the two distinct sources of binding annotations, structure-based versus intrinsic disorder-based. The methods trained on the structure-annotated interactions tend to perform poorly on the disorder-annotated binding and vice versa, with only a few methods that target and perform well across both annotation types. The cross-predictions are a significant problem, with some predictors of DNA-binding or RNA-binding residues indiscriminately predicting interactions with both nucleic acid types. Moreover, we show that methods with web servers are cited substantially more than tools without implementation or with no longer working implementations, motivating the development and long-term maintenance of the web servers. We close by discussing future research directions that aim to drive further progress in this area.
Collapse
Affiliation(s)
- Sushmita Basu
- Department of Computer Science, Virginia Commonwealth University, 401 West Main Street, Richmond, VA 23284, United States
| | - Jing Yu
- Department of Computer Science, Virginia Commonwealth University, 401 West Main Street, Richmond, VA 23284, United States
| | - Daisuke Kihara
- Department of Biological Sciences, Purdue University, 915 Mitch Daniels Boulevard, West Lafayette, IN 47907, United States
- Department of Computer Science, Purdue University, 305 N. University Street, West Lafayette, IN 47907, United States
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, 401 West Main Street, Richmond, VA 23284, United States
| |
Collapse
|
4
|
Mouland AJ, Chau BA, Uversky VN. Methodological approaches to studying phase separation and HIV-1 replication: Current and future perspectives. Methods 2024; 229:147-155. [PMID: 39002735 DOI: 10.1016/j.ymeth.2024.07.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2023] [Revised: 06/26/2024] [Accepted: 07/11/2024] [Indexed: 07/15/2024] Open
Abstract
This article reviews tried-and-tested methodologies that have been employed in the first studies on phase separating properties of structural, RNA-binding and catalytic proteins of HIV-1. These are described here to stimulate interest for any who may want to initiate similar studies on virus-mediated liquid-liquid phase separation. Such studies serve to better understand the life cycle and pathogenesis of viruses and open the door to new therapeutics.
Collapse
Affiliation(s)
- Andrew J Mouland
- Department of Medicine, McGill University, Montreal, Quebec, Canada.
| | - Bao-An Chau
- Department of Medicine, McGill University, Montreal, Quebec, Canada
| | - Vladimir N Uversky
- Department of Molecular Medicine and Byrd Alzheimer's Research Institute, Morsani College of Medicine, University of South Florida, Tampa, FL, USA.
| |
Collapse
|
5
|
Basu S, Hegedűs T, Kurgan L. CoMemMoRFPred: Sequence-based Prediction of MemMoRFs by Combining Predictors of Intrinsic Disorder, MoRFs and Disordered Lipid-binding Regions. J Mol Biol 2023; 435:168272. [PMID: 37709009 DOI: 10.1016/j.jmb.2023.168272] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2023] [Revised: 09/01/2023] [Accepted: 09/07/2023] [Indexed: 09/16/2023]
Abstract
Molecular recognition features (MoRFs) are a commonly occurring type of intrinsically disordered regions (IDRs) that undergo disorder-to-order transition upon binding to partner molecules. We focus on recently characterized and functionally important membrane-binding MoRFs (MemMoRFs). Motivated by the lack of computational tools that predict MemMoRFs, we use a dataset of experimentally annotated MemMoRFs to conceptualize, design, evaluate and release an accurate sequence-based predictor. We rely on state-of-the-art tools that predict residues that possess key characteristics of MemMoRFs, such as intrinsic disorder, disorder-to-order transition and lipid-binding. We identify and combine results from three tools that include flDPnn for the disorder prediction, DisoLipPred for the prediction of disordered lipid-binding regions, and MoRFCHiBiLight for the prediction of disorder-to-order transitioning protein binding regions. Our empirical analysis demonstrates that combining results produced by these three methods generates accurate predictions of MemMoRFs. We also show that use of a smoothing operator produces predictions that closely mimic the number and sizes of the native MemMoRF regions. The resulting CoMemMoRFPred method is available as an easy-to-use webserver at http://biomine.cs.vcu.edu/servers/CoMemMoRFPred. This tool will aid future studies of MemMoRFs in the context of exploring their abundance, cellular functions, and roles in pathologic phenomena.
Collapse
Affiliation(s)
- Sushmita Basu
- Department of Computer Science, Virginia Commonwealth University, USA
| | - Tamás Hegedűs
- Department of Biophysics and Radiation Biology, Semmelweis University, Budapest, Hungary; ELKH-SE Biophysical Virology Research Group, Eötvös Loránd Research Network, Budapest, Hungary
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, USA.
| |
Collapse
|
6
|
Kurgan L, Hu G, Wang K, Ghadermarzi S, Zhao B, Malhis N, Erdős G, Gsponer J, Uversky VN, Dosztányi Z. Tutorial: a guide for the selection of fast and accurate computational tools for the prediction of intrinsic disorder in proteins. Nat Protoc 2023; 18:3157-3172. [PMID: 37740110 DOI: 10.1038/s41596-023-00876-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Accepted: 06/21/2023] [Indexed: 09/24/2023]
Abstract
Intrinsic disorder is instrumental for a wide range of protein functions, and its analysis, using computational predictions from primary structures, complements secondary and tertiary structure-based approaches. In this Tutorial, we provide an overview and comparison of 23 publicly available computational tools with complementary parameters useful for intrinsic disorder prediction, partly relying on results from the Critical Assessment of protein Intrinsic Disorder prediction experiment. We consider factors such as accuracy, runtime, availability and the need for functional insights. The selected tools are available as web servers and downloadable programs, offer state-of-the-art predictions and can be used in a high-throughput manner. We provide examples and instructions for the selected tools to illustrate practical aspects related to the submission, collection and interpretation of predictions, as well as the timing and their limitations. We highlight two predictors for intrinsically disordered proteins, flDPnn as accurate and fast and IUPred as very fast and moderately accurate, while suggesting ANCHOR2 and MoRFchibi as two of the best-performing predictors for intrinsically disordered region binding. We link these tools to additional resources, including databases of predictions and web servers that integrate multiple predictive methods. Altogether, this Tutorial provides a hands-on guide to comparatively evaluating multiple predictors, submitting and collecting their own predictions, and reading and interpreting results. It is suitable for experimentalists and computational biologists interested in accurately and conveniently identifying intrinsic disorder, facilitating the functional characterization of the rapidly growing collections of protein sequences.
Collapse
Affiliation(s)
- Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA.
| | - Gang Hu
- School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin, China
| | - Kui Wang
- School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin, China
| | - Sina Ghadermarzi
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| | - Bi Zhao
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| | - Nawar Malhis
- Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia, Canada
| | - Gábor Erdős
- MTA-ELTE Momentum Bioinformatics Research Group, Department of Biochemistry, Eötvös Loránd University, Budapest, Hungary
| | - Jörg Gsponer
- Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia, Canada.
| | - Vladimir N Uversky
- Department of Molecular Medicine, Morsani College of Medicine, University of South Florida, Tampa, FL, USA.
- Byrd Alzheimer's Center and Research Institute, Morsani College of Medicine, University of South Florida, Tampa, FL, USA.
| | - Zsuzsanna Dosztányi
- MTA-ELTE Momentum Bioinformatics Research Group, Department of Biochemistry, Eötvös Loránd University, Budapest, Hungary.
| |
Collapse
|
7
|
Zhao B, Ghadermarzi S, Kurgan L. Comparative evaluation of AlphaFold2 and disorder predictors for prediction of intrinsic disorder, disorder content and fully disordered proteins. Comput Struct Biotechnol J 2023; 21:3248-3258. [PMID: 38213902 PMCID: PMC10782001 DOI: 10.1016/j.csbj.2023.06.001] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 05/31/2023] [Accepted: 06/01/2023] [Indexed: 01/13/2024] Open
Abstract
We expand studies of AlphaFold2 (AF2) in the context of intrinsic disorder prediction by comparing it against a broad selection of 20 accurate, popular and recently released disorder predictors. We use 25% larger benchmark dataset with 646 proteins and cover protein-level predictions of disorder content and fully disordered proteins. AF2-based disorder predictions secure a relatively high Area Under receiver operating characteristic Curve (AUC) of 0.77 and are statistically outperformed by several modern disorder predictors that secure AUCs around 0.8 with median runtime of about 20 s compared to 1200 s for AF2. Moreover, AF2 provides modestly accurate predictions of fully disordered proteins (F1 = 0.59 vs. 0.91 for the best disorder predictor) and disorder content (mean absolute error of 0.21 vs. 0.15). AF2 also generates statistically more accurate disorder predictions for about 20% of proteins that have relatively short sequences and a few disordered regions that tend to be located at the sequence termini, and which are absent of disordered protein-binding regions. Interestingly, AF2 and the most accurate disorder predictors rely on deep neural networks, suggesting that these models are useful for protein structure and disorder predictions.
Collapse
Affiliation(s)
- Bi Zhao
- Genomics program, College of Public Health, University of South Florida, Tampa, FL, United States
| | - Sina Ghadermarzi
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| |
Collapse
|
8
|
Abstract
There are over 100 computational predictors of intrinsic disorder. These methods predict amino acid-level propensities for disorder directly from protein sequences. The propensities can be used to annotate putative disordered residues and regions. This unit provides a practical and holistic introduction to the sequence-based intrinsic disorder prediction. We define intrinsic disorder, explain the format of computational prediction of disorder, and identify and describe several accurate predictors. We also introduce recently released databases of intrinsic disorder predictions and use an illustrative example to provide insights into how predictions should be interpreted and combined. Lastly, we summarize key experimental methods that can be used to validate computational predictions. © 2023 Wiley Periodicals LLC.
Collapse
Affiliation(s)
- Vladimir N Uversky
- Department of Molecular Medicine and USF Health Byrd Alzheimer's Research Institute, Morsani College of Medicine, University of South Florida, Tampa, Florida
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, Virginia
| |
Collapse
|
9
|
Computational prediction of disordered binding regions. Comput Struct Biotechnol J 2023; 21:1487-1497. [PMID: 36851914 PMCID: PMC9957716 DOI: 10.1016/j.csbj.2023.02.018] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Revised: 02/08/2023] [Accepted: 02/08/2023] [Indexed: 02/12/2023] Open
Abstract
One of the key features of intrinsically disordered regions (IDRs) is their ability to interact with a broad range of partner molecules. Multiple types of interacting IDRs were identified including molecular recognition fragments (MoRFs), short linear sequence motifs (SLiMs), and protein-, nucleic acids- and lipid-binding regions. Prediction of binding IDRs in protein sequences is gaining momentum in recent years. We survey 38 predictors of binding IDRs that target interactions with a diverse set of partners, such as peptides, proteins, RNA, DNA and lipids. We offer a historical perspective and highlight key events that fueled efforts to develop these methods. These tools rely on a diverse range of predictive architectures that include scoring functions, regular expressions, traditional and deep machine learning and meta-models. Recent efforts focus on the development of deep neural network-based architectures and extending coverage to RNA, DNA and lipid-binding IDRs. We analyze availability of these methods and show that providing implementations and webservers results in much higher rates of citations/use. We also make several recommendations to take advantage of modern deep network architectures, develop tools that bundle predictions of multiple and different types of binding IDRs, and work on algorithms that model structures of the resulting complexes.
Collapse
|
10
|
Han B, Ren C, Wang W, Li J, Gong X. Computational Prediction of Protein Intrinsically Disordered Region Related Interactions and Functions. Genes (Basel) 2023; 14:432. [PMID: 36833360 PMCID: PMC9956190 DOI: 10.3390/genes14020432] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2022] [Revised: 02/02/2023] [Accepted: 02/05/2023] [Indexed: 02/11/2023] Open
Abstract
Intrinsically Disordered Proteins (IDPs) and Regions (IDRs) exist widely. Although without well-defined structures, they participate in many important biological processes. In addition, they are also widely related to human diseases and have become potential targets in drug discovery. However, there is a big gap between the experimental annotations related to IDPs/IDRs and their actual number. In recent decades, the computational methods related to IDPs/IDRs have been developed vigorously, including predicting IDPs/IDRs, the binding modes of IDPs/IDRs, the binding sites of IDPs/IDRs, and the molecular functions of IDPs/IDRs according to different tasks. In view of the correlation between these predictors, we have reviewed these prediction methods uniformly for the first time, summarized their computational methods and predictive performance, and discussed some problems and perspectives.
Collapse
Affiliation(s)
- Bingqing Han
- Mathematical Intelligence Application Lab, Institute for Mathematical Sciences, Renmin University of China, Beijing 100872, China
| | - Chongjiao Ren
- Mathematical Intelligence Application Lab, Institute for Mathematical Sciences, Renmin University of China, Beijing 100872, China
| | - Wenda Wang
- Mathematical Intelligence Application Lab, Institute for Mathematical Sciences, Renmin University of China, Beijing 100872, China
| | - Jiashan Li
- Mathematical Intelligence Application Lab, Institute for Mathematical Sciences, Renmin University of China, Beijing 100872, China
| | - Xinqi Gong
- Mathematical Intelligence Application Lab, Institute for Mathematical Sciences, Renmin University of China, Beijing 100872, China
- Beijing Academy of Intelligence, Beijing 100083, China
| |
Collapse
|
11
|
Zhang F, Li M, Zhang J, Shi W, Kurgan L. DeepPRObind: Modular Deep Learner that Accurately Predicts Structure and Disorder-Annotated Protein Binding Residues. J Mol Biol 2023:167945. [PMID: 36621533 DOI: 10.1016/j.jmb.2023.167945] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2022] [Revised: 12/15/2022] [Accepted: 01/01/2023] [Indexed: 01/07/2023]
Abstract
Current sequence-based predictors of protein-binding residues (PBRs) belong to two distinct categories: structure-trained vs. intrinsic disorder-trained. Since disordered PBRs differ from structured PBRs in several ways, including ability to bind multiple partners by folding into different conformations and enrichment in different amino acids, the structure-trained and disorder-trained predictors were shown to provide inaccurate results for the other annotation type. A simple consensus-based solution that combines structure- and disorder-trained methods provides limited levels of predictive performance and generates relatively many cross-predictions, where residues that interact with other ligand types are predicted as PBRs. We address this unsolved problem by designing a novel and fast deep-learner, DeepPRObind, that relies on carefully designed modular convolutional architecture and uses innovative aggregate input features. Comparative empirical tests on a low-similarity test dataset reveal that DeepPRObind generates accurate predictions of structured and disordered PBRs and low amounts of cross-predictions, outperforming a comprehensive collection of 12 predictors of PBRs. Given the relatively low runtime of DeepPRObind (40 seconds per protein), we further validate its results based on an analysis of putative PBRs in the yeast proteome, confirming that interactions in disordered regions are enriched among hub proteins. We release DeepPRObind as a convenient web server at https://www.csuligroup.com/DeepPRObind/.
Collapse
Affiliation(s)
- Fuhao Zhang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Min Li
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China.
| | - Jian Zhang
- School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, China
| | - Wenbo Shi
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA.
| |
Collapse
|
12
|
Waury K, Willemse EAJ, Vanmechelen E, Zetterberg H, Teunissen CE, Abeln S. Bioinformatics tools and data resources for assay development of fluid protein biomarkers. Biomark Res 2022; 10:83. [DOI: 10.1186/s40364-022-00425-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2022] [Accepted: 10/25/2022] [Indexed: 11/16/2022] Open
Abstract
AbstractFluid protein biomarkers are important tools in clinical research and health care to support diagnosis and to monitor patients. Especially within the field of dementia, novel biomarkers could address the current challenges of providing an early diagnosis and of selecting trial participants. While the great potential of fluid biomarkers is recognized, their implementation in routine clinical use has been slow. One major obstacle is the often unsuccessful translation of biomarker candidates from explorative high-throughput techniques to sensitive antibody-based immunoassays. In this review, we propose the incorporation of bioinformatics into the workflow of novel immunoassay development to overcome this bottleneck and thus facilitate the development of novel biomarkers towards clinical laboratory practice. Due to the rapid progress within the field of bioinformatics many freely available and easy-to-use tools and data resources exist which can aid the researcher at various stages. Current prediction methods and databases can support the selection of suitable biomarker candidates, as well as the choice of appropriate commercial affinity reagents. Additionally, we examine methods that can determine or predict the epitope - an antibody’s binding region on its antigen - and can help to make an informed choice on the immunogenic peptide used for novel antibody production. Selected use cases for biomarker candidates help illustrate the application and interpretation of the introduced tools.
Collapse
|
13
|
Farooq T, Hussain MD, Shakeel MT, Riaz H, Waheed U, Siddique M, Shahzadi I, Aslam MN, Tang Y, She X, He Z. Global genetic diversity and evolutionary patterns among Potato leafroll virus populations. Front Microbiol 2022; 13:1022016. [PMID: 36590416 PMCID: PMC9801716 DOI: 10.3389/fmicb.2022.1022016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Accepted: 09/12/2022] [Indexed: 01/04/2023] Open
Abstract
Potato leafroll virus (PLRV) is a widespread and one of the most damaging viral pathogens causing significant quantitative and qualitative losses in potato worldwide. The current knowledge of the geographical distribution, standing genetic diversity and the evolutionary patterns existing among global PLRV populations is limited. Here, we employed several bioinformatics tools and comprehensively analyzed the diversity, genomic variability, and the dynamics of key evolutionary factors governing the global spread of this viral pathogen. To date, a total of 84 full-genomic sequences of PLRV isolates have been reported from 22 countries with most genomes documented from Kenya. Among all PLRV-encoded major proteins, RTD and P0 displayed the highest level of nucleotide variability. The highest percentage of mutations were associated with RTD (38.81%) and P1 (31.66%) in the coding sequences. We detected a total of 10 significantly supported recombination events while the most frequently detected ones were associated with PLRV genome sequences reported from Kenya. Notably, the distribution patterns of recombination breakpoints across different genomic regions of PLRV isolates remained variable. Further analysis revealed that with exception of a few positively selected codons, a major part of the PLRV genome is evolving under strong purifying selection. Protein disorder prediction analysis revealed that CP-RTD had the highest percentage (48%) of disordered amino acids and the majority (27%) of disordered residues were positioned at the C-terminus. These findings will extend our current knowledge of the PLRV geographical prevalence, genetic diversity, and evolutionary factors that are presumably shaping the global spread and successful adaptation of PLRV as a destructive potato pathogen to geographically isolated regions of the world.
Collapse
Affiliation(s)
- Tahir Farooq
- Guangdong Academy of Agricultural Sciences, Plant Protection Research Institute and Guangdong Provincial Key Laboratory of High Technology for Plant Protection, Guangzhou, China
| | - Muhammad Dilshad Hussain
- State Key Laboratory for Agro-Biotechnology, and Ministry of Agriculture and Rural Affairs, Key Laboratory for Pest Monitoring and Green Management, Department of Plant Pathology, China Agricultural University, Beijing, China
| | - Muhammad Taimoor Shakeel
- Department of Plant Pathology, Faculty of Agriculture & Environment, The Islamia University of Bahawalpur, Bahawalpur, Pakistan
| | - Hasan Riaz
- Institute of Plant Protection, Muhammad Nawaz Shareef University of Agriculture, Multan, Pakistan
| | - Ummara Waheed
- Institute of Plant Breeding and Biotechnology, Muhammad Nawaz Shareef University of Agriculture, Multan, Pakistan
| | - Maria Siddique
- Department of Environmental Sciences, COMSATS University Islamabad, Abbottabad, Pakistan
| | - Irum Shahzadi
- Department of Biotechnology, COMSATS University Islamabad, Abbottabad, Pakistan
| | - Muhammad Naveed Aslam
- Department of Plant Pathology, Faculty of Agriculture & Environment, The Islamia University of Bahawalpur, Bahawalpur, Pakistan
| | - Yafei Tang
- Guangdong Academy of Agricultural Sciences, Plant Protection Research Institute and Guangdong Provincial Key Laboratory of High Technology for Plant Protection, Guangzhou, China
| | - Xiaoman She
- Guangdong Academy of Agricultural Sciences, Plant Protection Research Institute and Guangdong Provincial Key Laboratory of High Technology for Plant Protection, Guangzhou, China,*Correspondence: Xiaoman She, ; Zifu He,
| | - Zifu He
- Guangdong Academy of Agricultural Sciences, Plant Protection Research Institute and Guangdong Provincial Key Laboratory of High Technology for Plant Protection, Guangzhou, China,*Correspondence: Xiaoman She, ; Zifu He,
| |
Collapse
|
14
|
Compositional Bias of Intrinsically Disordered Proteins and Regions and Their Predictions. Biomolecules 2022; 12:biom12070888. [PMID: 35883444 PMCID: PMC9313023 DOI: 10.3390/biom12070888] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Revised: 06/10/2022] [Accepted: 06/10/2022] [Indexed: 11/17/2022] Open
Abstract
Intrinsically disordered regions (IDRs) carry out many cellular functions and vary in length and placement in protein sequences. This diversity leads to variations in the underlying compositional biases, which were demonstrated for the short vs. long IDRs. We analyze compositional biases across four classes of disorder: fully disordered proteins; short IDRs; long IDRs; and binding IDRs. We identify three distinct biases: for the fully disordered proteins, the short IDRs and the long and binding IDRs combined. We also investigate compositional bias for putative disorder produced by leading disorder predictors and find that it is similar to the bias of the native disorder. Interestingly, the accuracy of disorder predictions across different methods is correlated with the correctness of the compositional bias of their predictions highlighting the importance of the compositional bias. The predictive quality is relatively low for the disorder classes with compositional bias that is the most different from the “generic” disorder bias, while being much higher for the classes with the most similar bias. We discover that different predictors perform best across different classes of disorder. This suggests that no single predictor is universally best and motivates the development of new architectures that combine models that target specific disorder classes.
Collapse
|
15
|
Biró B, Zhao B, Kurgan L. Complementarity of the residue-level protein function and structure predictions in human proteins. Comput Struct Biotechnol J 2022; 20:2223-2234. [PMID: 35615015 PMCID: PMC9118482 DOI: 10.1016/j.csbj.2022.05.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Revised: 05/02/2022] [Accepted: 05/02/2022] [Indexed: 11/24/2022] Open
Abstract
Sequence-based predictors of the residue-level protein function and structure cover a broad spectrum of characteristics including intrinsic disorder, secondary structure, solvent accessibility and binding to nucleic acids. They were catalogued and evaluated in numerous surveys and assessments. However, methods focusing on a given characteristic are studied separately from predictors of other characteristics, while they are typically used on the same proteins. We fill this void by studying complementarity of a representative collection of methods that target different predictions using a large, taxonomically consistent, and low similarity dataset of human proteins. First, we bridge the gap between the communities that develop structure-trained vs. disorder-trained predictors of binding residues. Motivated by a recent study of the protein-binding residue predictions, we empirically find that combining the structure-trained and disorder-trained predictors of the DNA-binding and RNA-binding residues leads to substantial improvements in predictive quality. Second, we investigate whether diverse predictors generate results that accurately reproduce relations between secondary structure, solvent accessibility, interaction sites, and intrinsic disorder that are present in the experimental data. Our empirical analysis concludes that predictions accurately reflect all combinations of these relations. Altogether, this study provides unique insights that support combining results produced by diverse residue-level predictors of protein function and structure.
Collapse
Affiliation(s)
- Bálint Biró
- Institute of Genetics and Biotechnology, Hungarian University of Agriculture and Life Sciences, Gödöllő, Hungary
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| | - Bi Zhao
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| |
Collapse
|
16
|
Predicting protein intrinsically disordered regions by applying natural language processing practices. Soft comput 2022. [DOI: 10.1007/s00500-022-07085-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
17
|
Micsonai A, Moussong É, Murvai N, Tantos Á, Tőke O, Réfrégiers M, Wien F, Kardos J. Disordered-Ordered Protein Binary Classification by Circular Dichroism Spectroscopy. Front Mol Biosci 2022; 9:863141. [PMID: 35591946 PMCID: PMC9110821 DOI: 10.3389/fmolb.2022.863141] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2022] [Accepted: 03/24/2022] [Indexed: 12/31/2022] Open
Abstract
Intrinsically disordered proteins lack a stable tertiary structure and form dynamic conformational ensembles due to their characteristic physicochemical properties and amino acid composition. They are abundant in nature and responsible for a large variety of cellular functions. While numerous bioinformatics tools have been developed for in silico disorder prediction in the last decades, there is a need for experimental methods to verify the disordered state. CD spectroscopy is widely used for protein secondary structure analysis. It is usable in a wide concentration range under various buffer conditions. Even without providing high-resolution information, it is especially useful when NMR, X-ray, or other techniques are problematic or one simply needs a fast technique to verify the structure of proteins. Here, we propose an automatized binary disorder-order classification method by analyzing far-UV CD spectroscopy data. The method needs CD data at only three wavelength points, making high-throughput data collection possible. The mathematical analysis applies the k-nearest neighbor algorithm with cosine distance function, which is independent of the spectral amplitude and thus free of concentration determination errors. Moreover, the method can be used even for strong absorbing samples, such as the case of crowded environmental conditions, if the spectrum can be recorded down to the wavelength of 212 nm. We believe the classification method will be useful in identifying disorder and will also facilitate the growth of experimental data in IDP databases. The method is implemented on a webserver and freely available for academic users.
Collapse
Affiliation(s)
- András Micsonai
- ELTE NAP Neuroimmunology Research Group, Department of Biochemistry, Institute of Biology, ELTE Eötvös Loránd University, Budapest, Hungary
| | - Éva Moussong
- ELTE NAP Neuroimmunology Research Group, Department of Biochemistry, Institute of Biology, ELTE Eötvös Loránd University, Budapest, Hungary
| | - Nikoletta Murvai
- Department of Biochemistry, Institute of Biology, ELTE Eötvös Loránd University, Budapest, Hungary
- Institute of Enzymology, Research Centre for Natural Sciences, Budapest, Hungary
| | - Ágnes Tantos
- Institute of Enzymology, Research Centre for Natural Sciences, Budapest, Hungary
| | - Orsolya Tőke
- Laboratory for NMR Spectroscopy, Research Centre for Natural Sciences, Budapest, Hungary
| | - Matthieu Réfrégiers
- Synchrotron SOLEIL, Gif-sur-Yvette, France
- Centre de Biophysique Moléculaire, CNRS UPR4301, Orléans, France
| | - Frank Wien
- Synchrotron SOLEIL, Gif-sur-Yvette, France
| | - József Kardos
- ELTE NAP Neuroimmunology Research Group, Department of Biochemistry, Institute of Biology, ELTE Eötvös Loránd University, Budapest, Hungary
| |
Collapse
|
18
|
Poboinev VV, Khrustalev VV, Khrustaleva TA, Kasko TE, Popkov VD. The PentUnFOLD algorithm as a tool to distinguish the dark and the light sides of the structural instability of proteins. Amino Acids 2022; 54:1155-1171. [PMID: 35294674 PMCID: PMC8924573 DOI: 10.1007/s00726-022-03153-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2021] [Accepted: 02/14/2022] [Indexed: 12/12/2022]
Abstract
Intrinsically disordered proteins are frequently involved in important regulatory processes in the cell thanks to their ability to bind several different targets performing sometimes even opposite functions. The PentUnFOLD algorithm is a physicochemical method that is based on new propensity scales for disordered, nonstable and stable elements of secondary structure and on the counting of stabilizing and destabilizing intraprotein contacts. Unlike other methods, it works with a PDB file, and it can determine not only those fragments of alpha helices, beta strands, and random coils that can turn into disordered state (the “dark” side of the disorder), but also nonstable regions of alpha helices and beta strands which are able to turn into random coils (the “light” side), and vice versa (H ↔ C, E ↔ C). The scales have been obtained from structural data on disordered regions from the middle parts of amino acid sequences only, and not on their expectedly disordered N- and C-termini. Among other tendencies we have found that regions of both alpha helices and beta strands that can turn into the disordered state are relatively enriched in residues of Ala, Met, Asp, and Lys, while regions of both alpha helices and beta strands that can turn into random coil are relatively enriched in hydrophilic residues, and Cys, Pro, and Gly. Moreover, PentUnFOLD has the option to determine the effect of secondary structure transitions on the stability of a given region of a protein. The PentUnFOLD algorithm is freely available at http://3.17.12.213/pent-un-fold and http://chemres.bsmu.by/PentUnFOLD.htm.
Collapse
Affiliation(s)
| | | | - Tatyana Aleksandrovna Khrustaleva
- Biochemical Group of the Multidisciplinary Diagnostic Laboratory, Institute of Physiology of the National Academy of Sciences of Belarus, Minsk, Belarus
| | - Tihon Evgenyevich Kasko
- Department of General Chemistry, Belarusian State Medical University, Dzerzinskogo 83, Minsk, Belarus
| | - Vadim Dmitrievich Popkov
- Department of General Chemistry, Belarusian State Medical University, Dzerzinskogo 83, Minsk, Belarus
| |
Collapse
|
19
|
Zhao B, Kurgan L. Deep learning in prediction of intrinsic disorder in proteins. Comput Struct Biotechnol J 2022; 20:1286-1294. [PMID: 35356546 PMCID: PMC8927795 DOI: 10.1016/j.csbj.2022.03.003] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Revised: 03/04/2022] [Accepted: 03/04/2022] [Indexed: 12/12/2022] Open
Abstract
Intrinsic disorder prediction is an active area that has developed over 100 predictors. We identify and investigate a recent trend towards the development of deep neural network (DNN)-based methods. The first DNN-based method was released in 2013 and since 2019 deep learners account for majority of the new disorder predictors. We find that the 13 currently available DNN-based predictors are diverse in their topologies, sizes of their networks and the inputs that they utilize. We empirically show that the deep learners are statistically more accurate than other types of disorder predictors using the blind test dataset from the recent community assessment of intrinsic disorder predictions (CAID). We also identify several well-rounded DNN-based predictors that are accurate, fast and/or conveniently available. The popularity, favorable predictive performance and architectural flexibility suggest that deep networks are likely to fuel the development of future disordered predictors. Novel hybrid designs of deep networks could be used to adequately accommodate for diversity of types and flavors of intrinsic disorder. We also discuss scarcity of the DNN-based methods for the prediction of disordered binding regions and the need to develop more accurate methods for this prediction.
Collapse
Affiliation(s)
- Bi Zhao
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| |
Collapse
|
20
|
Kurgan L. Resources for computational prediction of intrinsic disorder in proteins. Methods 2022; 204:132-141. [DOI: 10.1016/j.ymeth.2022.03.018] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2022] [Revised: 03/25/2022] [Accepted: 03/29/2022] [Indexed: 12/26/2022] Open
|
21
|
Bondos SE, Dunker AK, Uversky VN. Intrinsically disordered proteins play diverse roles in cell signaling. Cell Commun Signal 2022; 20:20. [PMID: 35177069 PMCID: PMC8851865 DOI: 10.1186/s12964-022-00821-7] [Citation(s) in RCA: 83] [Impact Index Per Article: 27.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 12/11/2021] [Indexed: 11/29/2022] Open
Abstract
Signaling pathways allow cells to detect and respond to a wide variety of chemical (e.g. Ca2+ or chemokine proteins) and physical stimuli (e.g., sheer stress, light). Together, these pathways form an extensive communication network that regulates basic cell activities and coordinates the function of multiple cells or tissues. The process of cell signaling imposes many demands on the proteins that comprise these pathways, including the abilities to form active and inactive states, and to engage in multiple protein interactions. Furthermore, successful signaling often requires amplifying the signal, regulating or tuning the response to the signal, combining information sourced from multiple pathways, all while ensuring fidelity of the process. This sensitivity, adaptability, and tunability are possible, in part, due to the inclusion of intrinsically disordered regions in many proteins involved in cell signaling. The goal of this collection is to highlight the many roles of intrinsic disorder in cell signaling. Following an overview of resources that can be used to study intrinsically disordered proteins, this review highlights the critical role of intrinsically disordered proteins for signaling in widely diverse organisms (animals, plants, bacteria, fungi), in every category of cell signaling pathway (autocrine, juxtacrine, intracrine, paracrine, and endocrine) and at each stage (ligand, receptor, transducer, effector, terminator) in the cell signaling process. Thus, a cell signaling pathway cannot be fully described without understanding how intrinsically disordered protein regions contribute to its function. The ubiquitous presence of intrinsic disorder in different stages of diverse cell signaling pathways suggest that more mechanisms by which disorder modulates intra- and inter-cell signals remain to be discovered.
Collapse
Affiliation(s)
- Sarah E. Bondos
- Department of Molecular and Cellular Medicine, Texas A&M Health Science Center, College Station, TX 77843 USA
| | - A. Keith Dunker
- Center for Computational Biology and Bioinformatics, Department of Biochemistry and Molecular Biology, Indiana University School of Medicine, Indianapolis, IN 46202 USA
| | - Vladimir N. Uversky
- Department of Molecular Medicine and USF Health Byrd Alzheimer’s Research Institute, Morsani College of Medicine, University of South Florida, Tampa, FL 33612 USA
- Institute for Biological Instrumentation of the Russian Academy of Sciences, Federal Research Center “Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences”, Pushchino, Moscow Region, Russia 142290
| |
Collapse
|
22
|
Capturing a Crucial ‘Disorder-to-Order Transition’ at the Heart of the Coronavirus Molecular Pathology—Triggered by Highly Persistent, Interchangeable Salt-Bridges. Vaccines (Basel) 2022; 10:vaccines10020301. [PMID: 35214759 PMCID: PMC8875383 DOI: 10.3390/vaccines10020301] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2021] [Revised: 01/27/2022] [Accepted: 02/05/2022] [Indexed: 02/05/2023] Open
Abstract
The COVID-19 origin debate has greatly been influenced by genome comparison studies of late, revealing the emergence of the Furin-like cleavage site at the S1/S2 junction of the SARS-CoV-2 Spike (FLCSSpike) containing its 681PRRAR685 motif, absent in other related respiratory viruses. Being the rate-limiting (i.e., the slowest) step, the host Furin cleavage is instrumental in the abrupt increase in transmissibility in COVID-19, compared to earlier onsets of respiratory viral diseases. In such a context, the current paper entraps a ‘disorder-to-order transition’ of the FLCSSpike (concomitant to an entropy arrest) upon binding to Furin. The interaction clearly seems to be optimized for a more efficient proteolytic cleavage in SARS-CoV-2. The study further shows the formation of dynamically interchangeable and persistent networks of salt-bridges at the Spike–Furin interface in SARS-CoV-2 involving the three arginines (R682, R683, R685) of the FLCSSpike with several anionic residues (E230, E236, D259, D264, D306) coming from Furin, strategically distributed around its catalytic triad. Multiplicity and structural degeneracy of plausible salt-bridge network archetypes seem to be the other key characteristic features of the Spike–Furin binding in SARS-CoV-2, allowing the system to breathe—a trademark of protein disorder transitions. Interestingly, with respect to the homologous interaction in SARS-CoV (2002/2003) taken as a baseline, the Spike–Furin binding events, generally, in the coronavirus lineage, seems to have preference for ionic bond formation, even with a lesser number of cationic residues at their potentially polybasic FLCSSpike patches. The interaction energies are suggestive of characteristic metastabilities attributed to Spike–Furin interactions, generally to the coronavirus lineage, which appears to be favorable for proteolytic cleavages targeted at flexible protein loops. The current findings not only offer novel mechanistic insights into the coronavirus molecular pathology and evolution, but also add substantially to the existing theories of proteolytic cleavages.
Collapse
|
23
|
Abstract
INTRODUCTION Intrinsic disorder prediction field develops, assesses, and deploys computational predictors of disorder in protein sequences and constructs and disseminates databases of these predictions. Over 40 years of research resulted in the release of numerous resources. AREAS COVERED We identify and briefly summarize the most comprehensive to date collection of over 100 disorder predictors. We focus on their predictive models, availability and predictive performance. We categorize and study them from a historical point of view to highlight informative trends. EXPERT OPINION We find a consistent trend of improvements in predictive quality as newer and more advanced predictors are developed. The original focus on machine learning methods has shifted to meta-predictors in early 2010s, followed by a recent transition to deep learning. The use of deep learners will continue in foreseeable future given recent and convincing success of these methods. Moreover, a broad range of resources that facilitate convenient collection of accurate disorder predictions is available to users. They include web servers and standalone programs for disorder prediction, servers that combine prediction of disorder and disorder functions, and large databases of pre-computed predictions. We also point to the need to address the shortage of accurate methods that predict disordered binding regions.
Collapse
Affiliation(s)
- Bi Zhao
- Department of Computer Science, Virginia Commonwealth University, Richmond, Virginia, USA
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, Virginia, USA
| |
Collapse
|
24
|
Katuwawala A, Zhao B, Kurgan L. DisoLipPred: accurate prediction of disordered lipid-binding residues in protein sequences with deep recurrent networks and transfer learning. Bioinformatics 2021; 38:115-124. [PMID: 34487138 DOI: 10.1093/bioinformatics/btab640] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2021] [Revised: 08/05/2021] [Accepted: 09/02/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Intrinsically disordered protein regions interact with proteins, nucleic acids and lipids. Regions that bind lipids are implicated in a wide spectrum of cellular functions and several human diseases. Motivated by the growing amount of experimental data for these interactions and lack of tools that can predict them from the protein sequence, we develop DisoLipPred, the first predictor of the disordered lipid-binding residues (DLBRs). RESULTS DisoLipPred relies on a deep bidirectional recurrent network that implements three innovative features: transfer learning, bypass module that sidesteps predictions for putative structured residues, and expanded inputs that cover physiochemical properties associated with the protein-lipid interactions. Ablation analysis shows that these features drive predictive quality of DisoLipPred. Tests on an independent test dataset and the yeast proteome reveal that DisoLipPred generates accurate results and that none of the related existing tools can be used to indirectly identify DLBR. We also show that DisoLipPred's predictions complement the results generated by predictors of the transmembrane regions. Altogether, we conclude that DisoLipPred provides high-quality predictions of DLBRs that complement the currently available methods. AVAILABILITY AND IMPLEMENTATION DisoLipPred's webserver is available at http://biomine.cs.vcu.edu/servers/DisoLipPred/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Akila Katuwawala
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Bi Zhao
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| |
Collapse
|
25
|
Emenecker RJ, Griffith D, Holehouse AS. Metapredict: a fast, accurate, and easy-to-use predictor of consensus disorder and structure. Biophys J 2021; 120:4312-4319. [PMID: 34480923 PMCID: PMC8553642 DOI: 10.1016/j.bpj.2021.08.039] [Citation(s) in RCA: 125] [Impact Index Per Article: 31.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2021] [Revised: 08/08/2021] [Accepted: 08/30/2021] [Indexed: 01/02/2023] Open
Abstract
Intrinsically disordered proteins and protein regions make up a substantial fraction of many proteomes in which they play a wide variety of essential roles. A critical first step in understanding the role of disordered protein regions in biological function is to identify those disordered regions correctly. Computational methods for disorder prediction have emerged as a core set of tools to guide experiments, interpret results, and develop hypotheses. Given the multiple different predictors available, consensus scores have emerged as a popular approach to mitigate biases or limitations of any single method. Consensus scores integrate the outcome of multiple independent disorder predictors and provide a per-residue value that reflects the number of tools that predict a residue to be disordered. Although consensus scores help mitigate the inherent problems of using any single disorder predictor, they are computationally expensive to generate. They also necessitate the installation of multiple different software tools, which can be prohibitively difficult. To address this challenge, we developed a deep-learning-based predictor of consensus disorder scores. Our predictor, metapredict, utilizes a bidirectional recurrent neural network trained on the consensus disorder scores from 12 proteomes. By benchmarking metapredict using two orthogonal approaches, we found that metapredict is among the most accurate disorder predictors currently available. Metapredict is also remarkably fast, enabling proteome-scale disorder prediction in minutes. Importantly, metapredict is a fully open source and is distributed as a Python package, a collection of command-line tools, and a web server, maximizing the potential practical utility of the predictor. We believe metapredict offers a convenient, accessible, accurate, and high-performance predictor for single-proteins and proteomes alike.
Collapse
Affiliation(s)
- Ryan J Emenecker
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, Missouri; Center for Science and Engineering Living Systems (CSELS), St. Louis, Missouri; Center for Engineering Mechanobiology, Washington University, St. Louis, Missouri
| | - Daniel Griffith
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, Missouri; Center for Science and Engineering Living Systems (CSELS), St. Louis, Missouri
| | - Alex S Holehouse
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, Missouri; Center for Science and Engineering Living Systems (CSELS), St. Louis, Missouri.
| |
Collapse
|
26
|
Emenecker RJ, Griffith D, Holehouse AS. Metapredict: a fast, accurate, and easy-to-use predictor of consensus disorder and structure. Biophys J 2021; 120:4312-4319. [PMID: 34480923 DOI: 10.1101/2021.05.30.446349] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2021] [Revised: 08/08/2021] [Accepted: 08/30/2021] [Indexed: 05/28/2023] Open
Abstract
Intrinsically disordered proteins and protein regions make up a substantial fraction of many proteomes in which they play a wide variety of essential roles. A critical first step in understanding the role of disordered protein regions in biological function is to identify those disordered regions correctly. Computational methods for disorder prediction have emerged as a core set of tools to guide experiments, interpret results, and develop hypotheses. Given the multiple different predictors available, consensus scores have emerged as a popular approach to mitigate biases or limitations of any single method. Consensus scores integrate the outcome of multiple independent disorder predictors and provide a per-residue value that reflects the number of tools that predict a residue to be disordered. Although consensus scores help mitigate the inherent problems of using any single disorder predictor, they are computationally expensive to generate. They also necessitate the installation of multiple different software tools, which can be prohibitively difficult. To address this challenge, we developed a deep-learning-based predictor of consensus disorder scores. Our predictor, metapredict, utilizes a bidirectional recurrent neural network trained on the consensus disorder scores from 12 proteomes. By benchmarking metapredict using two orthogonal approaches, we found that metapredict is among the most accurate disorder predictors currently available. Metapredict is also remarkably fast, enabling proteome-scale disorder prediction in minutes. Importantly, metapredict is a fully open source and is distributed as a Python package, a collection of command-line tools, and a web server, maximizing the potential practical utility of the predictor. We believe metapredict offers a convenient, accessible, accurate, and high-performance predictor for single-proteins and proteomes alike.
Collapse
Affiliation(s)
- Ryan J Emenecker
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, Missouri; Center for Science and Engineering Living Systems (CSELS), St. Louis, Missouri; Center for Engineering Mechanobiology, Washington University, St. Louis, Missouri
| | - Daniel Griffith
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, Missouri; Center for Science and Engineering Living Systems (CSELS), St. Louis, Missouri
| | - Alex S Holehouse
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, Missouri; Center for Science and Engineering Living Systems (CSELS), St. Louis, Missouri.
| |
Collapse
|
27
|
Zhao B, Katuwawala A, Oldfield CJ, Hu G, Wu Z, Uversky VN, Kurgan L. Intrinsic Disorder in Human RNA-Binding Proteins. J Mol Biol 2021; 433:167229. [PMID: 34487791 DOI: 10.1016/j.jmb.2021.167229] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2021] [Revised: 08/30/2021] [Accepted: 08/31/2021] [Indexed: 12/24/2022]
Abstract
Although RNA-binding proteins (RBPs) are known to be enriched in intrinsic disorder, no previous analysis focused on RBPs interacting with specific RNA types. We fill this gap with a comprehensive analysis of the putative disorder in RBPs binding to six common RNA types: messenger RNA (mRNA), transfer RNA (tRNA), small nuclear RNA (snRNA), non-coding RNA (ncRNA), ribosomal RNA (rRNA), and internal ribosome RNA (irRNA). We also analyze the amount of putative intrinsic disorder in the RNA-binding domains (RBDs) and non-RNA-binding-domain regions (non-RBD regions). Consistent with previous studies, we show that in comparison with human proteome, RBPs are significantly enriched in disorder. However, closer examination finds significant enrichment in predicted disorder for the mRNA-, rRNA- and snRNA-binding proteins, while the proteins that interact with ncRNA and irRNA are not enriched in disorder, and the tRNA-binding proteins are significantly depleted in disorder. We show a consistent pattern of significant disorder enrichment in the non-RBD regions coupled with low levels of disorder in RBDs, which suggests that disorder is relatively rarely utilized in the RNA-binding regions. Our analysis of the non-RBD regions suggests that disorder harbors posttranslational modification sites and is involved in the putative interactions with DNA. Importantly, we utilize experimental data from DisProt and independent data from Pfam to validate the above observations that rely on the disorder predictions. This study provides new insights into the distribution of disorder across proteins that bind different RNA types and the functional role of disorder in the regions where it is enriched.
Collapse
Affiliation(s)
- Bi Zhao
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Akila Katuwawala
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Christopher J Oldfield
- Department of Microbiology and Immunology, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Gang Hu
- School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin 300071, China
| | - Zhonghua Wu
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China
| | - Vladimir N Uversky
- Department of Molecular Medicine and USF Health Byrd Alzheimer's Research Institute, Morsani College of Medicine, University of South Florida, Tampa, FL 33612, USA
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA.
| |
Collapse
|
28
|
Positive selection and intrinsic disorder are associated with multifunctional C4(AC4) proteins and geminivirus diversification. Sci Rep 2021; 11:11150. [PMID: 34045539 PMCID: PMC8160170 DOI: 10.1038/s41598-021-90557-0] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2020] [Accepted: 05/13/2021] [Indexed: 02/06/2023] Open
Abstract
Viruses within the Geminiviridae family cause extensive agricultural losses. Members of four genera of geminiviruses contain a C4 gene (AC4 in geminiviruses with bipartite genomes). C4(AC4) genes are entirely overprinted on the C1(AC1) genes, which encode the replication-associated proteins. The C4(AC4) proteins exhibit diverse functions that may be important for geminivirus diversification. In this study, the influence of natural selection on the evolutionary diversity of 211 C4(AC4) genes relative to the C1(AC1) sequences they overlap was determined from isolates of the Begomovirus and Curtovirus genera. The ratio of nonsynonymous (dN) to synonymous (dS) nucleotide substitutions indicated that C4(AC4) genes are under positive selection, while the overlapped C1(AC1) sequences are under purifying selection. Ninety-one of 200 Begomovirus C4(AC4) genes encode elongated proteins with the extended regions being under neutral selection. C4(AC4) genes from begomoviruses isolated from tomato from native versus exotic regions were under similar levels of positive selection. Analysis of protein structure suggests that C4(AC4) proteins are entirely intrinsically disordered. Our data suggest that non-synonymous mutations and mutations that increase the length of C4(AC4) drive protein diversity that is intrinsically disordered, which could explain C4/AC4 functional variation and contribute to both geminivirus diversification and host jumping.
Collapse
|
29
|
Katuwawala A, Ghadermarzi S, Hu G, Wu Z, Kurgan L. QUARTERplus: Accurate disorder predictions integrated with interpretable residue-level quality assessment scores. Comput Struct Biotechnol J 2021; 19:2597-2606. [PMID: 34025946 PMCID: PMC8122155 DOI: 10.1016/j.csbj.2021.04.066] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Revised: 04/24/2021] [Accepted: 04/24/2021] [Indexed: 12/13/2022] Open
Abstract
A recent advance in the disorder prediction field is the development of the quality assessment (QA) scores. QA scores complement the propensities produced by the disorder predictors by identifying regions where these predictions are more likely to be correct. We develop, empirically test and release a new QA tool, QUARTERplus, that addresses several key drawbacks of the current QA method, QUARTER. QUARTERplus is the first solution that utilizes QA scores and the associated input disorder predictions to produce very accurate disorder predictions with the help of a modern deep learning meta-model. The deep neural network utilizes the QA scores to identify and fix the regions where the original/input disorder predictions are poor. More importantly, the accurate QUATERplus's predictions are accompanied by easy to interpret residue-level QA scores that reliably quantify their residue-level predictive quality. We provide these interpretable QA scores for QUARTERplus and 10 other popular disorder predictors. Empirical tests on a large and independent (low similarity) test dataset show that QUARTERplus predictions secure AUC = 0.93 and are statistically more accurate than the results of twelve state-of-the-art disorder predictors. We also demonstrate that the new QA scores produced by QUARTERplus are highly correlated with the actual predictive quality and that they can be effectively used to identify regions of correct disorder predictions. This feature empowers the users to easily identify which parts of the predictions generated by the modern disorder predictors are more trustworthy. QUARTERplus is available as a convenient webserver at http://biomine.cs.vcu.edu/servers/QUARTERplus/.
Collapse
Affiliation(s)
- Akila Katuwawala
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Sina Ghadermarzi
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Gang Hu
- School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin 300071, China
| | - Zhonghua Wu
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| |
Collapse
|
30
|
Peng Z, Xing Q, Kurgan L. APOD: accurate sequence-based predictor of disordered flexible linkers. Bioinformatics 2021; 36:i754-i761. [PMID: 33381830 PMCID: PMC7773485 DOI: 10.1093/bioinformatics/btaa808] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/07/2020] [Indexed: 12/21/2022] Open
Abstract
Motivation Disordered flexible linkers (DFLs) are abundant and functionally important intrinsically disordered regions that connect protein domains and structural elements within domains and which facilitate disorder-based allosteric regulation. Although computational estimates suggest that thousands of proteins have DFLs, they were annotated experimentally in <200 proteins. This substantial annotation gap can be reduced with the help of accurate computational predictors. The sole predictor of DFLs, DFLpred, trade-off accuracy for shorter runtime by excluding relevant but computationally costly predictive inputs. Moreover, it relies on the local/window-based information while lacking to consider useful protein-level characteristics. Results We conceptualize, design and test APOD (Accurate Predictor Of DFLs), the first highly accurate predictor that utilizes both local- and protein-level inputs that quantify propensity for disorder, sequence composition, sequence conservation and selected putative structural properties. Consequently, APOD offers significantly more accurate predictions when compared with its faster predecessor, DFLpred, and several other alternative ways to predict DFLs. These improvements stem from the use of a more comprehensive set of inputs that cover the protein-level information and the application of a more sophisticated predictive model, a well-parametrized support vector machine. APOD achieves area under the curve = 0.82 (28% improvement over DFLpred) and Matthews correlation coefficient = 0.42 (180% increase over DFLpred) when tested on an independent/low-similarity test dataset. Consequently, APOD is a suitable choice for accurate and small-scale prediction of DFLs. Availability and implementation https://yanglab.nankai.edu.cn/APOD/.
Collapse
Affiliation(s)
- Zhenling Peng
- Center for Applied Mathematics, Tianjin University, Tianjin 300072, China.,School of Statistics and Data Science, Nankai University, Tianjin 300074, China
| | - Qian Xing
- Center for Applied Mathematics, Tianjin University, Tianjin 300072, China
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| |
Collapse
|
31
|
Zhao B, Katuwawala A, Uversky VN, Kurgan L. IDPology of the living cell: intrinsic disorder in the subcellular compartments of the human cell. Cell Mol Life Sci 2021; 78:2371-2385. [PMID: 32997198 PMCID: PMC11071772 DOI: 10.1007/s00018-020-03654-0] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2020] [Revised: 09/09/2020] [Accepted: 09/22/2020] [Indexed: 12/11/2022]
Abstract
Intrinsic disorder can be found in all proteomes of all kingdoms of life and in viruses, being particularly prevalent in the eukaryotes. We conduct a comprehensive analysis of the intrinsic disorder in the human proteins while mapping them into 24 compartments of the human cell. In agreement with previous studies, we show that human proteins are significantly enriched in disorder relative to a generic protein set that represents the protein universe. In fact, the fraction of proteins with long disordered regions and the average protein-level disorder content in the human proteome are about 3 times higher than in the protein universe. Furthermore, levels of intrinsic disorder in the majority of human subcellular compartments significantly exceed the average disorder content in the protein universe. Relative to the overall amount of disorder in the human proteome, proteins localized in the nucleus and cytoskeleton have significantly increased amounts of disorder, measured by both high disorder content and presence of multiple long intrinsically disordered regions. We empirically demonstrate that, on average, human proteins are assigned to 2.3 subcellular compartments, with proteins localized to few subcellular compartments being more disordered than the proteins that are localized to many compartments. Functionally, the disordered proteins localized in the most disorder-enriched subcellular compartments are primarily responsible for interactions with nucleic acids and protein partners. This is the first-time disorder is comprehensively mapped into the human cell. Our observations add a missing piece to the puzzle of functional disorder and its organization inside the cell.
Collapse
Affiliation(s)
- Bi Zhao
- Department of Computer Science, Virginia Commonwealth University, 401 West Main Street, Room E4225, Richmond, VA, 23284, USA
| | - Akila Katuwawala
- Department of Computer Science, Virginia Commonwealth University, 401 West Main Street, Room E4225, Richmond, VA, 23284, USA
| | - Vladimir N Uversky
- Department of Molecular Medicine, USF Health Byrd Alzheimer's Research Institute, Morsani College of Medicine, University of South Florida, 12901 Bruce B. Downs Blvd. MDC07, Tampa, FL, 33612, USA.
- Laboratory of New Methods in Biology, Institute for Biological Instrumentation of the Russian Academy of Sciences, Federal Research Center "Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences", Pushchino, Russia.
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, 401 West Main Street, Room E4225, Richmond, VA, 23284, USA.
| |
Collapse
|
32
|
Abstract
Many virus-encoded proteins have intrinsically disordered regions that lack a stable, folded three-dimensional structure. These disordered proteins often play important functional roles in virus replication, such as down-regulating host defense mechanisms. With the widespread availability of next-generation sequencing, the number of new virus genomes with predicted open reading frames is rapidly outpacing our capacity for directly characterizing protein structures through crystallography. Hence, computational methods for structural prediction play an important role. A large number of predictors focus on the problem of classifying residues into ordered and disordered regions, and these methods tend to be validated on a diverse training set of proteins from eukaryotes, prokaryotes, and viruses. In this study, we investigate whether some predictors outperform others in the context of virus proteins and compared our findings with data from non-viral proteins. We evaluate the prediction accuracy of 21 methods, many of which are only available as web applications, on a curated set of 126 proteins encoded by viruses. Furthermore, we apply a random forest classifier to these predictor outputs. Based on cross-validation experiments, this ensemble approach confers a substantial improvement in accuracy, e.g., a mean 36 per cent gain in Matthews correlation coefficient. Lastly, we apply the random forest predictor to severe acute respiratory syndrome coronavirus 2 ORF6, an accessory gene that encodes a short (61 AA) and moderately disordered protein that inhibits the host innate immune response. We show that disorder prediction methods perform differently for viral and non-viral proteins, and that an ensemble approach can yield more robust and accurate predictions.
Collapse
Affiliation(s)
- Gal Almog
- Department of Pathology & Laboratory Medicine, Western University, Dental Sciences Building, Rm. 4044 London, Ontario, Canada, N6A 5C1
| | - Abayomi S Olabode
- Department of Pathology & Laboratory Medicine, Western University, Dental Sciences Building, Rm. 4044 London, Ontario, Canada, N6A 5C1
| | - Art F Y Poon
- Department of Pathology & Laboratory Medicine, Western University, Dental Sciences Building, Rm. 4044 London, Ontario, Canada, N6A 5C1.,Department of Applied Mathematics, Western University, Middlesex College Room 255, 1151 Richmond Street London, Ontario, Canada, N6A 5B7.,Department of Microbiology & Immunology, Western University, 1151 Richmond Street London, Ontario, Canada, N6A 3K
| |
Collapse
|
33
|
Zhao B, Katuwawala A, Oldfield CJ, Dunker AK, Faraggi E, Gsponer J, Kloczkowski A, Malhis N, Mirdita M, Obradovic Z, Söding J, Steinegger M, Zhou Y, Kurgan L. DescribePROT: database of amino acid-level protein structure and function predictions. Nucleic Acids Res 2021; 49:D298-D308. [PMID: 33119734 PMCID: PMC7778963 DOI: 10.1093/nar/gkaa931] [Citation(s) in RCA: 50] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Revised: 09/11/2020] [Accepted: 10/05/2020] [Indexed: 12/30/2022] Open
Abstract
We present DescribePROT, the database of predicted amino acid-level descriptors of structure and function of proteins. DescribePROT delivers a comprehensive collection of 13 complementary descriptors predicted using 10 popular and accurate algorithms for 83 complete proteomes that cover key model organisms. The current version includes 7.8 billion predictions for close to 600 million amino acids in 1.4 million proteins. The descriptors encompass sequence conservation, position specific scoring matrix, secondary structure, solvent accessibility, intrinsic disorder, disordered linkers, signal peptides, MoRFs and interactions with proteins, DNA and RNAs. Users can search DescribePROT by the amino acid sequence and the UniProt accession number and entry name. The pre-computed results are made available instantaneously. The predictions can be accesses via an interactive graphical interface that allows simultaneous analysis of multiple descriptors and can be also downloaded in structured formats at the protein, proteome and whole database scale. The putative annotations included by DescriPROT are useful for a broad range of studies, including: investigations of protein function, applied projects focusing on therapeutics and diseases, and in the development of predictors for other protein sequence descriptors. Future releases will expand the coverage of DescribePROT. DescribePROT can be accessed at http://biomine.cs.vcu.edu/servers/DESCRIBEPROT/.
Collapse
Affiliation(s)
- Bi Zhao
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| | - Akila Katuwawala
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| | | | - A Keith Dunker
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Eshel Faraggi
- Battelle Center for Mathematical Medicine at the Nationwide Children's Hospital, and Department of Pediatrics, The Ohio State University, Columbus, OH, USA
| | - Jörg Gsponer
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, Canada
| | - Andrzej Kloczkowski
- Battelle Center for Mathematical Medicine at the Nationwide Children's Hospital, and Department of Pediatrics, The Ohio State University, Columbus, OH, USA
| | - Nawar Malhis
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, Canada
| | - Milot Mirdita
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany
| | - Zoran Obradovic
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Johannes Söding
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany
| | - Martin Steinegger
- School of Biological Sciences and Institute of Molecular Biology & Genetics, Seoul National University, Seoul, Republic of Korea
| | - Yaoqi Zhou
- Institute for Glycomics, Griffith University, Gold Coast, Queensland, Australia
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| |
Collapse
|
34
|
Zhang F, Shi W, Zhang J, Zeng M, Li M, Kurgan L. PROBselect: accurate prediction of protein-binding residues from proteins sequences via dynamic predictor selection. Bioinformatics 2020; 36:i735-i744. [DOI: 10.1093/bioinformatics/btaa806] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/07/2020] [Indexed: 12/13/2022] Open
Abstract
Abstract
Motivation
Knowledge of protein-binding residues (PBRs) improves our understanding of protein−protein interactions, contributes to the prediction of protein functions and facilitates protein−protein docking calculations. While many sequence-based predictors of PBRs were published, they offer modest levels of predictive performance and most of them cross-predict residues that interact with other partners. One unexplored option to improve the predictive quality is to design consensus predictors that combine results produced by multiple methods.
Results
We empirically investigate predictive performance of a representative set of nine predictors of PBRs. We report substantial differences in predictive quality when these methods are used to predict individual proteins, which contrast with the dataset-level benchmarks that are currently used to assess and compare these methods. Our analysis provides new insights for the cross-prediction concern, dissects complementarity between predictors and demonstrates that predictive performance of the top methods depends on unique characteristics of the input protein sequence. Using these insights, we developed PROBselect, first-of-its-kind consensus predictor of PBRs. Our design is based on the dynamic predictor selection at the protein level, where the selection relies on regression-based models that accurately estimate predictive performance of selected predictors directly from the sequence. Empirical assessment using a low-similarity test dataset shows that PROBselect provides significantly improved predictive quality when compared with the current predictors and conventional consensuses that combine residue-level predictions. Moreover, PROBselect informs the users about the expected predictive quality for the prediction generated from a given input protein.
Availability and implementation
PROBselect is available at http://bioinformatics.csu.edu.cn/PROBselect/home/index.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Fuhao Zhang
- Hunan Provincial Key Laboratory on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Wenbo Shi
- Hunan Provincial Key Laboratory on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Jian Zhang
- School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, China
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Min Zeng
- Hunan Provincial Key Laboratory on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Min Li
- Hunan Provincial Key Laboratory on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| |
Collapse
|
35
|
Katuwawala A, Kurgan L. Comparative Assessment of Intrinsic Disorder Predictions with a Focus on Protein and Nucleic Acid-Binding Proteins. Biomolecules 2020; 10:E1636. [PMID: 33291838 PMCID: PMC7762010 DOI: 10.3390/biom10121636] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2020] [Revised: 11/26/2020] [Accepted: 12/03/2020] [Indexed: 01/18/2023] Open
Abstract
With over 60 disorder predictors, users need help navigating the predictor selection task. We review 28 surveys of disorder predictors, showing that only 11 include assessment of predictive performance. We identify and address a few drawbacks of these past surveys. To this end, we release a novel benchmark dataset with reduced similarity to the training sets of the considered predictors. We use this dataset to perform a first-of-its-kind comparative analysis that targets two large functional families of disordered proteins that interact with proteins and with nucleic acids. We show that limiting sequence similarity between the benchmark and the training datasets has a substantial impact on predictive performance. We also demonstrate that predictive quality is sensitive to the use of the well-annotated order and inclusion of the fully structured proteins in the benchmark datasets, both of which should be considered in future assessments. We identify three predictors that provide favorable results using the new benchmark set. While we find that VSL2B offers the most accurate and robust results overall, ESpritz-DisProt and SPOT-Disorder perform particularly well for disordered proteins. Moreover, we find that predictions for the disordered protein-binding proteins suffer low predictive quality compared to generic disordered proteins and the disordered nucleic acids-binding proteins. This can be explained by the high disorder content of the disordered protein-binding proteins, which makes it difficult for the current methods to accurately identify ordered regions in these proteins. This finding motivates the development of a new generation of methods that would target these difficult-to-predict disordered proteins. We also discuss resources that support users in collecting and identifying high-quality disorder predictions.
Collapse
Affiliation(s)
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA;
| |
Collapse
|
36
|
Fan BL, Jiang Z, Sun J, Liu R. Systematic characterization and prediction of coenzyme A-associated proteins using sequence and network information. Brief Bioinform 2020; 22:6012866. [PMID: 33253385 DOI: 10.1093/bib/bbaa308] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Revised: 09/08/2020] [Accepted: 10/12/2020] [Indexed: 01/11/2023] Open
Abstract
Coenzyme A-associated proteins (CAPs) are a category of functionally important proteins involved in multiple biological processes through interactions with coenzyme A (CoA). To date, unfortunately, the specific differences between CAPs and other proteins have yet to be systemically investigated. Moreover, there are no computational methods that can be used specifically to predict these proteins. Herein, we characterized CAPs from multifaceted viewpoints and revealed their specific preferences. Compared with other proteins, CAPs were more likely to possess binding regions for CoA and its derivatives, were evolutionarily highly conserved, exhibited ordered and hydrophobic structural conformations, and tended to be densely located in protein-protein interaction networks. Based on these biological insights, we built seven classifiers using predicted CoA-binding residue distributions, word embedding vectors, remote homolog numbers, evolutionary conservation, amino acid composition, predicted structural features and network properties. These classifiers could effectively identify CAPs in Homo sapiens, Mus musculus and Arabidopsis thaliana. The complementarity among the individual classifiers prompted us to build a two-layer stacking model named CAPE for improving prediction performance. We applied CAPE to identify some high-confidence candidates in the three species, which were tightly associated with the known functions of CAPs. Finally, we extended our algorithm to cross-species prediction, thereby developing a generic CAP prediction model. In summary, this work provides a comprehensive survey and an effective predictor for CAPs, which can help uncover the interplay between CoA and functionally relevant proteins.
Collapse
Affiliation(s)
- Bing-Liang Fan
- College of Informatics, Huazhong Agricultural University
| | - Zheng Jiang
- College of Informatics, Huazhong Agricultural University
| | - Jun Sun
- College of Informatics, Huazhong Agricultural University
| | - Rong Liu
- College of Informatics, Huazhong Agricultural University
| |
Collapse
|
37
|
The Anti-Inflammatory Protein TNIP1 Is Intrinsically Disordered with Structural Flexibility Contributed by Its AHD1-UBAN Domain. Biomolecules 2020; 10:biom10111531. [PMID: 33182596 PMCID: PMC7697625 DOI: 10.3390/biom10111531] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2020] [Revised: 11/04/2020] [Accepted: 11/05/2020] [Indexed: 01/02/2023] Open
Abstract
TNFAIP3 interacting protein 1 (TNIP1) interacts with numerous non-related cellular, viral, and bacterial proteins. TNIP1 is also linked with multiple chronic inflammatory disorders on the gene and protein levels, through numerous single-nucleotide polymorphisms and reduced protein amounts. Despite the importance of TNIP1 function, there is limited investigation as to how its conformation may impact its apparent multiple roles. Hub proteins like TNIP1 are often intrinsically disordered proteins. Our initial in silico assessments suggested TNIP1 is natively unstructured, featuring numerous potentials intrinsically disordered regions, including the ABIN homology domain 1-ubiquitin binding domain in ABIN proteins and NEMO (AHD1-UBAN) domain associated with its anti-inflammatory function. Using multiple biophysical approaches, we demonstrate the structural flexibility of full-length TNIP1 and the AHD1-UBAN domain. We present evidence the AHD1-UBAN domain exists primarily as a pre-molten globule with limited secondary structure in solution. Data presented here suggest the previously described coiled-coil conformation of the crystallized UBAN-only region may represent just one of possibly multiple states for the AHD1-UBAN domain in solution. These data also characterize the AHD1-UBAN domain in solution as mostly monomeric with potential to undergo oligomerization under specific environmental conditions (e.g., binding partner availability, pH-dependence). This proposed intrinsic disorder across TNIP1 and within the AHD1-UBAN region is likely to impact TNIP1 function and interaction with its multiple partners.
Collapse
|
38
|
Wang K, Hu G, Wu Z, Su H, Yang J, Kurgan L. Comprehensive Survey and Comparative Assessment of RNA-Binding Residue Predictions with Analysis by RNA Type. Int J Mol Sci 2020; 21:E6879. [PMID: 32961749 PMCID: PMC7554811 DOI: 10.3390/ijms21186879] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2020] [Revised: 09/15/2020] [Accepted: 09/17/2020] [Indexed: 02/07/2023] Open
Abstract
With close to 30 sequence-based predictors of RNA-binding residues (RBRs), this comparative survey aims to help with understanding and selection of the appropriate tools. We discuss past reviews on this topic, survey a comprehensive collection of predictors, and comparatively assess six representative methods. We provide a novel and well-designed benchmark dataset and we are the first to report and compare protein-level and datasets-level results, and to contextualize performance to specific types of RNAs. The methods considered here are well-cited and rely on machine learning algorithms on occasion combined with homology-based prediction. Empirical tests reveal that they provide relatively accurate predictions. Virtually all methods perform well for the proteins that interact with rRNAs, some generate accurate predictions for mRNAs, snRNA, SRP and IRES, while proteins that bind tRNAs are predicted poorly. Moreover, except for DRNApred, they confuse DNA and RNA-binding residues. None of the six methods consistently outperforms the others when tested on individual proteins. This variable and complementary protein-level performance suggests that users should not rely on applying just the single best dataset-level predictor. We recommend that future work should focus on the development of approaches that facilitate protein-level selection of accurate predictors and the consensus-based prediction of RBRs.
Collapse
Affiliation(s)
- Kui Wang
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China; (K.W.); (Z.W.); (H.S.); (J.Y.)
| | - Gang Hu
- School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin 300071, China;
| | - Zhonghua Wu
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China; (K.W.); (Z.W.); (H.S.); (J.Y.)
| | - Hong Su
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China; (K.W.); (Z.W.); (H.S.); (J.Y.)
| | - Jianyi Yang
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China; (K.W.); (Z.W.); (H.S.); (J.Y.)
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| |
Collapse
|
39
|
Katuwawala A, Oldfield CJ, Kurgan L. DISOselect: Disorder predictor selection at the protein level. Protein Sci 2020; 29:184-200. [PMID: 31642118 PMCID: PMC6933862 DOI: 10.1002/pro.3756] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2019] [Revised: 10/16/2019] [Accepted: 10/17/2019] [Indexed: 12/27/2022]
Abstract
The intense interest in the intrinsically disordered proteins in the life science community, together with the remarkable advancements in predictive technologies, have given rise to the development of a large number of computational predictors of intrinsic disorder from protein sequence. While the growing number of predictors is a positive trend, we have observed a considerable difference in predictive quality among predictors for individual proteins. Furthermore, variable predictor performance is often inconsistent between predictors for different proteins, and the predictor that shows the best predictive performance depends on the unique properties of each protein sequence. We propose a computational approach, DISOselect, to estimate the predictive performance of 12 selected predictors for individual proteins based on their unique sequence-derived properties. This estimation informs the users about the expected predictive quality for a selected disorder predictor and can be used to recommend methods that are likely to provide the best quality predictions. Our solution does not depend on the results of any disorder predictor; the estimations are made based solely on the protein sequence. Our solution significantly improves predictive performance, as judged with a test set of 1,000 proteins, when compared to other alternatives. We have empirically shown that by using the recommended methods the overall predictive performance for a given set of proteins can be improved by a statistically significant margin. DISOselect is freely available for non-commercial users through the webserver at http://biomine.cs.vcu.edu/servers/DISOselect/.
Collapse
Affiliation(s)
- Akila Katuwawala
- Department of Computer ScienceVirginia Commonwealth UniversityRichmondVirginia
| | | | - Lukasz Kurgan
- Department of Computer ScienceVirginia Commonwealth UniversityRichmondVirginia
| |
Collapse
|