1
|
Perin C, Cretin G, Gelly JC. Hierarchical Analysis of Protein Structures: From Secondary Structures to Protein Units and Domains. Methods Mol Biol 2025; 2870:357-370. [PMID: 39543044 DOI: 10.1007/978-1-0716-4213-9_18] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2024]
Abstract
The three-dimensional structure of proteins is traditionally organized into hierarchical levels, specifically secondary structures and domains. However, different studies suggest the existence of intermediate levels, such as Protein Units (PUs), which provide a refined understanding of protein architecture. PUs, characterized by their compactness and independence, serve as an intermediate organizational level, bridging the gap between secondary structures and domains. This new view not only enhances our comprehension of protein structure, folding, and evolutionary mechanisms but also provides a robust methodology for identifying and categorizing protein domains. Based on the concept of PUs, alternative structural partitioning solutions can be proposed that address the structural ambiguity of proteins, leading to more meaningful domain identification.
Collapse
Affiliation(s)
- Charlotte Perin
- TBI, Université de Toulouse, CNRS, INRAE, INSA, Toulouse, France
| | | | | |
Collapse
|
2
|
Wang L, Zeng Z, Xue Z, Wang Y. DeepNeuropePred: A robust and universal tool to predict cleavage sites from neuropeptide precursors by protein language model. Comput Struct Biotechnol J 2024; 23:309-315. [PMID: 38179071 PMCID: PMC10764246 DOI: 10.1016/j.csbj.2023.12.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Revised: 11/30/2023] [Accepted: 12/02/2023] [Indexed: 01/06/2024] Open
Abstract
Neuropeptides play critical roles in many biological processes such as growth, learning, memory, metabolism, and neuronal differentiation. A few approaches have been reported for predicting neuropeptides that are cleaved from precursor protein sequences. However, these models for cleavage site prediction of precursors were developed using a limited number of neuropeptide precursor datasets and simple precursors representation models. In addition, a universal method for predicting neuropeptide cleavage sites that can be applied to all species is still lacking. In this paper, we proposed a novel deep learning method called DeepNeuropePred, using a combination of pre-trained language model and Convolutional Neural Networks for feature extraction and predicting the neuropeptide cleavage sites from precursors. To demonstrate the model's effectiveness and robustness, we evaluated the performance of DeepNeuropePred and four models from the NeuroPred server in the independent dataset and our model achieved the highest AUC score (0.916), which are 6.9%, 7.8%, 8.8%, and 10.9% higher than Mammalian (0.857), insects (0.850), Mollusc (0.842) and Motif (0.826), respectively. For the convenience of researchers, we provide a web server (http://isyslab.info/NeuroPepV2/deepNeuropePred.jsp).
Collapse
Affiliation(s)
- Lei Wang
- Institute of Medical Artificial Intelligence, Binzhou Medical University, Yantai, Shandong 264003, China
- School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Zilu Zeng
- Wuhan Children's Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei 430010, China
| | - Zhidong Xue
- Institute of Medical Artificial Intelligence, Binzhou Medical University, Yantai, Shandong 264003, China
- School of Software Engineering, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Yan Wang
- Institute of Medical Artificial Intelligence, Binzhou Medical University, Yantai, Shandong 264003, China
- School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| |
Collapse
|
3
|
Iovino BG, Tang H, Ye Y. Protein domain embeddings for fast and accurate similarity search. Genome Res 2024; 34:1434-1444. [PMID: 39237301 PMCID: PMC11529836 DOI: 10.1101/gr.279127.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Accepted: 09/03/2024] [Indexed: 09/07/2024]
Abstract
Recently developed protein language models have enabled a variety of applications with the protein contextual embeddings they produce. Per-protein representations (each protein is represented as a vector of fixed dimension) can be derived via averaging the embeddings of individual residues, or applying matrix transformation techniques such as the discrete cosine transformation (DCT) to matrices of residue embeddings. Such protein-level embeddings have been applied to enable fast searches of similar proteins; however, limitations have been found; for example, PROST is good at detecting global homologs but not local homologs, and knnProtT5 excels for proteins with single domains but not multidomain proteins. Here, we propose a novel approach that first segments proteins into domains (or subdomains) and then applies the DCT to the vectorized embeddings of residues in each domain to infer domain-level contextual vectors. Our approach, called DCTdomain, uses predicted contact maps from ESM-2 for domain segmentation, which is formulated as a domain segmentation problem and can be solved using a recursive cut algorithm (RecCut in short) in quadratic time to the protein length; for comparison, an existing approach for domain segmentation uses a cubic-time algorithm. We show such domain-level contextual vectors (termed as DCT fingerprints) enable fast and accurate detection of similarity between proteins that share global similarities but with undefined extended regions between shared domains, and those that only share local similarities. In addition, tests on a database search benchmark show that the DCTdomain is able to detect distant homologs by leveraging the structural information in the contextual embeddings.
Collapse
Affiliation(s)
- Benjamin Giovanni Iovino
- Luddy School of Informatics, Computing and Engineering, Indiana University, Bloomington, Indiana 47408, USA
| | - Haixu Tang
- Luddy School of Informatics, Computing and Engineering, Indiana University, Bloomington, Indiana 47408, USA
| | - Yuzhen Ye
- Luddy School of Informatics, Computing and Engineering, Indiana University, Bloomington, Indiana 47408, USA
| |
Collapse
|
4
|
Wells J, Hawkins-Hooker A, Bordin N, Sillitoe I, Paige B, Orengo C. Chainsaw: protein domain segmentation with fully convolutional neural networks. Bioinformatics 2024; 40:btae296. [PMID: 38718225 PMCID: PMC11256964 DOI: 10.1093/bioinformatics/btae296] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Revised: 03/23/2024] [Accepted: 05/07/2024] [Indexed: 05/23/2024] Open
Abstract
MOTIVATION Protein domains are fundamental units of protein structure and play a pivotal role in understanding folding, function, evolution, and design. The advent of accurate structure prediction techniques has resulted in an influx of new structural data, making the partitioning of these structures into domains essential for inferring evolutionary relationships and functional classification. RESULTS This article presents Chainsaw, a supervised learning approach to domain parsing that achieves accuracy that surpasses current state-of-the-art methods. Chainsaw uses a fully convolutional neural network which is trained to predict the probability that each pair of residues is in the same domain. Domain predictions are then derived from these pairwise predictions using an algorithm that searches for the most likely assignment of residues to domains given the set of pairwise co-membership probabilities. Chainsaw matches CATH domain annotations in 78% of protein domains versus 72% for the next closest method. When predicting on AlphaFold models, expert human evaluators were twice as likely to prefer Chainsaw's predictions versus the next best method. AVAILABILITY AND IMPLEMENTATION github.com/JudeWells/Chainsaw.
Collapse
Affiliation(s)
- Jude Wells
- Centre for Artificial Intelligence, University College London, WC1E 6BT, United Kingdom
| | - Alex Hawkins-Hooker
- Centre for Artificial Intelligence, University College London, WC1E 6BT, United Kingdom
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, WC1E 6BT, United Kingdom
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, WC1E 6BT, United Kingdom
| | - Brooks Paige
- Centre for Artificial Intelligence, University College London, WC1E 6BT, United Kingdom
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, WC1E 6BT, United Kingdom
| |
Collapse
|
5
|
Schaduangrat N, Anuwongcharoen N, Charoenkwan P, Shoombuatong W. DeepAR: a novel deep learning-based hybrid framework for the interpretable prediction of androgen receptor antagonists. J Cheminform 2023; 15:50. [PMID: 37149650 PMCID: PMC10163717 DOI: 10.1186/s13321-023-00721-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2022] [Accepted: 04/08/2023] [Indexed: 05/08/2023] Open
Abstract
Drug resistance represents a major obstacle to therapeutic innovations and is a prevalent feature in prostate cancer (PCa). Androgen receptors (ARs) are the hallmark therapeutic target for prostate cancer modulation and AR antagonists have achieved great success. However, rapid emergence of resistance contributing to PCa progression is the ultimate burden of their long-term usage. Hence, the discovery and development of AR antagonists with capability to combat the resistance, remains an avenue for further exploration. Therefore, this study proposes a novel deep learning (DL)-based hybrid framework, named DeepAR, to accurately and rapidly identify AR antagonists by using only the SMILES notation. Specifically, DeepAR is capable of extracting and learning the key information embedded in AR antagonists. Firstly, we established a benchmark dataset by collecting active and inactive compounds against AR from the ChEMBL database. Based on this dataset, we developed and optimized a collection of baseline models by using a comprehensive set of well-known molecular descriptors and machine learning algorithms. Then, these baseline models were utilized for creating probabilistic features. Finally, these probabilistic features were combined and used for the construction of a meta-model based on a one-dimensional convolutional neural network. Experimental results indicated that DeepAR is a more accurate and stable approach for identifying AR antagonists in terms of the independent test dataset, by achieving an accuracy of 0.911 and MCC of 0.823. In addition, our proposed framework is able to provide feature importance information by leveraging a popular computational approach, named SHapley Additive exPlanations (SHAP). In the meanwhile, the characterization and analysis of potential AR antagonist candidates were achieved through the SHAP waterfall plot and molecular docking. The analysis inferred that N-heterocyclic moieties, halogenated substituents, and a cyano functional group were significant determinants of potential AR antagonists. Lastly, we implemented an online web server by using DeepAR (at http://pmlabstack.pythonanywhere.com/DeepAR ). We anticipate that DeepAR could be a useful computational tool for community-wide facilitation of AR candidates from a large number of uncharacterized compounds.
Collapse
Affiliation(s)
- Nalini Schaduangrat
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| | - Nuttapat Anuwongcharoen
- Department of Community Medical Technology, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| | - Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, 50200, Thailand.
| | - Watshara Shoombuatong
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand.
| |
Collapse
|
6
|
Yu ZZ, Peng CX, Liu J, Zhang B, Zhou XG, Zhang GJ. DomBpred: Protein Domain Boundary Prediction Based on Domain-Residue Clustering Using Inter-Residue Distance. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:912-922. [PMID: 35594218 DOI: 10.1109/tcbb.2022.3175905] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Domain boundary prediction is one of the most important problems in the study of protein structure and function, especially for large proteins. At present, most domain boundary prediction methods have low accuracy and limitations in dealing with multi-domain proteins. In this study, we develop a sequence-based protein domain boundary prediction, named DomBpred. In DomBpred, the input sequence is first classified as either a single-domain protein or a multi-domain protein through a designed effective sequence metric based on a constructed single-domain sequence library. For the multi-domain protein, a domain-residue clustering algorithm inspired by Ising model is proposed to cluster the spatially close residues according inter-residue distance. The unclassified residues and the residues at the edge of the cluster are then tuned by the secondary structure to form potential cut points. Finally, a domain boundary scoring function is proposed to recursively evaluate the potential cut points to generate the domain boundary. DomBpred is tested on a large-scale test set of FUpred comprising 2549 proteins. Experimental results show that DomBpred better performs than the state-of-the-art methods in classifying whether protein sequences are composed by single or multiple domains, and the Matthew's correlation coefficient is 0.882. Moreover, on 849 multi-domain proteins, the domain boundary distance and normalised domain overlap scores of DomBpred are 0.523 and 0.824, respectively, which are 5.0% and 4.2% higher than those of the best comparison method, respectively. Comparison with other methods on the given test set shows that DomBpred outperforms most state-of-the-art sequence-based methods and even achieves better results than the top-level template-based method. The executable program is freely available at https://github.com/iobio-zjut/DomBpred and the online server at http://zhanglab-bioinf.com/DomBpred/.
Collapse
|
7
|
Zhu K, Su H, Peng Z, Yang J. A unified approach to protein domain parsing with inter-residue distance matrix. Bioinformatics 2023; 39:7025502. [PMID: 36734597 PMCID: PMC9919455 DOI: 10.1093/bioinformatics/btad070] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Revised: 01/02/2023] [Accepted: 02/01/2023] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION It is fundamental to cut multi-domain proteins into individual domains, for precise domain-based structural and functional studies. In the past, sequence-based and structure-based domain parsing was carried out independently with different methodologies. The recent progress in deep learning-based protein structure prediction provides the opportunity to unify sequence-based and structure-based domain parsing. RESULTS Based on the inter-residue distance matrix, which can be either derived from the input structure or predicted by trRosettaX, we can decode the domain boundaries under a unified framework. We name the proposed method UniDoc. The principle of UniDoc is based on the well-accepted physical concept of maximizing intra-domain interaction while minimizing inter-domain interaction. Comprehensive tests on five benchmark datasets indicate that UniDoc outperforms other state-of-the-art methods in terms of both accuracy and speed, for both sequence-based and structure-based domain parsing. The major contribution of UniDoc is providing a unified framework for structure-based and sequence-based domain parsing. We hope that UniDoc would be a convenient tool for protein domain analysis. AVAILABILITY AND IMPLEMENTATION https://yanglab.nankai.edu.cn/UniDoc/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kun Zhu
- School of Mathematical Sciences, Nankai University, Tianjin 300071, China
| | - Hong Su
- School of Mathematical Sciences, Nankai University, Tianjin 300071, China
| | - Zhenling Peng
- Ministry of Education Frontiers Science Center for Nonlinear Expectations, Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
| | - Jianyi Yang
- Ministry of Education Frontiers Science Center for Nonlinear Expectations, Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
| |
Collapse
|
8
|
Wang L, Zhong H, Xue Z, Wang Y. Res-Dom: predicting protein domain boundary from sequence using deep residual network and Bi-LSTM. BIOINFORMATICS ADVANCES 2022; 2:vbac060. [PMID: 36699417 PMCID: PMC9710680 DOI: 10.1093/bioadv/vbac060] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/22/2022] [Revised: 07/01/2022] [Accepted: 08/30/2022] [Indexed: 01/28/2023]
Abstract
Motivation Protein domains are the basic units of proteins that can fold, function and evolve independently. Protein domain boundary partition plays an important role in protein structure prediction, understanding their biological functions, annotating their evolutionary mechanisms and protein design. Although there are many methods that have been developed to predict domain boundaries from protein sequence over the past two decades, there is still much room for improvement. Results In this article, a novel domain boundary prediction tool called Res-Dom was developed, which is based on a deep residual network, bidirectional long short-term memory (Bi-LSTM) and transfer learning. We used deep residual neural networks to extract higher-order residue-related information. In addition, we also used a pre-trained protein language model called ESM to extract sequence embedded features, which can summarize sequence context information more abundantly. To improve the global representation of these deep residual networks, a Bi-LSTM network was also designed to consider long-range interactions between residues. Res-Dom was then tested on an independent test set including 342 proteins and generated correct single-domain and multi-domain classifications with a Matthew's correlation coefficient of 0.668, which was 17.6% higher than the second-best compared method. For domain boundaries, the normalized domain overlapping score of Res-Dom was 0.849, which was 5% higher than the second-best compared method. Furthermore, Res-Dom required significantly less time than most of the recently developed state-of-the-art domain prediction methods. Availability and implementation All source code, datasets and model are available at http://isyslab.info/Res-Dom/.
Collapse
Affiliation(s)
- Lei Wang
- Institute of Medical Artificial Intelligence, Binzhou Medical University, Yantai, Shandong 264003, China.,School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Haolin Zhong
- School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Zhidong Xue
- Institute of Medical Artificial Intelligence, Binzhou Medical University, Yantai, Shandong 264003, China.,School of Software Engineering, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Yan Wang
- Institute of Medical Artificial Intelligence, Binzhou Medical University, Yantai, Shandong 264003, China.,School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| |
Collapse
|
9
|
Shi H, Zhang S, Li X. R5hmCFDV: computational identification of RNA 5-hydroxymethylcytosine based on deep feature fusion and deep voting. Brief Bioinform 2022; 23:6658858. [PMID: 35945157 DOI: 10.1093/bib/bbac341] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2022] [Revised: 07/17/2022] [Accepted: 07/25/2022] [Indexed: 11/13/2022] Open
Abstract
RNA 5-hydroxymethylcytosine (5hmC) is a kind of RNA modification, which is related to the life activities of many organisms. Studying its distribution is very important to reveal its biological function. Previously, high-throughput sequencing was used to identify 5hmC, but it is expensive and inefficient. Therefore, machine learning is used to identify 5hmC sites. Here, we design a model called R5hmCFDV, which is mainly divided into feature representation, feature fusion and classification. (i) Pseudo dinucleotide composition, dinucleotide binary profile and frequency, natural vector and physicochemical property are used to extract features from four aspects: nucleotide composition, coding, natural language and physical and chemical properties. (ii) To strengthen the relevance of features, we construct a novel feature fusion method. Firstly, the attention mechanism is employed to process four single features, stitch them together and feed them to the convolution layer. After that, the output data are processed by BiGRU and BiLSTM, respectively. Finally, the features of these two parts are fused by the multiply function. (iii) We design the deep voting algorithm for classification by imitating the soft voting mechanism in the Python package. The base classifiers contain deep neural network (DNN), convolutional neural network (CNN) and improved gated recurrent unit (GRU). And then using the principle of soft voting, the corresponding weights are assigned to the predicted probabilities of the three classifiers. The predicted probability values are multiplied by the corresponding weights and then summed to obtain the final prediction results. We use 10-fold cross-validation to evaluate the model, and the evaluation indicators are significantly improved. The prediction accuracy of the two datasets is as high as 95.41% and 93.50%, respectively. It demonstrates the stronger competitiveness and generalization performance of our model. In addition, all datasets and source codes can be found at https://github.com/HongyanShi026/R5hmCFDV.
Collapse
Affiliation(s)
- Hongyan Shi
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China
| | - Xinjie Li
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China
| |
Collapse
|
10
|
Mahmud S, Guo Z, Quadir F, Liu J, Cheng J. Multi-head attention-based U-Nets for predicting protein domain boundaries using 1D sequence features and 2D distance maps. BMC Bioinformatics 2022; 23:283. [PMID: 35854211 PMCID: PMC9295499 DOI: 10.1186/s12859-022-04829-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Accepted: 07/08/2022] [Indexed: 01/25/2023] Open
Abstract
The information about the domain architecture of proteins is useful for studying protein structure and function. However, accurate prediction of protein domain boundaries (i.e., sequence regions separating two domains) from sequence remains a significant challenge. In this work, we develop a deep learning method based on multi-head U-Nets (called DistDom) to predict protein domain boundaries utilizing 1D sequence features and predicted 2D inter-residue distance map as input. The 1D features contain the evolutionary and physicochemical information of protein sequences, whereas the 2D distance map includes the structural information of proteins that was rarely used in domain boundary prediction before. The 1D and 2D features are processed by the 1D and 2D U-Nets respectively to generate hidden features. The hidden features are then used by the multi-head attention to predict the probability of each residue of a protein being in a domain boundary, leveraging both local and global information in the features. The residue-level domain boundary predictions can be used to classify proteins as single-domain or multi-domain proteins. It classifies the CASP14 single-domain and multi-domain targets at the accuracy of 75.9%, 13.28% more accurate than the state-of-the-art method. Tested on the CASP14 multi-domain protein targets with expert annotated domain boundaries, the average per-target F1 measure score of the domain boundary prediction by DistDom is 0.263, 29.56% higher than the state-of-the-art method.
Collapse
Affiliation(s)
- Sajid Mahmud
- grid.134936.a0000 0001 2162 3504Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO USA
| | - Zhiye Guo
- grid.134936.a0000 0001 2162 3504Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO USA
| | - Farhan Quadir
- grid.134936.a0000 0001 2162 3504Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO USA
| | - Jian Liu
- grid.134936.a0000 0001 2162 3504Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO USA
| | - Jianlin Cheng
- grid.134936.a0000 0001 2162 3504Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO USA
| |
Collapse
|
11
|
Cretin G, Galochkina T, Vander Meersche Y, de Brevern AG, Postic G, Gelly JC. SWORD2: hierarchical analysis of protein 3D structures. Nucleic Acids Res 2022; 50:W732-W738. [PMID: 35580056 PMCID: PMC9252838 DOI: 10.1093/nar/gkac370] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2022] [Revised: 04/19/2022] [Accepted: 04/29/2022] [Indexed: 11/27/2022] Open
Abstract
Understanding the functions and origins of proteins requires splitting these macromolecules into fragments that could be independent in terms of folding, activity, or evolution. For that purpose, structural domains are the typical level of analysis, but shorter segments, such as subdomains and supersecondary structures, are insightful as well. Here, we propose SWORD2, a web server for exploring how an input protein structure may be decomposed into ‘Protein Units’ that can be hierarchically assembled to delimit structural domains. For each partitioning solution, the relevance of the identified substructures is estimated through different measures. This multilevel analysis is achieved by integrating our previous work on domain delineation, ‘protein peeling’ and model quality assessment. We hope that SWORD2 will be useful to biologists searching for key regions in their proteins of interest and to bioinformaticians building datasets of protein structures. The web server is freely available online: https://www.dsimb.inserm.fr/SWORD2.
Collapse
Affiliation(s)
- Gabriel Cretin
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, F-75015 Paris, France.,Laboratoire d'Excellence GR-Ex, 75015 Paris, France
| | - Tatiana Galochkina
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, F-75015 Paris, France.,Laboratoire d'Excellence GR-Ex, 75015 Paris, France
| | - Yann Vander Meersche
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, F-75015 Paris, France.,Laboratoire d'Excellence GR-Ex, 75015 Paris, France
| | - Alexandre G de Brevern
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, F-75015 Paris, France.,Laboratoire d'Excellence GR-Ex, 75015 Paris, France
| | - Guillaume Postic
- Université Paris-Saclay, Univ Evry, IBISC, 91020 Evry-Courcouronnes, France
| | - Jean-Christophe Gelly
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, F-75015 Paris, France.,Laboratoire d'Excellence GR-Ex, 75015 Paris, France
| |
Collapse
|
12
|
Yang W, Liu Y, Xiao C. Deep metric learning for accurate protein secondary structure prediction. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.108356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
13
|
Charoenkwan P, Nantasenamat C, Hasan MM, Manavalan B, Shoombuatong W. BERT4Bitter: a bidirectional encoder representations from transformers (BERT)-based model for improving the prediction of bitter peptides. Bioinformatics 2021; 37:2556-2562. [PMID: 33638635 DOI: 10.1093/bioinformatics/btab133] [Citation(s) in RCA: 98] [Impact Index Per Article: 24.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2020] [Revised: 02/08/2021] [Accepted: 02/24/2021] [Indexed: 12/11/2022] Open
Abstract
MOTIVATION The identification of bitter peptides through experimental approaches is an expensive and time-consuming endeavor. Due to the huge number of newly available peptide sequences in the post-genomic era, the development of automated computational models for the identification of novel bitter peptides is highly desira-ble. RESULTS In this work, we present BERT4Bitter, a bidirectional encoder representation from transformers (BERT)-based model for predicting bitter peptides directly from their amino acid sequence without using any structural information. To the best of our knowledge, this is the first time a BERT-based model has been employed to identify bitter peptides. Compared to widely used machine learning models, BERT4Bitter achieved the best performance with accuracy of 0.861 and 0.922 for cross-validation and independent tests, respectively. Furthermore, extensive empirical benchmarking experiments on the independent dataset demonstrated that BERT4Bitter clearly outperformed the existing method with improvements of > 8% accuracy and >16% Matthews coefficient correlation, highlighting the effectiveness and robustness of BERT4Bitter. We believe that the BERT4Bitter method proposed herein will be a useful tool for rapidly screening and identifying novel bitter peptides for drug development and nutritional research. AVAILABILITY The user-friendly web server of the proposed BERT4Bitter is freely accessible at: http://pmlab.pythonanywhere.com/BERT4Bitter. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai 50200, Thailand
| | - Chanin Nantasenamat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| | - Md Mehedi Hasan
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112 USA
| | | | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| |
Collapse
|
14
|
Wang Y, Zhang H, Zhong H, Xue Z. Protein domain identification methods and online resources. Comput Struct Biotechnol J 2021; 19:1145-1153. [PMID: 33680357 PMCID: PMC7895673 DOI: 10.1016/j.csbj.2021.01.041] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2020] [Revised: 01/25/2021] [Accepted: 01/26/2021] [Indexed: 01/03/2023] Open
Abstract
Protein domains are the basic units of proteins that can fold, function, and evolve independently. Knowledge of protein domains is critical for protein classification, understanding their biological functions, annotating their evolutionary mechanisms and protein design. Thus, over the past two decades, a number of protein domain identification approaches have been developed, and a variety of protein domain databases have also been constructed. This review divides protein domain prediction methods into two categories, namely sequence-based and structure-based. These methods are introduced in detail, and their advantages and limitations are compared. Furthermore, this review also provides a comprehensive overview of popular online protein domain sequence and structure databases. Finally, we discuss potential improvements of these prediction methods.
Collapse
Affiliation(s)
- Yan Wang
- Institute of Medical Artificial Intelligence, Binzhou Medical College, Yantai, Shandong 264003, China
- School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Hang Zhang
- School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Haolin Zhong
- School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Zhidong Xue
- School of Software Engineering, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| |
Collapse
|
15
|
Zheng W, Zhou X, Wuyun Q, Pearce R, Li Y, Zhang Y. FUpred: detecting protein domains through deep-learning-based contact map prediction. Bioinformatics 2020; 36:3749-3757. [PMID: 32227201 DOI: 10.1093/bioinformatics/btaa217] [Citation(s) in RCA: 40] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2019] [Revised: 02/27/2020] [Accepted: 03/25/2020] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Protein domains are subunits that can fold and function independently. Correct domain boundary assignment is thus a critical step toward accurate protein structure and function analyses. There is, however, no efficient algorithm available for accurate domain prediction from sequence. The problem is particularly challenging for proteins with discontinuous domains, which consist of domain segments that are separated along the sequence. RESULTS We developed a new algorithm, FUpred, which predicts protein domain boundaries utilizing contact maps created by deep residual neural networks coupled with coevolutionary precision matrices. The core idea of the algorithm is to retrieve domain boundary locations by maximizing the number of intra-domain contacts, while minimizing the number of inter-domain contacts from the contact maps. FUpred was tested on a large-scale dataset consisting of 2549 proteins and generated correct single- and multi-domain classifications with a Matthew's correlation coefficient of 0.799, which was 19.1% (or 5.3%) higher than the best machine learning (or threading)-based method. For proteins with discontinuous domains, the domain boundary detection and normalized domain overlapping scores of FUpred were 0.788 and 0.521, respectively, which were 17.3% and 23.8% higher than the best control method. The results demonstrate a new avenue to accurately detect domain composition from sequence alone, especially for discontinuous, multi-domain proteins. AVAILABILITY AND IMPLEMENTATION https://zhanglab.ccmb.med.umich.edu/FUpred. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wei Zheng
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109
| | - Xiaogen Zhou
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109
| | - Qiqige Wuyun
- Computer Science and Engineering Department, Michigan State University, East Lansing, MI 48824, USA
| | - Robin Pearce
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109
| | - Yang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109.,School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109.,Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
16
|
Xie R, Li J, Wang J, Dai W, Leier A, Marquez-Lago TT, Akutsu T, Lithgow T, Song J, Zhang Y. DeepVF: a deep learning-based hybrid framework for identifying virulence factors using the stacking strategy. Brief Bioinform 2020; 22:5864586. [PMID: 32599617 DOI: 10.1093/bib/bbaa125] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Revised: 05/22/2020] [Accepted: 05/22/2020] [Indexed: 12/14/2022] Open
Abstract
Virulence factors (VFs) enable pathogens to infect their hosts. A wealth of individual, disease-focused studies has identified a wide variety of VFs, and the growing mass of bacterial genome sequence data provides an opportunity for computational methods aimed at predicting VFs. Despite their attractive advantages and performance improvements, the existing methods have some limitations and drawbacks. Firstly, as the characteristics and mechanisms of VFs are continually evolving with the emergence of antibiotic resistance, it is more and more difficult to identify novel VFs using existing tools that were previously developed based on the outdated data sets; secondly, few systematic feature engineering efforts have been made to examine the utility of different types of features for model performances, as the majority of tools only focused on extracting very few types of features. By addressing the aforementioned issues, the accuracy of VF predictors can likely be significantly improved. This, in turn, would be particularly useful in the context of genome wide predictions of VFs. In this work, we present a deep learning (DL)-based hybrid framework (termed DeepVF) that is utilizing the stacking strategy to achieve more accurate identification of VFs. Using an enlarged, up-to-date dataset, DeepVF comprehensively explores a wide range of heterogeneous features with popular machine learning algorithms. Specifically, four classical algorithms, including random forest, support vector machines, extreme gradient boosting and multilayer perceptron, and three DL algorithms, including convolutional neural networks, long short-term memory networks and deep neural networks are employed to train 62 baseline models using these features. In order to integrate their individual strengths, DeepVF effectively combines these baseline models to construct the final meta model using the stacking strategy. Extensive benchmarking experiments demonstrate the effectiveness of DeepVF: it achieves a more accurate and stable performance compared with baseline models on the benchmark dataset and clearly outperforms state-of-the-art VF predictors on the independent test. Using the proposed hybrid ensemble model, a user-friendly online predictor of DeepVF (http://deepvf.erc.monash.edu/) is implemented. Furthermore, its utility, from the user's viewpoint, is compared with that of existing toolkits. We believe that DeepVF will be exploited as a useful tool for screening and identifying potential VFs from protein-coding gene sequences in bacterial genomes.
Collapse
Affiliation(s)
- Ruopeng Xie
- Bioinformatics Lab at Guilin University of Electronic Technology
| | - Jiahui Li
- Bioinformatics Lab at Guilin University of Electronic Technology
| | - Jiawei Wang
- Biomedicine Discovery Institute and the Department of Microbiology at Monash University, Australia
| | - Wei Dai
- School of Computer Science and Information Security, Guilin University of Electronic Technology, China
| | - André Leier
- Department of Genetics and the Department of Cell, Developmental and Integrative Biology, University of Alabama at Birmingham (UAB) School of Medicine, USA
| | - Tatiana T Marquez-Lago
- Department of Genetics and the Department of Cell, Developmental and Integrative Biology, University of Alabama at Birmingham (UAB) School of Medicine, USA
| | | | - Trevor Lithgow
- Biomedicine Discovery Institute and the Director of the Centre to Impact AMR at Monash University, Australia
| | - Jiangning Song
- Group Leader in the Biomedicine Discovery Institute and the Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Yanju Zhang
- Leiden Institute of Advanced Computer Science, Leiden University
| |
Collapse
|
17
|
Öztürk H, Özgür A, Schwaller P, Laino T, Ozkirimli E. Exploring chemical space using natural language processing methodologies for drug discovery. Drug Discov Today 2020; 25:689-705. [DOI: 10.1016/j.drudis.2020.01.020] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2019] [Revised: 12/20/2019] [Accepted: 01/28/2020] [Indexed: 01/06/2023]
|
18
|
Shi Q, Chen W, Huang S, Wang Y, Xue Z. Deep learning for mining protein data. Brief Bioinform 2019; 22:194-218. [PMID: 31867611 DOI: 10.1093/bib/bbz156] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2019] [Revised: 10/21/2019] [Accepted: 11/07/2019] [Indexed: 01/16/2023] Open
Abstract
The recent emergence of deep learning to characterize complex patterns of protein big data reveals its potential to address the classic challenges in the field of protein data mining. Much research has revealed the promise of deep learning as a powerful tool to transform protein big data into valuable knowledge, leading to scientific discoveries and practical solutions. In this review, we summarize recent publications on deep learning predictive approaches in the field of mining protein data. The application architectures of these methods include multilayer perceptrons, stacked autoencoders, deep belief networks, two- or three-dimensional convolutional neural networks, recurrent neural networks, graph neural networks, and complex neural networks and are described from five perspectives: residue-level prediction, sequence-level prediction, three-dimensional structural analysis, interaction prediction, and mass spectrometry data mining. The advantages and deficiencies of these architectures are presented in relation to various tasks in protein data mining. Additionally, some practical issues and their future directions are discussed, such as robust deep learning for protein noisy data, architecture optimization for specific tasks, efficient deep learning for limited protein data, multimodal deep learning for heterogeneous protein data, and interpretable deep learning for protein understanding. This review provides comprehensive perspectives on general deep learning techniques for protein data analysis.
Collapse
Affiliation(s)
- Qiang Shi
- School of Software Engineering, Huazhong University of Science and Technology. His main interests cover machine learning especially deep learning, protein data analysis, and big data mining
| | - Weiya Chen
- School of Software Engineering, Huazhong University of Science & Technology, Wuhan, China. His research interests cover bioinformatics, virtual reality, and data visualization
| | - Siqi Huang
- Software Engineering at Huazhong University of science and technology, focusing on Machine learning and data mining
| | - Yan Wang
- School of life, University of Science & Technology; her main interests cover protein structure and function prediction and big data mining
| | - Zhidong Xue
- School of Software Engineering, Huazhong University of Science & Technology, Wuhan, China. His research interests cover bioinformatics, machine learning, and image processing
| |
Collapse
|