1
|
Zhao C, Wang S. AttCON: With better MSAs and attention mechanism for accurate protein contact map prediction. Comput Biol Med 2024; 169:107822. [PMID: 38091726 DOI: 10.1016/j.compbiomed.2023.107822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Revised: 11/19/2023] [Accepted: 12/04/2023] [Indexed: 02/08/2024]
Abstract
Protein contact map prediction is a critical and vital step in protein structure prediction, and its accuracy is highly contingent upon the feature representations of protein sequence information and the efficacy of deep learning models. In this paper, we propose an algorithm, DeepMSA+, to generate protein multiple sequence alignments (MSAs) and to construct feature representations based on co-evolutionary information and sequence information derived from MSAs. We also propose an improved deep learning model, AttCON, for training input features to predict protein contact map. The model incorporates an attention module, and by comparing different attention modules, we find a parameter-free attention module suitable for contact map prediction. Additionally, we use the Focal Loss function to better address the data imbalance issue in protein contact map. We also developed a weighted evaluation index (W score) for model evaluation, which takes into account a wide range of metrics. W score is comprehensive in its scope, with a particular focus on the precision of predictions for medium-range and long-range contacts. Experimental results show that AttCON achieves good precision results on datasets from CASP11 to CASP15. Compared to some state-of-the-art methods, it achieves an average improvement of over 5% in both medium-range and long-range predictions, and W score is improved by an average of 2 points.
Collapse
Affiliation(s)
- Che Zhao
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, 650504, Yunnan, China
| | - Shunfang Wang
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, 650504, Yunnan, China; Yunnan Key Laboratory of Intelligent Systems and Computing, Yunnan University, Kunming, 650504, Yunnan, China.
| |
Collapse
|
2
|
Chen S, Yang Y, Zhou H, Sun Q, Su R. DNN-PNN: A parallel deep neural network model to improve anticancer drug sensitivity. Methods 2023; 209:1-9. [PMID: 36410694 DOI: 10.1016/j.ymeth.2022.11.002] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2022] [Revised: 10/11/2022] [Accepted: 11/14/2022] [Indexed: 11/19/2022] Open
Abstract
With the rapid development of deep learning techniques and large-scale genomics database, it is of great potential to apply deep learning to the prediction task of anticancer drug sensitivity, which can effectively improve the identification efficiency and accuracy of therapeutic biomarkers. In this study, we propose a parallel deep learning framework DNN-PNN, which integrates rich and heterogeneous information from gene expression and pharmaceutical chemical structure data. With the proposal of DNN-PNN, a new and more effective drug data representation strategy is introduced, that is, the correlation between features is represented by product, which alleviates the limitations of high-dimensional discrete data in deep learning. Furthermore, the framework is optimized to reduce the time complexity of the model. We conducted extensive experiments on the CCLE datasets to compare DNN-PNN with its variant DNN-FM representing the traditional feature correlation model, the component DNN or PNN alone, and the common machine learning models. It is found that DNN-PNN not only has high prediction accuracy, but also has significant advantages in stability and convergence speed.
Collapse
Affiliation(s)
- Siqi Chen
- College of Intelligence and Computing, Tianjin University, Tianjin 300072, China.
| | - Yang Yang
- College of Intelligence and Computing, Tianjin University, Tianjin 300072, China
| | - Haoran Zhou
- College of Intelligence and Computing, Tianjin University, Tianjin 300072, China
| | - Qisong Sun
- College of Intelligence and Computing, Tianjin University, Tianjin 300072, China
| | - Ran Su
- College of Intelligence and Computing, Tianjin University, Tianjin 300072, China.
| |
Collapse
|
3
|
Mufassirin MMM, Newton MAH, Sattar A. Artificial intelligence for template-free protein structure prediction: a comprehensive review. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10350-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
4
|
Roche R, Bhattacharya S, Shuvo MH, Bhattacharya D. rrQNet: Protein contact map quality estimation by deep evolutionary reconciliation. Proteins 2022; 90:2023-2034. [PMID: 35751651 PMCID: PMC9633355 DOI: 10.1002/prot.26394] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 05/31/2022] [Accepted: 06/21/2022] [Indexed: 11/10/2022]
Abstract
Protein contact maps have proven to be a valuable tool in the deep learning revolution of protein structure prediction, ushering in the recent breakthrough by AlphaFold2. However, self-assessment of the quality of predicted structures are typically performed at the granularity of three-dimensional coordinates as opposed to directly exploiting the rotation- and translation-invariant two-dimensional (2D) contact maps. Here, we present rrQNet, a deep learning method for self-assessment in 2D by contact map quality estimation. Our approach is based on the intuition that for a contact map to be of high quality, the residue pairs predicted to be in contact should be mutually consistent with the evolutionary context of the protein. The deep neural network architecture of rrQNet implements this intuition by cascading two deep modules-one encoding the evolutionary context and the other performing evolutionary reconciliation. The penultimate stage of rrQNet estimates the quality scores at the interacting residue-pair level, which are then aggregated for estimating the quality of a contact map. This design choice offers versatility at varied resolutions from individual residue pairs to full-fledged contact maps. Trained on multiple complementary sources of contact predictors, rrQNet facilitates generalizability across various contact maps. By rigorously testing using publicly available datasets and comparing against several in-house baseline approaches, we show that rrQNet accurately reproduces the true quality score of a predicted contact map and successfully distinguishes between accurate and inaccurate contact maps predicted by a wide variety of contact predictors. The open-source rrQNet software package is freely available at https://github.com/Bhattacharya-Lab/rrQNet.
Collapse
Affiliation(s)
- Rahmatullah Roche
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA
| | - Sutanu Bhattacharya
- Department of Computer Science, Florida Polytechnic University, Lakeland, FL 33805, USA
| | - Md Hossain Shuvo
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA
| | | |
Collapse
|
5
|
Lin Y, Wang Y, Liang Y, Yu Y, Li J, Ma Q, He F, Xu D. Sampling and ranking spatial transcriptomics data embeddings to identify tissue architecture. Front Genet 2022; 13:912813. [PMID: 36035139 PMCID: PMC9411666 DOI: 10.3389/fgene.2022.912813] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Accepted: 07/08/2022] [Indexed: 11/13/2022] Open
Abstract
Spatial transcriptomics is an emerging technology widely applied to the analyses of tissue architecture and corresponding biological functions. Substantial computational methods have been developed for analyzing spatial transcriptomics data. These methods generate embeddings from gene expression and spatial locations for spot clustering or tissue architecture segmentation. Although the hyperparameters used to produce an embedding can be tuned for a given training set, a fixed embedding has variable performance from case to case due to data distributions. Therefore, selecting an effective embedding for new data in advance would be useful. For this purpose, we developed an embedding evaluation method named message passing-Moran's I with maximum filtering (MP-MIM), which combines message passing-based embedding transformation with spatial autocorrelation analysis. We applied a graph convolution to aggregate spatial transcriptomics data and employed global Moran's I to measure spatial autocorrelation and select the most effective embedding to infer tissue architecture. Sixteen spatial transcriptomics samples generated from the human brain were used to validate our method. The results show that MP-MIM can accurately identify high-quality embeddings that produce a high correlation between the predicted tissue architecture and the ground truth. Overall, our study provides a novel method to select embeddings for new test data and enhance the usability of deep learning tools for spatial transcriptome analyses.
Collapse
Affiliation(s)
- Yu Lin
- School of Artificial Intelligence, Jilin University, Changchun, China
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Yan Wang
- School of Artificial Intelligence, Jilin University, Changchun, China
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China
| | - Yanchun Liang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China
- School of Computer Science, Zhuhai College of Science and Technology, Zhuhai, China
| | - Yang Yu
- School of Information Science and Technology, Northeast Normal University, Changchun, China
| | - Jingyi Li
- School of Information Science and Technology, Northeast Normal University, Changchun, China
| | - Qin Ma
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, United States
| | - Fei He
- School of Information Science and Technology, Northeast Normal University, Changchun, China
| | - Dong Xu
- Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| |
Collapse
|
6
|
Rezende PM, Xavier JS, Ascher DB, Fernandes GR, Pires DEV. Evaluating hierarchical machine learning approaches to classify biological databases. Brief Bioinform 2022; 23:6611916. [PMID: 35724625 PMCID: PMC9310517 DOI: 10.1093/bib/bbac216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2021] [Revised: 04/29/2022] [Accepted: 05/09/2022] [Indexed: 12/04/2022] Open
Abstract
The rate of biological data generation has increased dramatically in recent years, which has driven the importance of databases as a resource to guide innovation and the generation of biological insights. Given the complexity and scale of these databases, automatic data classification is often required. Biological data sets are often hierarchical in nature, with varying degrees of complexity, imposing different challenges to train, test and validate accurate and generalizable classification models. While some approaches to classify hierarchical data have been proposed, no guidelines regarding their utility, applicability and limitations have been explored or implemented. These include ‘Local’ approaches considering the hierarchy, building models per level or node, and ‘Global’ hierarchical classification, using a flat classification approach. To fill this gap, here we have systematically contrasted the performance of ‘Local per Level’ and ‘Local per Node’ approaches with a ‘Global’ approach applied to two different hierarchical datasets: BioLip and CATH. The results show how different components of hierarchical data sets, such as variation coefficient and prediction by depth, can guide the choice of appropriate classification schemes. Finally, we provide guidelines to support this process when embarking on a hierarchical classification task, which will help optimize computational resources and predictive performance.
Collapse
Affiliation(s)
- Pâmela M Rezende
- Universidade Federal de Minas Gerais.,Instituto René Rachou, Fundação Oswaldo Cruz.,Stilingue Inteligência Artificial
| | - Joicymara S Xavier
- Universidade Federal de Minas Gerais.,Instituto René Rachou, Fundação Oswaldo Cruz.,Institute of Agricultural Sciences, Universidade Federal dos Vales do Jequitinhonha e Mucuri
| | - David B Ascher
- School of Chemistry and Molecular Biosciences, University of Queensland.,Systems and Computational Biology, Bio 21 Institute, University of Melbourne.,Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute
| | | | - Douglas E V Pires
- Systems and Computational Biology, Bio 21 Institute, University of Melbourne.,Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute.,School of Computing and Information Systems, University of Melbourne
| |
Collapse
|
7
|
Lee D, Xiong D, Wierbowski S, Li L, Liang S, Yu H. Deep learning methods for 3D structural proteome and interactome modeling. Curr Opin Struct Biol 2022; 73:102329. [PMID: 35139457 PMCID: PMC8957610 DOI: 10.1016/j.sbi.2022.102329] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2021] [Revised: 12/05/2021] [Accepted: 12/31/2021] [Indexed: 12/19/2022]
Abstract
Bolstered by recent methodological and hardware advances, deep learning has increasingly been applied to biological problems and structural proteomics. Such approaches have achieved remarkable improvements over traditional machine learning methods in tasks ranging from protein contact map prediction to protein folding, prediction of protein-protein interaction interfaces, and characterization of protein-drug binding pockets. In particular, emergence of ab initio protein structure prediction methods including AlphaFold2 has revolutionized protein structural modeling. From a protein function perspective, numerous deep learning methods have facilitated deconvolution of the exact amino acid residues and protein surface regions responsible for binding other proteins or small molecule drugs. In this review, we provide a comprehensive overview of recent deep learning methods applied in structural proteomics.
Collapse
Affiliation(s)
- Dongjin Lee
- Department of Computational Biology, Cornell University, Ithaca, NY 14853, USA; Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA
| | - Dapeng Xiong
- Department of Computational Biology, Cornell University, Ithaca, NY 14853, USA; Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA
| | - Shayne Wierbowski
- Department of Computational Biology, Cornell University, Ithaca, NY 14853, USA; Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA
| | - Le Li
- Department of Computational Biology, Cornell University, Ithaca, NY 14853, USA; Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA
| | - Siqi Liang
- Department of Computational Biology, Cornell University, Ithaca, NY 14853, USA; Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA
| | - Haiyuan Yu
- Department of Computational Biology, Cornell University, Ithaca, NY 14853, USA; Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA.
| |
Collapse
|
8
|
Si Y, Yan C. Improved protein contact prediction using dimensional hybrid residual networks and singularity enhanced loss function. Brief Bioinform 2021; 22:6357883. [PMID: 34448830 DOI: 10.1093/bib/bbab341] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Revised: 07/10/2021] [Accepted: 08/02/2021] [Indexed: 11/12/2022] Open
Abstract
Deep residual learning has shown great success in protein contact prediction. In this study, a new deep residual learning-based protein contact prediction model was developed. Comparing with previous models, a new type of residual block hybridizing 1D and 2D convolutions was designed to increase the effective receptive field of the residual network, and a new loss function emphasizing the easily misclassified residue pairs was proposed to enhance the model training. The developed protein contact prediction model referred to as DRN-1D2D was first evaluated on 105 CASP11 targets, 76 CAMEO hard targets and 398 membrane proteins together with two in house-developed reference models based on either the standard 2D residual block or the traditional BCE loss function, from which we confirmed that both the dimensional hybrid residual block and the singularity enhanced loss function can be employed to improve the model performance for protein contact prediction. DRN-1D2D was further evaluated on 39 CASP13 and CASP14 free modeling targets together with the two reference models and six state-of-the-art protein contact prediction models including DeepCov, DeepCon, DeepConPred2, SPOT-Contact, RaptorX-Contact and TripleRes. The result shows that DRN-1D2D consistently achieved the best performance among all these models.
Collapse
Affiliation(s)
- Yunda Si
- School of Physics, Huazhong University of Science and Technology, China
| | - Chengfei Yan
- School of Physics, Huazhong University of Science and Technology, China
| |
Collapse
|
9
|
Reza MS, Zhang H, Hossain MT, Jin L, Feng S, Wei Y. COMTOP: Protein Residue-Residue Contact Prediction through Mixed Integer Linear Optimization. MEMBRANES 2021; 11:membranes11070503. [PMID: 34209399 PMCID: PMC8305966 DOI: 10.3390/membranes11070503] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/25/2021] [Revised: 06/24/2021] [Accepted: 06/25/2021] [Indexed: 11/17/2022]
Abstract
Protein contact prediction helps reconstruct the tertiary structure that greatly determines a protein’s function; therefore, contact prediction from the sequence is an important problem. Recently there has been exciting progress on this problem, but many of the existing methods are still low quality of prediction accuracy. In this paper, we present a new mixed integer linear programming (MILP)-based consensus method: a Consensus scheme based On a Mixed integer linear opTimization method for prOtein contact Prediction (COMTOP). The MILP-based consensus method combines the strengths of seven selected protein contact prediction methods, including CCMpred, EVfold, DeepCov, NNcon, PconsC4, plmDCA, and PSICOV, by optimizing the number of correctly predicted contacts and achieving a better prediction accuracy. The proposed hybrid protein residue–residue contact prediction scheme was tested in four independent test sets. For 239 highly non-redundant proteins, the method showed a prediction accuracy of 59.68%, 70.79%, 78.86%, 89.04%, 94.51%, and 97.35% for top-5L, top-3L, top-2L, top-L, top-L/2, and top-L/5 contacts, respectively. When tested on the CASP13 and CASP14 test sets, the proposed method obtained accuracies of 75.91% and 77.49% for top-L/5 predictions, respectively. COMTOP was further tested on 57 non-redundant α-helical transmembrane proteins and achieved prediction accuracies of 64.34% and 73.91% for top-L/2 and top-L/5 predictions, respectively. For all test datasets, the improvement of COMTOP in accuracy over the seven individual methods increased with the increasing number of predicted contacts. For example, COMTOP performed much better for large number of contact predictions (such as top-5L and top-3L) than for small number of contact predictions such as top-L/2 and top-L/5. The results and analysis demonstrate that COMTOP can significantly improve the performance of the individual methods; therefore, COMTOP is more robust against different types of test sets. COMTOP also showed better/comparable predictions when compared with the state-of-the-art predictors.
Collapse
Affiliation(s)
- Md. Selim Reza
- School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China; (M.S.R.); (H.Z.); (M.T.H.)
- Centre for High Performance Computing, Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;
| | - Huiling Zhang
- School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China; (M.S.R.); (H.Z.); (M.T.H.)
- Centre for High Performance Computing, Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;
| | - Md. Tofazzal Hossain
- School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China; (M.S.R.); (H.Z.); (M.T.H.)
- Centre for High Performance Computing, Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;
| | - Langxi Jin
- Department of Computer Science and Technology, School of Computer Science and Technology, Harbin University of Science and Technology, 52 Xuefu Road, Nangang District, Harbin 150080, China;
| | - Shengzhong Feng
- Centre for High Performance Computing, Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;
| | - Yanjie Wei
- School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China; (M.S.R.); (H.Z.); (M.T.H.)
- Centre for High Performance Computing, Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;
- Correspondence:
| |
Collapse
|
10
|
Jiang L, Xie S, Wang Y, Xu X, Zhao X, Zhang Y, Wang J, Hu L. SeekDoc: Seeking eligible doctors from electronic health record. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2021; 18:5347-5363. [PMID: 34517491 DOI: 10.3934/mbe.2021271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
With the development of online medical service platform, patients can find more medical information resources and obtain better medical treatment. However, it is difficult for patients to discover the most suitable doctors from the complex information resources. Therefore, the analysis and mining of Electronic Health Record(EHR) is very important for patients' timely and accurate treatment. Discovering the most suitable doctor is actually predicting the exact performance of the doctor for a specific disease. We believe that "a curative/bad treatment is likely to be caused by a good/bad doctor, and a good/bad doctor has a higher/lower evaluation by the patient(s)". In this paper, we propose a novel approach named SeekDoc, which is to seek the most effective doctor for a specific disease. Specifically, we build a doctor-disease heterogeneous information network and collect patients reviews and rating records for doctors. Then, we embed the comprehensive comment data for doctors and the constructed heterogeneous information network. Next, we use the autoencoder mechanism to learn the embedded features, which is an effective learning algorithm for constructing the latent feature representation in an unsupervised manner. After this learning, the latent features are input into the extreme gradient boosting (XGBoost) algorithm to improve their detection capabilities. Finally, extensive experiments show that our method can effectively and efficiently predict the doctor's experience score for specific diseases and has good performance compared with other algorithms.
Collapse
Affiliation(s)
- Lu Jiang
- Northeast Normal University, Changchun 130117, China
| | - Shasha Xie
- Northeast Normal University, Changchun 130117, China
| | - Yuqi Wang
- Northeast Normal University, Changchun 130117, China
| | - Xin Xu
- Northeast Normal University, Changchun 130117, China
| | - Xiaosa Zhao
- Northeast Normal University, Changchun 130117, China
| | - Ye Zhang
- Northeast Normal University, Changchun 130117, China
| | - Jianan Wang
- Northeast Normal University, Changchun 130117, China
| | - Lihong Hu
- Northeast Normal University, Changchun 130117, China
| |
Collapse
|
11
|
Sun D, Gong X. Tetramer protein complex interface residue pairs prediction with LSTM combined with graph representations. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2020; 1868:140504. [PMID: 32717382 DOI: 10.1016/j.bbapap.2020.140504] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/28/2020] [Revised: 06/30/2020] [Accepted: 07/16/2020] [Indexed: 10/23/2022]
Abstract
MOTIVATION Protein-protein interactions are important for many biological processes. Theoretical understanding of the structurally determining factors of interaction sites will help to understand the underlying mechanism of protein-protein interactions. Taking advantage of advanced mathematical methods to correctly predict interaction sites will be useful. Although some previous studies have been devoted to the interaction interface of protein monomer and the interface residues between chains of protein dimers, very few studies about the interface residues prediction of protein multimers, including trimers, tetramer and even more monomers in a large protein complex. As we all know, a large number of proteins function with the form of multibody protein complexes. And the complexity of the protein multimers structure causes the difficulty of interface residues prediction on them. So, we hope to build a method for the prediction of protein tetramer interface residue pairs. RESULTS Here, we developed a new deep network based on LSTM network combining with graph to predict protein tetramers interaction interface residue pairs. On account of the protein structure data is not the same as the image or video data which is well-arranged matrices, namely the Euclidean Structure mentioned in many researches. Because the Non-Euclidean Structure data can't keep the translation invariance, and we hope to extract some spatial features from this kind of data applying on deep learning, an algorithm combining with graph was developed to predict the interface residue pairs of protein interactions based on a topological graph building a relationship between vertexes and edges in graph theory combining multilayer Long Short-Term Memory network. First, selecting the training and test samples from the Protein Data Bank, and then extracting the physicochemical property features and the geometric features of surface residue associated with interfacial properties. Subsequently, we transform the protein multimers data to topological graphs and predict protein interaction interface residue pairs using the model. In addition, different types of evaluation indicators verified its validity.
Collapse
Affiliation(s)
- Daiwen Sun
- Mathematics Intelligence Application LAB, Institute for Mathematical Sciences, Renmin University of China, Beijing 100872, PR China
| | - Xinqi Gong
- Mathematics Intelligence Application LAB, Institute for Mathematical Sciences, Renmin University of China, Beijing 100872, PR China; Beijing Advanced Innovation Center for Structural Biology, Tsinghua Univeristy, Beijing 100091, PR China.
| |
Collapse
|
12
|
Sun J, Frishman D. DeepHelicon: Accurate prediction of inter-helical residue contacts in transmembrane proteins by residual neural networks. J Struct Biol 2020; 212:107574. [PMID: 32663598 DOI: 10.1016/j.jsb.2020.107574] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2020] [Revised: 07/03/2020] [Accepted: 07/07/2020] [Indexed: 01/16/2023]
Abstract
Accurate prediction of amino acid residue contacts is an important prerequisite for generating high-quality 3D models of transmembrane (TM) proteins. While a large number of compositional, evolutionary, and structural properties of proteins can be used to train contact prediction methods, recent research suggests that coevolution between residues provides the strongest indication of their spatial proximity. We have developed a deep learning approach, DeepHelicon, to predict inter-helical residue contacts in TM proteins by considering only coevolutionary features. DeepHelicon comprises a two-stage supervised learning process by residual neural networks for a gradual refinement of contact maps, followed by variance reduction by an ensemble of models. We present a benchmark study of 12 contact predictors and conclude that DeepHelicon together with the two other state-of-the-art methods DeepMetaPSICOV and Membrain2 outperforms the 10 remaining algorithms on all datasets and at all settings. On a set of 44 TM proteins with an average length of 388 residues DeepHelicon achieves the best performance among all benchmarked methods in predicting the top L/5 and L/2 inter-helical contacts, with the mean precision of 87.42% and 77.84%, respectively. On a set of 57 relatively small TM proteins with an average length of 298 residues DeepHelicon ranks second best after DeepMetaPSICOV. DeepHelicon produces the most accurate predictions for large proteins with more than 10 transmembrane helices. Coevolutionary features alone allow to predict inter-helical residue contacts with an accuracy sufficient for generating acceptable 3D models for up to 30% of proteins using a fully automated modeling method such as CONFOLD2.
Collapse
Affiliation(s)
- Jianfeng Sun
- Department of Bioinformatics, Wissenschaftzentrum Weihenstephan, Technische Universität München, 85354 Freising, Germany
| | - Dmitrij Frishman
- Department of Bioinformatics, Wissenschaftzentrum Weihenstephan, Technische Universität München, 85354 Freising, Germany.
| |
Collapse
|
13
|
Jiang M, Li Z, Zhang S, Wang S, Wang X, Yuan Q, Wei Z. Drug-target affinity prediction using graph neural network and contact maps. RSC Adv 2020; 10:20701-20712. [PMID: 35517730 PMCID: PMC9054320 DOI: 10.1039/d0ra02297g] [Citation(s) in RCA: 166] [Impact Index Per Article: 33.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2020] [Accepted: 05/07/2020] [Indexed: 02/01/2023] Open
Abstract
Computer-aided drug design uses high-performance computers to simulate the tasks in drug design, which is a promising research area. Drug-target affinity (DTA) prediction is the most important step of computer-aided drug design, which could speed up drug development and reduce resource consumption. With the development of deep learning, the introduction of deep learning to DTA prediction and improving the accuracy have become a focus of research. In this paper, utilizing the structural information of molecules and proteins, two graphs of drug molecules and proteins are built up respectively. Graph neural networks are introduced to obtain their representations, and a method called DGraphDTA is proposed for DTA prediction. Specifically, the protein graph is constructed based on the contact map output from the prediction method, which could predict the structural characteristics of the protein according to its sequence. It can be seen from the test of various metrics on benchmark datasets that the method proposed in this paper has strong robustness and generalizability.
Collapse
Affiliation(s)
- Mingjian Jiang
- Department of Computer Science and Technology, Ocean University of China China
| | - Zhen Li
- Department of Computer Science and Technology, Ocean University of China China
| | - Shugang Zhang
- Department of Computer Science and Technology, Ocean University of China China
| | - Shuang Wang
- Department of Computer Science and Technology, Ocean University of China China
| | - Xiaofeng Wang
- Department of Computer Science and Technology, Ocean University of China China
| | - Qing Yuan
- Department of Computer Science and Technology, Ocean University of China China
| | - Zhiqiang Wei
- Department of Computer Science and Technology, Ocean University of China China
| |
Collapse
|
14
|
Small design from big alignment: engineering proteins with multiple sequence alignment as the starting point. Biotechnol Lett 2020; 42:1305-1315. [PMID: 32430802 DOI: 10.1007/s10529-020-02914-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2020] [Accepted: 05/14/2020] [Indexed: 02/08/2023]
Abstract
Multiple sequence alignment (MSA) is a fundamental way to gain information that cannot be obtained from the analysis of any individual sequence included in the alignment. It provides ways to investigate the relationship between sequence and function from a perspective of evolution. Thus, the MSA of proteins can be employed as a reference for protein engineering. In this paper, we reviewed the recent advances to highlight how protein engineering was benefited from the MSA of proteins. These methods include (1) engineering the thermostability or solubility of proteins by making it closer to the consensus sequence of the alignment through introducing site mutations; (2) structure-based engineering proteins with comparative modeling; (3) creating paleoenzymes featured with high thermostability and promiscuity by constructing the ancestral sequences derived from multiple sequence alignment; and (4) incorporating site-mutations targeting the evolutionarily coupled sites identified from multiple sequence alignment.
Collapse
|
15
|
Getting to Know Your Neighbor: Protein Structure Prediction Comes of Age with Contextual Machine Learning. J Comput Biol 2020; 27:796-814. [DOI: 10.1089/cmb.2019.0193] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
|
16
|
Protein Contact Map Prediction Based on ResNet and DenseNet. BIOMED RESEARCH INTERNATIONAL 2020; 2020:7584968. [PMID: 32337273 PMCID: PMC7165324 DOI: 10.1155/2020/7584968] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/04/2020] [Accepted: 03/05/2020] [Indexed: 11/18/2022]
Abstract
Residue-residue contact prediction has become an increasingly important tool for modeling the three-dimensional structure of a protein when no homologous structure is available. Ultradeep residual neural network (ResNet) has become the most popular method for making contact predictions because it captures the contextual information between residues. In this paper, we propose a novel deep neural network framework for contact prediction which combines ResNet and DenseNet. This framework uses 1D ResNet to process sequential features, and besides PSSM, SS3, and solvent accessibility, we have introduced a new feature, position-specific frequency matrix (PSFM), as an input. Using ResNet's residual module and identity mapping, it can effectively process sequential features after which the outer concatenation function is used for sequential and pairwise features. Prediction accuracy is improved following a final processing step using the dense connection of DenseNet. The prediction accuracy of the protein contact map shows that our method is more effective than other popular methods due to the new network architecture and the added feature input.
Collapse
|
17
|
Shi Q, Chen W, Huang S, Wang Y, Xue Z. Deep learning for mining protein data. Brief Bioinform 2019; 22:194-218. [PMID: 31867611 DOI: 10.1093/bib/bbz156] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2019] [Revised: 10/21/2019] [Accepted: 11/07/2019] [Indexed: 01/16/2023] Open
Abstract
The recent emergence of deep learning to characterize complex patterns of protein big data reveals its potential to address the classic challenges in the field of protein data mining. Much research has revealed the promise of deep learning as a powerful tool to transform protein big data into valuable knowledge, leading to scientific discoveries and practical solutions. In this review, we summarize recent publications on deep learning predictive approaches in the field of mining protein data. The application architectures of these methods include multilayer perceptrons, stacked autoencoders, deep belief networks, two- or three-dimensional convolutional neural networks, recurrent neural networks, graph neural networks, and complex neural networks and are described from five perspectives: residue-level prediction, sequence-level prediction, three-dimensional structural analysis, interaction prediction, and mass spectrometry data mining. The advantages and deficiencies of these architectures are presented in relation to various tasks in protein data mining. Additionally, some practical issues and their future directions are discussed, such as robust deep learning for protein noisy data, architecture optimization for specific tasks, efficient deep learning for limited protein data, multimodal deep learning for heterogeneous protein data, and interpretable deep learning for protein understanding. This review provides comprehensive perspectives on general deep learning techniques for protein data analysis.
Collapse
Affiliation(s)
- Qiang Shi
- School of Software Engineering, Huazhong University of Science and Technology. His main interests cover machine learning especially deep learning, protein data analysis, and big data mining
| | - Weiya Chen
- School of Software Engineering, Huazhong University of Science & Technology, Wuhan, China. His research interests cover bioinformatics, virtual reality, and data visualization
| | - Siqi Huang
- Software Engineering at Huazhong University of science and technology, focusing on Machine learning and data mining
| | - Yan Wang
- School of life, University of Science & Technology; her main interests cover protein structure and function prediction and big data mining
| | - Zhidong Xue
- School of Software Engineering, Huazhong University of Science & Technology, Wuhan, China. His research interests cover bioinformatics, machine learning, and image processing
| |
Collapse
|
18
|
Hanson J, Paliwal K, Litfin T, Yang Y, Zhou Y. Accurate prediction of protein contact maps by coupling residual two-dimensional bidirectional long short-term memory with convolutional neural networks. Bioinformatics 2019; 34:4039-4045. [PMID: 29931279 DOI: 10.1093/bioinformatics/bty481] [Citation(s) in RCA: 71] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2018] [Accepted: 06/13/2018] [Indexed: 11/12/2022] Open
Abstract
Motivation Accurate prediction of a protein contact map depends greatly on capturing as much contextual information as possible from surrounding residues for a target residue pair. Recently, ultra-deep residual convolutional networks were found to be state-of-the-art in the latest Critical Assessment of Structure Prediction techniques (CASP12) for protein contact map prediction by attempting to provide a protein-wide context at each residue pair. Recurrent neural networks have seen great success in recent protein residue classification problems due to their ability to propagate information through long protein sequences, especially Long Short-Term Memory (LSTM) cells. Here, we propose a novel protein contact map prediction method by stacking residual convolutional networks with two-dimensional residual bidirectional recurrent LSTM networks, and using both one-dimensional sequence-based and two-dimensional evolutionary coupling-based information. Results We show that the proposed method achieves a robust performance over validation and independent test sets with the Area Under the receiver operating characteristic Curve (AUC) > 0.95 in all tests. When compared to several state-of-the-art methods for independent testing of 228 proteins, the method yields an AUC value of 0.958, whereas the next-best method obtains an AUC of 0.909. More importantly, the improvement is over contacts at all sequence-position separations. Specifically, a 8.95%, 5.65% and 2.84% increase in precision were observed for the top L∕10 predictions over the next best for short, medium and long-range contacts, respectively. This confirms the usefulness of ResNets to congregate the short-range relations and 2D-BRLSTM to propagate the long-range dependencies throughout the entire protein contact map 'image'. Availability and implementation SPOT-Contact server url: http://sparks-lab.org/jack/server/SPOT-Contact/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jack Hanson
- Signal Processing Laboratory, Griffith University, Brisbane, Australia
| | - Kuldip Paliwal
- Signal Processing Laboratory, Griffith University, Brisbane, Australia
| | - Thomas Litfin
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Southport, Australia
| | - Yuedong Yang
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Southport, Australia
- School of Data and Computer Science, Sun-Yat Sen University, Guangzhou, Guangdong, China
| | - Yaoqi Zhou
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Southport, Australia
| |
Collapse
|
19
|
Zeng H, Wang S, Zhou T, Zhao F, Li X, Wu Q, Xu J. ComplexContact: a web server for inter-protein contact prediction using deep learning. Nucleic Acids Res 2019; 46:W432-W437. [PMID: 29790960 PMCID: PMC6030867 DOI: 10.1093/nar/gky420] [Citation(s) in RCA: 86] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2018] [Accepted: 05/20/2018] [Indexed: 12/15/2022] Open
Abstract
ComplexContact (http://raptorx2.uchicago.edu/ComplexContact/) is a web server for sequence-based interfacial residue-residue contact prediction of a putative protein complex. Interfacial residue-residue contacts are critical for understanding how proteins form complex and interact at residue level. When receiving a pair of protein sequences, ComplexContact first searches for their sequence homologs and builds two paired multiple sequence alignments (MSA), then it applies co-evolution analysis and a CASP-winning deep learning (DL) method to predict interfacial contacts from paired MSAs and visualizes the prediction as an image. The DL method was originally developed for intra-protein contact prediction and performed the best in CASP12. Our large-scale experimental test further shows that ComplexContact greatly outperforms pure co-evolution methods for inter-protein contact prediction, regardless of the species.
Collapse
Affiliation(s)
- Hong Zeng
- School of Computer Science and Technology, Hangzhou Dianzi University, China
| | - Sheng Wang
- King Abdullah University of Science and Technology (KAUST), Saudi Arabia.,Toyota Technological Institute at Chicago, USA
| | - Tianming Zhou
- Toyota Technological Institute at Chicago, USA.,Institute for Interdisciplinary Information Sciences, Tsinghua University, China
| | - Feifeng Zhao
- School of Computer Science and Technology, Hangzhou Dianzi University, China
| | - Xiufeng Li
- School of Computer Science and Technology, Hangzhou Dianzi University, China
| | - Qing Wu
- School of Computer Science and Technology, Hangzhou Dianzi University, China
| | - Jinbo Xu
- Toyota Technological Institute at Chicago, USA
| |
Collapse
|
20
|
Hockenberry AJ, Wilke CO. Evolutionary couplings detect side-chain interactions. PeerJ 2019; 7:e7280. [PMID: 31328041 PMCID: PMC6622159 DOI: 10.7717/peerj.7280] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2019] [Accepted: 06/09/2019] [Indexed: 12/19/2022] Open
Abstract
Patterns of amino acid covariation in large protein sequence alignments can inform the prediction of de novo protein structures, binding interfaces, and mutational effects. While algorithms that detect these so-called evolutionary couplings between residues have proven useful for practical applications, less is known about how and why these methods perform so well, and what insights into biological processes can be gained from their application. Evolutionary coupling algorithms are commonly benchmarked by comparison to true structural contacts derived from solved protein structures. However, the methods used to determine true structural contacts are not standardized and different definitions of structural contacts may have important consequences for interpreting the results from evolutionary coupling analyses and understanding their overall utility. Here, we show that evolutionary coupling analyses are significantly more likely to identify structural contacts between side-chain atoms than between backbone atoms. We use both simulations and empirical analyses to highlight that purely backbone-based definitions of true residue–residue contacts (i.e., based on the distance between Cα atoms) may underestimate the accuracy of evolutionary coupling algorithms by as much as 40% and that a commonly used reference point (Cβ atoms) underestimates the accuracy by 10–15%. These findings show that co-evolutionary outcomes differ according to which atoms participate in residue–residue interactions and suggest that accounting for different interaction types may lead to further improvements to contact-prediction methods.
Collapse
Affiliation(s)
- Adam J Hockenberry
- Department of Integrative Biology, The University of Texas at Austin, Austin, TX, USA
| | - Claus O Wilke
- Department of Integrative Biology, The University of Texas at Austin, Austin, TX, USA
| |
Collapse
|
21
|
Wu Q, Peng Z, Anishchenko I, Cong Q, Baker D, Yang J. Protein contact prediction using metagenome sequence data and residual neural networks. Bioinformatics 2019; 36:41-48. [PMID: 31173061 PMCID: PMC8792440 DOI: 10.1093/bioinformatics/btz477] [Citation(s) in RCA: 55] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2018] [Revised: 05/30/2019] [Accepted: 06/04/2019] [Indexed: 01/31/2023] Open
Abstract
MOTIVATION Almost all protein residue contact prediction methods rely on the availability of deep multiple sequence alignments (MSAs). However, many proteins from the poorly populated families do not have sufficient number of homologs in the conventional UniProt database. Here we aim to solve this issue by exploring the rich sequence data from the metagenome sequencing projects. RESULTS Based on the improved MSA constructed from the metagenome sequence data, we developed MapPred, a new deep learning-based contact prediction method. MapPred consists of two component methods, DeepMSA and DeepMeta, both trained with the residual neural networks. DeepMSA was inspired by the recent method DeepCov, which was trained on 441 matrices of covariance features. By considering the symmetry of contact map, we reduced the number of matrices to 231, which makes the training more efficient in DeepMSA. Experiments show that DeepMSA outperforms DeepCov by 10-13% in precision. DeepMeta works by combining predicted contacts and other sequence profile features. Experiments on three benchmark datasets suggest that the contribution from the metagenome sequence data is significant with P-values less than 4.04E-17. MapPred is shown to be complementary and comparable the state-of-the-art methods. The success of MapPred is attributed to three factors: the deeper MSA from the metagenome sequence data, improved feature design in DeepMSA and optimized training by the residual neural networks. AVAILABILITY AND IMPLEMENTATION http://yanglab.nankai.edu.cn/mappred/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Qi Wu
- School of Mathematical Sciences, Nankai University, Tianjin 300071, China
| | - Zhenling Peng
- To whom correspondence should be addressed. E-mail: or
| | - Ivan Anishchenko
- Department of Biochemistry, Seattle, WA 98105, USA,Institute for Protein Design, University of Washington, Seattle, WA 98105, USA
| | - Qian Cong
- Department of Biochemistry, Seattle, WA 98105, USA,Institute for Protein Design, University of Washington, Seattle, WA 98105, USA
| | - David Baker
- Department of Biochemistry, Seattle, WA 98105, USA,Institute for Protein Design, University of Washington, Seattle, WA 98105, USA
| | - Jianyi Yang
- To whom correspondence should be addressed. E-mail: or
| |
Collapse
|
22
|
Deep learning in bioinformatics: Introduction, application, and perspective in the big data era. Methods 2019; 166:4-21. [PMID: 31022451 DOI: 10.1016/j.ymeth.2019.04.008] [Citation(s) in RCA: 152] [Impact Index Per Article: 25.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2018] [Revised: 03/23/2019] [Accepted: 04/15/2019] [Indexed: 12/13/2022] Open
Abstract
Deep learning, which is especially formidable in handling big data, has achieved great success in various fields, including bioinformatics. With the advances of the big data era in biology, it is foreseeable that deep learning will become increasingly important in the field and will be incorporated in vast majorities of analysis pipelines. In this review, we provide both the exoteric introduction of deep learning, and concrete examples and implementations of its representative applications in bioinformatics. We start from the recent achievements of deep learning in the bioinformatics field, pointing out the problems which are suitable to use deep learning. After that, we introduce deep learning in an easy-to-understand fashion, from shallow neural networks to legendary convolutional neural networks, legendary recurrent neural networks, graph neural networks, generative adversarial networks, variational autoencoder, and the most recent state-of-the-art architectures. After that, we provide eight examples, covering five bioinformatics research directions and all the four kinds of data type, with the implementation written in Tensorflow and Keras. Finally, we discuss the common issues, such as overfitting and interpretability, that users will encounter when adopting deep learning methods and provide corresponding suggestions. The implementations are freely available at https://github.com/lykaust15/Deep_learning_examples.
Collapse
|
23
|
Kalkatawi M, Magana-Mora A, Jankovic B, Bajic VB. DeepGSR: an optimized deep-learning structure for the recognition of genomic signals and regions. Bioinformatics 2019; 35:1125-1132. [PMID: 30184052 PMCID: PMC6449759 DOI: 10.1093/bioinformatics/bty752] [Citation(s) in RCA: 41] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2017] [Revised: 07/15/2018] [Accepted: 08/31/2018] [Indexed: 01/05/2023] Open
Abstract
MOTIVATION Recognition of different genomic signals and regions (GSRs) in DNA is crucial for understanding genome organization, gene regulation, and gene function, which in turn generate better genome and gene annotations. Although many methods have been developed to recognize GSRs, their pure computational identification remains challenging. Moreover, various GSRs usually require a specialized set of features for developing robust recognition models. Recently, deep-learning (DL) methods have been shown to generate more accurate prediction models than 'shallow' methods without the need to develop specialized features for the problems in question. Here, we explore the potential use of DL for the recognition of GSRs. RESULTS We developed DeepGSR, an optimized DL architecture for the prediction of different types of GSRs. The performance of the DeepGSR structure is evaluated on the recognition of polyadenylation signals (PAS) and translation initiation sites (TIS) of different organisms: human, mouse, bovine and fruit fly. The results show that DeepGSR outperformed the state-of-the-art methods, reducing the classification error rate of the PAS and TIS prediction in the human genome by up to 29% and 86%, respectively. Moreover, the cross-organisms and genome-wide analyses we performed, confirmed the robustness of DeepGSR and provided new insights into the conservation of examined GSRs across species. AVAILABILITY AND IMPLEMENTATION DeepGSR is implemented in Python using Keras API; it is available as open-source software and can be obtained at https://doi.org/10.5281/zenodo.1117159. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Manal Kalkatawi
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
- Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Arturo Magana-Mora
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
- Drilling Technology Team, EXPEC-ARC, Saudi Aramco, Dhahran, Saudi Arabia
| | - Boris Jankovic
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Vladimir B Bajic
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| |
Collapse
|
24
|
Zhao X, Zhang Y, Ning Q, Zhang H, Ji J, Yin M. Identifying N6-methyladenosine sites using extreme gradient boosting system optimized by particle swarm optimizer. J Theor Biol 2019; 467:39-47. [DOI: 10.1016/j.jtbi.2019.01.035] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2018] [Revised: 01/04/2019] [Accepted: 01/30/2019] [Indexed: 01/15/2023]
|
25
|
Dehghani T, Naghibzadeh M, Eghdami M. BetaDL: A protein beta-sheet predictor utilizing a deep learning model and independent set solution. Comput Biol Med 2019; 104:241-249. [PMID: 30530227 DOI: 10.1016/j.compbiomed.2018.11.021] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2018] [Revised: 11/23/2018] [Accepted: 11/27/2018] [Indexed: 10/27/2022]
Abstract
The sequence-based prediction of beta-residue contacts and beta-sheet structures contain key information for protein structure prediction. However, the determination of beta-sheet structures poses numerous challenges due to long-range beta-residue interactions and the huge number of possible beta-sheet structures. Recently gaining attention has been the prediction of residue contacts based on deep learning models whose results have led to improvement in protein structure prediction. In addition, to reduce the computational complexity of determining beta-sheet structures, it has been suggested that this problem be transformed into graph-based solutions. Consequently, the current work proposes BetaDL, a combination of a deep learning and a graph-based beta-sheet structure predictor. BetaDL adopts deep learning models to capture beta-residue contacts and improve beta-sheet structure predictions. In addition, a graph-based approach is presented to model the beta-sheets conformational space and a new score function is introduced to evaluate beta-sheets. Furthermore, the present study demonstrates that the beta-sheet structure can be predicted within an acceptable computational time by the utilization of a heuristic maximum weight independent set solution. When compared to state-of-the-art methods, experimental results from BetaSheet916 and BetaSheet1452 datasets indicate that BetaDL improves the accuracy of beta-residue contact and beta-sheet structure prediction. Using BetaDL, beta-sheet structures are predicted with a 4% and 6% improvement in the F1-score at the residue and strand levels, respectively. BetaDL's source code and data are available at http://kerg.um.ac.ir/index.php/datasets/#BetaDL.
Collapse
Affiliation(s)
- Toktam Dehghani
- Department of Computer Engineering, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Mahmoud Naghibzadeh
- Department of Computer Engineering, Ferdowsi University of Mashhad, Mashhad, Iran.
| | - Mahdie Eghdami
- Department of Computer Engineering, Ferdowsi University of Mashhad, Mashhad, Iran
| |
Collapse
|
26
|
Ding W, Mao W, Shao D, Zhang W, Gong H. DeepConPred2: An Improved Method for the Prediction of Protein Residue Contacts. Comput Struct Biotechnol J 2018; 16:503-510. [PMID: 30505403 PMCID: PMC6247404 DOI: 10.1016/j.csbj.2018.10.009] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2018] [Revised: 10/16/2018] [Accepted: 10/18/2018] [Indexed: 12/18/2022] Open
Abstract
Information of residue-residue contacts is essential for understanding the mechanism of protein folding, and has been successfully applied as special topological restraints to simplify the conformational sampling in de novo protein structure prediction. Prediction of protein residue contacts has experienced amazingly rapid progresses recently, with prediction accuracy approaching impressively high levels in the past two years. In this work, we introduce a second version of our residue contact predictor, DeepConPred2, which exhibits substantially improved performance and sufficiently reduced running time after model re-optimization and feature updates. When testing on the CASP12 free modeling targets, our program reaches at least the same level of prediction accuracy as the best contact predictors so far and provides information complementary to other state-of-the-art methods in contact-assisted folding.
Collapse
Affiliation(s)
- Wenze Ding
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China.,Beijing Innovation Center of Structural Biology, Tsinghua University, Beijing 100084, China
| | - Wenzhi Mao
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China.,Beijing Innovation Center of Structural Biology, Tsinghua University, Beijing 100084, China
| | - Di Shao
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China.,Beijing Innovation Center of Structural Biology, Tsinghua University, Beijing 100084, China
| | - Wenxuan Zhang
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China.,Beijing Innovation Center of Structural Biology, Tsinghua University, Beijing 100084, China
| | - Haipeng Gong
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China.,Beijing Innovation Center of Structural Biology, Tsinghua University, Beijing 100084, China
| |
Collapse
|
27
|
Jones DT, Kandathil SM. High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features. Bioinformatics 2018; 34:3308-3315. [PMID: 29718112 PMCID: PMC6157083 DOI: 10.1093/bioinformatics/bty341] [Citation(s) in RCA: 112] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2017] [Revised: 03/06/2018] [Accepted: 04/25/2018] [Indexed: 12/22/2022] Open
Abstract
Motivation In addition to substitution frequency data from protein sequence alignments, many state-of-the-art methods for contact prediction rely on additional sources of information, or features, of protein sequences in order to predict residue-residue contacts, such as solvent accessibility, predicted secondary structure, and scores from other contact prediction methods. It is unclear how much of this information is needed to achieve state-of-the-art results. Here, we show that using deep neural network models, simple alignment statistics contain sufficient information to achieve state-of-the-art precision. Our prediction method, DeepCov, uses fully convolutional neural networks operating on amino-acid pair frequency or covariance data derived directly from sequence alignments, without using global statistical methods such as sparse inverse covariance or pseudolikelihood estimation. Results Comparisons against CCMpred and MetaPSICOV2 show that using pairwise covariance data calculated from raw alignments as input allows us to match or exceed the performance of both of these methods. Almost all of the achieved precision is obtained when considering relatively local windows (around 15 residues) around any member of a given residue pairing; larger window sizes have comparable performance. Assessment on a set of shallow sequence alignments (fewer than 160 effective sequences) indicates that the new method is substantially more precise than CCMpred and MetaPSICOV2 in this regime, suggesting that improved precision is attainable on smaller sequence families. Overall, the performance of DeepCov is competitive with the state of the art, and our results demonstrate that global models, which employ features from all parts of the input alignment when predicting individual contacts, are not strictly needed in order to attain precise contact predictions. Availability and implementation DeepCov is freely available at https://github.com/psipred/DeepCov. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David T Jones
- Department of Computer Science, University College London, London, UK
- Biomedical Data Science Laboratory, The Francis Crick Institute, London, UK
| | - Shaun M Kandathil
- Department of Computer Science, University College London, London, UK
- Biomedical Data Science Laboratory, The Francis Crick Institute, London, UK
| |
Collapse
|
28
|
Mao W, Wang T, Zhang W, Gong H. Identification of residue pairing in interacting β-strands from a predicted residue contact map. BMC Bioinformatics 2018; 19:146. [PMID: 29673311 PMCID: PMC5907701 DOI: 10.1186/s12859-018-2150-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2017] [Accepted: 04/09/2018] [Indexed: 12/04/2022] Open
Abstract
Background Despite the rapid progress of protein residue contact prediction, predicted residue contact maps frequently contain many errors. However, information of residue pairing in β strands could be extracted from a noisy contact map, due to the presence of characteristic contact patterns in β-β interactions. This information may benefit the tertiary structure prediction of mainly β proteins. In this work, we propose a novel ridge-detection-based β-β contact predictor to identify residue pairing in β strands from any predicted residue contact map. Results Our algorithm RDb2C adopts ridge detection, a well-developed technique in computer image processing, to capture consecutive residue contacts, and then utilizes a novel multi-stage random forest framework to integrate the ridge information and additional features for prediction. Starting from the predicted contact map of CCMpred, RDb2C remarkably outperforms all state-of-the-art methods on two conventional test sets of β proteins (BetaSheet916 and BetaSheet1452), and achieves F1-scores of ~ 62% and ~ 76% at the residue level and strand level, respectively. Taking the prediction of the more advanced RaptorX-Contact as input, RDb2C achieves impressively higher performance, with F1-scores reaching ~ 76% and ~ 86% at the residue level and strand level, respectively. In a test of structural modeling using the top 1 L predicted contacts as constraints, for 61 mainly β proteins, the average TM-score achieves 0.442 when using the raw RaptorX-Contact prediction, but increases to 0.506 when using the improved prediction by RDb2C. Conclusion Our method can significantly improve the prediction of β-β contacts from any predicted residue contact maps. Prediction results of our algorithm could be directly applied to effectively facilitate the practical structure prediction of mainly β proteins. Availability All source data and codes are available at http://166.111.152.91/Downloads.html or the GitHub address of https://github.com/wzmao/RDb2C. Electronic supplementary material The online version of this article (10.1186/s12859-018-2150-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wenzhi Mao
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China.,Beijing Advanced Innovation Center for Structural Biology, Tsinghua University, Beijing, China
| | - Tong Wang
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China.,Beijing Advanced Innovation Center for Structural Biology, Tsinghua University, Beijing, China
| | - Wenxuan Zhang
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China.,Beijing Advanced Innovation Center for Structural Biology, Tsinghua University, Beijing, China
| | - Haipeng Gong
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China. .,Beijing Advanced Innovation Center for Structural Biology, Tsinghua University, Beijing, China.
| |
Collapse
|