1
|
Zhang S, Ma A, Xie X, Lian Z, Wang Y. CacPred: a cascaded convolutional neural network for TF-DNA binding prediction. BMC Genomics 2025; 26:264. [PMID: 40102719 PMCID: PMC11916463 DOI: 10.1186/s12864-025-11399-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Accepted: 02/21/2025] [Indexed: 03/20/2025] Open
Abstract
BACKGROUND Transcription factors (TFs) regulate the genes' expression by binding to DNA sequences. Aligned TFBSs of the same TF are seen as cis-regulatory motifs, and substantial computational efforts have been invested to find motifs. In recent years, convolutional neural networks (CNNs) have succeeded in TF-DNA binding prediction, but existing DL methods' accuracy needs to be improved and convolution function in TF-DNA binding prediction should be further explored. RESULTS We develop a cascaded convolutional neural network model named CacPred to predict TF-DNA binding on 790 Chromatin immunoprecipitation-sequencing (ChIP-seq) datasets and seven ChIP-nexus (chromatin immunoprecipitation experiments with nucleotide resolution through exonuclease, unique barcode, and single ligation) datasets. We compare CacPred to six existing DL models across nine standard evaluation metrics. Our results indicate that CacPred outperforms all comparison models for TF-DNA binding prediction, and the average accuracy (ACC), matthews correlation coefficient (MCC), and the area of eight metrics radar (AEMR) are improved by 3.3%, 9.2%, and 6.4% on 790 ChIP-seq datasets. Meanwhile, CacPred improves the average ACC, MCC, and AEMR of 5.5%, 16.8%, and 12.9% on seven ChIP-nexus datasets. To explain the proposed method, motifs are used to show features CacPred learned. In light of the results, CacPred can find some significant motifs from input sequences. CONCLUSIONS This paper indicates that CacPred performs better than existing models on ChIP-seq data. Seven ChIP-nexus datasets are also analyzed, and they coincide with results that our proposed method performs the best on ChIP-seq data. CacPred only is equipped with the convolutional algorithm, demonstrating that pooling processing of the existing models leads to losing some sequence information. Some significant motifs are found, showing that CacPred can learn features from input sequences. In this study, we demonstrate that CacPred is an effective and feasible model for predicting TF-DNA binding. CacPred is freely available at https://github.com/zhangsq06/CacPred .
Collapse
Affiliation(s)
- Shuangquan Zhang
- School of Cyber Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
| | - Anjun Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Xuping Xie
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
| | - Zhichao Lian
- School of Cyber Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| | - Yan Wang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China.
| |
Collapse
|
2
|
Han B, Zhang Y, Li L, Gong X, Xia K. TopoQA: a topological deep learning-based approach for protein complex structure interface quality assessment. Brief Bioinform 2025; 26:bbaf083. [PMID: 40062613 PMCID: PMC11891663 DOI: 10.1093/bib/bbaf083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2024] [Revised: 01/11/2025] [Accepted: 02/17/2025] [Indexed: 05/13/2025] Open
Abstract
Even with the significant advances of AlphaFold-Multimer (AF-Multimer) and AlphaFold3 (AF3) in protein complex structure prediction, their accuracy is still not comparable with monomer structure prediction. Efficient and effective quality assessment (QA) or estimation of model accuracy models that can evaluate the quality of the predicted protein-complexes without knowing their native structures are of key importance for protein structure generation and model selection. In this paper, we leverage persistent homology (PH) to capture the atomic-level topological information around residues and design a topological deep learning-based QA method, TopoQA, to assess the accuracy of protein complex interfaces. We integrate PH from topological data analysis into graph neural networks (GNNs) to characterize complex higher-order structures that GNNs might overlook, enhancing the learning of the relationship between the topological structure of complex interfaces and quality scores. Our TopoQA model is extensively validated based on the two most-widely used benchmark datasets, Docking Benchmark5.5 AF2 (DBM55-AF2) and Heterodimer-AF2 (HAF2), along with our newly constructed ABAG-AF3 dataset to facilitate comparisons with AF3. For all three datasets, TopoQA outperforms AF-Multimer-based AF2Rank and shows an advantage over AF3 in nearly half of the targets. In particular, in the DBM55-AF2 dataset, a ranking loss of 73.6% lower than AF-Multimer-based AF2Rank is obtained. Further, other than AF-Multimer and AF3, we have also extensively compared with nearly-all the state-of-the-art models (as far as we know), it has been found that our TopoQA can achieve the highest Top 10 Hit-rate on the DBM55-AF2 dataset and the lowest ranking loss on the HAF2 dataset. Ablation experiments show that our topological features significantly improve the model's performance. At the same time, our method also provides a new paradigm for protein structure representation learning.
Collapse
Affiliation(s)
- Bingqing Han
- Institute for Mathematical Sciences, Renmin University of China, Beijing 100872, China
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore
| | - Yipeng Zhang
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore
| | - Longlong Li
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore
- School of Mathematics, Shandong University, Jinan 250100, China
- Data Science Institute, Shandong University, Jinan 250100, China
| | - Xinqi Gong
- Institute for Mathematical Sciences, Renmin University of China, Beijing 100872, China
| | - Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore
| |
Collapse
|
3
|
Liang F, Sun M, Xie L, Zhao X, Liu D, Zhao K, Zhang G. Recent advances and challenges in protein complex model accuracy estimation. Comput Struct Biotechnol J 2024; 23:1824-1832. [PMID: 38707538 PMCID: PMC11066466 DOI: 10.1016/j.csbj.2024.04.049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2024] [Revised: 04/18/2024] [Accepted: 04/18/2024] [Indexed: 05/07/2024] Open
Abstract
Estimation of model accuracy plays a crucial role in protein structure prediction, aiming to evaluate the quality of predicted protein structure models accurately and objectively. This process is not only key to screening candidate models that are close to the real structure, but also provides guidance for further optimization of protein structures. With the significant advancements made by AlphaFold2 in monomer structure, the problem of single-domain protein structure prediction has been widely solved. Correspondingly, the importance of assessing the quality of single-domain protein models decreased, and the research focus has shifted to estimation of model accuracy of protein complexes. In this review, our goal is to provide a comprehensive overview of the reference and statistical metrics, as well as representative methods, and the current challenges within four distinct facets (Topology Global Score, Interface Total Score, Interface Residue-Wise Score, and Tertiary Residue-Wise Score) in the field of complex EMA.
Collapse
Affiliation(s)
| | | | - Lei Xie
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Xuanfeng Zhao
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Dong Liu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Kailong Zhao
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Guijun Zhang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| |
Collapse
|
4
|
Srivastava V, Naik B, Godara P, Das D, Mattaparthi VSK, Prusty D. Identification of FDA-approved drugs with triple targeting mode of action for the treatment of monkeypox: a high throughput virtual screening study. Mol Divers 2024; 28:1093-1107. [PMID: 37079243 PMCID: PMC10116100 DOI: 10.1007/s11030-023-10636-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2022] [Accepted: 03/17/2023] [Indexed: 04/21/2023]
Abstract
According to the Center for Disease Control and Prevention, as of August 23, 94 countries had confirmed 42,954 Monkeypox Virus cases. As specific monkeypox drugs are not yet developed, the treatment depends on repurposed FDA-approved drugs. According to a recent study, the Monkeypox outbreak is caused by a strain with a unique mutation, raising the likelihood that the virus will develop resistance to current drugs by acquiring mutations in the targets of currently used drugs. The probability of multiple mutations in two or more drug targets at a time is always low than mutation in a single drug target. Therefore, we identified 15 triple-targeting FDA-approved drugs that can inhibit three viral targets, including topoisomerase1, p37, and thymidylate kinase, using high throughput virtual screening approach. Further, the molecular dynamics simulation analysis of the top hits such as Naldemedine and Saquinavir with their respective targets reveals the formation of stable conformational changes of the ligand-protein complexes inside the dynamic biological environment. We suggest further research on these triple-targeting molecules to develop an effective therapy for the currently spreading Monkeypox.
Collapse
Affiliation(s)
- Varshita Srivastava
- Department of Biochemistry, School of Life Sciences, Central University of Rajasthan, NH-8, Bandarsindri, Kishangarh, Ajmer, Rajasthan, 305817, India
| | - Biswajit Naik
- Department of Biochemistry, School of Life Sciences, Central University of Rajasthan, NH-8, Bandarsindri, Kishangarh, Ajmer, Rajasthan, 305817, India
| | - Priya Godara
- Department of Biochemistry, School of Life Sciences, Central University of Rajasthan, NH-8, Bandarsindri, Kishangarh, Ajmer, Rajasthan, 305817, India
| | - Dorothy Das
- Molecular Modelling and Simulation Laboratory, Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur, Assam, 784028, India
| | - Venkata Satish Kumar Mattaparthi
- Molecular Modelling and Simulation Laboratory, Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur, Assam, 784028, India
| | - Dhaneswar Prusty
- Department of Biochemistry, School of Life Sciences, Central University of Rajasthan, NH-8, Bandarsindri, Kishangarh, Ajmer, Rajasthan, 305817, India.
| |
Collapse
|
5
|
Cerk K, Ugalde‐Salas P, Nedjad CG, Lecomte M, Muller C, Sherman DJ, Hildebrand F, Labarthe S, Frioux C. Community-scale models of microbiomes: Articulating metabolic modelling and metagenome sequencing. Microb Biotechnol 2024; 17:e14396. [PMID: 38243750 PMCID: PMC10832553 DOI: 10.1111/1751-7915.14396] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 11/27/2023] [Accepted: 12/20/2023] [Indexed: 01/21/2024] Open
Abstract
Building models is essential for understanding the functions and dynamics of microbial communities. Metabolic models built on genome-scale metabolic network reconstructions (GENREs) are especially relevant as a means to decipher the complex interactions occurring among species. Model reconstruction increasingly relies on metagenomics, which permits direct characterisation of naturally occurring communities that may contain organisms that cannot be isolated or cultured. In this review, we provide an overview of the field of metabolic modelling and its increasing reliance on and synergy with metagenomics and bioinformatics. We survey the means of assigning functions and reconstructing metabolic networks from (meta-)genomes, and present the variety and mathematical fundamentals of metabolic models that foster the understanding of microbial dynamics. We emphasise the characterisation of interactions and the scaling of model construction to large communities, two important bottlenecks in the applicability of these models. We give an overview of the current state of the art in metagenome sequencing and bioinformatics analysis, focusing on the reconstruction of genomes in microbial communities. Metagenomics benefits tremendously from third-generation sequencing, and we discuss the opportunities of long-read sequencing, strain-level characterisation and eukaryotic metagenomics. We aim at providing algorithmic and mathematical support, together with tool and application resources, that permit bridging the gap between metagenomics and metabolic modelling.
Collapse
Affiliation(s)
- Klara Cerk
- Quadram Institute BioscienceNorwichUK
- Earlham InstituteNorwichUK
| | | | - Chabname Ghassemi Nedjad
- Inria, University of Bordeaux, INRAETalenceFrance
- University of Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800TalenceFrance
| | - Maxime Lecomte
- Inria, University of Bordeaux, INRAETalenceFrance
- INRAE STLO¸University of RennesRennesFrance
| | | | | | - Falk Hildebrand
- Quadram Institute BioscienceNorwichUK
- Earlham InstituteNorwichUK
| | - Simon Labarthe
- Inria, University of Bordeaux, INRAETalenceFrance
- INRAE, University of Bordeaux, BIOGECO, UMR 1202CestasFrance
| | | |
Collapse
|
6
|
Yue T, Wang Y, Zhang L, Gu C, Xue H, Wang W, Lyu Q, Dun Y. Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models. Int J Mol Sci 2023; 24:15858. [PMID: 37958843 PMCID: PMC10649223 DOI: 10.3390/ijms242115858] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 10/24/2023] [Accepted: 10/30/2023] [Indexed: 11/15/2023] Open
Abstract
The data explosion driven by advancements in genomic research, such as high-throughput sequencing techniques, is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in various fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning, since we expect a superhuman intelligence that explores beyond our knowledge to interpret the genome from deep learning. A powerful deep learning model should rely on the insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with proper deep learning-based architecture, and we remark on practical considerations of developing deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research and point out current challenges and potential research directions for future genomics applications. We believe the collaborative use of ever-growing diverse data and the fast iteration of deep learning models will continue to contribute to the future of genomics.
Collapse
Affiliation(s)
- Tianwei Yue
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Yuanxin Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Longxiang Zhang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Chunming Gu
- Department of Biomedical Engineering, School of Medicine, Johns Hopkins University, Baltimore, MD 21218, USA;
| | - Haoru Xue
- The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA;
| | - Wenping Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Qi Lyu
- Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI 48824, USA;
| | - Yujie Dun
- School of Information and Communications Engineering, Xi’an Jiaotong University, Xi’an 710049, China;
| |
Collapse
|
7
|
Zhang L, Wang S, Hou J, Si D, Zhu J, Cao R. ComplexQA: a deep graph learning approach for protein complex structure assessment. Brief Bioinform 2023; 24:bbad287. [PMID: 37930021 DOI: 10.1093/bib/bbad287] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2023] [Revised: 05/09/2023] [Accepted: 07/24/2023] [Indexed: 11/07/2023] Open
Abstract
MOTIVATION In recent years, the end-to-end deep learning method for single-chain protein structure prediction has achieved high accuracy. For example, the state-of-the-art method AlphaFold, developed by Google, has largely increased the accuracy of protein structure predictions to near experimental accuracy in some of the cases. At the same time, there are few methods that can evaluate the quality of protein complexes at the residue level. In particular, evaluating the quality of residues at the interface of protein complexes can lead to a wide range of applications, such as protein function analysis and drug design. In this paper, we introduce a new deep graph neural network-based method ComplexQA, to evaluate the local quality of interfaces for protein complexes by utilizing the residue-level structural information in 3D space and the sequence-level constraints. RESULTS We benchmark our method to other state-of-the-art quality assessment approaches on the HAF2 and DBM55-AF2 datasets (high-quality structural models predicted by AlphaFold-Multimer), and the BM5 docking dataset. The experimental results show that our proposed method achieves better or similar performance compared with other state-of-the-art methods, especially on difficult targets which only contain a few acceptable models. Our method is able to suggest a score for each interfac e residue, which demonstrates a powerful assessment tool for the ever-increasing number of protein complexes. AVAILABILITY https://github.com/Cao-Labs/ComplexQA.git. Contact: caora@plu.edu.
Collapse
Affiliation(s)
- Lei Zhang
- Department of Computer Science and Technology, AnHui University, Hefei, 230601, Anhui, China
| | - Sheng Wang
- Department of Computer Science and Technology, AnHui University, Hefei, 230601, Anhui, China
| | - Jie Hou
- Department of Computer Science, Saint Louis University, Saint. Louis, 63103, MO, USA
| | - Dong Si
- Division of Computing and Software Systems, University of Washington Bothell, Bothell, 98011, WA, USA
| | - Junyong Zhu
- Department of Computer Science and Technology, AnHui University, Hefei, 230601, Anhui, China
| | - Renzhi Cao
- Department of Humanities, Pacific Lutheran University, Tacoma, 98447, WA, USA
| |
Collapse
|
8
|
Zhang P, Xia C, Shen HB. High-accuracy protein model quality assessment using attention graph neural networks. Brief Bioinform 2023; 24:7025462. [PMID: 36736352 DOI: 10.1093/bib/bbac614] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Revised: 11/23/2022] [Accepted: 12/12/2022] [Indexed: 02/05/2023] Open
Abstract
Great improvement has been brought to protein tertiary structure prediction through deep learning. It is important but very challenging to accurately rank and score decoy structures predicted by different models. CASP14 results show that existing quality assessment (QA) approaches lag behind the development of protein structure prediction methods, where almost all existing QA models degrade in accuracy when the target is a decoy of high quality. How to give an accurate assessment to high-accuracy decoys is particularly useful with the available of accurate structure prediction methods. Here we propose a fast and effective single-model QA method, QATEN, which can evaluate decoys only by their topological characteristics and atomic types. Our model uses graph neural networks and attention mechanisms to evaluate global and amino acid level scores, and uses specific loss functions to constrain the network to focus more on high-precision decoys and protein domains. On the CASP14 evaluation decoys, QATEN performs better than other QA models under all correlation coefficients when targeting average LDDT. QATEN shows promising performance when considering only high-accuracy decoys. Compared to the embedded evaluation modules of predicted ${C}_{\alpha^{-}} RMSD$ (pRMSD) in RosettaFold and predicted LDDT (pLDDT) in AlphaFold2, QATEN is complementary and capable of achieving better evaluation on some decoy structures generated by AlphaFold2 and RosettaFold. These results suggest that the new QATEN approach can be used as a reliable independent assessment algorithm for high-accuracy protein structure decoys.
Collapse
Affiliation(s)
- Peidong Zhang
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, 200240 Shanghai, China
| | - Chunqiu Xia
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, 200240 Shanghai, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, 200240 Shanghai, China
| |
Collapse
|
9
|
Chen C, Chen X, Morehead A, Wu T, Cheng J. 3D-equivariant graph neural networks for protein model quality assessment. BIOINFORMATICS (OXFORD, ENGLAND) 2023; 39:6986970. [PMID: 36637199 PMCID: PMC10089647 DOI: 10.1093/bioinformatics/btad030] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/12/2022] [Revised: 11/28/2022] [Accepted: 01/12/2023] [Indexed: 01/14/2023]
Abstract
MOTIVATION Quality assessment (QA) of predicted protein tertiary structure models plays an important role in ranking and using them. With the recent development of deep learning end-to-end protein structure prediction techniques for generating highly confident tertiary structures for most proteins, it is important to explore corresponding QA strategies to evaluate and select the structural models predicted by them since these models have better quality and different properties than the models predicted by traditional tertiary structure prediction methods. RESULTS We develop EnQA, a novel graph-based 3D-equivariant neural network method that is equivariant to rotation and translation of 3D objects to estimate the accuracy of protein structural models by leveraging the structural features acquired from the state-of-the-art tertiary structure prediction method-AlphaFold2. We train and test the method on both traditional model datasets (e.g. the datasets of the Critical Assessment of Techniques for Protein Structure Prediction) and a new dataset of high-quality structural models predicted only by AlphaFold2 for the proteins whose experimental structures were released recently. Our approach achieves state-of-the-art performance on protein structural models predicted by both traditional protein structure prediction methods and the latest end-to-end deep learning method-AlphaFold2. It performs even better than the model QA scores provided by AlphaFold2 itself. The results illustrate that the 3D-equivariant graph neural network is a promising approach to the evaluation of protein structural models. Integrating AlphaFold2 features with other complementary sequence and structural features is important for improving protein model QA. AVAILABILITY AND IMPLEMENTATION The source code is available at https://github.com/BioinfoMachineLearning/EnQA. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chen Chen
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA
| | - Xiao Chen
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA
| | - Alex Morehead
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA
| | - Tianqi Wu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA
| |
Collapse
|
10
|
Zhao C, Liu T, Wang Z. Predicting residue-specific qualities of individual protein models using residual neural networks and graph neural networks. Proteins 2022; 90:2091-2102. [PMID: 35842895 PMCID: PMC9796650 DOI: 10.1002/prot.26400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Revised: 06/24/2022] [Accepted: 07/08/2022] [Indexed: 01/02/2023]
Abstract
The estimation of protein model accuracy (EMA) or model quality assessment (QA) is important for protein structure prediction. An accurate EMA algorithm can guide the refinement of models or pick the best model or best parts of models from a pool of predicted tertiary structures. We developed two novel methods: MASS2 and LAW, for predicting residue-specific or local qualities of individual models, which incorporate residual neural networks and graph neural networks, respectively. These two methods use similar features extracted from protein models but different architectures of neural networks to predict the local accuracies of single models. MASS2 and LAW participated in the QA category of CASP14, and according to our evaluations based on CASP14 official criteria, MASS2 and LAW are the best and second-best methods based on the Z-scores of ASE/100, AUC, and ULR-1.F1. We also evaluated MASS2, LAW, and the residue-specific predicted deviations (between model and native structure) generated by AlphaFold2 on CASP14 AlphaFold2 tertiary structure (TS) models. LAW achieved comparable or better performances compared to the predicted deviations generated by AlphaFold2 on AlphaFold2 TS models, even though LAW was not trained on any AlphaFold2 TS models. Specifically, LAW performed better on AUC and ULR scores, and AlphaFold2 performed better on ASE scores. This means that AlphaFold2 is better at predicting deviations, but LAW is better at classifying accurate and inaccurate residues and detecting unreliable local regions. MASS2 and LAW can be freely accessed from http://dna.cs.miami.edu/MASS2-CASP14/ and http://dna.cs.miami.edu/LAW-CASP14/, respectively.
Collapse
Affiliation(s)
- Chenguang Zhao
- Department of Computer ScienceUniversity of MiamiCoral GablesFloridaUSA
| | - Tong Liu
- Department of Computer ScienceUniversity of MiamiCoral GablesFloridaUSA
| | - Zheng Wang
- Department of Computer ScienceUniversity of MiamiCoral GablesFloridaUSA
| |
Collapse
|
11
|
Kaushik R, Zhang KY. An Integrated Protein Structure Fitness Scoring Approach for Identifying Native-Like Model Structures. Comput Struct Biotechnol J 2022; 20:6467-6472. [DOI: 10.1016/j.csbj.2022.11.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Revised: 11/14/2022] [Accepted: 11/14/2022] [Indexed: 11/18/2022] Open
|
12
|
DeepRHD: An efficient Hybrid feature Extraction technique for protein remote homology detection using Deep learning strategies. Comput Biol Chem 2022; 100:107749. [DOI: 10.1016/j.compbiolchem.2022.107749] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Revised: 07/28/2022] [Accepted: 07/30/2022] [Indexed: 11/19/2022]
|
13
|
Kurniawan J, Ishida T. Protein Model Quality Estimation Using Molecular Dynamics Simulation. ACS OMEGA 2022; 7:24274-24281. [PMID: 35874260 PMCID: PMC9301944 DOI: 10.1021/acsomega.2c01475] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The estimation of protein model quality remains a challenging task and is important for protein structural model utilization. In the last decade, existing methods that rely on machine learning to deep learning have been developed and shown progressive improvement. Despite utilizing more sophisticated techniques and introducing new features, none of these methods employ explicit protein structure stability information. Hypothetically, protein model quality might be indicated by its structural stability in an in silico system disclosed by the structural difference from its initial structure. One of the possible methods to exploit such information is by implementing molecular dynamics simulations that have shown successful applications in many research fields. We present a novel approach by introducing explicit protein structure stability information using molecular dynamics simulation. Despite using only simple features, small data with no training process required, and a short molecular dynamics simulation time, our method shows comparable performance to the state-of-the-art deep learning-based method.
Collapse
|
14
|
Akhter N, Kabir KL, Chennupati G, Vangara R, Alexandrov BS, Djidjev H, Shehu A. Improved Protein Decoy Selection via Non-Negative Matrix Factorization. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1670-1682. [PMID: 33400654 DOI: 10.1109/tcbb.2020.3049088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
A central challenge in protein modeling research and protein structure prediction in particular is known as decoy selection. The problem refers to selecting biologically-active/native tertiary structures among a multitude of physically-realistic structures generated by template-free protein structure prediction methods. Research on decoy selection is active. Clustering-based methods are popular, but they fail to identify good/near-native decoys on datasets where near-native decoys are severely under-sampled by a protein structure prediction method. Reasonable progress is reported by methods that additionally take into account the internal energy of a structure and employ it to identify basins in the energy landscape organizing the multitude of decoys. These methods, however, incur significant time costs for extracting basins from the landscape. In this paper, we propose a novel decoy selection method based on non-negative matrix factorization. We demonstrate that our method outperforms energy landscape-based methods. In particular, the proposed method addresses both the time cost issue and the challenge of identifying good decoys in a sparse dataset, successfully recognizing near-native decoys for both easy and hard protein targets.
Collapse
|
15
|
Chen X, Cheng J. DISTEMA: distance map-based estimation of single protein model accuracy with attentive 2D convolutional neural network. BMC Bioinformatics 2022; 23:141. [PMID: 35439931 PMCID: PMC9019949 DOI: 10.1186/s12859-022-04683-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2022] [Accepted: 04/11/2022] [Indexed: 01/25/2023] Open
Abstract
BACKGROUND Estimation of the accuracy (quality) of protein structural models is important for both prediction and use of protein structural models. Deep learning methods have been used to integrate protein structure features to predict the quality of protein models. Inter-residue distances are key information for predicting protein's tertiary structures and therefore have good potentials to predict the quality of protein structural models. However, few methods have been developed to fully take advantage of predicted inter-residue distance maps to estimate the accuracy of a single protein structural model. RESULT We developed an attentive 2D convolutional neural network (CNN) with channel-wise attention to take only a raw difference map between the inter-residue distance map calculated from a single protein model and the distance map predicted from the protein sequence as input to predict the quality of the model. The network comprises multiple convolutional layers, batch normalization layers, dense layers, and Squeeze-and-Excitation blocks with attention to automatically extract features relevant to protein model quality from the raw input without using any expert-curated features. We evaluated DISTEMA's capability of selecting the best models for CASP13 targets in terms of ranking loss of GDT-TS score. The ranking loss of DISTEMA is 0.079, lower than several state-of-the-art single-model quality assessment methods. CONCLUSION This work demonstrates that using raw inter-residue distance information with deep learning can predict the quality of protein structural models reasonably well. DISTEMA is freely at https://github.com/jianlin-cheng/DISTEMA.
Collapse
Affiliation(s)
- Xiao Chen
- grid.134936.a0000 0001 2162 3504Department of Electrical Engineering and Computer Science, University of Missouri Columbia, Columbia, MO 65211 USA
| | - Jianlin Cheng
- grid.134936.a0000 0001 2162 3504Department of Electrical Engineering and Computer Science, University of Missouri Columbia, Columbia, MO 65211 USA
| |
Collapse
|
16
|
Kabir MN, Wong L. EnsembleFam: towards more accurate protein family prediction in the twilight zone. BMC Bioinformatics 2022; 23:90. [PMID: 35287576 PMCID: PMC8919565 DOI: 10.1186/s12859-022-04626-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Accepted: 03/02/2022] [Indexed: 11/30/2022] Open
Abstract
Background Current protein family modeling methods like profile Hidden Markov Model (pHMM), k-mer based methods, and deep learning-based methods do not provide very accurate protein function prediction for proteins in the twilight zone, due to low sequence similarity to reference proteins with known functions. Results We present a novel method EnsembleFam, aiming at better function prediction for proteins in the twilight zone. EnsembleFam extracts the core characteristics of a protein family using similarity and dissimilarity features calculated from sequence homology relations. EnsembleFam trains three separate Support Vector Machine (SVM) classifiers for each family using these features, and an ensemble prediction is made to classify novel proteins into these families. Extensive experiments are conducted using the Clusters of Orthologous Groups (COG) dataset and G Protein-Coupled Receptor (GPCR) dataset. EnsembleFam not only outperforms state-of-the-art methods on the overall dataset but also provides a much more accurate prediction for twilight zone proteins. Conclusions EnsembleFam, a machine learning method to model protein families, can be used to better identify members with very low sequence homology. Using EnsembleFam protein functions can be predicted using just sequence information with better accuracy than state-of-the-art methods.
Collapse
Affiliation(s)
- Mohammad Neamul Kabir
- Department of Computer Science, National University of Singapore, 13 Computing Drive, 117417, Singapore, Singapore.
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, 13 Computing Drive, 117417, Singapore, Singapore
| |
Collapse
|
17
|
Hippe K, Lilley C, William Berkenpas J, Chandana Pocha C, Kishaba K, Ding H, Hou J, Si D, Cao R. ZoomQA: residue-level protein model accuracy estimation with machine learning on sequential and 3D structural features. Brief Bioinform 2022; 23:bbab384. [PMID: 34553747 PMCID: PMC8499977 DOI: 10.1093/bib/bbab384] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2021] [Revised: 08/02/2021] [Accepted: 08/28/2021] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION The Estimation of Model Accuracy problem is a cornerstone problem in the field of Bioinformatics. As of CASP14, there are 79 global QA methods, and a minority of 39 residue-level QA methods with very few of them working on protein complexes. Here, we introduce ZoomQA, a novel, single-model method for assessing the accuracy of a tertiary protein structure/complex prediction at residue level, which have many applications such as drug discovery. ZoomQA differs from others by considering the change in chemical and physical features of a fragment structure (a portion of a protein within a radius $r$ of the target amino acid) as the radius of contact increases. Fourteen physical and chemical properties of amino acids are used to build a comprehensive representation of every residue within a protein and grade their placement within the protein as a whole. Moreover, we have shown the potential of ZoomQA to identify problematic regions of the SARS-CoV-2 protein complex. RESULTS We benchmark ZoomQA on CASP14, and it outperforms other state-of-the-art local QA methods and rivals state of the art QA methods in global prediction metrics. Our experiment shows the efficacy of these new features and shows that our method is able to match the performance of other state-of-the-art methods without the use of homology searching against databases or PSSM matrices. AVAILABILITY http://zoomQA.renzhitech.com.
Collapse
Affiliation(s)
- Kyle Hippe
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA 98447, USA
| | - Cade Lilley
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA 98447, USA
| | | | | | - Kiyomi Kishaba
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA 98447, USA
| | - Hui Ding
- Center for Informational Biology at University of Electronic Science and Technology of China
| | | | - Dong Si
- University of Washington Bothell, USA
| | - Renzhi Cao
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA 98447, USA
| |
Collapse
|
18
|
Kaushik R, Zhang KYJ. ProFitFun: a protein tertiary structure fitness function for quantifying the accuracies of model structures. Bioinformatics 2022; 38:369-376. [PMID: 34542606 DOI: 10.1093/bioinformatics/btab666] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2021] [Revised: 09/06/2021] [Accepted: 09/16/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION An accurate estimation of the quality of protein model structures typifies as a cornerstone in protein structure prediction regimes. Despite the recent groundbreaking success in the field of protein structure prediction, there are certain prospects for the improvement in model quality estimation at multiple stages of protein structure prediction and thus, to further push the prediction accuracy. Here, a novel approach, named ProFitFun, for assessing the quality of protein models is proposed by harnessing the sequence and structural features of experimental protein structures in terms of the preferences of backbone dihedral angles and relative surface accessibility of their amino acid residues at the tripeptide level. The proposed approach leverages upon the backbone dihedral angle and surface accessibility preferences of the residues by accounting for its N-terminal and C-terminal neighbors in the protein structure. These preferences are used to evaluate protein structures through a machine learning approach and tested on an extensive dataset of diverse proteins. RESULTS The approach was extensively validated on a large test dataset (n = 25 005) of protein structures, comprising 23 661 models of 82 non-homologous proteins and 1344 non-homologous experimental structures. In addition, an external dataset of 40 000 models of 200 non-homologous proteins was also used for the validation of the proposed method. Both datasets were further used for benchmarking the proposed method with four different state-of-the-art methods for protein structure quality assessment. In the benchmarking, the proposed method outperformed some state-of-the-art methods in terms of Spearman's and Pearson's correlation coefficients, average GDT-TS loss, sum of z-scores and average absolute difference of predictions over corresponding observed values. The high accuracy of the proposed approach promises a potential use of the sequence and structural features in computational protein design. AVAILABILITY AND IMPLEMENTATION http://github.com/KYZ-LSB/ProTerS-FitFun. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Rahul Kaushik
- Laboratory for Structural Bioinformatics, Center for Biosystems Dynamics Research, RIKEN, Yokohama, Kanagawa 230-0045, Japan
| | - Kam Y J Zhang
- Laboratory for Structural Bioinformatics, Center for Biosystems Dynamics Research, RIKEN, Yokohama, Kanagawa 230-0045, Japan
| |
Collapse
|
19
|
Wang W, Wang J, Li Z, Xu D, Shang Y. MUfoldQA_G: High-accuracy protein model QA via retraining and transformation. Comput Struct Biotechnol J 2021; 19:6282-6290. [PMID: 34900138 PMCID: PMC8636996 DOI: 10.1016/j.csbj.2021.11.021] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2021] [Revised: 11/10/2021] [Accepted: 11/14/2021] [Indexed: 11/21/2022] Open
Abstract
Protein tertiary structure prediction is an active research area and has attracted significant attention recently due to the success of AlphaFold from DeepMind. Methods capable of accurately evaluating the quality of predicted models are of great importance. In the past, although many model quality assessment (QA) methods have been developed, their accuracies are not consistently high across different QA performance metrics for diverse target proteins. In this paper, we propose MUfoldQA_G, a new multi-model QA method that aims at simultaneously optimizing Pearson correlation and average GDT-TS difference, two commonly used QA performance metrics. This method is based on two new algorithms MUfoldQA_Gp and MUfoldQA_Gr. MUfoldQA_Gp uses a new technique to combine information from protein templates and reference protein models to maximize the Pearson correlation QA metric. MUfoldQA_Gr employs a new machine learning technique that resamples training data and retrains adaptively to learn a consensus model that is better than naïve consensus while minimizing average GDT-TS difference. MUfoldQA_G uses a new method to combine the results of MUfoldQA_Gr and MUfoldQA_Gp so that the final QA prediction results achieve low average GDT-TS difference that is close to the results from MUfoldQA_Gr, while maintaining high Pearson correlation that is the same as the results from MUfoldQA_Gp. In CASP14 QA categories, MUfoldQA_G ranked No. 1 in Pearson correlation and No. 2 in average GDT-TS difference.
Collapse
Affiliation(s)
- Wenbo Wang
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA
| | - Junlin Wang
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA
| | - Zhaoyu Li
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65211, USA
| | - Yi Shang
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA
| |
Collapse
|
20
|
Ye L, Wu P, Peng Z, Gao J, Liu J, Yang J. Improved estimation of model quality using predicted inter-residue distance. Bioinformatics 2021; 37:3752-3759. [PMID: 34473228 DOI: 10.1093/bioinformatics/btab632] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Revised: 08/27/2021] [Accepted: 08/31/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Protein model quality assessment (QA) is an essential component in protein structure prediction, which aims to estimate the quality of a structure model and/or select the most accurate model out from a pool of structure models, without knowing the native structure. QA remains a challenging task in protein structure prediction. RESULTS Based on the inter-residue distance predicted by the recent deep learning-based structure prediction algorithm trRosetta, we developed QDistance, a new approach to the estimation of both global and local qualities. QDistance works for both single-model and multi-models inputs. We designed several distance-based features to assess the agreement between the predicted and model-derived inter-residue distances. Together with a few widely used features, they are fed into a simple yet powerful linear regression model to infer the global QA scores. The local QA scores for each structure model are predicted based on a comparative analysis with a set of selected reference models. For multi-models input, the reference models are selected from the input based on the predicted global QA scores. For single-model input, the reference models are predicted by trRosetta. With the informative distance-based features, QDistance can predict the global quality with satisfactory accuracy. Benchmark tests on the CASP13 and the CAMEO structure models suggested that QDistance was competitive other methods. Blind tests in the CASP14 experiments showed that QDistance was robust and ranked among the top predictors. Especially, QDistance was the top 3 local QA method and made the most accurate local QA prediction for unreliable local region. Analysis showed that this superior performance can be attributed to the inclusion of the predicted inter-residue distance. AVAILABILITY AND IMPLEMENTATION http://yanglab.nankai.edu.cn/QDistance. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lisha Ye
- School of Mathematical Sciences, Nankai University, Tianjin, 300071, China
| | - Peikun Wu
- School of Mathematical Sciences, Nankai University, Tianjin, 300071, China
| | - Zhenling Peng
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, 266237, China
| | - Jianzhao Gao
- School of Mathematical Sciences, Nankai University, Tianjin, 300071, China
| | - Jian Liu
- College of Computer Science, Nankai University, Tianjin, 300071, China
| | - Jianyi Yang
- School of Mathematical Sciences, Nankai University, Tianjin, 300071, China
| |
Collapse
|
21
|
Antoniak A, Biskupek I, Bojarski KK, Czaplewski C, Giełdoń A, Kogut M, Kogut MM, Krupa P, Lipska AG, Liwo A, Lubecka EA, Marcisz M, Maszota-Zieleniak M, Samsonov SA, Sieradzan AK, Ślusarz MJ, Ślusarz R, Wesołowski PA, Ziȩba K. Modeling protein structures with the coarse-grained UNRES force field in the CASP14 experiment. J Mol Graph Model 2021; 108:108008. [PMID: 34419932 DOI: 10.1016/j.jmgm.2021.108008] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2021] [Revised: 08/12/2021] [Accepted: 08/13/2021] [Indexed: 12/31/2022]
Abstract
The UNited RESidue (UNRES) force field was tested in the 14th Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP14), in which larger oligomeric and multimeric targets were present compared to previous editions. Three prediction modes were tested (i) ab initio (the UNRES group), (ii) contact-assisted (the UNRES-contact group), and (iii) template-assisted (the UNRES-template group). For most of the targets, the contact restraints were derived from the server models top-ranked by the DeepQA method, while the DNCON2 method was used for 11 targets. Our consensus-fragment procedure was used to run template-assisted predictions. Each group also processed the Nuclear Magnetic Resonance (NMR)- and Small Angle X-Ray Scattering (SAXS)-data assisted targets. The average Global Distance Test Total Score (GDT_TS) of the 'Model 1' predictions were 29.17, 39.32, and 56.37 for the UNRES, UNRES-contact, and UNRES-template predictions, respectively, increasing by 0.53, 2.24, and 3.76, respectively, compared to CASP13. It was also found that the GDT_TS of the UNRES models obtained in ab initio mode and in the contact-assisted mode decreases with the square root of chain length, while the exponent in this relationship is 0.20 for the UNRES-template group models and 0.11 for the best performing AlphaFold2 models, which suggests that incorporation of database information, which stems from protein evolution, brings in long-range correlations, thus enabling the correction of force-field inaccuracies.
Collapse
Affiliation(s)
- Anna Antoniak
- Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308, Gdańsk, Poland
| | - Iga Biskupek
- Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308, Gdańsk, Poland
| | - Krzysztof K Bojarski
- Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308, Gdańsk, Poland
| | - Cezary Czaplewski
- Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308, Gdańsk, Poland
| | - Artur Giełdoń
- Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308, Gdańsk, Poland
| | - Mateusz Kogut
- Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308, Gdańsk, Poland
| | - Małgorzata M Kogut
- Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308, Gdańsk, Poland
| | - Paweł Krupa
- Institute of Physics, Polish Academy of Sciences, Aleja Lotników 32/46, Warsaw, PL-02668, Poland
| | - Agnieszka G Lipska
- Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308, Gdańsk, Poland
| | - Adam Liwo
- Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308, Gdańsk, Poland; School of Computational Sciences, Korea Institute for Advanced Study, 87 Hoegiro, Dongdaemun-gu, 130-722, Seoul, Republic of Korea.
| | - Emilia A Lubecka
- Faculty of Electronics, Telecommunications and Informatics, Gdańsk University of Technology, G. Narutowicza 11/12, 80-233, Gdańsk, Poland
| | - Mateusz Marcisz
- Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308, Gdańsk, Poland; Intercollegiate Faculty of Biotechnology, University of Gdańsk and Medical University of Gdańsk, ul. Abrahama 58, 80-307, Gdańsk, Poland
| | | | - Sergey A Samsonov
- Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308, Gdańsk, Poland
| | - Adam K Sieradzan
- Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308, Gdańsk, Poland
| | - Magdalena J Ślusarz
- Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308, Gdańsk, Poland
| | - Rafał Ślusarz
- Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308, Gdańsk, Poland
| | - Patryk A Wesołowski
- Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308, Gdańsk, Poland; Intercollegiate Faculty of Biotechnology, University of Gdańsk and Medical University of Gdańsk, ul. Abrahama 58, 80-307, Gdańsk, Poland
| | - Karolina Ziȩba
- Faculty of Chemistry, University of Gdańsk, Wita Stwosza 63, 80-308, Gdańsk, Poland
| |
Collapse
|
22
|
Routray M, Vipsita S. Protein remote homology detection combining PCA and multiobjective optimization tools. EVOLUTIONARY INTELLIGENCE 2021. [DOI: 10.1007/s12065-021-00642-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
23
|
Liu J, Wu T, Guo Z, Hou J, Cheng J. Improving protein tertiary structure prediction by deep learning and distance prediction in CASP14. Proteins 2021; 90:58-72. [PMID: 34291486 PMCID: PMC8671168 DOI: 10.1002/prot.26186] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 06/21/2021] [Accepted: 07/12/2021] [Indexed: 12/15/2022]
Abstract
Substantial progresses in protein structure prediction have been made by utilizing deep‐learning and residue‐residue distance prediction since CASP13. Inspired by the advances, we improve our CASP14 MULTICOM protein structure prediction system by incorporating three new components: (a) a new deep learning‐based protein inter‐residue distance predictor to improve template‐free (ab initio) tertiary structure prediction, (b) an enhanced template‐based tertiary structure prediction method, and (c) distance‐based model quality assessment methods empowered by deep learning. In the 2020 CASP14 experiment, MULTICOM predictor was ranked seventh out of 146 predictors in tertiary structure prediction and ranked third out of 136 predictors in inter‐domain structure prediction. The results demonstrate that the template‐free modeling based on deep learning and residue‐residue distance prediction can predict the correct topology for almost all template‐based modeling targets and a majority of hard targets (template‐free targets or targets whose templates cannot be recognized), which is a significant improvement over the CASP13 MULTICOM predictor. Moreover, the template‐free modeling performs better than the template‐based modeling on not only hard targets but also the targets that have homologous templates. The performance of the template‐free modeling largely depends on the accuracy of distance prediction closely related to the quality of multiple sequence alignments. The structural model quality assessment works well on targets for which enough good models can be predicted, but it may perform poorly when only a few good models are predicted for a hard target and the distribution of model quality scores is highly skewed. MULTICOM is available at https://github.com/jianlin-cheng/MULTICOM_Human_CASP14/tree/CASP14_DeepRank3 and https://github.com/multicom-toolbox/multicom/tree/multicom_v2.0.
Collapse
Affiliation(s)
- Jian Liu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
| | - Tianqi Wu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
| | - Zhiye Guo
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
| | - Jie Hou
- Department of Computer Science, Saint Louis University, St. Louis, Missouri, USA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
| |
Collapse
|
24
|
Min X, Lu F, Li C. Sequence-Based Deep Learning Frameworks on Enhancer-Promoter Interactions Prediction. Curr Pharm Des 2021; 27:1847-1855. [PMID: 33234095 DOI: 10.2174/1381612826666201124112710] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2020] [Revised: 07/29/2020] [Accepted: 08/06/2020] [Indexed: 11/22/2022]
Abstract
Enhancer-promoter interactions (EPIs) in the human genome are of great significance to transcriptional regulation, which tightly controls gene expression. Identification of EPIs can help us better decipher gene regulation and understand disease mechanisms. However, experimental methods to identify EPIs are constrained by funds, time, and manpower, while computational methods using DNA sequences and genomic features are viable alternatives. Deep learning methods have shown promising prospects in classification and efforts that have been utilized to identify EPIs. In this survey, we specifically focus on sequence-based deep learning methods and conduct a comprehensive review of the literature. First, we briefly introduce existing sequence- based frameworks on EPIs prediction and their technique details. After that, we elaborate on the dataset, pre-processing means, and evaluation strategies. Finally, we concluded with the challenges these methods are confronted with and suggest several future opportunities. We hope this review will provide a useful reference for further studies on enhancer-promoter interactions.
Collapse
Affiliation(s)
- Xiaoping Min
- School of Informatics, Xiamen University, Xiamen 361005, China
| | - Fengqing Lu
- School of Informatics, Xiamen University, Xiamen 361005, China
| | - Chunyan Li
- Graduate School, Yunnan Minzu University, Kunming 650504, China
| |
Collapse
|
25
|
Protein model accuracy estimation empowered by deep learning and inter-residue distance prediction in CASP14. Sci Rep 2021; 11:10943. [PMID: 34035363 PMCID: PMC8149836 DOI: 10.1038/s41598-021-90303-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Accepted: 05/10/2021] [Indexed: 11/28/2022] Open
Abstract
The inter-residue contact prediction and deep learning showed the promise to improve the estimation of protein model accuracy (EMA) in the 13th Critical Assessment of Protein Structure Prediction (CASP13). To further leverage the improved inter-residue distance predictions to enhance EMA, during the 2020 CASP14 experiment, we integrated several new inter-residue distance features with the existing model quality assessment features in several deep learning methods to predict the quality of protein structural models. According to the evaluation of performance in selecting the best model from the models of CASP14 targets, our three multi-model predictors of estimating model accuracy (MULTICOM-CONSTRUCT, MULTICOM-AI, and MULTICOM-CLUSTER) achieve the averaged loss of 0.073, 0.079, and 0.081, respectively, in terms of the global distance test score (GDT-TS). The three methods are ranked first, second, and third out of all 68 CASP14 predictors. MULTICOM-DEEP, the single-model predictor of estimating model accuracy (EMA), is ranked within top 10 among all the single-model EMA methods according to GDT-TS score loss. The results demonstrate that inter-residue distance features are valuable inputs for deep learning to predict the quality of protein structural models. However, larger training datasets and better ways of leveraging inter-residue distance information are needed to fully explore its potentials.
Collapse
|
26
|
Zhang ZM, Guan ZX, Wang F, Zhang D, Ding H. Application of Machine Learning Methods in Predicting Nuclear Receptors and their Families. Med Chem 2021; 16:594-604. [PMID: 31584374 DOI: 10.2174/1573406415666191004125551] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2019] [Revised: 06/18/2019] [Accepted: 08/23/2019] [Indexed: 11/22/2022]
Abstract
Nuclear receptors (NRs) are a superfamily of ligand-dependent transcription factors that are closely related to cell development, differentiation, reproduction, homeostasis, and metabolism. According to the alignments of the conserved domains, NRs are classified and assigned the following seven subfamilies or eight subfamilies: (1) NR1: thyroid hormone like (thyroid hormone, retinoic acid, RAR-related orphan receptor, peroxisome proliferator activated, vitamin D3- like), (2) NR2: HNF4-like (hepatocyte nuclear factor 4, retinoic acid X, tailless-like, COUP-TFlike, USP), (3) NR3: estrogen-like (estrogen, estrogen-related, glucocorticoid-like), (4) NR4: nerve growth factor IB-like (NGFI-B-like), (5) NR5: fushi tarazu-F1 like (fushi tarazu-F1 like), (6) NR6: germ cell nuclear factor like (germ cell nuclear factor), and (7) NR0: knirps like (knirps, knirpsrelated, embryonic gonad protein, ODR7, trithorax) and DAX like (DAX, SHP), or dividing NR0 into (7) NR7: knirps like and (8) NR8: DAX like. Different NRs families have different structural features and functions. Since the function of a NR is closely correlated with which subfamily it belongs to, it is highly desirable to identify NRs and their subfamilies rapidly and effectively. The knowledge acquired is essential for a proper understanding of normal and abnormal cellular mechanisms. With the advent of the post-genomics era, huge amounts of sequence-known proteins have increased explosively. Conventional methods for accurately classifying the family of NRs are experimental means with high cost and low efficiency. Therefore, it has created a greater need for bioinformatics tools to effectively recognize NRs and their subfamilies for the purpose of understanding their biological function. In this review, we summarized the application of machine learning methods in the prediction of NRs from different aspects. We hope that this review will provide a reference for further research on the classification of NRs and their families.
Collapse
Affiliation(s)
- Zi-Mei Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zheng-Xing Guan
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Fang Wang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Dan Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
27
|
Dao FY, Lv H, Su W, Sun ZJ, Huang QL, Lin H. iDHS-Deep: an integrated tool for predicting DNase I hypersensitive sites by deep neural network. Brief Bioinform 2021; 22:6158360. [PMID: 33751027 DOI: 10.1093/bib/bbab047] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2020] [Revised: 01/28/2021] [Accepted: 01/29/2021] [Indexed: 01/09/2023] Open
Abstract
DNase I hypersensitive site (DHS) refers to the hypersensitive region of chromatin for the DNase I enzyme. It is an important part of the noncoding region and contains a variety of regulatory elements, such as promoter, enhancer, and transcription factor-binding site, etc. Moreover, the related locus of disease (or trait) are usually enriched in the DHS regions. Therefore, the detection of DHS region is of great significance. In this study, we develop a deep learning-based algorithm to identify whether an unknown sequence region would be potential DHS. The proposed method showed high prediction performance on both training datasets and independent datasets in different cell types and developmental stages, demonstrating that the method has excellent superiority in the identification of DHSs. Furthermore, for the convenience of related wet-experimental researchers, the user-friendly web-server iDHS-Deep was established at http://lin-group.cn/server/iDHS-Deep/, by which users can easily distinguish DHS and non-DHS and obtain the corresponding developmental stage ofDHS.
Collapse
Affiliation(s)
- Fu-Ying Dao
- Informational Biology at University of Electronic Science and Technology of China, China
| | - Hao Lv
- Informational Biology at University of Electronic Science and Technology of China, China
| | - Wei Su
- Informational Biology at University of Electronic Science and Technology of China, China
| | - Zi-Jie Sun
- Informational Biology at University of Electronic Science and Technology of China, China
| | - Qin-Lai Huang
- Informational Biology at University of Electronic Science and Technology of China, China
| | - Hao Lin
- Informational Biology at University of Electronic Science and Technology of China, China
| |
Collapse
|
28
|
Alam FF, Shehu A. Unsupervised multi-instance learning for protein structure determination. J Bioinform Comput Biol 2021; 19:2140002. [PMID: 33568002 DOI: 10.1142/s0219720021400023] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Many regions of the protein universe remain inaccessible by wet-laboratory or computational structure determination methods. A significant challenge in elucidating these dark regions in silico relates to the ability to discriminate relevant structure(s) among many structures/decoys computed for a protein of interest, a problem known as decoy selection. Clustering decoys based on geometric similarity remains popular. However, it is unclear how exactly to exploit the groups of decoys revealed via clustering to select individual structures for prediction. In this paper, we provide an intuitive formulation of the decoy selection problem as an instance of unsupervised multi-instance learning. We address the problem in three stages, first organizing given decoys of a protein molecule into bags, then identifying relevant bags, and finally drawing individual instances from these bags to offer as prediction. We propose both non-parametric and parametric algorithms for drawing individual instances. Our evaluation utilizes two datasets, one benchmark dataset of ensembles of decoys for a varied list of protein molecules, and a dataset of decoy ensembles for targets drawn from recent CASP competitions. A comparative analysis with state-of-the-art methods reveals that the proposed approach outperforms existing methods, thus warranting further investigation of multi-instance learning to advance our treatment of decoy selection.
Collapse
Affiliation(s)
- Fardina Fathmiul Alam
- Department of Computer Science, George Mason University, Fairfax, Virginia 22030, USA
| | - Amarda Shehu
- Department of Computer Science, George Mason University, Fairfax, Virginia 22030, USA
| |
Collapse
|
29
|
|
30
|
Neural Network Analysis. Adv Bioinformatics 2021. [DOI: 10.1007/978-981-33-6191-1_18] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022] Open
|
31
|
Akhter N, Chennupati G, Djidjev H, Shehu A. Decoy selection for protein structure prediction via extreme gradient boosting and ranking. BMC Bioinformatics 2020; 21:189. [PMID: 33297949 PMCID: PMC7724862 DOI: 10.1186/s12859-020-3523-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Accepted: 04/29/2020] [Indexed: 11/10/2022] Open
Abstract
Background Identifying one or more biologically-active/native decoys from millions of non-native decoys is one of the major challenges in computational structural biology. The extreme lack of balance in positive and negative samples (native and non-native decoys) in a decoy set makes the problem even more complicated. Consensus methods show varied success in handling the challenge of decoy selection despite some issues associated with clustering large decoy sets and decoy sets that do not show much structural similarity. Recent investigations into energy landscape-based decoy selection approaches show promises. However, lack of generalization over varied test cases remains a bottleneck for these methods. Results We propose a novel decoy selection method, ML-Select, a machine learning framework that exploits the energy landscape associated with the structure space probed through a template-free decoy generation. The proposed method outperforms both clustering and energy ranking-based methods, all the while consistently offering better performance on varied test-cases. Moreover, ML-Select shows promising results even for the decoy sets consisting of mostly low-quality decoys. Conclusions ML-Select is a useful method for decoy selection. This work suggests further research in finding more effective ways to adopt machine learning frameworks in achieving robust performance for decoy selection in template-free protein structure prediction.
Collapse
Affiliation(s)
- Nasrin Akhter
- Department of Computer Science, George Mason University, Fairfax, 22030, VA, USA
| | - Gopinath Chennupati
- Information Sciences (CCS-3) Group, Los Alamos National Laboratory, Bikini At al Rd., Los Alamos, 87545, USA.
| | - Hristo Djidjev
- Information Sciences (CCS-3) Group, Los Alamos National Laboratory, Bikini At al Rd., Los Alamos, 87545, USA
| | - Amarda Shehu
- Department of Computer Science, George Mason University, Fairfax, 22030, VA, USA.,Department of Bioengineering, George Mason University, Fairfax, 22030, VA, USA.,School of Systems Biology, George Mason University, Manassas, 20110, VA, USA
| |
Collapse
|
32
|
Ouyang J, Huang N, Jiang Y. A single-model quality assessment method for poor quality protein structure. BMC Bioinformatics 2020; 21:157. [PMID: 32334508 PMCID: PMC7183596 DOI: 10.1186/s12859-020-3499-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2020] [Accepted: 04/15/2020] [Indexed: 11/13/2022] Open
Abstract
Background Quality assessment of protein tertiary structure prediction models, in which structures of the best quality are selected from decoys, is a major challenge in protein structure prediction, and is crucial to determine a model’s utility and potential applications. Estimating the quality of a single model predicts the model’s quality based on the single model itself. In general, the Pearson correlation value of the quality assessment method increases in tandem with an increase in the quality of the model pool. However, there is no consensus regarding the best method to select a few good models from the poor quality model pool. Results We introduce a novel single-model quality assessment method for poor quality models that uses simple linear combinations of six features. We perform weighted search and linear regression on a large dataset of models from the 12th Critical Assessment of Protein Structure Prediction (CASP12) and benchmark the results on CASP13 models. We demonstrate that our method achieves outstanding performance on poor quality models. Conclusions According to results of poor protein structure assessment based on six features, contact prediction and relying on fewer prediction features can improve selection accuracy.
Collapse
|
33
|
Manavalan B, Hasan MM, Basith S, Gosu V, Shin TH, Lee G. Empirical Comparison and Analysis of Web-Based DNA N 4-Methylcytosine Site Prediction Tools. MOLECULAR THERAPY. NUCLEIC ACIDS 2020; 22:406-420. [PMID: 33230445 PMCID: PMC7533314 DOI: 10.1016/j.omtn.2020.09.010] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/16/2020] [Accepted: 09/11/2020] [Indexed: 12/12/2022]
Abstract
DNA N4-methylcytosine (4mC) is a crucial epigenetic modification involved in various biological processes. Accurate genome-wide identification of these sites is critical for improving our understanding of their biological functions and mechanisms. As experimental methods for 4mC identification are tedious, expensive, and labor-intensive, several machine learning-based approaches have been developed for genome-wide detection of such sites in multiple species. However, the predictions projected by these tools are difficult to quantify and compare. To date, no systematic performance comparison of 4mC tools has been reported. The aim of this study was to compare and critically evaluate 12 publicly available 4mC site prediction tools according to species specificity, based on a huge independent validation dataset. The tools 4mCCNN (Escherichia coli), DNA4mC-LIP (Arabidopsis thaliana), iDNA-MS (Fragaria vesca), DNA4mC-LIP and 4mCCNN (Drosophila melanogaster), and four tools for Caenorhabditis elegans achieved excellent overall performance compared with their counterparts. However, none of the existing methods was suitable for Geoalkalibacter subterraneus, Geobacter pickeringii, and Mus musculus, thereby limiting their practical applicability. Model transferability to five species and non-transferability to three species are also discussed. The presented evaluation will assist researchers in selecting appropriate prediction tools that best suit their purpose and provide useful guidelines for the development of improved 4mC predictors in the future.
Collapse
Affiliation(s)
- Balachandran Manavalan
- Department of Physiology, Ajou University School of Medicine, Suwon 16499, Republic of Korea
| | - Md Mehedi Hasan
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, Iizuka, Fukuoka 820-8502, Japan.,Japan Society for the Promotion of Science, Chiyoda-ku, Tokyo 102-0083, Japan
| | - Shaherin Basith
- Department of Physiology, Ajou University School of Medicine, Suwon 16499, Republic of Korea
| | - Vijayakumar Gosu
- Department of Animal Biotechnology, Jeonbuk National University, Jeonju 54896, Republic of Korea
| | - Tae-Hwan Shin
- Department of Physiology, Ajou University School of Medicine, Suwon 16499, Republic of Korea
| | - Gwang Lee
- Department of Physiology, Ajou University School of Medicine, Suwon 16499, Republic of Korea.,Department of Molecular Science and Technology, Ajou University, Suwon 16499, Republic of Korea
| |
Collapse
|
34
|
Grigas AT, Mei Z, Treado JD, Levine ZA, Regan L, O'Hern CS. Using physical features of protein core packing to distinguish real proteins from decoys. Protein Sci 2020; 29:1931-1944. [PMID: 32710566 PMCID: PMC7454528 DOI: 10.1002/pro.3914] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2020] [Revised: 07/10/2020] [Accepted: 07/20/2020] [Indexed: 01/06/2023]
Abstract
The ability to consistently distinguish real protein structures from computationally generated model decoys is not yet a solved problem. One route to distinguish real protein structures from decoys is to delineate the important physical features that specify a real protein. For example, it has long been appreciated that the hydrophobic cores of proteins contribute significantly to their stability. We used two sources to obtain datasets of decoys to compare with real protein structures: submissions to the biennial Critical Assessment of protein Structure Prediction competition, in which researchers attempt to predict the structure of a protein only knowing its amino acid sequence, and also decoys generated by 3DRobot, which have user-specified global root-mean-squared deviations from experimentally determined structures. Our analysis revealed that both sets of decoys possess cores that do not recapitulate the key features that define real protein cores. In particular, the model structures appear more densely packed (because of energetically unfavorable atomic overlaps), contain too few residues in the core, and have improper distributions of hydrophobic residues throughout the structure. Based on these observations, we developed a feed-forward neural network, which incorporates key physical features of protein cores, to predict how well a computational model recapitulates the real protein structure without knowledge of the structure of the target sequence. By identifying the important features of protein structure, our method is able to rank decoy structures with similar accuracy to that obtained by state-of-the-art methods that incorporate many additional features. The small number of physical features makes our model interpretable, emphasizing the importance of protein packing and hydrophobicity in protein structure prediction.
Collapse
Affiliation(s)
- Alex T. Grigas
- Graduate Program in Computational Biology and BioinformaticsYale UniversityNew HavenConnecticutUSA
- Integrated Graduate Program in Physical and Engineering BiologyYale UniversityNew HavenConnecticutUSA
| | - Zhe Mei
- Integrated Graduate Program in Physical and Engineering BiologyYale UniversityNew HavenConnecticutUSA
- Department of ChemistryYale UniversityNew HavenConnecticutUSA
| | - John D. Treado
- Integrated Graduate Program in Physical and Engineering BiologyYale UniversityNew HavenConnecticutUSA
- Department of Mechanical Engineering and Materials ScienceYale UniversityNew HavenConnecticutUSA
| | - Zachary A. Levine
- Department of PathologyYale UniversityNew HavenConnecticutUSA
- Department of Molecular Biophysics and BiochemistryYale UniversityNew HavenConnecticutUSA
| | - Lynne Regan
- Institute of Quantitative Biology, Biochemistry and Biotechnology, Centre for Synthetic and Systems Biology, School of Biological SciencesUniversity of EdinburghEdinburghUK
| | - Corey S. O'Hern
- Graduate Program in Computational Biology and BioinformaticsYale UniversityNew HavenConnecticutUSA
- Integrated Graduate Program in Physical and Engineering BiologyYale UniversityNew HavenConnecticutUSA
- Department of Mechanical Engineering and Materials ScienceYale UniversityNew HavenConnecticutUSA
- Department of PhysicsYale UniversityNew HavenConnecticutUSA
- Department of Applied PhysicsYale UniversityNew HavenConnecticutUSA
| |
Collapse
|
35
|
Postic G, Janel N, Tufféry P, Moroy G. An information gain-based approach for evaluating protein structure models. Comput Struct Biotechnol J 2020; 18:2228-2236. [PMID: 32837711 PMCID: PMC7431362 DOI: 10.1016/j.csbj.2020.08.013] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2020] [Revised: 08/06/2020] [Accepted: 08/07/2020] [Indexed: 12/23/2022] Open
Abstract
For three decades now, knowledge-based scoring functions that operate through the "potential of mean force" (PMF) approach have continuously proven useful for studying protein structures. Although these statistical potentials are not to be confused with their physics-based counterparts of the same name-i.e. PMFs obtained by molecular dynamics simulations-their particular success in assessing the native-like character of protein structure predictions has lead authors to consider the computed scores as approximations of the free energy. However, this physical justification is a matter of controversy since the beginning. Alternative interpretations based on Bayes' theorem have been proposed, but the misleading formalism that invokes the inverse Boltzmann law remains recurrent in the literature. In this article, we present a conceptually new method for ranking protein structure models by quality, which is (i) independent of any physics-based explanation and (ii) relevant to statistics and to a general definition of information gain. The theoretical development described in this study provides new insights into how statistical PMFs work, in comparison with our approach. To prove the concept, we have built interatomic distance-dependent scoring functions, based on the former and new equations, and compared their performance on an independent benchmark of 60,000 protein structures. The results demonstrate that our new formalism outperforms statistical PMFs in evaluating the quality of protein structural decoys. Therefore, this original type of score offers a possibility to improve the success of statistical PMFs in the various fields of structural biology where they are applied. The open-source code is available for download at https://gitlab.rpbs.univ-paris-diderot.fr/src/ig-score.
Collapse
Affiliation(s)
- Guillaume Postic
- Université de Paris, BFA, UMR 8251, CNRS, ERL U1133, Inserm, F-75013 Paris, France.,Université de Paris, BFA, UMR 8251, CNRS, F-75013 Paris, France.,Institut Français de Bioinformatique (IFB), UMS 3601-CNRS, Université Paris-Saclay, Orsay, France.,Ressource Parisienne en Bioinformatique Structurale (RPBS), Paris, France
| | - Nathalie Janel
- Université de Paris, BFA, UMR 8251, CNRS, F-75013 Paris, France
| | - Pierre Tufféry
- Université de Paris, BFA, UMR 8251, CNRS, ERL U1133, Inserm, F-75013 Paris, France.,Ressource Parisienne en Bioinformatique Structurale (RPBS), Paris, France
| | - Gautier Moroy
- Université de Paris, BFA, UMR 8251, CNRS, ERL U1133, Inserm, F-75013 Paris, France
| |
Collapse
|
36
|
Abstract
Atom pairwise potential functions make up an essential part of many scoring functions for protein decoy detection. With the development of machine learning (ML) tools, there are multiple ways to combine potential functions to create novel ML models and methods. Potential function parameters can be easily extracted; however, it is usually hard to directly obtain the calculated atom pairwise energies from scoring functions. Amber, as one of the most popular suites of modeling programs, has an extensive history and library of force field potential functions. In this work, we directly used the force field parameters in ff94 and ff14SB from Amber and encoded them to calculate atom pairwise energies for different interactions. Two sets of structures (single amino acid set and a dipeptide set) were used to evaluate the performance of our encoded Amber potentials. From the comparison results between energy terms obtained from our encoding and Amber, we find energy difference within ±0.06 kcal/mol for all tested structures. Previously we have shown that the Random Forest (RF) model can help to emphasize more important atom pairwise interactions and ignore insignificant ones [Pei, J.; Zheng, Z.; Merz, K. M. J. Chem. Inf. Model. 2019, 59, 1919-1929]. Here, as an example of combining ML methods with traditional potential functions, we followed the same work flow to combine the RF models with force field potential functions from Amber. To determine the performance of our RF models with force field potential functions, 224 different protein native-decoy systems were used as our training and testing sets We find that the RF models with ff94 and ff14SB force field parameters outperformed all other scoring functions (RF models with KECSA2, RWplus, DFIRE, dDFIRE, and GOAP) considered in this work for native structure detection, and they performed similarly in detecting the best decoy. Through inclusion of best decoy to decoy comparisons in building our RF models, we were able to generate models that outperformed the score functions tested herein both on accuracy and best decoy detection, again showing the performance and flexibility of our RF models to tackle this problem. Finally, the importance of the RF algorithm and force field parameters were also tested and the comparison results suggest that both the RF algorithm and force field potentials are important with the ML scoring function achieving its best performance only by combining them together. All code and data used in this work are available at https://github.com/JunPei000/FFENCODER_for_Protein_Folding_Pose_Selection.
Collapse
Affiliation(s)
- Jun Pei
- Department of Chemistry and the Department of Biochemistry and Molecular Biology, Michigan State University, 578 South Shaw Lane, East Lansing, Michigan 48824, United States
| | - Lin Frank Song
- Department of Chemistry and the Department of Biochemistry and Molecular Biology, Michigan State University, 578 South Shaw Lane, East Lansing, Michigan 48824, United States
| | - Kenneth M Merz
- Department of Chemistry and the Department of Biochemistry and Molecular Biology, Michigan State University, 578 South Shaw Lane, East Lansing, Michigan 48824, United States
| |
Collapse
|
37
|
Chen W, Feng P, Nie F. iATP: A Sequence Based Method for Identifying Anti-tubercular Peptides. Med Chem 2020; 16:620-625. [DOI: 10.2174/1573406415666191002152441] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2019] [Revised: 05/15/2019] [Accepted: 08/23/2019] [Indexed: 11/22/2022]
Abstract
Background:
Tuberculosis is one of the biggest threats to human health. Recent studies
have demonstrated that anti-tubercular peptides are promising candidates for the discovery of new
anti-tubercular drugs. Since experimental methods are still labor intensive, it is highly desirable to
develop automatic computational methods to identify anti-tubercular peptides from the huge
amount of natural and synthetic peptides. Hence, accurate and fast computational methods are
highly needed.
Methods and Results:
In this study, a support vector machine based method was proposed to identify
anti-tubercular peptides, in which the peptides were encoded by using the optimal g-gap dipeptide
compositions. Comparative results demonstrated that our method outperforms existing methods
on the same benchmark dataset. For the convenience of scientific community, a freely accessible
web-server was built, which is available at http://lin-group.cn/server/iATP.
Conclusion:
It is anticipated that the proposed method will become a useful tool for identifying
anti-tubercular peptides.
Collapse
Affiliation(s)
- Wei Chen
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China
| | - Pengmian Feng
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China
| | - Fulei Nie
- Center for Genomics and Computational Biology, School of Life Sciences, North China University of Science and Technology, Tangshan 063000, China
| |
Collapse
|
38
|
Guan ZX, Li SH, Zhang ZM, Zhang D, Yang H, Ding H. A Brief Survey for MicroRNA Precursor Identification Using Machine Learning Methods. Curr Genomics 2020; 21:11-25. [PMID: 32655294 PMCID: PMC7324890 DOI: 10.2174/1389202921666200214125102] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2019] [Revised: 01/24/2020] [Accepted: 01/30/2020] [Indexed: 11/22/2022] Open
Abstract
MicroRNAs, a group of short non-coding RNA molecules, could regulate gene expression. Many diseases are associated with abnormal expression of miRNAs. Therefore, accurate identification of miRNA precursors is necessary. In the past 10 years, experimental methods, comparative genomics methods, and artificial intelligence methods have been used to identify pre-miRNAs. However, experimental methods and comparative genomics methods have their disadvantages, such as time-consuming. In contrast, machine learning-based method is a better choice. Therefore, the review summarizes the current advances in pre-miRNA recognition based on computational methods, including the construction of benchmark datasets, feature extraction methods, prediction algorithms, and the results of the models. And we also provide valid information about the predictors currently available. Finally, we give the future perspectives on the identification of pre-miRNAs. The review provides scholars with a whole background of pre-miRNA identification by using machine learning methods, which can help researchers have a clear understanding of progress of the research in this field.
Collapse
Affiliation(s)
- Zheng-Xing Guan
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu610054, China
| | - Shi-Hao Li
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu610054, China
| | - Zi-Mei Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu610054, China
| | - Dan Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu610054, China
| | - Hui Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu610054, China
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu610054, China
| |
Collapse
|
39
|
Liu T, Wang Z. MASS: predict the global qualities of individual protein models using random forests and novel statistical potentials. BMC Bioinformatics 2020; 21:246. [PMID: 32631256 PMCID: PMC7336608 DOI: 10.1186/s12859-020-3383-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2020] [Accepted: 01/22/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Protein model quality assessment (QA) is an essential procedure in protein structure prediction. QA methods can predict the qualities of protein models and identify good models from decoys. Clustering-based methods need a certain number of models as input. However, if a pool of models are not available, methods that only need a single model as input are indispensable. RESULTS We developed MASS, a QA method to predict the global qualities of individual protein models using random forests and various novel energy functions. We designed six novel energy functions or statistical potentials that can capture the structural characteristics of a protein model, which can also be used in other protein-related bioinformatics research. MASS potentials demonstrated higher importance than the energy functions of RWplus, GOAP, DFIRE and Rosetta when the scores they generated are used as machine learning features. MASS outperforms almost all of the four CASP11 top-performing single-model methods for global quality assessment in terms of all of the four evaluation criteria officially used by CASP, which measure the abilities to assign relative and absolute scores, identify the best model from decoys, and distinguish between good and bad models. MASS has also achieved comparable performances with the leading QA methods in CASP12 and CASP13. CONCLUSIONS MASS and the source code for all MASS potentials are publicly available at http://dna.cs.miami.edu/MASS/ .
Collapse
Affiliation(s)
- Tong Liu
- Department of Computer Science, University of Miami, 1365 Memorial Drive, P.O. Box 248154, Coral Gables, FL, 33124, USA
| | - Zheng Wang
- Department of Computer Science, University of Miami, 1365 Memorial Drive, P.O. Box 248154, Coral Gables, FL, 33124, USA.
| |
Collapse
|
40
|
Wang W, Wang J, Xu D, Shang Y. Two New Heuristic Methods for Protein Model Quality Assessment. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1430-1439. [PMID: 30418914 PMCID: PMC8988942 DOI: 10.1109/tcbb.2018.2880202] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Protein tertiary structure prediction is an important open challenge in bioinformatics and requires effective methods to accurately evaluate the quality of protein 3-D models generated computationally. Many quality assessment (QA) methods have been proposed over the past three decades. However, the accuracy or robustness is unsatisfactory for practical applications. In this paper, two new heuristic QA methods are proposed: MUfoldQA_S and MUfoldQA_C. The MUfoldQA_S is a quasi-single-model QA method that assesses the model quality based on the known protein structures with similar sequences. This algorithm can be directly applied to protein fragments without the necessity of building a full structural model. A BLOSUM-based heuristic is also introduced to help differentiate accurate templates from poor ones. In MUfoldQA_C, the ideas from MUfoldQA_S were combined with the consensus approach to create a multi-model QA method that could also utilize information from existing reference models and have demonstrated improved performance. Extensive experimental results of these two methods have shown significant improvement over existing methods. In addition, both methods have been blindly tested in the CASP12 world-wide competition in the protein structure prediction field and ranked as top performers in their respective categories.
Collapse
|
41
|
Gao J, Miao Z, Zhang Z, Wei H, Kurgan L. Prediction of Ion Channels and their Types from Protein Sequences: Comprehensive Review and Comparative Assessment. Curr Drug Targets 2020; 20:579-592. [PMID: 30360734 DOI: 10.2174/1389450119666181022153942] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2018] [Revised: 10/03/2018] [Accepted: 10/04/2018] [Indexed: 12/20/2022]
Abstract
BACKGROUND Ion channels are a large and growing protein family. Many of them are associated with diseases, and consequently, they are targets for over 700 drugs. Discovery of new ion channels is facilitated with computational methods that predict ion channels and their types from protein sequences. However, these methods were never comprehensively compared and evaluated. OBJECTIVE We offer first-of-its-kind comprehensive survey of the sequence-based predictors of ion channels. We describe eight predictors that include five methods that predict ion channels, their types, and four classes of the voltage-gated channels. We also develop and use a new benchmark dataset to perform comparative empirical analysis of the three currently available predictors. RESULTS While several methods that rely on different designs were published, only a few of them are currently available and offer a broad scope of predictions. Support and availability after publication should be required when new methods are considered for publication. Empirical analysis shows strong performance for the prediction of ion channels and modest performance for the prediction of ion channel types and voltage-gated channel classes. We identify a substantial weakness of current methods that cannot accurately predict ion channels that are categorized into multiple classes/types. CONCLUSION Several predictors of ion channels are available to the end users. They offer practical levels of predictive quality. Methods that rely on a larger and more diverse set of predictive inputs (such as PSIONplus) are more accurate. New tools that address multi-label prediction of ion channels should be developed.
Collapse
Affiliation(s)
- Jianzhao Gao
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, China
| | - Zhen Miao
- College of Life Sciences, Nankai University, Tianjin, China
| | - Zhaopeng Zhang
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, China
| | - Hong Wei
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, China
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, United States
| |
Collapse
|
42
|
Feng P, Wang Z. Recent Advances in Computational Methods for Identifying Anticancer Peptides. Curr Drug Targets 2020; 20:481-487. [PMID: 30068270 DOI: 10.2174/1389450119666180801121548] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2018] [Revised: 05/28/2018] [Accepted: 05/28/2018] [Indexed: 01/10/2023]
Abstract
Anticancer peptide (ACP) is a kind of small peptides that can kill cancer cells without damaging normal cells. In recent years, ACP has been pre-clinically used for cancer treatment. Therefore, accurate identification of ACPs will promote their clinical applications. In contrast to labor-intensive experimental techniques, a series of computational methods have been proposed for identifying ACPs. In this review, we briefly summarized the current progress in computational identification of ACPs. The challenges and future perspectives in developing reliable methods for identification of ACPs were also discussed. We anticipate that this review could provide novel insights into future researches on anticancer peptides.
Collapse
Affiliation(s)
- Pengmian Feng
- School of Public Health, North China University of Science and Technology, Tangshan, 063000, China
| | - Zhenyi Wang
- Center for Genomics and Computational Biology, School of Life Science, North China University of Science and Technology, Tangshan, 063000, China
| |
Collapse
|
43
|
Tan JX, Lv H, Wang F, Dao FY, Chen W, Ding H. A Survey for Predicting Enzyme Family Classes Using Machine Learning Methods. Curr Drug Targets 2020; 20:540-550. [PMID: 30277150 DOI: 10.2174/1389450119666181002143355] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2018] [Revised: 08/17/2018] [Accepted: 09/04/2018] [Indexed: 12/13/2022]
Abstract
Enzymes are proteins that act as biological catalysts to speed up cellular biochemical processes. According to their main Enzyme Commission (EC) numbers, enzymes are divided into six categories: EC-1: oxidoreductase; EC-2: transferase; EC-3: hydrolase; EC-4: lyase; EC-5: isomerase and EC-6: synthetase. Different enzymes have different biological functions and acting objects. Therefore, knowing which family an enzyme belongs to can help infer its catalytic mechanism and provide information about the relevant biological function. With the large amount of protein sequences influxing into databanks in the post-genomics age, the annotation of the family for an enzyme is very important. Since the experimental methods are cost ineffective, bioinformatics tool will be a great help for accurately classifying the family of the enzymes. In this review, we summarized the application of machine learning methods in the prediction of enzyme family from different aspects. We hope that this review will provide insights and inspirations for the researches on enzyme family classification.
Collapse
Affiliation(s)
- Jiu-Xin Tan
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hao Lv
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Fang Wang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Fu-Ying Dao
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Chen
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.,Department of Physics, School of Sciences, and Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan 063000, China.,Gordon Life Science Institute, Boston, MA 02478, United States
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
44
|
Abstract
Background:Pseudouridine (Ψ) is the most abundant RNA modification and has important functions in a series of biological and cellular processes. Although experimental techniques have made great contributions to identify Ψ sites, they are still labor-intensive and costineffective. In the past few years, a series of computational approaches have been developed, which provided rapid and efficient approaches to identify Ψ sites.Results:To provide the readership with a clear landscape about the recent development in this important area, in this review, we summarized and compared the representative computational approaches developed for identifying Ψ sites. Moreover, future directions in computationally identifying Ψ sites were discussed as well.Conclusion:We anticipate that this review will provide novel insights into the researches on pseudouridine modification.
Collapse
Affiliation(s)
- Wei Chen
- School of Life Sciences, and Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan 063210, China
| | - Kewei Liu
- School of Life Sciences, and Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan 063210, China
| |
Collapse
|
45
|
Miranda MRA, Uchôa AF, Ferreira SR, Ventury KE, Costa EP, Carmo PRL, Machado OLT, Fernandes KVS, Amancio Oliveira AE. Chemical Modifications of Vicilins Interfere with Chitin-Binding Affinity and Toxicity to Callosobruchus maculatus (Coleoptera: Chrysomelidae) Insect: A Combined In Vitro and In Silico Analysis. JOURNAL OF AGRICULTURAL AND FOOD CHEMISTRY 2020; 68:5596-5605. [PMID: 32343573 DOI: 10.1021/acs.jafc.9b08034] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Vicilins are related to cowpea seed resistance toward Callosobruchus maculatus due to their ability to bind to chitinous structures lining larval midgut. However, this binding mechanism is not fully understood. Here, we identified chitin binding sites and investigated how in vitro and in silico chemical modifications interfere with vicilin chitin binding and insect toxicity. In vitro assays showed that unmodified vicilin strongly binds to chitin matrices, mainly with acetylated chitin. Chemical modifications of specific amino acids (tryptophan, lysine, tyrosine), as well as glutaraldehyde cross-linking, decreased the evaluated parameters. In silico analyses identified at least one chitin binding site in vicilin monomer, the region between Arg208 and Lys216, which bears the sequence REGIRELMK and forms an α helix, exposed in the 3D structure. In silico modifications of Lys223 (acetylated at its terminal nitrogen) and Trp316 (iodinated to 7-iodine-L-tryptophan or oxidized to β-oxy-indolylalanine) decreased vicilin chitin binding affinity. Glucose, sucrose, and N-acetylglucosamine also interfered with vicilin chitin binding affinity.
Collapse
Affiliation(s)
- Maria Raquel A Miranda
- Departamento de Bioquímica, Centro de Ciências, Universidade Federal do Ceará (UFC), Fortaleza Ceará 60440554, Brazil
| | - Adriana F Uchôa
- Departamento de Biologia Celular e Genética, Centro de Biociências, Universidade Federal do Rio Grande do Norte, Natal, Rio Grande do Norte 59072970, Brazil
| | - Sarah R Ferreira
- Centro de Biociências e Biotecnologia, Universidade Estadual do Norte Fluminense Darcy Ribeiro (UENF), Campos dos Goytacazes, Rio de Janeiro 28013-602, Brazil
| | - Kayan E Ventury
- Centro de Biociências e Biotecnologia, Universidade Estadual do Norte Fluminense Darcy Ribeiro (UENF), Campos dos Goytacazes, Rio de Janeiro 28013-602, Brazil
| | - Evenilton P Costa
- Centro de Biociências e Biotecnologia, Universidade Estadual do Norte Fluminense Darcy Ribeiro (UENF), Campos dos Goytacazes, Rio de Janeiro 28013-602, Brazil
| | - Paulo R Leitão Carmo
- NUPEN, Universidade Federal do Rio de Janeiro (UFRJ) Macaé, Rio de Janeiro 27965-045, Brazil
| | - Olga L T Machado
- Centro de Biociências e Biotecnologia, Universidade Estadual do Norte Fluminense Darcy Ribeiro (UENF), Campos dos Goytacazes, Rio de Janeiro 28013-602, Brazil
| | - Katia V S Fernandes
- Centro de Biociências e Biotecnologia, Universidade Estadual do Norte Fluminense Darcy Ribeiro (UENF), Campos dos Goytacazes, Rio de Janeiro 28013-602, Brazil
| | - Antonia Elenir Amancio Oliveira
- Centro de Biociências e Biotecnologia, Universidade Estadual do Norte Fluminense Darcy Ribeiro (UENF), Campos dos Goytacazes, Rio de Janeiro 28013-602, Brazil
| |
Collapse
|
46
|
Feng CQ, Zhang ZY, Zhu XJ, Lin Y, Chen W, Tang H, Lin H. iTerm-PseKNC: a sequence-based tool for predicting bacterial transcriptional terminators. Bioinformatics 2020; 35:1469-1477. [PMID: 30247625 DOI: 10.1093/bioinformatics/bty827] [Citation(s) in RCA: 156] [Impact Index Per Article: 31.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2018] [Revised: 09/13/2018] [Accepted: 09/20/2018] [Indexed: 12/31/2022] Open
Abstract
MOTIVATION Transcription termination is an important regulatory step of gene expression. If there is no terminator in gene, transcription could not stop, which will result in abnormal gene expression. Detecting such terminators can determine the operon structure in bacterial organisms and improve genome annotation. Thus, accurate identification of transcriptional terminators is essential and extremely important in the research of transcription regulations. RESULTS In this study, we developed a new predictor called 'iTerm-PseKNC' based on support vector machine to identify transcription terminators. The binomial distribution approach was used to pick out the optimal feature subset derived from pseudo k-tuple nucleotide composition (PseKNC). The 5-fold cross-validation test results showed that our proposed method achieved an accuracy of 95%. To further evaluate the generalization ability of 'iTerm-PseKNC', the model was examined on independent datasets which are experimentally confirmed Rho-independent terminators in Escherichia coli and Bacillus subtilis genomes. As a result, all the terminators in E. coli and 87.5% of the terminators in B. subtilis were correctly identified, suggesting that the proposed model could become a powerful tool for bacterial terminator recognition. AVAILABILITY AND IMPLEMENTATION For the convenience of most of wet-experimental researchers, the web-server for 'iTerm-PseKNC' was established at http://lin-group.cn/server/iTerm-PseKNC/, by which users can easily obtain their desired result without the need to go through the detailed mathematical equations involved.
Collapse
Affiliation(s)
- Chao-Qin Feng
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Zhao-Yue Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Xiao-Juan Zhu
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Yan Lin
- Key Laboratory for Animal Disease Resistance Nutrition of the Ministry of Education, Animal Nutrition Institute, Sichuan Agricultural University, Chengdu, China
| | - Wei Chen
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,Center for Genomics and Computational Biology, School of Life Sciences, North China University of Science and Technology, Tangshan, China
| | - Hua Tang
- Department of Pathophysiology, Southwest Medical University, Luzhou, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
47
|
Makigaki S, Ishida T. Sequence Alignment Using Machine Learning for Accurate Template-based Protein Structure Prediction. Bio Protoc 2020; 10:e3600. [PMID: 33659566 DOI: 10.21769/bioprotoc.3600] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2019] [Revised: 02/26/2020] [Accepted: 03/22/2020] [Indexed: 11/02/2022] Open
Abstract
Template-based modeling, the process of predicting the tertiary structure of a protein by using homologous protein structures, is useful when good templates can be available. Indeed, modern homology detection methods can find remote homologs with high sensitivity. However, the accuracy of template-based models generated from the homology-detection-based alignments is often lower than that from ideal alignments. In this study, we propose a new method that generates pairwise sequence alignments for more accurate template-based modeling. Our method trains a machine learning model using the structural alignment of known homologs. When calculating sequence alignments, instead of a fixed substitution matrix, this method dynamically predicts a substitution score from the trained model.
Collapse
Affiliation(s)
| | - Takashi Ishida
- School of Computing, Tokyo Institute of Technology, Tokyo, Japan
| |
Collapse
|
48
|
Tadepalli S, Akhter N, Barbara D, Shehu A. Anomaly Detection-Based Recognition of Near-Native Protein Structures. IEEE Trans Nanobioscience 2020; 19:562-570. [PMID: 32340957 DOI: 10.1109/tnb.2020.2990642] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
The three-dimensional structures populated by a protein molecule determine to a great extent its biological activities. The rich information encoded by protein structure on protein function continues to motivate the development of computational approaches for determining functionally-relevant structures. The majority of structures generated in silico are not relevant. Discriminating relevant/native protein structures from non-native ones is an outstanding challenge in computational structural biology. Inherently, this is a recognition problem that can be addressed under the umbrella of machine learning. In this paper, based on the premise that near-native structures are effectively anomalies, we build on the concept of anomaly detection in machine learning. We propose methods that automatically select relevant subsets, as well as methods that select a single structure to offer as prediction. Evaluations are carried out on benchmark datasets and demonstrate that the proposed methods advance the state of the art. The presented results motivate further building on and adapting concepts and techniques from machine learning to improve recognition of near-native structures in protein structure prediction.
Collapse
|
49
|
Chen J, Siu SWI. Machine Learning Approaches for Quality Assessment of Protein Structures. Biomolecules 2020; 10:biom10040626. [PMID: 32316682 PMCID: PMC7226485 DOI: 10.3390/biom10040626] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Revised: 04/07/2020] [Accepted: 04/09/2020] [Indexed: 11/16/2022] Open
Abstract
Protein structures play a very important role in biomedical research, especially in drug discovery and design, which require accurate protein structures in advance. However, experimental determinations of protein structure are prohibitively costly and time-consuming, and computational predictions of protein structures have not been perfected. Methods that assess the quality of protein models can help in selecting the most accurate candidates for further work. Driven by this demand, many structural bioinformatics laboratories have developed methods for estimating model accuracy (EMA). In recent years, EMA by machine learning (ML) have consistently ranked among the top-performing methods in the community-wide CASP challenge. Accordingly, we systematically review all the major ML-based EMA methods developed within the past ten years. The methods are grouped by their employed ML approach-support vector machine, artificial neural networks, ensemble learning, or Bayesian learning-and their significances are discussed from a methodology viewpoint. To orient the reader, we also briefly describe the background of EMA, including the CASP challenge and its evaluation metrics, and introduce the major ML/DL techniques. Overall, this review provides an introductory guide to modern research on protein quality assessment and directions for future research in this area.
Collapse
|
50
|
Cao Y, Shen Y. Energy-based graph convolutional networks for scoring protein docking models. Proteins 2020; 88:1091-1099. [PMID: 32144844 DOI: 10.1002/prot.25888] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2019] [Revised: 01/15/2020] [Accepted: 02/26/2020] [Indexed: 12/18/2022]
Abstract
Structural information about protein-protein interactions, often missing at the interactome scale, is important for mechanistic understanding of cells and rational discovery of therapeutics. Protein docking provides a computational alternative for such information. However, ranking near-native docked models high among a large number of candidates, often known as the scoring problem, remains a critical challenge. Moreover, estimating model quality, also known as the quality assessment problem, is rarely addressed in protein docking. In this study, the two challenging problems in protein docking are regarded as relative and absolute scoring, respectively, and addressed in one physics-inspired deep learning framework. We represent protein and complex structures as intra- and inter-molecular residue contact graphs with atom-resolution node and edge features. And we propose a novel graph convolutional kernel that aggregates interacting nodes' features through edges so that generalized interaction energies can be learned directly from 3D data. The resulting energy-based graph convolutional networks (EGCN) with multihead attention are trained to predict intra- and inter-molecular energies, binding affinities, and quality measures (interface RMSD) for encounter complexes. Compared to a state-of-the-art scoring function for model ranking, EGCN significantly improves ranking for a critical assessment of predicted interactions (CAPRI) test set involving homology docking; and is comparable or slightly better for Score_set, a CAPRI benchmark set generated by diverse community-wide docking protocols not known to training data. For Score_set quality assessment, EGCN shows about 27% improvement to our previous efforts. Directly learning from 3D structure data in graph representation, EGCN represents the first successful development of graph convolutional networks for protein docking.
Collapse
Affiliation(s)
- Yue Cao
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, Texas
| | - Yang Shen
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, Texas.,TEES-AgriLife Center for Bioinformatics and Genomic Systems Engineering, Texas A&M University, College Station, Texas
| |
Collapse
|