1
|
Ambreen S, Umar M, Noor A, Jain H, Ali R. Advanced AI and ML frameworks for transforming drug discovery and optimization: With innovative insights in polypharmacology, drug repurposing, combination therapy and nanomedicine. Eur J Med Chem 2025; 284:117164. [PMID: 39721292 DOI: 10.1016/j.ejmech.2024.117164] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2024] [Revised: 11/24/2024] [Accepted: 11/27/2024] [Indexed: 12/28/2024]
Abstract
Artificial Intelligence (AI) and Machine Learning (ML) are transforming drug discovery by overcoming traditional challenges like high costs, time-consuming, and frequent failures. AI-driven approaches streamline key phases, including target identification, lead optimization, de novo drug design, and drug repurposing. Frameworks such as deep neural networks (DNNs), convolutional neural networks (CNNs), and deep reinforcement learning (DRL) models have shown promise in identifying drug targets, optimizing delivery systems, and accelerating drug repurposing. Generative adversarial networks (GANs) and variational autoencoders (VAEs) aid de novo drug design by creating novel drug-like compounds with desired properties. Case studies, such as DDR1 kinase inhibitors designed using generative models and CDK20 inhibitors developed via structure-based methods, highlight AI's ability to produce highly specific therapeutics. Models like SNF-CVAE and DeepDR further advance drug repurposing by uncovering new therapeutic applications for existing drugs. Advanced ML algorithms enhance precision in predicting drug efficacy, toxicity, and ADME-Tox properties, reducing development costs and improving drug-target interactions. AI also supports polypharmacology by optimizing multi-target drug interactions and enhances combination therapy through predictions of drug synergies and antagonisms. In nanomedicine, AI models like CURATE.AI and the Hartung algorithm optimize personalized treatments by predicting toxicological risks and real-time dosing adjustments with high accuracy. Despite its potential, challenges like data quality, model interpretability, and ethical concerns must be addressed. High-quality datasets, transparent models, and unbiased algorithms are essential for reliable AI applications. As AI continues to evolve, it is poised to revolutionize drug discovery and personalized medicine, advancing therapeutic development and patient care.
Collapse
Affiliation(s)
- Subiya Ambreen
- Department of Pharmaceutical Chemistry, Delhi Institute of Pharmaceutical Sciences and Research (DIPSAR), DPSRU, Pushp Vihar, New Delhi, 110017, India
| | - Mohammad Umar
- Department of Pharmaceutical Chemistry, Delhi Institute of Pharmaceutical Sciences and Research (DIPSAR), DPSRU, Pushp Vihar, New Delhi, 110017, India
| | - Aaisha Noor
- Department of Pharmaceutical Chemistry, Delhi Institute of Pharmaceutical Sciences and Research (DIPSAR), DPSRU, Pushp Vihar, New Delhi, 110017, India
| | - Himangini Jain
- Department of Pharmaceutical Chemistry, Delhi Institute of Pharmaceutical Sciences and Research (DIPSAR), DPSRU, Pushp Vihar, New Delhi, 110017, India
| | - Ruhi Ali
- Department of Pharmaceutical Chemistry, Delhi Institute of Pharmaceutical Sciences and Research (DIPSAR), DPSRU, Pushp Vihar, New Delhi, 110017, India.
| |
Collapse
|
2
|
McGuffin LJ, Alharbi SMA. ModFOLD9: A Web Server for Independent Estimates of 3D Protein Model Quality. J Mol Biol 2024; 436:168531. [PMID: 39237204 DOI: 10.1016/j.jmb.2024.168531] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Revised: 02/19/2024] [Accepted: 03/06/2024] [Indexed: 09/07/2024]
Abstract
Accurate models of protein tertiary structures are now available from numerous advanced prediction methods, although the accuracy of each method often varies depending on the specific protein target. Additionally, many models may still contain significant local errors. Therefore, reliable, independent model quality estimates are essential both for identifying errors and selecting the very best models for further biological investigations. ModFOLD9 is a leading independent server for detecting the local errors in models produced by any method, and it can accurately discriminate between high-quality models from multiple alternative approaches. ModFOLD9 incorporates several new scores from deep learning-based approaches, leading to greatly improved prediction accuracy compared with earlier versions of the server. ModFOLD9 is continuously independently benchmarked, and it is shown to be highly competitive with other public servers. ModFOLD9 is freely available at https://www.reading.ac.uk/bioinf/ModFOLD/.
Collapse
|
3
|
Yue T, Wang Y, Zhang L, Gu C, Xue H, Wang W, Lyu Q, Dun Y. Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models. Int J Mol Sci 2023; 24:15858. [PMID: 37958843 PMCID: PMC10649223 DOI: 10.3390/ijms242115858] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 10/24/2023] [Accepted: 10/30/2023] [Indexed: 11/15/2023] Open
Abstract
The data explosion driven by advancements in genomic research, such as high-throughput sequencing techniques, is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in various fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning, since we expect a superhuman intelligence that explores beyond our knowledge to interpret the genome from deep learning. A powerful deep learning model should rely on the insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with proper deep learning-based architecture, and we remark on practical considerations of developing deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research and point out current challenges and potential research directions for future genomics applications. We believe the collaborative use of ever-growing diverse data and the fast iteration of deep learning models will continue to contribute to the future of genomics.
Collapse
Affiliation(s)
- Tianwei Yue
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Yuanxin Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Longxiang Zhang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Chunming Gu
- Department of Biomedical Engineering, School of Medicine, Johns Hopkins University, Baltimore, MD 21218, USA;
| | - Haoru Xue
- The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA;
| | - Wenping Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Qi Lyu
- Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI 48824, USA;
| | - Yujie Dun
- School of Information and Communications Engineering, Xi’an Jiaotong University, Xi’an 710049, China;
| |
Collapse
|
4
|
Lin P, Yan Y, Tao H, Huang SY. Deep transfer learning for inter-chain contact predictions of transmembrane protein complexes. Nat Commun 2023; 14:4935. [PMID: 37582780 PMCID: PMC10427616 DOI: 10.1038/s41467-023-40426-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Accepted: 07/21/2023] [Indexed: 08/17/2023] Open
Abstract
Membrane proteins are encoded by approximately a quarter of human genes. Inter-chain residue-residue contact information is important for structure prediction of membrane protein complexes and valuable for understanding their molecular mechanism. Although many deep learning methods have been proposed to predict the intra-protein contacts or helix-helix interactions in membrane proteins, it is still challenging to accurately predict their inter-chain contacts due to the limited number of transmembrane proteins. Addressing the challenge, here we develop a deep transfer learning method for predicting inter-chain contacts of transmembrane protein complexes, named DeepTMP, by taking advantage of the knowledge pre-trained from a large data set of non-transmembrane proteins. DeepTMP utilizes a geometric triangle-aware module to capture the correct inter-chain interaction from the coevolution information generated by protein language models. DeepTMP is extensively evaluated on a test set of 52 self-associated transmembrane protein complexes, and compared with state-of-the-art methods including DeepHomo2.0, CDPred, GLINTER, DeepHomo, and DNCON2_Inter. It is shown that DeepTMP considerably improves the precision of inter-chain contact prediction and outperforms the existing approaches in both accuracy and robustness.
Collapse
Affiliation(s)
- Peicong Lin
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Yumeng Yan
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Huanyu Tao
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Sheng-You Huang
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China.
| |
Collapse
|
5
|
Chen X, Morehead A, Liu J, Cheng J. A gated graph transformer for protein complex structure quality assessment and its performance in CASP15. Bioinformatics 2023; 39:i308-i317. [PMID: 37387159 PMCID: PMC10311325 DOI: 10.1093/bioinformatics/btad203] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Proteins interact to form complexes to carry out essential biological functions. Computational methods such as AlphaFold-multimer have been developed to predict the quaternary structures of protein complexes. An important yet largely unsolved challenge in protein complex structure prediction is to accurately estimate the quality of predicted protein complex structures without any knowledge of the corresponding native structures. Such estimations can then be used to select high-quality predicted complex structures to facilitate biomedical research such as protein function analysis and drug discovery. RESULTS In this work, we introduce a new gated neighborhood-modulating graph transformer to predict the quality of 3D protein complex structures. It incorporates node and edge gates within a graph transformer framework to control information flow during graph message passing. We trained, evaluated and tested the method (called DProQA) on newly-curated protein complex datasets before the 15th Critical Assessment of Techniques for Protein Structure Prediction (CASP15) and then blindly tested it in the 2022 CASP15 experiment. The method was ranked 3rd among the single-model quality assessment methods in CASP15 in terms of the ranking loss of TM-score on 36 complex targets. The rigorous internal and external experiments demonstrate that DProQA is effective in ranking protein complex structures. AVAILABILITY AND IMPLEMENTATION The source code, data, and pre-trained models are available at https://github.com/jianlin-cheng/DProQA.
Collapse
Affiliation(s)
- Xiao Chen
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65201, United States
| | - Alex Morehead
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65201, United States
| | - Jian Liu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65201, United States
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65201, United States
| |
Collapse
|
6
|
Guo Z, Liu J, Skolnick J, Cheng J. Prediction of inter-chain distance maps of protein complexes with 2D attention-based deep neural networks. Nat Commun 2022; 13:6963. [PMID: 36379943 PMCID: PMC9666547 DOI: 10.1038/s41467-022-34600-2] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2022] [Accepted: 10/24/2022] [Indexed: 11/16/2022] Open
Abstract
Residue-residue distance information is useful for predicting tertiary structures of protein monomers or quaternary structures of protein complexes. Many deep learning methods have been developed to predict intra-chain residue-residue distances of monomers accurately, but few methods can accurately predict inter-chain residue-residue distances of complexes. We develop a deep learning method CDPred (i.e., Complex Distance Prediction) based on the 2D attention-powered residual network to address the gap. Tested on two homodimer datasets, CDPred achieves the precision of 60.94% and 42.93% for top L/5 inter-chain contact predictions (L: length of the monomer in homodimer), respectively, substantially higher than DeepHomo's 37.40% and 23.08% and GLINTER's 48.09% and 36.74%. Tested on the two heterodimer datasets, the top Ls/5 inter-chain contact prediction precision (Ls: length of the shorter monomer in heterodimer) of CDPred is 47.59% and 22.87% respectively, surpassing GLINTER's 23.24% and 13.49%. Moreover, the prediction of CDPred is complementary with that of AlphaFold2-multimer.
Collapse
Affiliation(s)
- Zhiye Guo
- grid.134936.a0000 0001 2162 3504Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211 USA
| | - Jian Liu
- grid.134936.a0000 0001 2162 3504Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211 USA
| | - Jeffrey Skolnick
- grid.213917.f0000 0001 2097 4943School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA 30332-200 USA
| | - Jianlin Cheng
- grid.134936.a0000 0001 2162 3504Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211 USA
| |
Collapse
|
7
|
Mahmud S, Guo Z, Quadir F, Liu J, Cheng J. Multi-head attention-based U-Nets for predicting protein domain boundaries using 1D sequence features and 2D distance maps. BMC Bioinformatics 2022; 23:283. [PMID: 35854211 PMCID: PMC9295499 DOI: 10.1186/s12859-022-04829-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Accepted: 07/08/2022] [Indexed: 01/25/2023] Open
Abstract
The information about the domain architecture of proteins is useful for studying protein structure and function. However, accurate prediction of protein domain boundaries (i.e., sequence regions separating two domains) from sequence remains a significant challenge. In this work, we develop a deep learning method based on multi-head U-Nets (called DistDom) to predict protein domain boundaries utilizing 1D sequence features and predicted 2D inter-residue distance map as input. The 1D features contain the evolutionary and physicochemical information of protein sequences, whereas the 2D distance map includes the structural information of proteins that was rarely used in domain boundary prediction before. The 1D and 2D features are processed by the 1D and 2D U-Nets respectively to generate hidden features. The hidden features are then used by the multi-head attention to predict the probability of each residue of a protein being in a domain boundary, leveraging both local and global information in the features. The residue-level domain boundary predictions can be used to classify proteins as single-domain or multi-domain proteins. It classifies the CASP14 single-domain and multi-domain targets at the accuracy of 75.9%, 13.28% more accurate than the state-of-the-art method. Tested on the CASP14 multi-domain protein targets with expert annotated domain boundaries, the average per-target F1 measure score of the domain boundary prediction by DistDom is 0.263, 29.56% higher than the state-of-the-art method.
Collapse
Affiliation(s)
- Sajid Mahmud
- grid.134936.a0000 0001 2162 3504Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO USA
| | - Zhiye Guo
- grid.134936.a0000 0001 2162 3504Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO USA
| | - Farhan Quadir
- grid.134936.a0000 0001 2162 3504Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO USA
| | - Jian Liu
- grid.134936.a0000 0001 2162 3504Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO USA
| | - Jianlin Cheng
- grid.134936.a0000 0001 2162 3504Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO USA
| |
Collapse
|
8
|
Chen X, Cheng J. DISTEMA: distance map-based estimation of single protein model accuracy with attentive 2D convolutional neural network. BMC Bioinformatics 2022; 23:141. [PMID: 35439931 PMCID: PMC9019949 DOI: 10.1186/s12859-022-04683-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2022] [Accepted: 04/11/2022] [Indexed: 01/25/2023] Open
Abstract
BACKGROUND Estimation of the accuracy (quality) of protein structural models is important for both prediction and use of protein structural models. Deep learning methods have been used to integrate protein structure features to predict the quality of protein models. Inter-residue distances are key information for predicting protein's tertiary structures and therefore have good potentials to predict the quality of protein structural models. However, few methods have been developed to fully take advantage of predicted inter-residue distance maps to estimate the accuracy of a single protein structural model. RESULT We developed an attentive 2D convolutional neural network (CNN) with channel-wise attention to take only a raw difference map between the inter-residue distance map calculated from a single protein model and the distance map predicted from the protein sequence as input to predict the quality of the model. The network comprises multiple convolutional layers, batch normalization layers, dense layers, and Squeeze-and-Excitation blocks with attention to automatically extract features relevant to protein model quality from the raw input without using any expert-curated features. We evaluated DISTEMA's capability of selecting the best models for CASP13 targets in terms of ranking loss of GDT-TS score. The ranking loss of DISTEMA is 0.079, lower than several state-of-the-art single-model quality assessment methods. CONCLUSION This work demonstrates that using raw inter-residue distance information with deep learning can predict the quality of protein structural models reasonably well. DISTEMA is freely at https://github.com/jianlin-cheng/DISTEMA.
Collapse
Affiliation(s)
- Xiao Chen
- grid.134936.a0000 0001 2162 3504Department of Electrical Engineering and Computer Science, University of Missouri Columbia, Columbia, MO 65211 USA
| | - Jianlin Cheng
- grid.134936.a0000 0001 2162 3504Department of Electrical Engineering and Computer Science, University of Missouri Columbia, Columbia, MO 65211 USA
| |
Collapse
|
9
|
Hong Y, Lee J, Ko J. A-Prot: protein structure modeling using MSA transformer. BMC Bioinformatics 2022; 23:93. [PMID: 35296230 PMCID: PMC8925138 DOI: 10.1186/s12859-022-04628-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2021] [Accepted: 03/03/2022] [Indexed: 11/18/2022] Open
Abstract
Background The accuracy of protein 3D structure prediction has been dramatically improved with the help of advances in deep learning. In the recent CASP14, Deepmind demonstrated that their new version of AlphaFold (AF) produces highly accurate 3D models almost close to experimental structures. The success of AF shows that the multiple sequence alignment of a sequence contains rich evolutionary information, leading to accurate 3D models. Despite the success of AF, only the prediction code is open, and training a similar model requires a vast amount of computational resources. Thus, developing a lighter prediction model is still necessary. Results In this study, we propose a new protein 3D structure modeling method, A-Prot, using MSA Transformer, one of the state-of-the-art protein language models. An MSA feature tensor and row attention maps are extracted and converted into 2D residue-residue distance and dihedral angle predictions for a given MSA. We demonstrated that A-Prot predicts long-range contacts better than the existing methods. Additionally, we modeled the 3D structures of the free modeling and hard template-based modeling targets of CASP14. The assessment shows that the A-Prot models are more accurate than most top server groups of CASP14. Conclusion These results imply that A-Prot accurately captures the evolutionary and structural information of proteins with relatively low computational cost. Thus, A-Prot can provide a clue for the development of other protein property prediction methods. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04628-8.
Collapse
Affiliation(s)
- Yiyu Hong
- Arontier Co, Seoul, Republic of Korea
| | - Juyong Lee
- Arontier Co, Seoul, Republic of Korea. .,Department of Chemistry, Division of Chemistry and Biochemistry, Kangwon National University, Chuncheon, Republic of Korea.
| | - Junsu Ko
- Arontier Co, Seoul, Republic of Korea
| |
Collapse
|
10
|
Du Z, Peng Z, Yang J. Toward the assessment of predicted inter-residue distance. Bioinformatics 2022; 38:962-969. [PMID: 34791040 DOI: 10.1093/bioinformatics/btab781] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2021] [Revised: 10/07/2021] [Accepted: 11/10/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Significant progress has been achieved in distance-based protein folding, due to improved prediction of inter-residue distance by deep learning. Many efforts are thus made to improve distance prediction in recent years. However, it remains unknown what is the best way of objectively assessing the accuracy of predicted distance. RESULTS A total of 19 metrics were proposed to measure the accuracy of predicted distance. These metrics were discussed and compared quantitatively on three benchmark datasets, with distance and structure models predicted by the trRosetta pipeline. The experiments show that a few metrics, such as distance precision, have a high correlation with the model accuracy measure TM-score (Pearson's correlation coefficient >0.7). In addition, the metrics are applied to rank the distance prediction groups in CASP14. The ranking by our metrics coincides largely with the official version. These data suggest that the proposed metrics are effective for measuring distance prediction. We anticipate that this study paves the way for objectively monitoring the progress of inter-residue distance prediction. A web server and a standalone package are provided to implement the proposed metrics. AVAILABILITY AND IMPLEMENTATION http://yanglab.nankai.edu.cn/APD. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zongyang Du
- School of Mathematical Sciences, Nankai University, Tianjin 300071, China
| | - Zhenling Peng
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
| | - Jianyi Yang
- School of Mathematical Sciences, Nankai University, Tianjin 300071, China
| |
Collapse
|
11
|
Kandathil SM, Greener JG, Lau AM, Jones DT. Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins. Proc Natl Acad Sci U S A 2022; 119:e2113348119. [PMID: 35074909 PMCID: PMC8795500 DOI: 10.1073/pnas.2113348119] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2021] [Accepted: 12/07/2021] [Indexed: 12/12/2022] Open
Abstract
Deep learning-based prediction of protein structure usually begins by constructing a multiple sequence alignment (MSA) containing homologs of the target protein. The most successful approaches combine large feature sets derived from MSAs, and considerable computational effort is spent deriving these input features. We present a method that greatly reduces the amount of preprocessing required for a target MSA, while producing main chain coordinates as a direct output of a deep neural network. The network makes use of just three recurrent networks and a stack of residual convolutional layers, making the predictor very fast to run, and easy to install and use. Our approach constructs a directly learned representation of the sequences in an MSA, starting from a one-hot encoding of the sequences. When supplemented with an approximate precision matrix, the learned representation can be used to produce structural models of comparable or greater accuracy as compared to our original DMPfold method, while requiring less than a second to produce a typical model. This level of accuracy and speed allows very large-scale three-dimensional modeling of proteins on minimal hardware, and we demonstrate this by producing models for over 1.3 million uncharacterized regions of proteins extracted from the BFD sequence clusters. After constructing an initial set of approximate models, we select a confident subset of over 30,000 models for further refinement and analysis, revealing putative novel protein folds. We also provide updated models for over 5,000 Pfam families studied in the original DMPfold paper.
Collapse
Affiliation(s)
- Shaun M Kandathil
- Department of Computer Science, University College London, London, WC1E 6BT, United Kingdom
| | - Joe G Greener
- Department of Computer Science, University College London, London, WC1E 6BT, United Kingdom
| | - Andy M Lau
- Department of Computer Science, University College London, London, WC1E 6BT, United Kingdom
| | - David T Jones
- Department of Computer Science, University College London, London, WC1E 6BT, United Kingdom
| |
Collapse
|
12
|
Rahman J, Newton MAH, Islam MKB, Sattar A. Enhancing protein inter-residue real distance prediction by scrutinising deep learning models. Sci Rep 2022; 12:787. [PMID: 35039537 PMCID: PMC8764118 DOI: 10.1038/s41598-021-04441-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2021] [Accepted: 12/17/2021] [Indexed: 12/29/2022] Open
Abstract
Protein structure prediction (PSP) has achieved significant progress lately via prediction of inter-residue distances using deep learning models and exploitation of the predictions during conformational search. In this context, prediction of large inter-residue distances and also prediction of distances between residues separated largely in the protein sequence remain challenging. To deal with these challenges, state-of-the-art inter-residue distance prediction algorithms have used large sets of coevolutionary and non-coevolutionary features. In this paper, we argue that the more the types of features used, the more the kinds of noises introduced and then the deep learning model has to overcome the noises to improve the accuracy of the predictions. Also, multiple features capturing similar underlying characteristics might not necessarily have significantly better cumulative effect. So we scrutinise the feature space to reduce the types of features to be used, but at the same time, we strive to improve the prediction accuracy. Consequently, for inter-residue real distance prediction, in this paper, we propose a deep learning model named scrutinised distance predictor (SDP), which uses only 2 coevolutionary and 3 non-coevolutionary features. On several sets of benchmark proteins, our proposed SDP method improves mean Local Distance Different Test (LDDT) scores at least by 10% over existing state-of-the-art methods. The SDP program along with its data is available from the website https://gitlab.com/mahnewton/sdp .
Collapse
Affiliation(s)
- Julia Rahman
- School of Information and Communication Technology, Griffith University, Southport, Australia.
| | - M A Hakim Newton
- Institute of Integrated and Intelligent Systems, Griffith University, Southport, Australia.
| | - Md Khaled Ben Islam
- School of Information and Communication Technology, Griffith University, Southport, Australia
| | - Abdul Sattar
- School of Information and Communication Technology, Griffith University, Southport, Australia
- Institute of Integrated and Intelligent Systems, Griffith University, Southport, Australia
| |
Collapse
|
13
|
Tran NH, Xu J, Li M. A tale of solving two computational challenges in protein science: neoantigen prediction and protein structure prediction. Brief Bioinform 2022; 23:bbab493. [PMID: 34891158 PMCID: PMC8769896 DOI: 10.1093/bib/bbab493] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2021] [Revised: 10/11/2021] [Accepted: 10/26/2021] [Indexed: 12/30/2022] Open
Abstract
In this article, we review two challenging computational questions in protein science: neoantigen prediction and protein structure prediction. Both topics have seen significant leaps forward by deep learning within the past five years, which immediately unlocked new developments of drugs and immunotherapies. We show that deep learning models offer unique advantages, such as representation learning and multi-layer architecture, which make them an ideal choice to leverage a huge amount of protein sequence and structure data to address those two problems. We also discuss the impact and future possibilities enabled by those two applications, especially how the data-driven approach by deep learning shall accelerate the progress towards personalized biomedicine.
Collapse
Affiliation(s)
| | - Jinbo Xu
- Toyota Technological Institute at Chicago, USA
| | - Ming Li
- University of Waterloo, Canada
| |
Collapse
|
14
|
Roy RS, Quadir F, Soltanikazemi E, Cheng J. OUP accepted manuscript. Bioinformatics 2022; 38:1904-1910. [PMID: 35134816 PMCID: PMC8963319 DOI: 10.1093/bioinformatics/btac063] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2021] [Revised: 01/17/2022] [Accepted: 01/31/2022] [Indexed: 11/23/2022] Open
Abstract
Motivation Deep learning has revolutionized protein tertiary structure prediction recently. The cutting-edge deep learning methods such as AlphaFold can predict high-accuracy tertiary structures for most individual protein chains. However, the accuracy of predicting quaternary structures of protein complexes consisting of multiple chains is still relatively low due to lack of advanced deep learning methods in the field. Because interchain residue–residue contacts can be used as distance restraints to guide quaternary structure modeling, here we develop a deep dilated convolutional residual network method (DRCon) to predict interchain residue–residue contacts in homodimers from residue–residue co-evolutionary signals derived from multiple sequence alignments of monomers, intrachain residue–residue contacts of monomers extracted from true/predicted tertiary structures or predicted by deep learning, and other sequence and structural features. Results Tested on three homodimer test datasets (Homo_std dataset, DeepHomo dataset and CASP-CAPRI dataset), the precision of DRCon for top L/5 interchain contact predictions (L: length of monomer in a homodimer) is 43.46%, 47.10% and 33.50% respectively at 6 Å contact threshold, which is substantially better than DeepHomo and DNCON2_inter and similar to Glinter. Moreover, our experiments demonstrate that using predicted tertiary structure or intrachain contacts of monomers in the unbound state as input, DRCon still performs well, even though its accuracy is lower than using true tertiary structures in the bound state are used as input. Finally, our case study shows that good interchain contact predictions can be used to build high-accuracy quaternary structure models of homodimers. Availability and implementation The source code of DRCon is available at https://github.com/jianlin-cheng/DRCon. The datasets are available at https://zenodo.org/record/5998532#.YgF70vXMKsB. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Raj S Roy
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA
| | - Farhan Quadir
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA
| | - Elham Soltanikazemi
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA
| | | |
Collapse
|
15
|
Zhang F, Zhao B, Shi W, Li M, Kurgan L. DeepDISOBind: accurate prediction of RNA-, DNA- and protein-binding intrinsically disordered residues with deep multi-task learning. Brief Bioinform 2021; 23:6461158. [PMID: 34905768 DOI: 10.1093/bib/bbab521] [Citation(s) in RCA: 41] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Revised: 10/30/2021] [Accepted: 11/14/2021] [Indexed: 12/14/2022] Open
Abstract
Proteins with intrinsically disordered regions (IDRs) are common among eukaryotes. Many IDRs interact with nucleic acids and proteins. Annotation of these interactions is supported by computational predictors, but to date, only one tool that predicts interactions with nucleic acids was released, and recent assessments demonstrate that current predictors offer modest levels of accuracy. We have developed DeepDISOBind, an innovative deep multi-task architecture that accurately predicts deoxyribonucleic acid (DNA)-, ribonucleic acid (RNA)- and protein-binding IDRs from protein sequences. DeepDISOBind relies on an information-rich sequence profile that is processed by an innovative multi-task deep neural network, where subsequent layers are gradually specialized to predict interactions with specific partner types. The common input layer links to a layer that differentiates protein- and nucleic acid-binding, which further links to layers that discriminate between DNA and RNA interactions. Empirical tests show that this multi-task design provides statistically significant gains in predictive quality across the three partner types when compared to a single-task design and a representative selection of the existing methods that cover both disorder- and structure-trained tools. Analysis of the predictions on the human proteome reveals that DeepDISOBind predictions can be encoded into protein-level propensities that accurately predict DNA- and RNA-binding proteins and protein hubs. DeepDISOBind is available at https://www.csuligroup.com/DeepDISOBind/.
Collapse
Affiliation(s)
- Fuhao Zhang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Bi Zhao
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, 23284, USA
| | - Wenbo Shi
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Min Li
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, 23284, USA
| |
Collapse
|
16
|
Gao M, Lund-Andersen P, Morehead A, Mahmud S, Chen C, Chen X, Giri N, Roy RS, Quadir F, Effler TC, Prout R, Abraham S, Elwasif W, Haas NQ, Skolnick J, Cheng J, Sedova A. High-Performance Deep Learning Toolbox for Genome-Scale Prediction of Protein Structure and Function. WORKSHOP ON MACHINE LEARNING IN HPC ENVIRONMENTS. WORKSHOP ON MACHINE LEARNING IN HPC ENVIRONMENTS 2021; 2021:46-57. [PMID: 35112110 PMCID: PMC8802329 DOI: 10.1109/mlhpc54614.2021.00010] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Computational biology is one of many scientific disciplines ripe for innovation and acceleration with the advent of high-performance computing (HPC). In recent years, the field of machine learning has also seen significant benefits from adopting HPC practices. In this work, we present a novel HPC pipeline that incorporates various machine-learning approaches for structure-based functional annotation of proteins on the scale of whole genomes. Our pipeline makes extensive use of deep learning and provides computational insights into best practices for training advanced deep-learning models for high-throughput data such as proteomics data. We showcase methodologies our pipeline currently supports and detail future tasks for our pipeline to envelop, including large-scale sequence comparison using SAdLSA and prediction of protein tertiary structures using AlphaFold2.
Collapse
Affiliation(s)
- Mu Gao
- Georgia Institute of Technology, Atlanta, GA
| | | | | | | | - Chen Chen
- University of Missouri, Columbia, MO
| | - Xiao Chen
- University of Missouri, Columbia, MO
| | | | | | | | | | - Ryan Prout
- Oak Ridge National Laboratory, Oak Ridge, TN
| | | | | | | | | | | | - Ada Sedova
- Oak Ridge National Laboratory, Oak Ridge, TN
| |
Collapse
|
17
|
Skolnick J, Gao M, Zhou H, Singh S. AlphaFold 2: Why It Works and Its Implications for Understanding the Relationships of Protein Sequence, Structure, and Function. J Chem Inf Model 2021; 61:4827-4831. [PMID: 34586808 DOI: 10.1021/acs.jcim.1c01114] [Citation(s) in RCA: 121] [Impact Index Per Article: 30.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
AlphaFold 2 (AF2) was the star of CASP14, the last biannual structure prediction experiment. Using novel deep learning, AF2 predicted the structures of many difficult protein targets at or near experimental resolution. Here, we present our perspective of why AF2 works and show that it is a very sophisticated fold recognition algorithm that exploits the completeness of the library of single domain PDB structures. It has also learned local side chain packing rearrangements that enable it to refine proteins to high resolution. The benefits and limitations of its ability to predict the structures of many more proteins at or close to atomic detail are discussed.
Collapse
Affiliation(s)
- Jeffrey Skolnick
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| | - Mu Gao
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| | - Hongyi Zhou
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| | - Suresh Singh
- Twilight Design, 4 Adams Road, Kendall Park, New Jersey 08824, United States
| |
Collapse
|
18
|
Xu G, Wang Q, Ma J. OPUS-X: an open-source toolkit for protein torsion angles, secondary structure, solvent accessibility, contact map predictions and 3D folding. Bioinformatics 2021; 38:108-114. [PMID: 34478500 PMCID: PMC8696105 DOI: 10.1093/bioinformatics/btab633] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2021] [Revised: 07/09/2021] [Accepted: 09/01/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION The development of an open-source platform to predict protein 1D features and 3D structure is an important task. In this paper, we report an open-source toolkit for protein 3D structure modeling, named OPUS-X. It contains three modules: OPUS-TASS2, which predicts protein torsion angles, secondary structure and solvent accessibility; OPUS-Contact, which measures the distance and orientation information between different residue pairs; and OPUS-Fold2, which uses the constraints derived from the first two modules to guide folding. RESULTS OPUS-TASS2 is an upgraded version of our previous method OPUS-TASS. OPUS-TASS2 integrates protein global structure information and significantly outperforms OPUS-TASS. OPUS-Contact combines multiple raw co-evolutionary features with protein 1D features predicted by OPUS-TASS2, and delivers better results than the open-source state-of-the-art method trRosetta. OPUS-Fold2 is a complementary version of our previous method OPUS-Fold. OPUS-Fold2 is a gradient-based protein folding framework based on the differentiable energy terms in opposed to OPUS-Fold that is a sampling-based method used to deal with the non-differentiable terms. OPUS-Fold2 exhibits comparable performance to the Rosetta folding protocol in trRosetta when using identical inputs. OPUS-Fold2 is written in Python and TensorFlow2.4, which is user-friendly to any source-code-level modification. AVAILABILITYAND IMPLEMENTATION The code and pre-trained models of OPUS-X can be downloaded from https://github.com/OPUS-MaLab/opus_x. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gang Xu
- Multiscale Research Institute of Complex Systems, Fudan University, Shanghai 200433, China,Zhangjiang Fudan International Innovation Center, Fudan University, Shanghai 201210, China,Shanghai AI Laboratory, Shanghai 200030, China
| | - Qinghua Wang
- Verna and Marrs Mclean Department of Biochemistry and Molecular Biology, Baylor College of Medicine, Houston, TX 77030, USA
| | | |
Collapse
|
19
|
Wu T, Guo Z, Hou J, Cheng J. Correction to: DeepDist: real‑value inter‑residue distance prediction with deep residual convolutional network. BMC Bioinformatics 2021; 22:354. [PMID: 34187363 PMCID: PMC8243437 DOI: 10.1186/s12859-021-04269-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Affiliation(s)
- Tianqi Wu
- Electrical Engineering and Computer Science Department, University of Missouri, Columbia, MO, 65211, USA
| | - Zhiye Guo
- Electrical Engineering and Computer Science Department, University of Missouri, Columbia, MO, 65211, USA
| | - Jie Hou
- Department of Computer Science, Saint Louis University, St. Louis, MO, 63103, USA
| | - Jianlin Cheng
- Electrical Engineering and Computer Science Department, University of Missouri, Columbia, MO, 65211, USA.
| |
Collapse
|
20
|
Wu T, Liu J, Guo Z, Hou J, Cheng J. MULTICOM2 open-source protein structure prediction system powered by deep learning and distance prediction. Sci Rep 2021; 11:13155. [PMID: 34162922 PMCID: PMC8222248 DOI: 10.1038/s41598-021-92395-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Accepted: 06/09/2021] [Indexed: 11/09/2022] Open
Abstract
Protein structure prediction is an important problem in bioinformatics and has been studied for decades. However, there are still few open-source comprehensive protein structure prediction packages publicly available in the field. In this paper, we present our latest open-source protein tertiary structure prediction system—MULTICOM2, an integration of template-based modeling (TBM) and template-free modeling (FM) methods. The template-based modeling uses sequence alignment tools with deep multiple sequence alignments to search for structural templates, which are much faster and more accurate than MULTICOM1. The template-free (ab initio or de novo) modeling uses the inter-residue distances predicted by DeepDist to reconstruct tertiary structure models without using any known structure as template. In the blind CASP14 experiment, the average TM-score of the models predicted by our server predictor based on the MULTICOM2 system is 0.720 for 58 TBM (regular) domains and 0.514 for 38 FM and FM/TBM (hard) domains, indicating that MULTICOM2 is capable of predicting good tertiary structures across the board. It can predict the correct fold for 76 CASP14 domains (95% regular domains and 55% hard domains) if only one prediction is made for a domain. The success rate is increased to 3% for both regular and hard domains if five predictions are made per domain. Moreover, the prediction accuracy of the pure template-free structure modeling method on both TBM and FM targets is very close to the combination of template-based and template-free modeling methods. This demonstrates that the distance-based template-free modeling method powered by deep learning can largely replace the traditional template-based modeling method even on TBM targets that TBM methods used to dominate and therefore provides a uniform structure modeling approach to any protein. Finally, on the 38 CASP14 FM and FM/TBM hard domains, MULTICOM2 server predictors (MULTICOM-HYBRID, MULTICOM-DEEP, MULTICOM-DIST) were ranked among the top 20 automated server predictors in the CASP14 experiment. After combining multiple predictors from the same research group as one entry, MULTICOM-HYBRID was ranked no. 5. The source code of MULTICOM2 is freely available at https://github.com/multicom-toolbox/multicom/tree/multicom_v2.0.
Collapse
Affiliation(s)
- Tianqi Wu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, 65211, USA
| | - Jian Liu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, 65211, USA
| | - Zhiye Guo
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, 65211, USA
| | - Jie Hou
- Department of Computer Science, Saint Louis University, St. Louis, MO, 63103, USA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, 65211, USA.
| |
Collapse
|
21
|
Protein model accuracy estimation empowered by deep learning and inter-residue distance prediction in CASP14. Sci Rep 2021; 11:10943. [PMID: 34035363 PMCID: PMC8149836 DOI: 10.1038/s41598-021-90303-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Accepted: 05/10/2021] [Indexed: 11/28/2022] Open
Abstract
The inter-residue contact prediction and deep learning showed the promise to improve the estimation of protein model accuracy (EMA) in the 13th Critical Assessment of Protein Structure Prediction (CASP13). To further leverage the improved inter-residue distance predictions to enhance EMA, during the 2020 CASP14 experiment, we integrated several new inter-residue distance features with the existing model quality assessment features in several deep learning methods to predict the quality of protein structural models. According to the evaluation of performance in selecting the best model from the models of CASP14 targets, our three multi-model predictors of estimating model accuracy (MULTICOM-CONSTRUCT, MULTICOM-AI, and MULTICOM-CLUSTER) achieve the averaged loss of 0.073, 0.079, and 0.081, respectively, in terms of the global distance test score (GDT-TS). The three methods are ranked first, second, and third out of all 68 CASP14 predictors. MULTICOM-DEEP, the single-model predictor of estimating model accuracy (EMA), is ranked within top 10 among all the single-model EMA methods according to GDT-TS score loss. The results demonstrate that inter-residue distance features are valuable inputs for deep learning to predict the quality of protein structural models. However, larger training datasets and better ways of leveraging inter-residue distance information are needed to fully explore its potentials.
Collapse
|
22
|
Pakhrin SC, Shrestha B, Adhikari B, KC DB. Deep Learning-Based Advances in Protein Structure Prediction. Int J Mol Sci 2021; 22:5553. [PMID: 34074028 PMCID: PMC8197379 DOI: 10.3390/ijms22115553] [Citation(s) in RCA: 57] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2021] [Revised: 05/12/2021] [Accepted: 05/18/2021] [Indexed: 12/29/2022] Open
Abstract
Obtaining an accurate description of protein structure is a fundamental step toward understanding the underpinning of biology. Although recent advances in experimental approaches have greatly enhanced our capabilities to experimentally determine protein structures, the gap between the number of protein sequences and known protein structures is ever increasing. Computational protein structure prediction is one of the ways to fill this gap. Recently, the protein structure prediction field has witnessed a lot of advances due to Deep Learning (DL)-based approaches as evidenced by the success of AlphaFold2 in the most recent Critical Assessment of protein Structure Prediction (CASP14). In this article, we highlight important milestones and progresses in the field of protein structure prediction due to DL-based methods as observed in CASP experiments. We describe advances in various steps of protein structure prediction pipeline viz. protein contact map prediction, protein distogram prediction, protein real-valued distance prediction, and Quality Assessment/refinement. We also highlight some end-to-end DL-based approaches for protein structure prediction approaches. Additionally, as there have been some recent DL-based advances in protein structure determination using Cryo-Electron (Cryo-EM) microscopy based, we also highlight some of the important progress in the field. Finally, we provide an outlook and possible future research directions for DL-based approaches in the protein structure prediction arena.
Collapse
Affiliation(s)
- Subash C. Pakhrin
- Department of Electrical Engineering and Computer Science, Wichita State University, Wichita, KS 67260, USA;
| | - Bikash Shrestha
- Department of Computer Science, University of Missouri-St. Louis, St. Louis, MO 63121, USA;
| | - Badri Adhikari
- Department of Computer Science, University of Missouri-St. Louis, St. Louis, MO 63121, USA;
| | - Dukka B. KC
- Department of Electrical Engineering and Computer Science, Wichita State University, Wichita, KS 67260, USA;
| |
Collapse
|
23
|
Bhattacharya S, Roche R, Shuvo MH, Bhattacharya D. Recent Advances in Protein Homology Detection Propelled by Inter-Residue Interaction Map Threading. Front Mol Biosci 2021; 8:643752. [PMID: 34046429 PMCID: PMC8148041 DOI: 10.3389/fmolb.2021.643752] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Accepted: 04/21/2021] [Indexed: 11/13/2022] Open
Abstract
Sequence-based protein homology detection has emerged as one of the most sensitive and accurate approaches to protein structure prediction. Despite the success, homology detection remains very challenging for weakly homologous proteins with divergent evolutionary profile. Very recently, deep neural network architectures have shown promising progress in mining the coevolutionary signal encoded in multiple sequence alignments, leading to reasonably accurate estimation of inter-residue interaction maps, which serve as a rich source of additional information for improved homology detection. Here, we summarize the latest developments in protein homology detection driven by inter-residue interaction map threading. We highlight the emerging trends in distant-homology protein threading through the alignment of predicted interaction maps at various granularities ranging from binary contact maps to finer-grained distance and orientation maps as well as their combination. We also discuss some of the current limitations and possible future avenues to further enhance the sensitivity of protein homology detection.
Collapse
Affiliation(s)
- Sutanu Bhattacharya
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL, United States
| | - Rahmatullah Roche
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL, United States
| | - Md Hossain Shuvo
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL, United States
| | - Debswapna Bhattacharya
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL, United States
- Department of Biological Sciences, Auburn University, Auburn, AL, United States
| |
Collapse
|
24
|
Li J, Xu J. Study of Real-Valued Distance Prediction for Protein Structure Prediction with Deep Learning. Bioinformatics 2021; 37:3197-3203. [PMID: 33961022 PMCID: PMC8504618 DOI: 10.1093/bioinformatics/btab333] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2021] [Revised: 03/07/2021] [Accepted: 04/28/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Inter-residue distance prediction by deep ResNet (convolutional residual neural network) has greatly advanced protein structure prediction. Currently the most successful structure prediction methods predict distance by discretizing it into dozens of bins. Here we study how well real-valued distance can be predicted and how useful it is for 3D structure modeling by comparing it with discrete-valued prediction based upon the same deep ResNet. RESULTS Different from the recent methods that predict only a single real value for the distance of an atom pair, we predict both the mean and standard deviation of a distance and then fold a protein by the predicted mean and deviation. Our findings include: 1) tested on the CASP13 FM (free-modeling) targets, our real-valued distance prediction obtains 81% precision on top L/5 long-range contact prediction, much better than the best CASP13 results (70%); 2) our real-valued prediction can predict correct folds for the same number of CASP13 FM targets as the best CASP13 group, despite generating only 20 decoys for each target; 3) our method greatly outperforms a very new real-valued prediction method DeepDist in both contact prediction and 3D structure modeling; and 4) when the same deep ResNet is used, our real-valued distance prediction has 1-6% higher contact and distance accuracy than our own discrete-valued prediction, but less accurate 3D structure models. AVAILABILITY AND IMPLEMENTATION https://github.com/j3xugit/RaptorX-3DModeling. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jin Li
- Toyota Technological Institute at Chicago, USA.,Department of Computer Science, University of Chicago, USA
| | - Jinbo Xu
- Toyota Technological Institute at Chicago, USA
| |
Collapse
|
25
|
Protein Structure Refinement Using Multi-Objective Particle Swarm Optimization with Decomposition Strategy. Int J Mol Sci 2021; 22:ijms22094408. [PMID: 33922489 PMCID: PMC8122964 DOI: 10.3390/ijms22094408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Revised: 04/16/2021] [Accepted: 04/20/2021] [Indexed: 12/02/2022] Open
Abstract
Protein structure refinement is a crucial step for more accurate protein structure predictions. Most existing approaches treat it as an energy minimization problem to intuitively improve the quality of initial models by searching for structures with lower energy. Considering that a single energy function could not reflect the accurate energy landscape of all the proteins, our previous AIR 1.0 pipeline uses multiple energy functions to realize a multi-objectives particle swarm optimization-based model refinement. It is expected to provide a general balanced conformation search protocol guided from different energy evaluations. However, AIR 1.0 solves the multi-objective optimization problem as a whole, which could not result in good solution diversity and convergence on some targets. In this study, we report a decomposition-based method AIR 2.0, which is an updated version of AIR, for protein structure refinement. AIR 2.0 decomposes a multi-objective optimization problem into a number of subproblems and optimizes them simultaneously using particle swarm optimization algorithm. The solutions yielded by AIR 2.0 show better convergence and diversity compared to its previous version, which increases the possibilities of digging out better structure conformations. The experimental results on CASP13 refinement benchmark targets and blind tests in CASP 14 demonstrate the efficacy of AIR 2.0.
Collapse
|
26
|
Chen C, Wu T, Guo Z, Cheng J. Combination of deep neural network with attention mechanism enhances the explainability of protein contact prediction. Proteins 2021; 89:697-707. [PMID: 33538038 PMCID: PMC8089057 DOI: 10.1002/prot.26052] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2020] [Revised: 12/22/2020] [Accepted: 01/31/2021] [Indexed: 12/17/2022]
Abstract
Deep learning has emerged as a revolutionary technology for protein residue‐residue contact prediction since the 2012 CASP10 competition. Considerable advancements in the predictive power of the deep learning‐based contact predictions have been achieved since then. However, little effort has been put into interpreting the black‐box deep learning methods. Algorithms that can interpret the relationship between predicted contact maps and the internal mechanism of the deep learning architectures are needed to explore the essential components of contact inference and improve their explainability. In this study, we present an attention‐based convolutional neural network for protein contact prediction, which consists of two attention mechanism‐based modules: sequence attention and regional attention. Our benchmark results on the CASP13 free‐modeling targets demonstrate that the two attention modules added on top of existing typical deep learning models exhibit a complementary effect that contributes to prediction improvements. More importantly, the inclusion of the attention mechanism provides interpretable patterns that contain useful insights into the key fold‐determining residues in proteins. We expect the attention‐based model can provide a reliable and practically interpretable technique that helps break the current bottlenecks in explaining deep neural networks for contact prediction. The source code of our method is available at https://github.com/jianlin-cheng/InterpretContactMap.
Collapse
Affiliation(s)
- Chen Chen
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
| | - Tianqi Wu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
| | - Zhiye Guo
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
| |
Collapse
|
27
|
Susanty M, Rajab TE, Hertadi R. A Review of Protein Structure Prediction using Deep Learning. BIO WEB OF CONFERENCES 2021. [DOI: 10.1051/bioconf/20214104003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Proteins are macromolecules composed of 20 types of amino acids in a specific order. Understanding how proteins fold is vital because its 3-dimensional structure determines the function of a protein. Prediction of protein structure based on amino acid strands and evolutionary information becomes the basis for other studies such as predicting the function, property or behaviour of a protein and modifying or designing new proteins to perform certain desired functions. Machine learning advances, particularly deep learning, are igniting a paradigm shift in scientific study. In this review, we summarize recent work in applying deep learning techniques to tackle problems in protein structural prediction. We discuss various deep learning approaches used to predict protein structure and future achievements and challenges. This review is expected to help provide perspectives on problems in biochemistry that can take advantage of the deep learning approach. Some of the unanswered challenges with current computational approaches are predicting the location and precision orientation of protein side chains, predicting protein interactions with DNA, RNA and other small molecules and predicting the structure of protein complexes.
Collapse
|