1
|
Thomas DPG, Garcia Fernandez CM, Haydarlou R, Feenstra KA. PIPENN-EMB ensemble net and protein embeddings generalise protein interface prediction beyond homology. Sci Rep 2025; 15:4391. [PMID: 39910126 PMCID: PMC11799512 DOI: 10.1038/s41598-025-88445-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Accepted: 01/28/2025] [Indexed: 02/07/2025] Open
Abstract
Protein interactions are crucial for understanding biological functions and disease mechanisms, but predicting these remains a complex task in computational biology. Increasingly, Deep Learning models are having success in interface prediction. This study presents PIPENN-EMB which explores the added value of using embeddings from the ProtT5-XL protein language model. Our results show substantial improvement over the previously published PIPENN model for protein interaction interface prediction, reaching an MCC of 0.313 vs. 0.249, and AUROC 0.800 vs. 0.755 on the BIO_DL_TE test set. We furthermore show that these embeddings cover a broad range of 'hand-crafted' protein features in ablation studies. PIPENN-EMB reaches state-of-the-art performance on the ZK448 dataset for protein-protein interface prediction. We showcase predictions on 25 resistance-related proteins from Mycobacterium tuberculosis. Furthermore, whereas other state-of-the-art sequence-based methods perform worse for proteins that have little recognisable homology in their training data, PIPENN-EMB generalises to remote homologs, yielding stable AUROC across all three test sets with less than 30% sequence identity to the training dataset, and even to proteins with less than 15% sequence identity.
Collapse
Affiliation(s)
- David P G Thomas
- Department of Computer Science, Vrije Universiteit Amsterdam, 1081HV, Amsterdam, The Netherlands
| | | | - Reza Haydarlou
- Department of Computer Science, Vrije Universiteit Amsterdam, 1081HV, Amsterdam, The Netherlands.
| | - K Anton Feenstra
- Department of Computer Science, Vrije Universiteit Amsterdam, 1081HV, Amsterdam, The Netherlands.
- AIMMS - Amsterdam Institute of Molecular and Life Sciences, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands.
| |
Collapse
|
2
|
Vural O, Jololian L. Machine learning approaches for predicting protein-ligand binding sites from sequence data. FRONTIERS IN BIOINFORMATICS 2025; 5:1520382. [PMID: 39963299 PMCID: PMC11830693 DOI: 10.3389/fbinf.2025.1520382] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Accepted: 01/10/2025] [Indexed: 02/20/2025] Open
Abstract
Proteins, composed of amino acids, are crucial for a wide range of biological functions. Proteins have various interaction sites, one of which is the protein-ligand binding site, essential for molecular interactions and biochemical reactions. These sites enable proteins to bind with other molecules, facilitating key biological functions. Accurate prediction of these binding sites is pivotal in computational drug discovery, helping to identify therapeutic targets and facilitate treatment development. Machine learning has made significant contributions to this field by improving the prediction of protein-ligand interactions. This paper reviews studies that use machine learning to predict protein-ligand binding sites from sequence data, focusing on recent advancements. The review examines various embedding methods and machine learning architectures, addressing current challenges and the ongoing debates in the field. Additionally, research gaps in the existing literature are highlighted, and potential future directions for advancing the field are discussed. This study provides a thorough overview of sequence-based approaches for predicting protein-ligand binding sites, offering insights into the current state of research and future possibilities.
Collapse
Affiliation(s)
- Orhun Vural
- Department of Electrical and Computer Engineering, The University of Alabama at Birmingham, Birmingham, AL, United States
| | | |
Collapse
|
3
|
Hao S, Li CY, Hu X, Feng Z, Zhang G, Yang C, Hu H. S-DCNN: prediction of ATP binding residues by deep convolutional neural network based on SMOTE. Front Genet 2025; 15:1513201. [PMID: 39834546 PMCID: PMC11744016 DOI: 10.3389/fgene.2024.1513201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2024] [Accepted: 12/11/2024] [Indexed: 01/22/2025] Open
Abstract
Background The realization of many protein functions requires binding with ligands. As a significant protein-binding ligand, ATP plays a crucial role in various biological processes. Currently, the precise prediction of ATP binding residues remains challenging. Methods Based on the sequence information, this paper introduces a method called S-DCNN for predicting ATP binding residues, utilizing a deep convolutional neural network (DCNN) enhanced with the synthetic minority over-sampling technique (SMOTE). Results The incorporation of additional feature parameters such as dihedral angles, energy, and propensity factors into the standard parameter set resulted in a significant enhancement in prediction accuracy on the ATP-289 dataset. The S-DCNN achieved the highest Matthews correlation coefficient value of 0.5031 and an accuracy rate of 97.06% on an independent test set. Furthermore, when applied to the ATP-221 and ATP-388 datasets for validation, the S-DCNN outperformed existing methods on ATP-221 and performed comparably to other methods on ATP-388 during independent testing. Conclusion Our experimental results underscore the efficacy of the S-DCNN in accurately predicting ATP binding residues, establishing it as a potent tool in the prediction of ATP binding residues.
Collapse
Affiliation(s)
- Sixi Hao
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
- School of Mathematics and Statistics, Xinyang College, Xinyang, China
| | - Cai-Yan Li
- School of Computer Science and Technology/Baotou Medical College, Baotou, China
| | - Xiuzhen Hu
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Zhenxing Feng
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Gaimei Zhang
- Department of Obstetrics and Gynecology, Hohhot First Hospital, Hohhot, China
| | - Caiyun Yang
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Huimin Hu
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| |
Collapse
|
4
|
Carpenter KA, Altman RB. Databases of ligand-binding pockets and protein-ligand interactions. Comput Struct Biotechnol J 2024; 23:1320-1338. [PMID: 38585646 PMCID: PMC10997877 DOI: 10.1016/j.csbj.2024.03.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 03/16/2024] [Accepted: 03/17/2024] [Indexed: 04/09/2024] Open
Abstract
Many research groups and institutions have created a variety of databases curating experimental and predicted data related to protein-ligand binding. The landscape of available databases is dynamic, with new databases emerging and established databases becoming defunct. Here, we review the current state of databases that contain binding pockets and protein-ligand binding interactions. We have compiled a list of such databases, fifty-three of which are currently available for use. We discuss variation in how binding pockets are defined and summarize pocket-finding methods. We organize the fifty-three databases into subgroups based on goals and contents, and describe standard use cases. We also illustrate that pockets within the same protein are characterized differently across different databases. Finally, we assess critical issues of sustainability, accessibility and redundancy.
Collapse
Affiliation(s)
- Kristy A. Carpenter
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| | - Russ B. Altman
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
- Department of Bioengineering, Stanford University, Stanford, CA 94305, USA
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
- Department of Medicine, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
5
|
Ishitani R, Takemoto M, Tomii K. Protein ligand binding site prediction using graph transformer neural network. PLoS One 2024; 19:e0308425. [PMID: 39106255 DOI: 10.1371/journal.pone.0308425] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Accepted: 07/23/2024] [Indexed: 08/09/2024] Open
Abstract
Ligand binding site prediction is a crucial initial step in structure-based drug discovery. Although several methods have been proposed previously, including those using geometry based and machine learning techniques, their accuracy is considered to be still insufficient. In this study, we introduce an approach that leverages a graph transformer neural network to rank the results of a geometry-based pocket detection method. We also created a larger training dataset compared to the conventionally used sc-PDB and investigated the correlation between the dataset size and prediction performance. Our findings indicate that utilizing a graph transformer-based method alongside a larger training dataset could enhance the performance of ligand binding site prediction.
Collapse
Affiliation(s)
- Ryuichiro Ishitani
- Division of Computational Drug Discovery and Design, Medical Research Institute, Tokyo Medical and Dental University, Bunkyo-ku, Tokyo, Japan
- Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Bunkyo-ku, Tokyo, Japan
- Preferred Networks, Inc., Chiyoda-ku, Tokyo, Japan
| | - Mizuki Takemoto
- Division of Computational Drug Discovery and Design, Medical Research Institute, Tokyo Medical and Dental University, Bunkyo-ku, Tokyo, Japan
| | - Kentaro Tomii
- Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Koto-ku, Tokyo, Japan
| |
Collapse
|
6
|
Jeevan K, Palistha S, Tayara H, Chong KT. PUResNetV2.0: a deep learning model leveraging sparse representation for improved ligand binding site prediction. J Cheminform 2024; 16:66. [PMID: 38849917 PMCID: PMC11157904 DOI: 10.1186/s13321-024-00865-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Accepted: 05/27/2024] [Indexed: 06/09/2024] Open
Abstract
Accurate ligand binding site prediction (LBSP) within proteins is essential for drug discovery. We developed ProteinUNetResNetV2.0 (PUResNetV2.0), leveraging sparse representation of protein structures to improve LBSP accuracy. Our training dataset included protein complexes from 4729 protein families. Evaluations on benchmark datasets showed that PUResNetV2.0 achieved an 85.4% Distance Center Atom (DCA) success rate and a 74.7% F1 Score on the Holo801 dataset, outperforming existing methods. However, its performance in specific cases, such as RNA, DNA, peptide-like ligand, and ion binding site prediction, was limited due to constraints in our training data. Our findings underscore the potential of sparse representation in LBSP, especially for oligomeric structures, suggesting PUResNetV2.0 as a promising tool for computational drug discovery.
Collapse
Affiliation(s)
- Kandel Jeevan
- Graduate School of Integrated Energy-AI, Jeonbuk National University, Jeonju, 54896, South Korea
| | - Shrestha Palistha
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju, 54896, South Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju, 54896, South Korea.
| | - Kil T Chong
- Graduate School of Integrated Energy-AI, Jeonbuk National University, Jeonju, 54896, South Korea.
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju, 54896, South Korea.
- School of International Engineering and Science, Jeonbuk National University, Jeonju, 54896, South Korea.
- Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju, 54896, South Korea.
| |
Collapse
|
7
|
Xia Y, Pan X, Shen HB. A comprehensive survey on protein-ligand binding site prediction. Curr Opin Struct Biol 2024; 86:102793. [PMID: 38447285 DOI: 10.1016/j.sbi.2024.102793] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Revised: 02/18/2024] [Accepted: 02/18/2024] [Indexed: 03/08/2024]
Abstract
Protein-ligand binding site prediction is critical for protein function annotation and drug discovery. Biological experiments are time-consuming and require significant equipment, materials, and labor resources. Developing accurate and efficient computational methods for protein-ligand interaction prediction is essential. Here, we summarize the key challenges associated with ligand binding site (LBS) prediction and introduce recently published methods from their input features, computational algorithms, and ligand types. Furthermore, we investigate the specificity of allosteric site identification as a particular LBS type. Finally, we discuss the prospective directions for machine learning-based LBS prediction in the near future.
Collapse
Affiliation(s)
- Ying Xia
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China.
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China.
| |
Collapse
|
8
|
Zhang Y, Li S, Meng K, Sun S. Machine Learning for Sequence and Structure-Based Protein-Ligand Interaction Prediction. J Chem Inf Model 2024; 64:1456-1472. [PMID: 38385768 DOI: 10.1021/acs.jcim.3c01841] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/23/2024]
Abstract
Developing new drugs is too expensive and time -consuming. Accurately predicting the interaction between drugs and targets will likely change how the drug is discovered. Machine learning-based protein-ligand interaction prediction has demonstrated significant potential. In this paper, computational methods, focusing on sequence and structure to study protein-ligand interactions, are examined. Therefore, this paper starts by presenting an overview of the data sets applied in this area, as well as the various approaches applied for representing proteins and ligands. Then, sequence-based and structure-based classification criteria are subsequently utilized to categorize and summarize both the classical machine learning models and deep learning models employed in protein-ligand interaction studies. Moreover, the evaluation methods and interpretability of these models are proposed. Furthermore, delving into the diverse applications of protein-ligand interaction models in drug research is presented. Lastly, the current challenges and future directions in this field are addressed.
Collapse
Affiliation(s)
- Yunjiang Zhang
- Beijing Key Laboratory for Green Catalysis and Separation, The Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, P. R. China
| | - Shuyuan Li
- Beijing Key Laboratory for Green Catalysis and Separation, The Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, P. R. China
| | - Kong Meng
- Beijing Key Laboratory for Green Catalysis and Separation, The Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, P. R. China
| | - Shaorui Sun
- Beijing Key Laboratory for Green Catalysis and Separation, The Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, P. R. China
| |
Collapse
|
9
|
Qi X, Zhao Y, Qi Z, Hou S, Chen J. Machine Learning Empowering Drug Discovery: Applications, Opportunities and Challenges. Molecules 2024; 29:903. [PMID: 38398653 PMCID: PMC10892089 DOI: 10.3390/molecules29040903] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2024] [Revised: 02/08/2024] [Accepted: 02/14/2024] [Indexed: 02/25/2024] Open
Abstract
Drug discovery plays a critical role in advancing human health by developing new medications and treatments to combat diseases. How to accelerate the pace and reduce the costs of new drug discovery has long been a key concern for the pharmaceutical industry. Fortunately, by leveraging advanced algorithms, computational power and biological big data, artificial intelligence (AI) technology, especially machine learning (ML), holds the promise of making the hunt for new drugs more efficient. Recently, the Transformer-based models that have achieved revolutionary breakthroughs in natural language processing have sparked a new era of their applications in drug discovery. Herein, we introduce the latest applications of ML in drug discovery, highlight the potential of advanced Transformer-based ML models, and discuss the future prospects and challenges in the field.
Collapse
Affiliation(s)
- Xin Qi
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou 215011, China; (Y.Z.); (S.H.); (J.C.)
| | - Yuanchun Zhao
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou 215011, China; (Y.Z.); (S.H.); (J.C.)
| | - Zhuang Qi
- School of Software, Shandong University, Jinan 250101, China;
| | - Siyu Hou
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou 215011, China; (Y.Z.); (S.H.); (J.C.)
| | - Jiajia Chen
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou 215011, China; (Y.Z.); (S.H.); (J.C.)
| |
Collapse
|
10
|
Habeeb M, You HW, Umapathi M, Ravikumar KK, Hariyadi, Mishra S. Strategies of Artificial intelligence tools in the domain of nanomedicine. J Drug Deliv Sci Technol 2024; 91:105157. [DOI: 10.1016/j.jddst.2023.105157] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2025]
|
11
|
Yu L, He X, Fang X, Liu L, Liu J. Deep Learning with Geometry-Enhanced Molecular Representation for Augmentation of Large-Scale Docking-Based Virtual Screening. J Chem Inf Model 2023; 63:6501-6514. [PMID: 37882338 DOI: 10.1021/acs.jcim.3c01371] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2023]
Abstract
Structure-based virtual screening has been a crucial tool in drug discovery for decades. However, as the chemical space expands, the existing structure-based virtual screening techniques based on molecular docking and scoring struggle to handle billion-entry ultralarge libraries due to the high computational cost. To address this challenge, people have resorted to machine learning techniques to enhance structure-based virtual screening for efficiently exploring the vast chemical space. In those cases, compounds are usually treated as sequential strings or two-dimensional topology graphs, limiting their ability to incorporate three-dimensional structural information for downstream tasks. We herein propose a novel deep learning protocol, GEM-Screen, which utilizes the geometry-enhanced molecular representation of the compounds docking to a specific target and is trained on docking scores of a small fraction of a library through an active learning strategy to approximate the docking outcome for yet nontraining entries. This protocol is applied to virtual screening campaigns against the AmpC and D4 targets, demonstrating that GEM-Screen enriches more than 90% of the hit scaffolds for AmpC in the top 4% of model predictions and more than 80% of the hit scaffolds for D4 in the same top-ranking size of library. GEM-Screen can be used in conjunction with traditional docking programs for docking of only the top-ranked compounds to avoid the exhaustive docking of the whole library, thus allowing for discovering top-scoring compounds from billion-entry libraries in a rapid yet accurate fashion.
Collapse
Affiliation(s)
- Lan Yu
- School of Science, China Pharmaceutical University, Nanjing 210009, China
| | - Xiao He
- Shanghai Engineering Research Center of Molecular Therapeutics and New Drug Development, Shanghai Frontiers Science Center of Molecule Intelligent Syntheses, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai 200062, China
- New York University-East China Normal University Center for Computational Chemistry, New York University Shanghai, Shanghai 200062, China
| | - Xiaomin Fang
- Baidu International Technology (Shenzhen) Co., Ltd., Shenzhen 518063, China
| | - Lihang Liu
- Baidu International Technology (Shenzhen) Co., Ltd., Shenzhen 518063, China
| | - Jinfeng Liu
- School of Science, China Pharmaceutical University, Nanjing 210009, China
- School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing 210009, China
| |
Collapse
|
12
|
Guan S, Zou Q, Wu H, Ding Y. Protein-DNA Binding Residues Prediction Using a Deep Learning Model With Hierarchical Feature Extraction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2619-2628. [PMID: 35834447 DOI: 10.1109/tcbb.2022.3190933] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Biologically important effects occur when proteins bind to other substances, of which binding to DNA is a crucial one. Therefore, accurate identification of protein-DNA binding residues is important for further understanding of the protein-DNA interaction mechanism. Although wet-lab methods can accurately obtain the location of bound residues, it requires significant human, financial and time costs. There is thus an urgent need to develop efficient computational-based methods. Most current state-of-the-art methods are two-step approaches: the first step uses a sliding window technique to extract residue features; the second step uses each residue as an input to the model for prediction. This has a negative impact on the efficiency of prediction and ease of use. In this study, we propose a sequence-to-sequence (seq2seq) model that can input the entire protein sequence of variable length and use two modules, Transformer Encoder Block and Feature Extracting Block, for hierarchical feature extraction, where Transformer Encoder Block is used to extract global features, and then Feature Extracting Block is used to extract local features to further improve the recognition capability of the model. The comparison results on two benchmark datasets, namely PDNA-543 and PDNA-41, prove the effectiveness of our method in identifying protein-DNA binding residues.
Collapse
|
13
|
Tang M, Li B, Chen H. Application of message passing neural networks for molecular property prediction. Curr Opin Struct Biol 2023; 81:102616. [PMID: 37267824 DOI: 10.1016/j.sbi.2023.102616] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Revised: 04/28/2023] [Accepted: 05/04/2023] [Indexed: 06/04/2023]
Abstract
Accurate molecular property prediction, as one of the classical cheminformatics topics, plays a prominent role in the fields of computer-aided drug design. For instance, property prediction models can be used to quickly screen large molecular libraries to find lead compounds. Message-passing neural networks (MPNNs), a sub-class of Graph neural networks (GNNs), have recently been demonstrated to outperform other deep learning methods on a variety of tasks, including the prediction of molecular characteristics. In this survey, we provide a brief review of the MPNN models and their applications on molecular property prediction.
Collapse
Affiliation(s)
- Miru Tang
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong Province, China; Bioland Laboratory (Guangzhou Regenerative Medicine and Health-Guangdong Laboratory), Guangzhou, 510530, China; State Key Laboratory of Respiratory Disease, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences, Guangzhou, 510530, China
| | - Baiqing Li
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong Province, China
| | - Hongming Chen
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong Province, China.
| |
Collapse
|
14
|
Mohanty M, Mohanty PS. Molecular docking in organic, inorganic, and hybrid systems: a tutorial review. MONATSHEFTE FUR CHEMIE 2023; 154:1-25. [PMID: 37361694 PMCID: PMC10243279 DOI: 10.1007/s00706-023-03076-1] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/27/2022] [Accepted: 05/08/2023] [Indexed: 06/28/2023]
Abstract
Molecular docking simulation is a very popular and well-established computational approach and has been extensively used to understand molecular interactions between a natural organic molecule (ideally taken as a receptor) such as an enzyme, protein, DNA, RNA and a natural or synthetic organic/inorganic molecule (considered as a ligand). But the implementation of docking ideas to synthetic organic, inorganic, or hybrid systems is very limited with respect to their use as a receptor despite their huge popularity in different experimental systems. In this context, molecular docking can be an efficient computational tool for understanding the role of intermolecular interactions in hybrid systems that can help in designing materials on mesoscale for different applications. The current review focuses on the implementation of the docking method in organic, inorganic, and hybrid systems along with examples from different case studies. We describe different resources, including databases and tools required in the docking study and applications. The concept of docking techniques, types of docking models, and the role of different intermolecular interactions involved in the docking process to understand the binding mechanisms are explained. Finally, the challenges and limitations of dockings are also discussed in this review. Graphical abstract
Collapse
Affiliation(s)
- Madhuchhanda Mohanty
- School of Biotechnology, Kalinga Institute of Industrial Technology (KIIT), Deemed to be University, Bhubaneswar, 751024 India
| | - Priti S. Mohanty
- School of Biotechnology, Kalinga Institute of Industrial Technology (KIIT), Deemed to be University, Bhubaneswar, 751024 India
- School of Chemical Technology, Kalinga Institute of Industrial Technology (KIIT), Deemed to be University, Bhubaneswar, 751024 India
| |
Collapse
|
15
|
Yousefi N, Yazdani-Jahromi M, Tayebi A, Kolanthai E, Neal CJ, Banerjee T, Gosai A, Balasubramanian G, Seal S, Ozmen Garibay O. BindingSite-AugmentedDTA: enabling a next-generation pipeline for interpretable prediction models in drug repurposing. Brief Bioinform 2023; 24:7140297. [PMID: 37096593 DOI: 10.1093/bib/bbad136] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2022] [Revised: 03/02/2022] [Accepted: 03/16/2023] [Indexed: 04/26/2023] Open
Abstract
While research into drug-target interaction (DTI) prediction is fairly mature, generalizability and interpretability are not always addressed in the existing works in this field. In this paper, we propose a deep learning (DL)-based framework, called BindingSite-AugmentedDTA, which improves drug-target affinity (DTA) predictions by reducing the search space of potential-binding sites of the protein, thus making the binding affinity prediction more efficient and accurate. Our BindingSite-AugmentedDTA is highly generalizable as it can be integrated with any DL-based regression model, while it significantly improves their prediction performance. Also, unlike many existing models, our model is highly interpretable due to its architecture and self-attention mechanism, which can provide a deeper understanding of its underlying prediction mechanism by mapping attention weights back to protein-binding sites. The computational results confirm that our framework can enhance the prediction performance of seven state-of-the-art DTA prediction algorithms in terms of four widely used evaluation metrics, including concordance index, mean squared error, modified squared correlation coefficient ($r^2_m$) and the area under the precision curve. We also contribute to three benchmark drug-traget interaction datasets by including additional information on 3D structure of all proteins contained in those datasets, which include the two most commonly used datasets, namely Kiba and Davis, as well as the data from IDG-DREAM drug-kinase binding prediction challenge. Furthermore, we experimentally validate the practical potential of our proposed framework through in-lab experiments. The relatively high agreement between computationally predicted and experimentally observed binding interactions supports the potential of our framework as the next-generation pipeline for prediction models in drug repurposing.
Collapse
Affiliation(s)
- Niloofar Yousefi
- Industrial Engineering and Management Systems, University of Central Florida, 32816, 4000 Central Florida Blvd., Orlando, FL, USA
| | - Mehdi Yazdani-Jahromi
- Computer Science, University of Central Florida, 32816, 4000 Central Florida Blvd., Orlando, FL, USA
| | - Aida Tayebi
- Industrial Engineering and Management Systems, University of Central Florida, 32816, 4000 Central Florida Blvd., Orlando, FL, USA
| | - Elayaraja Kolanthai
- College of Medicine, Bionix Cluster, University of Central Florida, 4000 Central Florida Blvd., Orlando 32816, FL, USA
| | - Craig J Neal
- College of Medicine, Bionix Cluster, University of Central Florida, 4000 Central Florida Blvd., Orlando 32816, FL, USA
| | - Tanumoy Banerjee
- Department of Mechanical Engineering and Mechanics, Lehigh University, Bethlehem 18015, PA, USA
| | | | - Ganesh Balasubramanian
- Department of Mechanical Engineering and Mechanics, Lehigh University, Bethlehem 18015, PA, USA
| | - Sudipta Seal
- College of Medicine, Bionix Cluster, University of Central Florida, 4000 Central Florida Blvd., Orlando 32816, FL, USA
- Advanced Materials Processing and Analysis Center, Department of Materials Science and Engineering, University of Central Florida, 4000 Central Florida Blvd., Orlando 32816, FL, USA
| | - Ozlem Ozmen Garibay
- Industrial Engineering and Management Systems, University of Central Florida, 32816, 4000 Central Florida Blvd., Orlando, FL, USA
| |
Collapse
|
16
|
Bordin N, Dallago C, Heinzinger M, Kim S, Littmann M, Rauer C, Steinegger M, Rost B, Orengo C. Novel machine learning approaches revolutionize protein knowledge. Trends Biochem Sci 2023; 48:345-359. [PMID: 36504138 PMCID: PMC10570143 DOI: 10.1016/j.tibs.2022.11.001] [Citation(s) in RCA: 34] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Revised: 10/24/2022] [Accepted: 11/17/2022] [Indexed: 12/10/2022]
Abstract
Breakthrough methods in machine learning (ML), protein structure prediction, and novel ultrafast structural aligners are revolutionizing structural biology. Obtaining accurate models of proteins and annotating their functions on a large scale is no longer limited by time and resources. The most recent method to be top ranked by the Critical Assessment of Structure Prediction (CASP) assessment, AlphaFold 2 (AF2), is capable of building structural models with an accuracy comparable to that of experimental structures. Annotations of 3D models are keeping pace with the deposition of the structures due to advancements in protein language models (pLMs) and structural aligners that help validate these transferred annotations. In this review we describe how recent developments in ML for protein science are making large-scale structural bioinformatics available to the general scientific community.
Collapse
Affiliation(s)
- Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, Gower St, WC1E 6BT London, UK
| | - Christian Dallago
- Technical University of Munich (TUM) Department of Informatics, Bioinformatics and Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany; VantAI, 151 W 42nd Street, New York, NY 10036, USA
| | - Michael Heinzinger
- Technical University of Munich (TUM) Department of Informatics, Bioinformatics and Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany; TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany
| | - Stephanie Kim
- School of Biological Sciences, Seoul National University, Seoul, South Korea; Artificial Intelligence Institute, Seoul National University, Seoul, South Korea
| | - Maria Littmann
- Technical University of Munich (TUM) Department of Informatics, Bioinformatics and Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany
| | - Clemens Rauer
- Institute of Structural and Molecular Biology, University College London, Gower St, WC1E 6BT London, UK
| | - Martin Steinegger
- School of Biological Sciences, Seoul National University, Seoul, South Korea; Artificial Intelligence Institute, Seoul National University, Seoul, South Korea
| | - Burkhard Rost
- Technical University of Munich (TUM) Department of Informatics, Bioinformatics and Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany; Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748 Garching/Munich, Germany; TUM School of Life Sciences Weihenstephan (TUM-WZW), Alte Akademie 8, Freising, Germany
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, Gower St, WC1E 6BT London, UK.
| |
Collapse
|
17
|
Rashid S, Sundaram S, Kwoh CK. Empirical Study of Protein Feature Representation on Deep Belief Networks Trained With Small Data for Secondary Structure Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:955-966. [PMID: 35439138 DOI: 10.1109/tcbb.2022.3168676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Protein secondary structure (SS) prediction is a classic problem of computational biology and is widely used in structural characterization and to infer homology. While most SS predictors have been trained on thousands of sequences, a previous approach had developed a compact model of training proteins that used a C-Alpha, C-Beta Side Chain (CABS)-algorithm derived energy based feature representation. Here, the previous approach is extended to Deep Belief Networks (DBN). Deep learning methods are notorious for requiring large datasets and there is a wide consensus that training deep models from scratch on small datasets, works poorly. By contrast, we demonstrate a simple DBN architecture containing a single hidden layer, trained only on the CB513 dataset. Testing on an independent set of G Switch proteins improved the Q 3 score of the previous compact model by almost 3%. The findings are further confirmed by comparison to several deep learning models which are trained on thousands of proteins. Finally, the DBN performance is also compared with Position Specific Scoring Matrix (PSSM)-profile based feature representation. The importance of (i) structural information in protein feature representation and (ii) complementary small dataset learning approaches for detection of structural fold switching are demonstrated.
Collapse
|
18
|
Yu T, Boob AG, Volk MJ, Liu X, Cui H, Zhao H. Machine learning-enabled retrobiosynthesis of molecules. Nat Catal 2023. [DOI: 10.1038/s41929-022-00909-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/18/2023]
|
19
|
Giri N, Cheng J. Improving Protein-Ligand Interaction Modeling with cryo-EM Data, Templates, and Deep Learning in 2021 Ligand Model Challenge. Biomolecules 2023; 13:132. [PMID: 36671518 PMCID: PMC9855343 DOI: 10.3390/biom13010132] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 01/04/2023] [Accepted: 01/06/2023] [Indexed: 01/11/2023] Open
Abstract
Elucidating protein-ligand interaction is crucial for studying the function of proteins and compounds in an organism and critical for drug discovery and design. The problem of protein-ligand interaction is traditionally tackled by molecular docking and simulation, which is based on physical forces and statistical potentials and cannot effectively leverage cryo-EM data and existing protein structural information in the protein-ligand modeling process. In this work, we developed a deep learning bioinformatics pipeline (DeepProLigand) to predict protein-ligand interactions from cryo-EM density maps of proteins and ligands. DeepProLigand first uses a deep learning method to predict the structure of proteins from cryo-EM maps, which is averaged with a reference (template) structure of the proteins to produce a combined structure to add ligands. The ligands are then identified and added into the structure to generate a protein-ligand complex structure, which is further refined. The method based on the deep learning prediction and template-based modeling was blindly tested in the 2021 EMDataResource Ligand Challenge and was ranked first in fitting ligands to cryo-EM density maps. These results demonstrate that the deep learning bioinformatics approach is a promising direction for modeling protein-ligand interactions on cryo-EM data using prior structural information.
Collapse
Affiliation(s)
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA
| |
Collapse
|
20
|
Gutiérrez-Mondragón MA, König C, Vellido A. Layer-Wise Relevance Analysis for Motif Recognition in the Activation Pathway of the β2- Adrenergic GPCR Receptor. Int J Mol Sci 2023; 24:ijms24021155. [PMID: 36674669 PMCID: PMC9865744 DOI: 10.3390/ijms24021155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 12/22/2022] [Accepted: 12/30/2022] [Indexed: 01/11/2023] Open
Abstract
G-protein-coupled receptors (GPCRs) are cell membrane proteins of relevance as therapeutic targets, and are associated to the development of treatments for illnesses such as diabetes, Alzheimer's, or even cancer. Therefore, comprehending the underlying mechanisms of the receptor functional properties is of particular interest in pharmacoproteomics and in disease therapy at large. Their interaction with ligands elicits multiple molecular rearrangements all along their structure, inducing activation pathways that distinctly influence the cell response. In this work, we studied GPCR signaling pathways from molecular dynamics simulations as they provide rich information about the dynamic nature of the receptors. We focused on studying the molecular properties of the receptors using deep-learning-based methods. In particular, we designed and trained a one-dimensional convolution neural network and illustrated its use in a classification of conformational states: active, intermediate, or inactive, of the β2-adrenergic receptor when bound to the full agonist BI-167107. Through a novel explainability-oriented investigation of the prediction results, we were able to identify and assess the contribution of individual motifs (residues) influencing a particular activation pathway. Consequently, we contribute a methodology that assists in the elucidation of the underlying mechanisms of receptor activation-deactivation.
Collapse
Affiliation(s)
- Mario A. Gutiérrez-Mondragón
- Computer Science Department, Universitat Politècnica de Catalunya—UPC BarcelonaTech, 08034 Barcelona, Spain
- Intelligent Data Science and Artificial Intelligence (IDEAI-UPC) Research Center, Universitat Politècnica de Catalunya—UPC BarcelonaTech, 08034 Barcelona, Spain
| | - Caroline König
- Computer Science Department, Universitat Politècnica de Catalunya—UPC BarcelonaTech, 08034 Barcelona, Spain
- Intelligent Data Science and Artificial Intelligence (IDEAI-UPC) Research Center, Universitat Politècnica de Catalunya—UPC BarcelonaTech, 08034 Barcelona, Spain
- Correspondence:
| | - Alfredo Vellido
- Computer Science Department, Universitat Politècnica de Catalunya—UPC BarcelonaTech, 08034 Barcelona, Spain
- Intelligent Data Science and Artificial Intelligence (IDEAI-UPC) Research Center, Universitat Politècnica de Catalunya—UPC BarcelonaTech, 08034 Barcelona, Spain
| |
Collapse
|
21
|
Chang Y, Hawkins BA, Du JJ, Groundwater PW, Hibbs DE, Lai F. A Guide to In Silico Drug Design. Pharmaceutics 2022; 15:pharmaceutics15010049. [PMID: 36678678 PMCID: PMC9867171 DOI: 10.3390/pharmaceutics15010049] [Citation(s) in RCA: 54] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2022] [Revised: 12/16/2022] [Accepted: 12/17/2022] [Indexed: 12/28/2022] Open
Abstract
The drug discovery process is a rocky path that is full of challenges, with the result that very few candidates progress from hit compound to a commercially available product, often due to factors, such as poor binding affinity, off-target effects, or physicochemical properties, such as solubility or stability. This process is further complicated by high research and development costs and time requirements. It is thus important to optimise every step of the process in order to maximise the chances of success. As a result of the recent advancements in computer power and technology, computer-aided drug design (CADD) has become an integral part of modern drug discovery to guide and accelerate the process. In this review, we present an overview of the important CADD methods and applications, such as in silico structure prediction, refinement, modelling and target validation, that are commonly used in this area.
Collapse
Affiliation(s)
- Yiqun Chang
- Sydney Pharmacy School, Faculty of Medicine and Health, The University of Sydney, Camperdown, NSW 2006, Australia
| | - Bryson A. Hawkins
- Sydney Pharmacy School, Faculty of Medicine and Health, The University of Sydney, Camperdown, NSW 2006, Australia
| | - Jonathan J. Du
- Department of Biochemistry, Emory University School of Medicine, Atlanta, GA 30322, USA
| | - Paul W. Groundwater
- Sydney Pharmacy School, Faculty of Medicine and Health, The University of Sydney, Camperdown, NSW 2006, Australia
| | - David E. Hibbs
- Sydney Pharmacy School, Faculty of Medicine and Health, The University of Sydney, Camperdown, NSW 2006, Australia
| | - Felcia Lai
- Sydney Pharmacy School, Faculty of Medicine and Health, The University of Sydney, Camperdown, NSW 2006, Australia
- Correspondence:
| |
Collapse
|
22
|
Liao J, Wang Q, Wu F, Huang Z. In Silico Methods for Identification of Potential Active Sites of Therapeutic Targets. Molecules 2022; 27:7103. [PMID: 36296697 PMCID: PMC9609013 DOI: 10.3390/molecules27207103] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 08/12/2022] [Accepted: 08/25/2022] [Indexed: 07/30/2023] Open
Abstract
Target identification is an important step in drug discovery, and computer-aided drug target identification methods are attracting more attention compared with traditional drug target identification methods, which are time-consuming and costly. Computer-aided drug target identification methods can greatly reduce the searching scope of experimental targets and associated costs by identifying the diseases-related targets and their binding sites and evaluating the druggability of the predicted active sites for clinical trials. In this review, we introduce the principles of computer-based active site identification methods, including the identification of binding sites and assessment of druggability. We provide some guidelines for selecting methods for the identification of binding sites and assessment of druggability. In addition, we list the databases and tools commonly used with these methods, present examples of individual and combined applications, and compare the methods and tools. Finally, we discuss the challenges and limitations of binding site identification and druggability assessment at the current stage and provide some recommendations and future perspectives.
Collapse
Affiliation(s)
- Jianbo Liao
- Key Laboratory of Big Data Mining and Precision Drug Design of Guangdong Medical University, Key Laboratory of Computer-Aided Drug Design of Dongguan City, Key Laboratory for Research and Development of Natural Drugs of Guangdong Province, School of Pharmacy, Guangdong Medical University, Dongguan 523808, China
- The Second School of Clinical Medicine, Guangdong Medical University, Dongguan 523808, China
| | - Qinyu Wang
- Key Laboratory of Big Data Mining and Precision Drug Design of Guangdong Medical University, Key Laboratory of Computer-Aided Drug Design of Dongguan City, Key Laboratory for Research and Development of Natural Drugs of Guangdong Province, School of Pharmacy, Guangdong Medical University, Dongguan 523808, China
| | - Fengxu Wu
- Hubei Key Laboratory of Wudang Local Chinese Medicine Research, School of Pharmaceutical Sciences, Hubei University of Medicine, Shiyan 442000, China
| | - Zunnan Huang
- Key Laboratory of Big Data Mining and Precision Drug Design of Guangdong Medical University, Key Laboratory of Computer-Aided Drug Design of Dongguan City, Key Laboratory for Research and Development of Natural Drugs of Guangdong Province, School of Pharmacy, Guangdong Medical University, Dongguan 523808, China
- Marine Biomedical Research Institute of Guangdong Zhanjiang, Zhanjiang 524023, China
| |
Collapse
|
23
|
Hao S, Hu X, Feng Z, Sun K, You X, Wang Z, Yang C. Prediction of metal ion ligand binding residues by adding disorder value and propensity factors based on deep learning algorithm. Front Genet 2022; 13:969412. [PMID: 36035120 PMCID: PMC9402973 DOI: 10.3389/fgene.2022.969412] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Accepted: 07/04/2022] [Indexed: 11/13/2022] Open
Abstract
Proteins need to interact with different ligands to perform their functions. Among the ligands, the metal ion is a major ligand. At present, the prediction of protein metal ion ligand binding residues is a challenge. In this study, we selected Zn2+, Cu2+, Fe2+, Fe3+, Co2+, Mn2+, Ca2+ and Mg2+ metal ion ligands from the BioLip database as the research objects. Based on the amino acids, the physicochemical properties and predicted structural information, we introduced the disorder value as the feature parameter. In addition, based on the component information, position weight matrix and information entropy, we introduced the propensity factor as prediction parameters. Then, we used the deep neural network algorithm for the prediction. Furtherly, we made an optimization for the hyper-parameters of the deep learning algorithm and obtained improved results than the previous IonSeq method.
Collapse
Affiliation(s)
- Sixi Hao
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
- Inner Mongolia Key Laboratory of Statistical Analysis Theory for Life Data and Neural Network Modeling, Hohhot, China
| | - Xiuzhen Hu
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
- Inner Mongolia Key Laboratory of Statistical Analysis Theory for Life Data and Neural Network Modeling, Hohhot, China
- *Correspondence: Xiuzhen Hu, ; Zhenxing Feng,
| | - Zhenxing Feng
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
- Inner Mongolia Key Laboratory of Statistical Analysis Theory for Life Data and Neural Network Modeling, Hohhot, China
- *Correspondence: Xiuzhen Hu, ; Zhenxing Feng,
| | - Kai Sun
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
- Inner Mongolia Key Laboratory of Statistical Analysis Theory for Life Data and Neural Network Modeling, Hohhot, China
| | - Xiaoxiao You
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
- Inner Mongolia Key Laboratory of Statistical Analysis Theory for Life Data and Neural Network Modeling, Hohhot, China
| | - Ziyang Wang
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
- Inner Mongolia Key Laboratory of Statistical Analysis Theory for Life Data and Neural Network Modeling, Hohhot, China
| | - Caiyun Yang
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
- Inner Mongolia Key Laboratory of Statistical Analysis Theory for Life Data and Neural Network Modeling, Hohhot, China
| |
Collapse
|
24
|
Villalobos-Alva J, Ochoa-Toledo L, Villalobos-Alva MJ, Aliseda A, Pérez-Escamirosa F, Altamirano-Bustamante NF, Ochoa-Fernández F, Zamora-Solís R, Villalobos-Alva S, Revilla-Monsalve C, Kemper-Valverde N, Altamirano-Bustamante MM. Protein Science Meets Artificial Intelligence: A Systematic Review and a Biochemical Meta-Analysis of an Inter-Field. Front Bioeng Biotechnol 2022; 10:788300. [PMID: 35875501 PMCID: PMC9301016 DOI: 10.3389/fbioe.2022.788300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2021] [Accepted: 05/25/2022] [Indexed: 11/23/2022] Open
Abstract
Proteins are some of the most fascinating and challenging molecules in the universe, and they pose a big challenge for artificial intelligence. The implementation of machine learning/AI in protein science gives rise to a world of knowledge adventures in the workhorse of the cell and proteome homeostasis, which are essential for making life possible. This opens up epistemic horizons thanks to a coupling of human tacit-explicit knowledge with machine learning power, the benefits of which are already tangible, such as important advances in protein structure prediction. Moreover, the driving force behind the protein processes of self-organization, adjustment, and fitness requires a space corresponding to gigabytes of life data in its order of magnitude. There are many tasks such as novel protein design, protein folding pathways, and synthetic metabolic routes, as well as protein-aggregation mechanisms, pathogenesis of protein misfolding and disease, and proteostasis networks that are currently unexplored or unrevealed. In this systematic review and biochemical meta-analysis, we aim to contribute to bridging the gap between what we call binomial artificial intelligence (AI) and protein science (PS), a growing research enterprise with exciting and promising biotechnological and biomedical applications. We undertake our task by exploring "the state of the art" in AI and machine learning (ML) applications to protein science in the scientific literature to address some critical research questions in this domain, including What kind of tasks are already explored by ML approaches to protein sciences? What are the most common ML algorithms and databases used? What is the situational diagnostic of the AI-PS inter-field? What do ML processing steps have in common? We also formulate novel questions such as Is it possible to discover what the rules of protein evolution are with the binomial AI-PS? How do protein folding pathways evolve? What are the rules that dictate the folds? What are the minimal nuclear protein structures? How do protein aggregates form and why do they exhibit different toxicities? What are the structural properties of amyloid proteins? How can we design an effective proteostasis network to deal with misfolded proteins? We are a cross-functional group of scientists from several academic disciplines, and we have conducted the systematic review using a variant of the PICO and PRISMA approaches. The search was carried out in four databases (PubMed, Bireme, OVID, and EBSCO Web of Science), resulting in 144 research articles. After three rounds of quality screening, 93 articles were finally selected for further analysis. A summary of our findings is as follows: regarding AI applications, there are mainly four types: 1) genomics, 2) protein structure and function, 3) protein design and evolution, and 4) drug design. In terms of the ML algorithms and databases used, supervised learning was the most common approach (85%). As for the databases used for the ML models, PDB and UniprotKB/Swissprot were the most common ones (21 and 8%, respectively). Moreover, we identified that approximately 63% of the articles organized their results into three steps, which we labeled pre-process, process, and post-process. A few studies combined data from several databases or created their own databases after the pre-process. Our main finding is that, as of today, there are no research road maps serving as guides to address gaps in our knowledge of the AI-PS binomial. All research efforts to collect, integrate multidimensional data features, and then analyze and validate them are, so far, uncoordinated and scattered throughout the scientific literature without a clear epistemic goal or connection between the studies. Therefore, our main contribution to the scientific literature is to offer a road map to help solve problems in drug design, protein structures, design, and function prediction while also presenting the "state of the art" on research in the AI-PS binomial until February 2021. Thus, we pave the way toward future advances in the synthetic redesign of novel proteins and protein networks and artificial metabolic pathways, learning lessons from nature for the welfare of humankind. Many of the novel proteins and metabolic pathways are currently non-existent in nature, nor are they used in the chemical industry or biomedical field.
Collapse
Affiliation(s)
- Jalil Villalobos-Alva
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Luis Ochoa-Toledo
- Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | - Mario Javier Villalobos-Alva
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Atocha Aliseda
- Instituto de Investigaciones Filosóficas, Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | - Fernando Pérez-Escamirosa
- Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | | | - Francine Ochoa-Fernández
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Ricardo Zamora-Solís
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Sebastián Villalobos-Alva
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Cristina Revilla-Monsalve
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Nicolás Kemper-Valverde
- Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | - Myriam M. Altamirano-Bustamante
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| |
Collapse
|
25
|
Yan X, Lu Y, Li Z, Wei Q, Gao X, Wang S, Wu S, Cui S. PointSite: A Point Cloud Segmentation Tool for Identification of Protein Ligand Binding Atoms. J Chem Inf Model 2022; 62:2835-2845. [PMID: 35621730 DOI: 10.1021/acs.jcim.1c01512] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Accurate identification of ligand binding sites (LBS) on a protein structure is critical for understanding protein function and designing structure-based drugs. As the previous pocket-centric methods are usually based on the investigation of pseudo-surface-points outside the protein structure, they cannot fully take advantage of the local connectivity of atoms within the protein, as well as the global 3D geometrical information from all the protein atoms. In this paper, we propose a novel point clouds segmentation method, PointSite, for accurate identification of protein ligand binding atoms, which performs protein LBS identification at the atom-level in a protein-centric manner. Specifically, we first transfer the original 3D protein structure to point clouds and then conduct segmentation through Submanifold Sparse Convolution based U-Net. With the fine-grained atom-level binding atoms representation and enhanced feature learning, PointSite can outperform previous methods in atom Intersection over Union (atom-IoU) by a large margin. Furthermore, our segmented binding atoms, that is, atoms with high probability predicted by our model can work as a filter on predictions achieved by previous pocket-centric approaches, which significantly decreases the false-positive of LBS candidates. Besides, we further directly extend PointSite trained on bound proteins for LBS identification on unbound proteins, which demonstrates the superior generalization capacity of PointSite. Through cascaded filter and reranking aided by the segmented atoms, state-of-the-art performance can be achieved over various canonical benchmarks, CAMEO hard targets, and unbound proteins in terms of the commonly used DCA criteria.
Collapse
Affiliation(s)
- Xu Yan
- The Chinese University of Hongkong (Shenzhen) & Future Network of Intelligence Institute, Shenzhen 518172, China
| | - Yingfeng Lu
- The Chinese University of Hongkong (Shenzhen) & Future Network of Intelligence Institute, Shenzhen 518172, China
| | - Zhen Li
- The Chinese University of Hongkong (Shenzhen) & Future Network of Intelligence Institute, Shenzhen 518172, China
| | - Qing Wei
- The Chinese University of Hongkong (Shenzhen) & Future Network of Intelligence Institute, Shenzhen 518172, China
| | - Xin Gao
- King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia
| | - Sheng Wang
- Shanghai Zelixir Biotech Company Ltd., Shanghai 200030, China.,CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| | - Song Wu
- Shenzhen University, Shenzhen 518060, China
| | - Shuguang Cui
- The Chinese University of Hongkong (Shenzhen) & Future Network of Intelligence Institute, Shenzhen 518172, China
| |
Collapse
|
26
|
Scafuri B, Verdino A, D'Arminio N, Marabotti A. Computational methods to assist in the discovery of pharmacological chaperones for rare diseases. Brief Bioinform 2022; 23:6590149. [PMID: 35595532 DOI: 10.1093/bib/bbac198] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2022] [Revised: 04/13/2022] [Accepted: 04/28/2022] [Indexed: 12/21/2022] Open
Abstract
Pharmacological chaperones are chemical compounds able to bind proteins and stabilize them against denaturation and following degradation. Some pharmacological chaperones have been approved, or are under investigation, for the treatment of rare inborn errors of metabolism, caused by genetic mutations that often can destabilize the structure of the wild-type proteins expressed by that gene. Given that, for rare diseases, there is a general lack of pharmacological treatments, many expectations are poured out on this type of compounds. However, their discovery is not straightforward. In this review, we would like to focus on the computational methods that can assist and accelerate the search for these compounds, showing also examples in which these methods were successfully applied for the discovery of promising molecules belonging to this new category of pharmacologically active compounds.
Collapse
Affiliation(s)
- Bernardina Scafuri
- Department of Chemistry and Biology "A. Zambelli", University of Salerno, via Giovanni Paolo II, 132, 84084 Fisciano (SA), Italy
| | - Anna Verdino
- Department of Chemistry and Biology "A. Zambelli", University of Salerno, via Giovanni Paolo II, 132, 84084 Fisciano (SA), Italy
| | - Nancy D'Arminio
- Department of Chemistry and Biology "A. Zambelli", University of Salerno, via Giovanni Paolo II, 132, 84084 Fisciano (SA), Italy
| | - Anna Marabotti
- Department of Chemistry and Biology "A. Zambelli", University of Salerno, via Giovanni Paolo II, 132, 84084 Fisciano (SA), Italy
| |
Collapse
|
27
|
Taneishi K, Tsuchiya Y. Structure-based analyses of gut microbiome-related proteins by neural networks and molecular dynamics simulations. Curr Opin Struct Biol 2022; 73:102336. [DOI: 10.1016/j.sbi.2022.102336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Revised: 11/18/2021] [Accepted: 01/14/2022] [Indexed: 11/03/2022]
|
28
|
Stringer B, de Ferrante H, Abeln S, Heringa J, Feenstra KA, Haydarlou R. PIPENN: protein interface prediction from sequence with an ensemble of neural nets. Bioinformatics 2022; 38:2111-2118. [PMID: 35150231 PMCID: PMC9004643 DOI: 10.1093/bioinformatics/btac071] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2021] [Revised: 01/16/2022] [Accepted: 02/04/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION The interactions between proteins and other molecules are essential to many biological and cellular processes. Experimental identification of interface residues is a time-consuming, costly and challenging task, while protein sequence data are ubiquitous. Consequently, many computational and machine learning approaches have been developed over the years to predict such interface residues from sequence. However, the effectiveness of different Deep Learning (DL) architectures and learning strategies for protein-protein, protein-nucleotide and protein-small molecule interface prediction has not yet been investigated in great detail. Therefore, we here explore the prediction of protein interface residues using six DL architectures and various learning strategies with sequence-derived input features. RESULTS We constructed a large dataset dubbed BioDL, comprising protein-protein interactions from the PDB, and DNA/RNA and small molecule interactions from the BioLip database. We also constructed six DL architectures, and evaluated them on the BioDL benchmarks. This shows that no single architecture performs best on all instances. An ensemble architecture, which combines all six architectures, does consistently achieve peak prediction accuracy. We confirmed these results on the published benchmark set by Zhang and Kurgan (ZK448), and on our own existing curated homo- and heteromeric protein interaction dataset. Our PIPENN sequence-based ensemble predictor outperforms current state-of-the-art sequence-based protein interface predictors on ZK448 on all interaction types, achieving an AUC-ROC of 0.718 for protein-protein, 0.823 for protein-nucleotide and 0.842 for protein-small molecule. AVAILABILITY AND IMPLEMENTATION Source code and datasets are available at https://github.com/ibivu/pipenn/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Hans de Ferrante
- Department of Computer Science, IBIVU—Center for Integrative Bioinformatics, Vrije Universiteit, 1081HV Amsterdam, The Netherlands
| | - Sanne Abeln
- Department of Computer Science, IBIVU—Center for Integrative Bioinformatics, Vrije Universiteit, 1081HV Amsterdam, The Netherlands
| | - Jaap Heringa
- Department of Computer Science, IBIVU—Center for Integrative Bioinformatics, Vrije Universiteit, 1081HV Amsterdam, The Netherlands
| | - K Anton Feenstra
- Department of Computer Science, IBIVU—Center for Integrative Bioinformatics, Vrije Universiteit, 1081HV Amsterdam, The Netherlands
| | | |
Collapse
|
29
|
Lee I, Nam H. Sequence-based prediction of protein binding regions and drug-target interactions. J Cheminform 2022; 14:5. [PMID: 35135622 PMCID: PMC8822694 DOI: 10.1186/s13321-022-00584-w] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2021] [Accepted: 01/20/2022] [Indexed: 12/19/2022] Open
Abstract
Identifying drug-target interactions (DTIs) is important for drug discovery. However, searching all drug-target spaces poses a major bottleneck. Therefore, recently many deep learning models have been proposed to address this problem. However, the developers of these deep learning models have neglected interpretability in model construction, which is closely related to a model's performance. We hypothesized that training a model to predict important regions on a protein sequence would increase DTI prediction performance and provide a more interpretable model. Consequently, we constructed a deep learning model, named Highlights on Target Sequences (HoTS), which predicts binding regions (BRs) between a protein sequence and a drug ligand, as well as DTIs between them. To train the model, we collected complexes of protein-ligand interactions and protein sequences of binding sites and pretrained the model to predict BRs for a given protein sequence-ligand pair via object detection employing transformers. After pretraining the BR prediction, we trained the model to predict DTIs from a compound token designed to assign attention to BRs. We confirmed that training the BRs prediction model indeed improved the DTI prediction performance. The proposed HoTS model showed good performance in BR prediction on independent test datasets even though it does not use 3D structure information in its prediction. Furthermore, the HoTS model achieved the best performance in DTI prediction on test datasets. Additional analysis confirmed the appropriate attention for BRs and the importance of transformers in BR and DTI prediction. The source code is available on GitHub ( https://github.com/GIST-CSBL/HoTS ).
Collapse
Affiliation(s)
- Ingoo Lee
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, 123 Cheomdangwagi-ro, Buk-ku, Gwangju, 61005 Republic of Korea
| | - Hojung Nam
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, 123 Cheomdangwagi-ro, Buk-ku, Gwangju, 61005 Republic of Korea
| |
Collapse
|
30
|
Sun K, Hu X, Feng Z, Wang H, Lv H, Wang Z, Zhang G, Xu S, You X. Predicting Ca 2+ and Mg 2+ ligand binding sites by deep neural network algorithm. BMC Bioinformatics 2022; 22:324. [PMID: 35045825 PMCID: PMC8772041 DOI: 10.1186/s12859-021-04250-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2021] [Accepted: 06/09/2021] [Indexed: 11/25/2022] Open
Abstract
Background Alkaline earth metal ions are important protein binding ligands in human body, and it is of great significance to predict their binding residues. Results In this paper, Mg2+ and Ca2+ ligands are taken as the research objects. Based on the characteristic parameters of protein sequences, amino acids, physicochemical characteristics of amino acids and predicted structural information, deep neural network algorithm is used to predict the binding sites of proteins. By optimizing the hyper-parameters of the deep learning algorithm, the prediction results by the fivefold cross-validation are better than those of the Ionseq method. In addition, to further verify the performance of the proposed model, the undersampling data processing method is adopted, and the prediction results on independent test are better than those obtained by the support vector machine algorithm. Conclusions An efficient method for predicting Mg2+ and Ca2+ ligand binding sites was presented.
Collapse
Affiliation(s)
- Kai Sun
- College of Sciences, Inner Mongolia University of Technology, Hohhot, 010051, People's Republic of China.,Inner Mongolia Key Laboratory of Statistical Analysis Theory for Life Data and Neural Network Modeling, Hohhot, People's Republic of China
| | - Xiuzhen Hu
- College of Sciences, Inner Mongolia University of Technology, Hohhot, 010051, People's Republic of China. .,Inner Mongolia Key Laboratory of Statistical Analysis Theory for Life Data and Neural Network Modeling, Hohhot, People's Republic of China.
| | - Zhenxing Feng
- College of Sciences, Inner Mongolia University of Technology, Hohhot, 010051, People's Republic of China.,Inner Mongolia Key Laboratory of Statistical Analysis Theory for Life Data and Neural Network Modeling, Hohhot, People's Republic of China
| | - Hongbin Wang
- College of Data Science and Application, Inner Mongolia University of Technology, Hohhot, 010051, People's Republic of China
| | - Haotian Lv
- College of Data Science and Application, Inner Mongolia University of Technology, Hohhot, 010051, People's Republic of China
| | - Ziyang Wang
- College of Sciences, Inner Mongolia University of Technology, Hohhot, 010051, People's Republic of China.,Inner Mongolia Key Laboratory of Statistical Analysis Theory for Life Data and Neural Network Modeling, Hohhot, People's Republic of China
| | - Gaimei Zhang
- Hohhot First Hospital, Hohhot, 010051, People's Republic of China
| | - Shuang Xu
- College of Sciences, Inner Mongolia University of Technology, Hohhot, 010051, People's Republic of China.,Inner Mongolia Key Laboratory of Statistical Analysis Theory for Life Data and Neural Network Modeling, Hohhot, People's Republic of China
| | - Xiaoxiao You
- College of Sciences, Inner Mongolia University of Technology, Hohhot, 010051, People's Republic of China.,Inner Mongolia Key Laboratory of Statistical Analysis Theory for Life Data and Neural Network Modeling, Hohhot, People's Republic of China
| |
Collapse
|
31
|
Pazos F. Computational prediction of protein functional sites-Applications in biotechnology and biomedicine. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2022; 130:39-57. [PMID: 35534114 DOI: 10.1016/bs.apcsb.2021.12.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
There are many computational approaches for predicting protein functional sites based on different sequence and structural features. These methods are essential to cope with the sequence deluge that is filling databases with uncharacterized protein sequences. They complement the more expensive and time-consuming experimental approaches by pointing them to possible candidate positions. In many cases they are jointly used to characterize the functional sites in proteins of biotechnological and biomedical interest and eventually modify them for different purposes. There is a clear trend towards approaches based on machine learning and those using structural information, due to the recent developments in these areas. Nevertheless, "classic" methods based on sequence and evolutionary features are still playing an important role as these features are strongly related to functionality. In this review, the main approaches for predicting general functional sites in a protein are discussed, with a focus on sequence-based approaches.
Collapse
Affiliation(s)
- Florencio Pazos
- Computational Systems Biology Group, National Center for Biotechnology (CNB-CSIC), Madrid, Spain.
| |
Collapse
|
32
|
Dhakal A, McKay C, Tanner JJ, Cheng J. Artificial intelligence in the prediction of protein-ligand interactions: recent advances and future directions. Brief Bioinform 2022; 23:bbab476. [PMID: 34849575 PMCID: PMC8690157 DOI: 10.1093/bib/bbab476] [Citation(s) in RCA: 112] [Impact Index Per Article: 37.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2021] [Revised: 09/28/2021] [Accepted: 10/15/2021] [Indexed: 12/13/2022] Open
Abstract
New drug production, from target identification to marketing approval, takes over 12 years and can cost around $2.6 billion. Furthermore, the COVID-19 pandemic has unveiled the urgent need for more powerful computational methods for drug discovery. Here, we review the computational approaches to predicting protein-ligand interactions in the context of drug discovery, focusing on methods using artificial intelligence (AI). We begin with a brief introduction to proteins (targets), ligands (e.g. drugs) and their interactions for nonexperts. Next, we review databases that are commonly used in the domain of protein-ligand interactions. Finally, we survey and analyze the machine learning (ML) approaches implemented to predict protein-ligand binding sites, ligand-binding affinity and binding pose (conformation) including both classical ML algorithms and recent deep learning methods. After exploring the correlation between these three aspects of protein-ligand interaction, it has been proposed that they should be studied in unison. We anticipate that our review will aid exploration and development of more accurate ML-based prediction strategies for studying protein-ligand interactions.
Collapse
Affiliation(s)
- Ashwin Dhakal
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, 65211, USA
| | - Cole McKay
- Department of Biochemistry, University of Missouri, Columbia, MO, 65211, USA
| | - John J Tanner
- Department of Biochemistry, University of Missouri, Columbia, MO, 65211, USA
- Department of Chemistry, University of Missouri, Columbia, MO, 65211, USA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, 65211, USA
| |
Collapse
|
33
|
Born J, Huynh T, Stroobants A, Cornell WD, Manica M. Active Site Sequence Representations of Human Kinases Outperform Full Sequence Representations for Affinity Prediction and Inhibitor Generation: 3D Effects in a 1D Model. J Chem Inf Model 2021; 62:240-257. [PMID: 34905358 DOI: 10.1021/acs.jcim.1c00889] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Recent advances in deep learning have enabled the development of large-scale multimodal models for virtual screening and de novo molecular design. The human kinome with its abundant sequence and inhibitor data presents an attractive opportunity to develop proteochemometric models that exploit the size and internal diversity of this family of targets. Here, we challenge a standard practice in sequence-based affinity prediction models: instead of leveraging the full primary structure of proteins, each target is represented by a sequence of 29 discontiguous residues defining the ATP binding site. In kinase-ligand binding affinity prediction, our results show that the reduced active site sequence representation is not only computationally more efficient but consistently yields significantly higher performance than the full primary structure. This trend persists across different models, data sets, and performance metrics and holds true when predicting pIC50 for both unseen ligands and kinases. Our interpretability analysis reveals a potential explanation for the superiority of the active site models: whereas only mild statistical effects about the extraction of three-dimensional (3D) interaction sites take place in the full sequence models, the active site models are equipped with an implicit but strong inductive bias about the 3D structure stemming from the discontiguity of the active sites. Moreover, in direct comparisons, our models perform similarly or better than previous state-of-the-art approaches in affinity prediction. We then investigate a de novo molecular design task and find that the active site provides benefits in the computational efficiency, but otherwise, both kinase representations yield similar optimized affinities (for both SMILES- and SELFIES-based molecular generators). Our work challenges the assumption that the full primary structure is indispensable for modeling human kinases.
Collapse
Affiliation(s)
- Jannis Born
- IBM Research Europe, 8804 Rüschlikon, Switzerland.,Department of Biosystems Science and Engineering, ETH Zurich, 4058 Basel, Switzerland
| | - Tien Huynh
- IBM Research, Yorktown Heights, New York 10598, United States
| | - Astrid Stroobants
- Department of Chemistry, Imperial College London, SW7 2AZ London, United Kingdom
| | - Wendy D Cornell
- IBM Research, Yorktown Heights, New York 10598, United States
| | | |
Collapse
|
34
|
Littmann M, Heinzinger M, Dallago C, Weissenow K, Rost B. Protein embeddings and deep learning predict binding residues for various ligand classes. Sci Rep 2021; 11:23916. [PMID: 34903827 PMCID: PMC8668950 DOI: 10.1038/s41598-021-03431-4] [Citation(s) in RCA: 62] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2021] [Accepted: 12/02/2021] [Indexed: 01/27/2023] Open
Abstract
One important aspect of protein function is the binding of proteins to ligands, including small molecules, metal ions, and macromolecules such as DNA or RNA. Despite decades of experimental progress many binding sites remain obscure. Here, we proposed bindEmbed21, a method predicting whether a protein residue binds to metal ions, nucleic acids, or small molecules. The Artificial Intelligence (AI)-based method exclusively uses embeddings from the Transformer-based protein Language Model (pLM) ProtT5 as input. Using only single sequences without creating multiple sequence alignments (MSAs), bindEmbed21DL outperformed MSA-based predictions. Combination with homology-based inference increased performance to F1 = 48 ± 3% (95% CI) and MCC = 0.46 ± 0.04 when merging all three ligand classes into one. All results were confirmed by three independent data sets. Focusing on very reliably predicted residues could complement experimental evidence: For the 25% most strongly predicted binding residues, at least 73% were correctly predicted even when ignoring the problem of missing experimental annotations. The new method bindEmbed21 is fast, simple, and broadly applicable-neither using structure nor MSAs. Thereby, it found binding residues in over 42% of all human proteins not otherwise implied in binding and predicted about 6% of all residues as binding to metal ions, nucleic acids, or small molecules.
Collapse
Affiliation(s)
- Maria Littmann
- Department of Informatics, Bioinformatics and Computational Biology, I12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany.
| | - Michael Heinzinger
- Department of Informatics, Bioinformatics and Computational Biology, I12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Christian Dallago
- Department of Informatics, Bioinformatics and Computational Biology, I12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Konstantin Weissenow
- Department of Informatics, Bioinformatics and Computational Biology, I12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), Boltzmannstr. 11, 85748, Garching, Germany
| | - Burkhard Rost
- Department of Informatics, Bioinformatics and Computational Biology, I12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, Garching, 85748, Munich, Germany
- TUM School of Life Sciences Weihenstephan (TUM-WZW), Alte Akademie 8, Freising, Germany
- Department of Biochemistry and Molecular Biophysics, Columbia University, 701 West, 168th Street, New York, NY, 10032, USA
| |
Collapse
|
35
|
Aggarwal R, Gupta A, Chelur V, Jawahar CV, Priyakumar UD. DeepPocket: Ligand Binding Site Detection and Segmentation using 3D Convolutional Neural Networks. J Chem Inf Model 2021; 62:5069-5079. [PMID: 34374539 DOI: 10.1021/acs.jcim.1c00799] [Citation(s) in RCA: 54] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
A structure-based drug design pipeline involves the development of potential drug molecules or ligands that form stable complexes with a given receptor at its binding site. A prerequisite to this is finding druggable and functionally relevant binding sites on the 3D structure of the protein. Although several methods for detecting binding sites have been developed beforehand, a majority of them surprisingly fail in the identification and ranking of binding sites accurately. The rapid adoption and success of deep learning algorithms in various sections of structural biology beckons the usage of such algorithms for accurate binding site detection. As a combination of geometry based software and deep learning, we report a novel framework, DeepPocket that utilizes 3D convolutional neural networks for the rescoring of pockets identified by Fpocket and further segments these identified cavities on the protein surface. Apart from this, we also propose another data set SC6K containing protein structures submitted in the Protein Data Bank (PDB) from January 1st, 2018, until February 28th, 2020, for ligand binding site (LBS) detection. DeepPocket's results on various binding site data sets and SC6K highlight its better performance over current state-of-the-art methods and good generalization ability over novel structures.
Collapse
Affiliation(s)
- Rishal Aggarwal
- International Institute of Information Technology, Hyderabad 500 032, India
| | - Akash Gupta
- International Institute of Information Technology, Hyderabad 500 032, India
| | - Vineeth Chelur
- International Institute of Information Technology, Hyderabad 500 032, India
| | - C V Jawahar
- International Institute of Information Technology, Hyderabad 500 032, India
| | - U Deva Priyakumar
- International Institute of Information Technology, Hyderabad 500 032, India
| |
Collapse
|
36
|
Rauer C, Sen N, Waman VP, Abbasian M, Orengo CA. Computational approaches to predict protein functional families and functional sites. Curr Opin Struct Biol 2021; 70:108-122. [PMID: 34225010 DOI: 10.1016/j.sbi.2021.05.012] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 05/13/2021] [Accepted: 05/25/2021] [Indexed: 01/06/2023]
Abstract
Understanding the mechanisms of protein function is indispensable for many biological applications, such as protein engineering and drug design. However, experimental annotations are sparse, and therefore, theoretical strategies are needed to fill the gap. Here, we present the latest developments in building functional subclassifications of protein superfamilies and using evolutionary conservation to detect functional determinants, for example, catalytic-, binding- and specificity-determining residues important for delineating the functional families. We also briefly review other features exploited for functional site detection and new machine learning strategies for combining multiple features.
Collapse
Affiliation(s)
- Clemens Rauer
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Neeladri Sen
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Vaishali P Waman
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Mahnaz Abbasian
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Christine A Orengo
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK.
| |
Collapse
|
37
|
Revolutionizing enzyme engineering through artificial intelligence and machine learning. Emerg Top Life Sci 2021; 5:113-125. [PMID: 33835131 DOI: 10.1042/etls20200257] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Revised: 03/17/2021] [Accepted: 03/22/2021] [Indexed: 12/20/2022]
Abstract
The combinatorial space of an enzyme sequence has astronomical possibilities and exploring it with contemporary experimental techniques is arduous and often ineffective. Multi-target objectives such as concomitantly achieving improved selectivity, solubility and activity of an enzyme have narrow plausibility under approaches of restricted mutagenesis and combinatorial search. Traditional enzyme engineering approaches have a limited scope for complex optimization due to the requirement of a priori knowledge or experimental burden of screening huge protein libraries. The recent surge in high-throughput experimental methods including Next Generation Sequencing and automated screening has flooded the field of molecular biology with big-data, which requires us to re-think our concurrent approaches towards enzyme engineering. Artificial Intelligence (AI) and Machine Learning (ML) have great potential to revolutionize smart enzyme engineering without the explicit need for a complete understanding of the underlying molecular system. Here, we portray the role and position of AI techniques in the field of enzyme engineering along with their scope and limitations. In addition, we explain how the traditional approaches of directed evolution and rational design can be extended through AI tools. Recent successful examples of AI-assisted enzyme engineering projects and their deviation from traditional approaches are highlighted. A comprehensive picture of current challenges and future avenues for AI in enzyme engineering are also discussed.
Collapse
|
38
|
Hu J, Zheng LL, Bai YS, Zhang KW, Yu DJ, Zhang GJ. Accurate prediction of protein-ATP binding residues using position-specific frequency matrix. Anal Biochem 2021; 626:114241. [PMID: 33971164 DOI: 10.1016/j.ab.2021.114241] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2021] [Revised: 04/27/2021] [Accepted: 05/01/2021] [Indexed: 10/21/2022]
Abstract
Knowledge of protein-ATP interaction can help for protein functional annotation and drug discovery. Accurately identifying protein-ATP binding residues is an important but challenging task to gain the knowledge of protein-ATP interactions, especially for the case where only protein sequence information is given. In this study, we propose a novel method, named DeepATPseq, to predict protein-ATP binding residues without using any information about protein three-dimension structure or sequence-derived structural information. In DeepATPseq, the HHBlits-generated position-specific frequency matrix (PSFM) profile is first employed to extract the feature information of each residue. Then, for each residue, the PSFM-based feature is fed into two prediction models, which are generated by the algorithms of deep convolutional neural network (DCNN) and support vector machine (SVM) separately. The final ATP-binding probability of the corresponding residue is calculated by the weighted sum of the outputted values of DCNN-based and SVM-based models. Experimental results on the independent validation data set demonstrate that DeepATPseq could achieve an accuracy of 77.71%, covering 57.42% of all ATP-binding residues, while achieving a Matthew's correlation coefficient value (0.655) that is significantly higher than that of existing sequence-based methods and comparable to that of the state-of-the-art structure-based predictors. Detailed data analysis show that the major advantage of DeepATPseq lies at the combination utilization of DCNN and SVM that helps dig out more discriminative information from the PSFM profiles. The online server and standalone package of DeepATPseq are freely available at: https://jun-csbio.github.io/DeepATPseq/for academic use.
Collapse
Affiliation(s)
- Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, 310023, China.
| | - Lin-Lin Zheng
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, 310023, China
| | - Yan-Song Bai
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, 310023, China
| | - Ke-Wen Zhang
- College of Mechanical Engineering, Zhejiang University of Technology, Hangzhou, 310023, China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology,Xiaolingwei 200, Nanjing, 210094, China.
| | - Gui-Jun Zhang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, 310023, China.
| |
Collapse
|
39
|
Nazarova AL, Yang L, Liu K, Mishra A, Kalia RK, Nomura KI, Nakano A, Vashishta P, Rajak P. Dielectric Polymer Property Prediction Using Recurrent Neural Networks with Optimizations. J Chem Inf Model 2021; 61:2175-2186. [PMID: 33871989 DOI: 10.1021/acs.jcim.0c01366] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
Despite the growing success of machine learning for predicting structure-property relationships in molecules and materials, such as predicting the dielectric properties of polymers, it is still in its infancy. We report on the effectiveness of solving structure-property relationships for a computer-generated database of dielectric polymers using recurrent neural network (RNN) models. The implementation of a series of optimization strategies was crucial to achieving high learning speeds and sufficient accuracy: (1) binary and nonbinary representations of SMILES (Simplified Molecular Input Line System) fingerprints and (2) backpropagation with affine transformation of the input sequence (ATransformedBP) and resilient backpropagation with initial weight update parameter optimizations (iRPROP- optimized). For the investigated database of polymers, the binary SMILES representation was found to be superior to the decimal representation with respect to the training and prediction performance. All developed and optimized Elman-type RNN algorithms outperformed nonoptimized RNN models in the efficient prediction of nonlinear structure-activity relationships. The average relative standard deviation (RSD) remained well below 5%, and the maximum RSD did not exceed 30%. Moreover, we provide a C++ codebase as a testbed for a new generation of open programming languages that target increasingly diverse computer architectures.
Collapse
Affiliation(s)
- Antonina L Nazarova
- Department of Chemistry, Loker Hydrocarbon Research Institute, and USC Bridge Institue, University of Southern California, Los Angeles, California 90089, United States
| | - Liqiu Yang
- Collaboratory of Advanced Computing and Simulations, Department of Computer Science, Department of Physics & Astronomy, Department of Chemical Engineering & Materials Science, and Department of Biological Sciences, University of Southern California, Los Angeles, California 90089, United States
| | - Kuang Liu
- Collaboratory of Advanced Computing and Simulations, Department of Computer Science, Department of Physics & Astronomy, Department of Chemical Engineering & Materials Science, and Department of Biological Sciences, University of Southern California, Los Angeles, California 90089, United States
| | - Ankit Mishra
- Collaboratory of Advanced Computing and Simulations, Department of Computer Science, Department of Physics & Astronomy, Department of Chemical Engineering & Materials Science, and Department of Biological Sciences, University of Southern California, Los Angeles, California 90089, United States
| | - Rajiv K Kalia
- Collaboratory of Advanced Computing and Simulations, Department of Computer Science, Department of Physics & Astronomy, Department of Chemical Engineering & Materials Science, and Department of Biological Sciences, University of Southern California, Los Angeles, California 90089, United States
| | - Ken-Ichi Nomura
- Collaboratory of Advanced Computing and Simulations, Department of Computer Science, Department of Physics & Astronomy, Department of Chemical Engineering & Materials Science, and Department of Biological Sciences, University of Southern California, Los Angeles, California 90089, United States
| | - Aiichiro Nakano
- Collaboratory of Advanced Computing and Simulations, Department of Computer Science, Department of Physics & Astronomy, Department of Chemical Engineering & Materials Science, and Department of Biological Sciences, University of Southern California, Los Angeles, California 90089, United States
| | - Priya Vashishta
- Collaboratory of Advanced Computing and Simulations, Department of Computer Science, Department of Physics & Astronomy, Department of Chemical Engineering & Materials Science, and Department of Biological Sciences, University of Southern California, Los Angeles, California 90089, United States
| | - Pankaj Rajak
- Argonne Leadership Computing Facility, Argonne National Laboratory, Lemont, Illinois 60439, United States
| |
Collapse
|
40
|
Wu F, Yang R, Zhang C, Zhang L. A deep learning framework combined with word embedding to identify DNA replication origins. Sci Rep 2021; 11:844. [PMID: 33436981 PMCID: PMC7804333 DOI: 10.1038/s41598-020-80670-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2020] [Accepted: 12/24/2020] [Indexed: 01/29/2023] Open
Abstract
The DNA replication influences the inheritance of genetic information in the DNA life cycle. As the distribution of replication origins (ORIs) is the major determinant to precisely regulate the replication process, the correct identification of ORIs is significant in giving an insightful understanding of DNA replication mechanisms and the regulatory mechanisms of genetic expressions. For eukaryotes in particular, multiple ORIs exist in each of their gene sequences to complete the replication in a reasonable period of time. To simplify the identification process of eukaryote's ORIs, most of existing methods are developed by traditional machine learning algorithms, and target to the gene sequences with a fixed length. Consequently, the identification results are not satisfying, i.e. there is still great room for improvement. To break through the limitations in previous studies, this paper develops sequence segmentation methods, and employs the word embedding technique, 'Word2vec', to convert gene sequences into word vectors, thereby grasping the inner correlations of gene sequences with different lengths. Then, a deep learning framework to perform the ORI identification task is constructed by a convolutional neural network with an embedding layer. On the basis of the analysis of similarity reduction dimensionality diagram, Word2vec can effectively transform the inner relationship among words into numerical feature. For four species in this study, the best models are obtained with the overall accuracy of 0.975, 0.765, 0.885, 0.967, the Matthew's correlation coefficient of 0.940, 0.530, 0.771, 0.934, and the AUC of 0.975, 0.800, 0.888, 0.981, which indicate that the proposed predictor has a stable ability and provide a high confidence coefficient to classify both of ORIs and non-ORIs. Compared with state-of-the-art methods, the proposed predictor can achieve ORI identification with significant improvement. It is therefore reasonable to anticipate that the proposed method will make a useful high throughput tool for genome analysis.
Collapse
Affiliation(s)
- Feng Wu
- School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, Weihai, 264200, China
| | - Runtao Yang
- School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, Weihai, 264200, China.
| | - Chengjin Zhang
- School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, Weihai, 264200, China
| | - Lina Zhang
- School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, Weihai, 264200, China
| |
Collapse
|
41
|
Feinstein J, Shi W, Ramanujam J, Brylinski M. Bionoi: A Voronoi Diagram-Based Representation of Ligand-Binding Sites in Proteins for Machine Learning Applications. Methods Mol Biol 2021; 2266:299-312. [PMID: 33759134 DOI: 10.1007/978-1-0716-1209-5_17] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Bionoi is a new software to generate Voronoi representations of ligand-binding sites in proteins for machine learning applications. Unlike many other deep learning models in biomedicine, Bionoi utilizes off-the-shelf convolutional neural network architectures, reducing the development work without sacrificing the performance. When initially generating images of binding sites, users have the option to color the Voronoi cells based on either one of six structural, physicochemical, and evolutionary properties, or a blend of all six individual properties. Encouragingly, after inputting images generated by Bionoi into the convolutional autoencoder, the network was able to effectively learn the most salient features of binding pockets. The accuracy of the generated model is evaluated both visually and numerically through the reconstruction of binding site images from the latent feature space. The generated feature vectors capture well various properties of binding sites and thus can be applied in a multitude of machine learning projects. As a demonstration, we trained the ResNet-18 architecture from Microsoft on Bionoi images to show that it is capable to effectively classify nucleotide- and heme-binding pockets against a large dataset of control pockets binding a variety of small molecules. Bionoi is freely available to the research community at https://github.com/CSBG-LSU/BionoiNet.
Collapse
Affiliation(s)
- Joseph Feinstein
- Center for Computation and Technology, Louisiana State University, Baton Rouge, LA, USA
| | - Wentao Shi
- Division of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, LA, USA
| | - J Ramanujam
- Center for Computation and Technology, Louisiana State University, Baton Rouge, LA, USA.,Division of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, LA, USA
| | - Michal Brylinski
- Center for Computation and Technology, Louisiana State University, Baton Rouge, LA, USA. .,Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, USA.
| |
Collapse
|
42
|
Issa NT, Stathias V, Schürer S, Dakshanamurthy S. Machine and deep learning approaches for cancer drug repurposing. Semin Cancer Biol 2021; 68:132-142. [PMID: 31904426 PMCID: PMC7723306 DOI: 10.1016/j.semcancer.2019.12.011] [Citation(s) in RCA: 135] [Impact Index Per Article: 33.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2019] [Revised: 10/31/2019] [Accepted: 12/15/2019] [Indexed: 02/07/2023]
Abstract
Knowledge of the underpinnings of cancer initiation, progression and metastasis has increased exponentially in recent years. Advanced "omics" coupled with machine learning and artificial intelligence (deep learning) methods have helped elucidate targets and pathways critical to those processes that may be amenable to pharmacologic modulation. However, the current anti-cancer therapeutic armamentarium continues to lag behind. As the cost of developing a new drug remains prohibitively expensive, repurposing of existing approved and investigational drugs is sought after given known safety profiles and reduction in the cost barrier. Notably, successes in oncologic drug repurposing have been infrequent. Computational in-silico strategies have been developed to aid in modeling biological processes to find new disease-relevant targets and discovering novel drug-target and drug-phenotype associations. Machine and deep learning methods have especially enabled leaps in those successes. This review will discuss these methods as they pertain to cancer biology as well as immunomodulation for drug repurposing opportunities in oncologic diseases.
Collapse
Affiliation(s)
- Naiem T Issa
- Dr. Phillip Frost Department of Dermatology and Cutaneous Surgery, University of Miami School of Medicine, Miami, FL, USA
| | - Vasileios Stathias
- Department of Molecular and Cellular Pharmacology, University of Miami School of Medicine, Miami, FL, USA
| | - Stephan Schürer
- Department of Molecular and Cellular Pharmacology, University of Miami School of Medicine, Miami, FL, USA
| | - Sivanesan Dakshanamurthy
- Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, DC, USA.
| |
Collapse
|
43
|
Predicting binding sites from unbound versus bound protein structures. Sci Rep 2020; 10:15856. [PMID: 32985584 PMCID: PMC7522209 DOI: 10.1038/s41598-020-72906-7] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2020] [Accepted: 07/27/2020] [Indexed: 11/30/2022] Open
Abstract
We present the application of seven binding-site prediction algorithms to a meticulously curated dataset of ligand-bound and ligand-free crystal structures for 304 unique protein sequences (2528 crystal structures). We probe the influence of starting protein structures on the results of binding-site prediction, so the dataset contains a minimum of two ligand-bound and two ligand-free structures for each protein. We use this dataset in a brief survey of five geometry-based, one energy-based, and one machine-learning-based methods: Surfnet, Ghecom, LIGSITEcsc, Fpocket, Depth, AutoSite, and Kalasanty. Distributions of the F scores and Matthew’s correlation coefficients for ligand-bound versus ligand-free structure performance show no statistically significant difference in structure type versus performance for most methods. Only Fpocket showed a statistically significant but low magnitude enhancement in performance for holo structures. Lastly, we found that most methods will succeed on some crystal structures and fail on others within the same protein family, despite all structures being relatively high-quality structures with low structural variation. We expected better consistency across varying protein conformations of the same sequence. Interestingly, the success or failure of a given structure cannot be predicted by quality metrics such as resolution, Cruickshank Diffraction Precision index, or unresolved residues. Cryptic sites were also examined.
Collapse
|
44
|
Liu L, Hu X, Feng Z, Wang S, Sun K, Xu S. Recognizing Ion Ligand-Binding Residues by Random Forest Algorithm Based on Optimized Dihedral Angle. Front Bioeng Biotechnol 2020; 8:493. [PMID: 32596216 PMCID: PMC7303464 DOI: 10.3389/fbioe.2020.00493] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Accepted: 04/28/2020] [Indexed: 11/26/2022] Open
Abstract
The prediction of ion ligand–binding residues in protein sequences is a challenging work that contributes to understand the specific functions of proteins in life processes. In this article, we selected binding residues of 14 ion ligands as research objects, including four acid radical ion ligands and 10 metal ion ligands. Based on the amino acid sequence information, we selected the composition and position conservation information of amino acids, the predicted structural information, and physicochemical properties of amino acids as basic feature parameters. We then performed a statistical analysis and reclassification for dihedral angle and proposed new methods on the extraction of feature parameters. The methods mainly included applying information entropy on the extraction of polarization charge and hydrophilic–hydrophobic information of amino acids and using position weight matrices on the extraction of position conservation information. In the prediction model, we used the random forest algorithm and obtained better prediction results than previous works. With the independent test, the Matthew's correlation coefficient and accuracy of 10 metal ion ligand–binding residues were larger than 0.07 and 52%, respectively; the corresponding evaluation values of four acid radical ion ligand–binding residues were larger than 0.15 and 86%, respectively. Further, we classified and combined the phi and psi angles and optimized prediction model for each ion ligand–binding residue.
Collapse
Affiliation(s)
- Liu Liu
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Xiuzhen Hu
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Zhenxing Feng
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Shan Wang
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Kai Sun
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Shuang Xu
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| |
Collapse
|
45
|
Zhao J, Cao Y, Zhang L. Exploring the computational methods for protein-ligand binding site prediction. Comput Struct Biotechnol J 2020; 18:417-426. [PMID: 32140203 PMCID: PMC7049599 DOI: 10.1016/j.csbj.2020.02.008] [Citation(s) in RCA: 103] [Impact Index Per Article: 20.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2019] [Revised: 01/23/2020] [Accepted: 02/11/2020] [Indexed: 12/21/2022] Open
Abstract
Proteins participate in various essential processes in vivo via interactions with other molecules. Identifying the residues participating in these interactions not only provides biological insights for protein function studies but also has great significance for drug discoveries. Therefore, predicting protein-ligand binding sites has long been under intense research in the fields of bioinformatics and computer aided drug discovery. In this review, we first introduce the research background of predicting protein-ligand binding sites and then classify the methods into four categories, namely, 3D structure-based, template similarity-based, traditional machine learning-based and deep learning-based methods. We describe representative algorithms in each category and elaborate on machine learning and deep learning-based prediction methods in more detail. Finally, we discuss the trends and challenges of the current research such as molecular dynamics simulation based cryptic binding sites prediction, and highlight prospective directions for the near future.
Collapse
Affiliation(s)
- Jingtian Zhao
- College of Computer Science, Sichuan University, Chengdu 610065, China
| | - Yang Cao
- Center of Growth, Metabolism and Aging, Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu 610065, China
| | - Le Zhang
- College of Computer Science, Sichuan University, Chengdu 610065, China
| |
Collapse
|
46
|
Shi Q, Chen W, Huang S, Wang Y, Xue Z. Deep learning for mining protein data. Brief Bioinform 2019; 22:194-218. [PMID: 31867611 DOI: 10.1093/bib/bbz156] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2019] [Revised: 10/21/2019] [Accepted: 11/07/2019] [Indexed: 01/16/2023] Open
Abstract
The recent emergence of deep learning to characterize complex patterns of protein big data reveals its potential to address the classic challenges in the field of protein data mining. Much research has revealed the promise of deep learning as a powerful tool to transform protein big data into valuable knowledge, leading to scientific discoveries and practical solutions. In this review, we summarize recent publications on deep learning predictive approaches in the field of mining protein data. The application architectures of these methods include multilayer perceptrons, stacked autoencoders, deep belief networks, two- or three-dimensional convolutional neural networks, recurrent neural networks, graph neural networks, and complex neural networks and are described from five perspectives: residue-level prediction, sequence-level prediction, three-dimensional structural analysis, interaction prediction, and mass spectrometry data mining. The advantages and deficiencies of these architectures are presented in relation to various tasks in protein data mining. Additionally, some practical issues and their future directions are discussed, such as robust deep learning for protein noisy data, architecture optimization for specific tasks, efficient deep learning for limited protein data, multimodal deep learning for heterogeneous protein data, and interpretable deep learning for protein understanding. This review provides comprehensive perspectives on general deep learning techniques for protein data analysis.
Collapse
Affiliation(s)
- Qiang Shi
- School of Software Engineering, Huazhong University of Science and Technology. His main interests cover machine learning especially deep learning, protein data analysis, and big data mining
| | - Weiya Chen
- School of Software Engineering, Huazhong University of Science & Technology, Wuhan, China. His research interests cover bioinformatics, virtual reality, and data visualization
| | - Siqi Huang
- Software Engineering at Huazhong University of science and technology, focusing on Machine learning and data mining
| | - Yan Wang
- School of life, University of Science & Technology; her main interests cover protein structure and function prediction and big data mining
| | - Zhidong Xue
- School of Software Engineering, Huazhong University of Science & Technology, Wuhan, China. His research interests cover bioinformatics, machine learning, and image processing
| |
Collapse
|
47
|
Matsuzaka Y, Uesawa Y. Prediction Model with High-Performance Constitutive Androstane Receptor (CAR) Using DeepSnap-Deep Learning Approach from the Tox21 10K Compound Library. Int J Mol Sci 2019; 20:ijms20194855. [PMID: 31574921 PMCID: PMC6801383 DOI: 10.3390/ijms20194855] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2019] [Revised: 09/23/2019] [Accepted: 09/27/2019] [Indexed: 12/30/2022] Open
Abstract
The constitutive androstane receptor (CAR) plays pivotal roles in drug-induced liver injury through the transcriptional regulation of drug-metabolizing enzymes and transporters. Thus, identifying regulatory factors for CAR activation is important for understanding its mechanisms. Numerous studies conducted previously on CAR activation and its toxicity focused on in vivo or in vitro analyses, which are expensive, time consuming, and require many animals. We developed a computational model that predicts agonists for the CAR using the Toxicology in the 21st Century 10k library. Additionally, we evaluate the prediction performance of novel deep learning (DL)-based quantitative structure-activity relationship analysis called the DeepSnap-DL approach, which is a procedure of generating an omnidirectional snapshot portraying three-dimensional (3D) structures of chemical compounds. The CAR prediction model, which applies a 3D structure generator tool, called CORINA-generated and -optimized chemical structures, in the DeepSnap-DL demonstrated better performance than the existing methods using molecular descriptors. These results indicate that high performance in the prediction model using the DeepSnap-DL approach may be important to prepare suitable 3D chemical structures as input data and to enable the identification of modulators of the CAR.
Collapse
Affiliation(s)
- Yasunari Matsuzaka
- Department of Medical Molecular Informatics, Meiji Pharmaceutical University, Tokyo 204-8588, Japan.
| | - Yoshihiro Uesawa
- Department of Medical Molecular Informatics, Meiji Pharmaceutical University, Tokyo 204-8588, Japan.
| |
Collapse
|