1
|
Weckbecker M, Anžel A, Yang Z, Hattab G. Interpretable molecular encodings and representations for machine learning tasks. Comput Struct Biotechnol J 2024; 23:2326-2336. [PMID: 38867722 PMCID: PMC11167246 DOI: 10.1016/j.csbj.2024.05.035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2024] [Revised: 05/13/2024] [Accepted: 05/19/2024] [Indexed: 06/14/2024] Open
Abstract
Molecular encodings and their usage in machine learning models have demonstrated significant breakthroughs in biomedical applications, particularly in the classification of peptides and proteins. To this end, we propose a new encoding method: Interpretable Carbon-based Array of Neighborhoods (iCAN). Designed to address machine learning models' need for more structured and less flexible input, it captures the neighborhoods of carbon atoms in a counting array and improves the utility of the resulting encodings for machine learning models. The iCAN method provides interpretable molecular encodings and representations, enabling the comparison of molecular neighborhoods, identification of repeating patterns, and visualization of relevance heat maps for a given data set. When reproducing a large biomedical peptide classification study, it outperforms its predecessor encoding. When extended to proteins, it outperforms a lead structure-based encoding on 71% of the data sets. Our method offers interpretable encodings that can be applied to all organic molecules, including exotic amino acids, cyclic peptides, and larger proteins, making it highly versatile across various domains and data sets. This work establishes a promising new direction for machine learning in peptide and protein classification in biomedicine and healthcare, potentially accelerating advances in drug discovery and disease diagnosis.
Collapse
Affiliation(s)
- Moritz Weckbecker
- Center for Artificial Intelligence in Public Health Research, (ZKI-PH), Robert Koch Institute, Nordufer 20, Berlin, 13353, Berlin, Germany
| | - Aleksandar Anžel
- Center for Artificial Intelligence in Public Health Research, (ZKI-PH), Robert Koch Institute, Nordufer 20, Berlin, 13353, Berlin, Germany
| | - Zewen Yang
- Center for Artificial Intelligence in Public Health Research, (ZKI-PH), Robert Koch Institute, Nordufer 20, Berlin, 13353, Berlin, Germany
| | - Georges Hattab
- Center for Artificial Intelligence in Public Health Research, (ZKI-PH), Robert Koch Institute, Nordufer 20, Berlin, 13353, Berlin, Germany
- Department of Mathematics and Computer science Freie Universität, Arnimallee 14, Berlin, 14195, Berlin, Germany
| |
Collapse
|
2
|
Kumar N, Acharya V. Advances in machine intelligence-driven virtual screening approaches for big-data. Med Res Rev 2024; 44:939-974. [PMID: 38129992 DOI: 10.1002/med.21995] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Revised: 07/15/2023] [Accepted: 10/29/2023] [Indexed: 12/23/2023]
Abstract
Virtual screening (VS) is an integral and ever-evolving domain of drug discovery framework. The VS is traditionally classified into ligand-based (LB) and structure-based (SB) approaches. Machine intelligence or artificial intelligence has wide applications in the drug discovery domain to reduce time and resource consumption. In combination with machine intelligence algorithms, VS has emerged into revolutionarily progressive technology that learns within robust decision orders for data curation and hit molecule screening from large VS libraries in minutes or hours. The exponential growth of chemical and biological data has evolved as "big-data" in the public domain demands modern and advanced machine intelligence-driven VS approaches to screen hit molecules from ultra-large VS libraries. VS has evolved from an individual approach (LB and SB) to integrated LB and SB techniques to explore various ligand and target protein aspects for the enhanced rate of appropriate hit molecule prediction. Current trends demand advanced and intelligent solutions to handle enormous data in drug discovery domain for screening and optimizing hits or lead with fewer or no false positive hits. Following the big-data drift and tremendous growth in computational architecture, we presented this review. Here, the article categorized and emphasized individual VS techniques, detailed literature presented for machine learning implementation, modern machine intelligence approaches, and limitations and deliberated the future prospects.
Collapse
Affiliation(s)
- Neeraj Kumar
- Artificial Intelligence for Computational Biology Lab (AICoB), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology, Palampur, Himachal Pradesh, India
- Academy of Scientific and Innovative Research, Ghaziabad, India
| | - Vishal Acharya
- Artificial Intelligence for Computational Biology Lab (AICoB), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology, Palampur, Himachal Pradesh, India
- Academy of Scientific and Innovative Research, Ghaziabad, India
| |
Collapse
|
3
|
Wang R, Liu Z, Gong J, Zhou Q, Guan X, Ge G. An Uncertainty-Guided Deep Learning Method Facilitates Rapid Screening of CYP3A4 Inhibitors. J Chem Inf Model 2023; 63:7699-7710. [PMID: 38055780 DOI: 10.1021/acs.jcim.3c01241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/08/2023]
Abstract
Cytochrome P450 3A4 (CYP3A4), a prominent member of the P450 enzyme superfamily, plays a crucial role in metabolizing various xenobiotics, including over 50% of clinically significant drugs. Evaluating CYP3A4 inhibition before drug approval is essential to avoiding potentially harmful pharmacokinetic drug-drug interactions (DDIs) and adverse drug reactions (ADRs). Despite the development of several CYP inhibitor prediction models, the primary approach for screening CYP inhibitors still relies on experimental methods. This might stem from the limitations of existing models, which only provide deterministic classification outcomes instead of precise inhibition intensity (e.g., IC50) and often suffer from inadequate prediction reliability. To address this challenge, we propose an uncertainty-guided regression model to accurately predict the IC50 values of anti-CYP3A4 activities. First, a comprehensive data set of CYP3A4 inhibitors was compiled, consisting of 27,045 compounds with classification labels, including 4395 compounds with explicit IC50 values. Second, by integrating the predictions of the classification model trained on a larger data set and introducing an evidential uncertainty method to rank prediction confidence, we obtained a high-precision and reliable regression model. Finally, we use the evidential uncertainty values as a trustworthy indicator to perform a virtual screening of an in-house compound set. The in vitro experiment results revealed that this new indicator significantly improved the hit ratio and reduced false positives among the top-ranked compounds. Specifically, among the top 20 compounds ranked with uncertainty, 15 compounds were identified as novel CYP3A4 inhibitors, and three of them exhibited activities less than 1 μM. In summary, our findings highlight the effectiveness of incorporating uncertainty in compound screening, providing a promising strategy for drug discovery and development.
Collapse
Affiliation(s)
- Ruixuan Wang
- Shanghai Frontiers Science Center of TCM Chemical Biology, Institute of Interdisciplinary Integrative Medicine Research, Shanghai University of Traditional Chinese Medicine, Shanghai 201203, China
| | - Zhikang Liu
- School of Mathematics and Statistics, Central South University, Changsha 410083, China
| | - Jiahao Gong
- Shanghai Frontiers Science Center of TCM Chemical Biology, Institute of Interdisciplinary Integrative Medicine Research, Shanghai University of Traditional Chinese Medicine, Shanghai 201203, China
| | - Qingping Zhou
- School of Mathematics and Statistics, Central South University, Changsha 410083, China
| | - Xiaoqing Guan
- Shanghai Frontiers Science Center of TCM Chemical Biology, Institute of Interdisciplinary Integrative Medicine Research, Shanghai University of Traditional Chinese Medicine, Shanghai 201203, China
| | - Guangbo Ge
- Shanghai Frontiers Science Center of TCM Chemical Biology, Institute of Interdisciplinary Integrative Medicine Research, Shanghai University of Traditional Chinese Medicine, Shanghai 201203, China
| |
Collapse
|
4
|
Yin Z, Song W, Li B, Wang F, Xie L, Xu X. Neural networks prediction of the protein-ligand binding affinity with circular fingerprints. Technol Health Care 2023; 31:487-495. [PMID: 37066944 PMCID: PMC10200229 DOI: 10.3233/thc-236042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/18/2023]
Abstract
BACKGROUND Protein-ligand binding affinity is of significant importance in structure-based drug design. Recently, the development of machine learning techniques has provided an efficient and accurate way to predict binding affinity. However, the prediction performance largely depends on how molecules are represented. OBJECTIVE Different molecular descriptors are designed to capture different features. The study aims to identify the optimal circular fingerprints for predicting protein-ligand binding affinity with matched neural network architectures. METHODS Extended-connectivity fingerprints (ECFP) and protein-ligand extended connectivity fingerprints (PLEC) encode circular atomic and bonding connectivity environments with the preference for intra- and inter-molecular features, respectively. Densely-connected neural networks are employed to map the circular fingerprints of protein-ligand complexes to binding affinitiesRESULTS:The performance of neural networks is sensitive to the parameters used for ECFP and PLEC fingerprints. The R2_score of the evaluated ECFP and PLEC fingerprints reaches 0.52 and 0.49, higher than that of the improperly set ECFP and PLEC fingerprints with R2_score of 0.45 and 0.38, respectively. Additionally, compared to the predictions from the standalone fingerprints, the ECFP+PLEC conjoint ones slightly improve the prediction accuracy with R2_score of approximately 0.55. CONCLUSION Both intra- and inter-molecular structural features encoded in the circular fingerprints contribute to the protein-ligand binding affinity. Optimizing the parameters of ECFP and PLEC can enhance performance. The conjoint fingerprint scheme can be generally extended to other molecular descriptors for enhanced feature engineering and improved predictive performance.
Collapse
Affiliation(s)
- Zuode Yin
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou, Jiangsu, China
| | - Wei Song
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou, Jiangsu, China
- School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou, Jiangsu, China
| | - Baiyi Li
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou, Jiangsu, China
| | - Fengfei Wang
- School of Mathematics and Physics, Jiangsu University of Technology, Changzhou, Jiangsu, China
| | - Liangxu Xie
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou, Jiangsu, China
| | - Xiaojun Xu
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou, Jiangsu, China
| |
Collapse
|
5
|
Muegge I, Hu Y. How do we further enhance 2D fingerprint similarity searching for novel drug discovery? Expert Opin Drug Discov 2022; 17:1173-1176. [PMID: 36150044 DOI: 10.1080/17460441.2022.2128332] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Affiliation(s)
| | - Yuan Hu
- Alkermes, Inc, Waltham, Massachusetts, USA
| |
Collapse
|
6
|
KUALA: a machine learning-driven framework for kinase inhibitors repositioning. Sci Rep 2022; 12:17877. [PMID: 36284125 PMCID: PMC9595087 DOI: 10.1038/s41598-022-22324-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Accepted: 10/12/2022] [Indexed: 01/20/2023] Open
Abstract
The family of protein kinases comprises more than 500 genes involved in numerous functions. Hence, their physiological dysfunction has paved the way toward drug discovery for cancer, cardiovascular, and inflammatory diseases. As a matter of fact, Kinase binding sites high similarity has a double role. On the one hand it is a critical issue for selectivity, on the other hand, according to poly-pharmacology, a synergistic controlled effect on more than one target could be of great pharmacological interest. Another important aspect of binding similarity is the possibility of exploit it for repositioning of drugs on targets of the same family. In this study, we propose our approach called Kinase drUgs mAchine Learning frAmework (KUALA) to automatically identify kinase active ligands by using specific sets of molecular descriptors and provide a multi-target priority score and a repurposing threshold to suggest the best repurposable and non-repurposable molecules. The comprehensive list of all kinase-ligand pairs and their scores can be found at https://github.com/molinfrimed/multi-kinases .
Collapse
|