1
|
Hu J, Zhang Y, Xie J, Yuan Z, Yin Z, Shi S, Li H, Li S. Learning motif features and topological structure of molecules for metabolic pathway prediction. J Cheminform 2025; 17:56. [PMID: 40259421 PMCID: PMC12013036 DOI: 10.1186/s13321-025-00994-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Accepted: 03/21/2025] [Indexed: 04/23/2025] Open
Abstract
Metabolites serve as crucial biomarkers for assessing disease progression and understanding underlying pathogenic mechanisms. However, when the metabolic pathway category of metabolites is unknown, researchers face challenges in conducting metabolomic analyses. Due to the complexity of wet laboratory experimentation for pathway identification, there is a growing demand for predictive methods. Various computational approaches, including machine learning and graph neural networks, have been proposed; however, interpretability remains a challenge. We have developed a neural network framework called MotifMol3D, which is designed for predicting molecular metabolic pathway categories. This framework introduces motif information to mine local features of small-sample molecules, combining with graph neural network and 3D information to complete the prediction task. Using a dataset of 5,698 molecules that participate in 11 metabolic pathway categories in the KEGG database, MotifMol3D outperformed state-of-the-art methods in precision, recall, and F1 score. In addition, ablation study and motif analysis have demonstrated the effectiveness and usefulness of the model. Motif analysis, in particular, has shown motif information can actually characterize the main features of specific pathway molecules to a certain extent and enhance the interpretability of the model. An external validation further corroborates this observation. MotifMol3D is an open-source tool that is available at https://github.com/Irena-Zhang/MotifMol3D.git .Scientific contribution MotifMol3D integrates motif information, graph neural networks, and 3D structural data to enhance feature extraction for small-sample molecules, improving the precision and interpretability of metabolic pathway predictions. The model outperforms state-of-the-art approaches in precision, recall, and F1 score. This work reveals how motif information characterizes pathway-specific molecules, offering novel insights into molecular properties within metabolic pathways.
Collapse
Affiliation(s)
- Jianguo Hu
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Yiqing Zhang
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Jinxin Xie
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Zhen Yuan
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Zhangxiang Yin
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Shanshan Shi
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Honglin Li
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China.
- Innovation Center for AI and Drug Discovery, School of Pharmacy, East China Normal University, Shanghai, 200062, China.
- Lingang Laboratory, Shanghai, 200031, China.
| | - Shiliang Li
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China.
- Innovation Center for AI and Drug Discovery, School of Pharmacy, East China Normal University, Shanghai, 200062, China.
| |
Collapse
|
2
|
Racki A, Paduszyński K. Recent Advances in the Modeling of Ionic Liquids Using Artificial Neural Networks. J Chem Inf Model 2025; 65:3161-3175. [PMID: 40143756 PMCID: PMC12004508 DOI: 10.1021/acs.jcim.4c02364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2024] [Revised: 02/07/2025] [Accepted: 03/14/2025] [Indexed: 03/28/2025]
Abstract
This paper reviews the recent and most impactful advancements in the application of artificial neural networks in modeling the properties of ionic liquids. As salts that are liquid at temperatures below 100 °C, ionic liquids possess unique properties beneficial for various industrial applications such as carbon capture, catalytic solvents, and lubricant additives. The study emphasizes the challenges in selecting appropriate ILs due to the vast variability in their properties, which depend significantly on their cation and anion structures. The review discusses the advantages of using ANNs, including feed-forward, cascade-forward, convolutional, recurrent, and graph neural networks, over traditional machine learning algorithms for predicting the thermodynamic and physical properties of ILs. The paper also highlights the importance of data preparation, including data collection, feature engineering, and data cleaning, in developing accurate predictive models. Additionally, the review covers the interpretability of these models using techniques such as SHapley Additive exPlanations to understand feature importance. The authors conclude by discussing future opportunities and the potential of combining ANNs with other computational methods to design new ILs with targeted properties.
Collapse
Affiliation(s)
- Adrian Racki
- Department of Physical Chemistry,
Faculty of Chemistry, Warsaw University
of Technology, Noakowskiego 3, 00-664 Warsaw, Poland
| | - Kamil Paduszyński
- Department of Physical Chemistry,
Faculty of Chemistry, Warsaw University
of Technology, Noakowskiego 3, 00-664 Warsaw, Poland
| |
Collapse
|
3
|
Lv Q, Chen G, Yang Z, Zhong W, Chen CYC. Meta-MolNet: A Cross-Domain Benchmark for Few Examples Drug Discovery. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:4849-4863. [PMID: 40038923 DOI: 10.1109/tnnls.2024.3359657] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2025]
Abstract
Predicting the pharmacological activity, toxicity, and pharmacokinetic properties of molecules is a central task in drug discovery. Existing machine learning methods are transferred from one resource rich molecular property to another data scarce property in the same scaffold dataset. However, existing models may produce fragile and highly uncertain predictions for new scaffold molecules. And these models were tested on different benchmarks, which seriously affected the quality of their evaluation results. In this article, we introduce Meta-MolNet, a collection of data benchmark and algorithms, which is a standard benchmark platform for measuring model generalization and uncertainty quantification capabilities. Meta-MolNet manages a wide range of molecular datasets with high ratio of molecules/scaffolds, which often leads to more difficult data shift and generalization problems. Furthermore, we propose a graph attention network based on cross-domain meta-learning, Meta-GAT, which uses bilevel optimization to learn meta-knowledge from the scaffold family molecular dataset in the source domain. Meta-GAT benefits from meta-knowledge that reduces the requirement of sample complexity to enable reliable predictions of new scaffold molecules in the target domain through internal iteration of a few examples. We evaluate existing methods as baselines for the community, and the Meta-MolNet benchmark demonstrates the effectiveness of measuring the proposed algorithm in domain generalization and uncertainty quantification. Extensive experiments demonstrate that the Meta-GAT model has state-of-the-art domain generalization performance and robustly estimates uncertainty under few examples constraints. By publishing AI-ready data, evaluation frameworks, and baseline results, we hope to see the Meta-MolNet suite become a comprehensive resource for the AI-assisted drug discovery community. Meta-MolNet is freely accessible at https://github.com/lol88/Meta-MolNet.
Collapse
|
4
|
Zheng S, Zhang C, Chen Y, Chen M. Graph and Multi-Level Sequence Fusion Learning for Predicting the Molecular Activity of BACE-1 Inhibitors. Int J Mol Sci 2025; 26:1681. [PMID: 40004143 PMCID: PMC11855840 DOI: 10.3390/ijms26041681] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Revised: 02/12/2025] [Accepted: 02/14/2025] [Indexed: 02/27/2025] Open
Abstract
The development of BACE-1 (β-site amyloid precursor protein cleaving enzyme 1) inhibitors is a crucial focus in exploring early treatments for Alzheimer's disease (AD). Recently, graph neural networks (GNNs) have demonstrated significant advantages in predicting molecular activity. However, their reliance on graph structures alone often neglects explicit sequence-level semantic information. To address this limitation, we proposed a Graph and multi-level Sequence Fusion Learning (GSFL) model for predicting the molecular activity of BACE-1 inhibitors. Firstly, molecular graph structures generated from SMILES strings were encoded using GNNs with an atomic-level characteristic attention mechanism. Next, substrings at functional group, ion level, and atomic level substrings were extracted from SMILES strings and encoded using a BiLSTM-Transformer framework equipped with a hierarchical attention mechanism. Finally, these features were fused to predict the activity of BACE-1 inhibitors. A dataset of 1548 compounds with BACE-1 activity measurements was curated from the ChEMBL database. In the classification experiment, the model achieved an accuracy of 0.941 on the training set and 0.877 on the test set. For the test set, it delivered a sensitivity of 0.852, a specificity of 0.894, a MCC of 0.744, an F1-score of 0.872, a PRC of 0.869, and an AUC of 0.915. Compared to traditional computer-aided drug design methods and other machine learning algorithms, the proposed model can effectively improve the accuracy of the molecular activity prediction of BACE-1 inhibitors and has a potential application value.
Collapse
Affiliation(s)
- Shaohua Zheng
- College of Physics and Information Engineering, Fuzhou University, Fuzhou 350108, China
| | - Changwang Zhang
- College of Physics and Information Engineering, Fuzhou University, Fuzhou 350108, China
| | - Youjia Chen
- College of Physics and Information Engineering, Fuzhou University, Fuzhou 350108, China
| | - Meimei Chen
- College of Traditional Chinese Medicine, Fujian University of Traditional Chinese Medicine, Fuzhou 350122, China
| |
Collapse
|
5
|
Pei Y, Ma Y, Xiang Y, Zhang G, Feng Y, Li W, Zhou Y, Li S. Stress hyperglycemia ratio and machine learning model for prediction of all-cause mortality in patients undergoing cardiac surgery. Cardiovasc Diabetol 2025; 24:77. [PMID: 39955587 PMCID: PMC11829518 DOI: 10.1186/s12933-025-02644-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/27/2024] [Accepted: 02/11/2025] [Indexed: 02/17/2025] Open
Abstract
BACKGROUND The stress hyperglycemia ratio (SHR) was developed to reduce the effects of long-term chronic glycemic factors on stress hyperglycemia levels, which was associated with adverse clinical outcomes. This study aims to evaluate the relationship between the postoperative SHR index and all-cause mortality in patients undergoing cardiac surgery. METHODS Data for this study were extracted from the Medical Information Mart for Intensive Care IV (MIMIC-IV) database. Patients were categorized into four groups based on postoperative SHR index quartiles. The primary outcome was 30-day all-cause mortality, while the secondary outcomes included in-hospital, 90-day and 360-day all-cause mortality. The SHR index was analyzed using quartiles, and Kaplan-Meier curves were generated to compare outcomes across groups. Cox proportional hazards regression and restricted cubic splines (RCS) were employed to assess the relationship between the SHR index and the outcomes. LASSO regression was used for feature selection. Six machine learning algorithms were used to predict in-hospital all-cause mortality and were further extended to predict 360-day all-cause mortality. The SHapley Additive exPlanations method was used for visualizing model characteristics and individual case predictions. RESULTS A total of 3,848 participants were included in the study, with a mean age of 68 ± 12 years and female participants comprised 30.6% (1,179). Higher postoperative SHR index levels were associated with an increased risk of in-hospital, 90-day and 360-day all-cause mortality as shown by Kaplan-Meier curves (log-rank P < 0.05). Cox regression analysis revealed that the highest postoperative SHR quartile was associated with a significantly higher risk of mortality at these time points (P < 0.05). RCS analysis demonstrated nonlinear relationships between the postoperative SHR index and all-cause mortality (P for nonlinear < 0.05). The Naive Bayes model achieves the highest area under the curve (AUC) for predicting both in-hospital mortality (0.7936) and 360-day all-cause mortality (0.7410). CONCLUSION In patients undergoing cardiac surgery, higher postoperative SHR index levels were significantly associated with increased risk of in-hospital, 90-day and 360-day all-cause mortality. The SHR index may serve as a valid tool for assessing the severity after cardiac surgery and guiding treatment decisions.
Collapse
Affiliation(s)
- Yingjian Pei
- Department of Neurology, National Clinical Research Center for Cardiovascular Diseases, Fuwai Hospital, National Center for Cardiovascular Diseases, Chinese Academy of Medical Sciences and Peking Union Medical College, A 167, Beilishi Road, Xicheng District, Beijing, 100037, China
| | - Yajun Ma
- Department of Neurology, National Clinical Research Center for Cardiovascular Diseases, Fuwai Hospital, National Center for Cardiovascular Diseases, Chinese Academy of Medical Sciences and Peking Union Medical College, A 167, Beilishi Road, Xicheng District, Beijing, 100037, China
| | - Ying Xiang
- Department of Neurology, National Clinical Research Center for Cardiovascular Diseases, Fuwai Hospital, National Center for Cardiovascular Diseases, Chinese Academy of Medical Sciences and Peking Union Medical College, A 167, Beilishi Road, Xicheng District, Beijing, 100037, China
| | - Guitao Zhang
- Department of Neurology, National Clinical Research Center for Cardiovascular Diseases, Fuwai Hospital, National Center for Cardiovascular Diseases, Chinese Academy of Medical Sciences and Peking Union Medical College, A 167, Beilishi Road, Xicheng District, Beijing, 100037, China
| | - Yao Feng
- Department of Neurology, National Clinical Research Center for Cardiovascular Diseases, Fuwai Hospital, National Center for Cardiovascular Diseases, Chinese Academy of Medical Sciences and Peking Union Medical College, A 167, Beilishi Road, Xicheng District, Beijing, 100037, China
| | - Wenbo Li
- Department of Neurology, National Clinical Research Center for Cardiovascular Diseases, Fuwai Hospital, National Center for Cardiovascular Diseases, Chinese Academy of Medical Sciences and Peking Union Medical College, A 167, Beilishi Road, Xicheng District, Beijing, 100037, China
| | - Yinghua Zhou
- Department of Neurology, National Clinical Research Center for Cardiovascular Diseases, Fuwai Hospital, National Center for Cardiovascular Diseases, Chinese Academy of Medical Sciences and Peking Union Medical College, A 167, Beilishi Road, Xicheng District, Beijing, 100037, China.
| | - Shujuan Li
- Department of Neurology, National Clinical Research Center for Cardiovascular Diseases, Fuwai Hospital, National Center for Cardiovascular Diseases, Chinese Academy of Medical Sciences and Peking Union Medical College, A 167, Beilishi Road, Xicheng District, Beijing, 100037, China.
| |
Collapse
|
6
|
Ye W, Li J, Cai X. Mfgnn: Multi-Scale Feature-Attentive Graph Neural Networks for Molecular Property Prediction. J Comput Chem 2025; 46:e70011. [PMID: 39840745 DOI: 10.1002/jcc.70011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Revised: 12/03/2024] [Accepted: 12/09/2024] [Indexed: 01/23/2025]
Abstract
In the realm of artificial intelligence-driven drug discovery (AIDD), accurately predicting the influence of molecular structures on their properties is a critical research focus. While deep learning models based on graph neural networks (GNNs) have made significant advancements in this area, prior studies have primarily concentrated on molecule-level representations, often neglecting the impact of functional group structures and the potential relationships between fragments on molecular property predictions. To address this gap, we introduce the multi-scale feature attention graph neural network (MfGNN), which enhances traditional atom-based molecular graph representations by incorporating fragment-level representations derived from chemically synthesizable BRICS fragments. MfGNN not only effectively captures both the structural information of molecules and the features of functional groups but also pays special attention to the potential relationships between fragments, exploring how they collectively influence molecular properties. This model integrates two core mechanisms: a graph attention mechanism that captures embeddings of molecules and functional groups, and a feature extraction module that systematically processes BRICS fragment-level features to uncover relationships among the fragments. Our comprehensive experiments demonstrate that MfGNN outperforms leading machine learning and deep learning models, achieving state-of-the-art performance in 8 out of 11 learning tasks across various domains, including physical chemistry, biophysics, physiology, and toxicology. Furthermore, ablation studies reveal that the integration of multi-scale feature information and the feature extraction module enhances the richness of molecular features, thereby improving the model's predictive capabilities.
Collapse
Affiliation(s)
- Weiting Ye
- College of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, Guangdong, China
| | - Jingcheng Li
- College of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, Guangdong, China
| | - Xianfa Cai
- College of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, Guangdong, China
| |
Collapse
|
7
|
Peng J, Fu L, Yang G, Cao D. Advanced AI-Driven Prediction of Pregnancy-Related Adverse Drug Reactions. J Chem Inf Model 2024; 64:9286-9298. [PMID: 39611337 DOI: 10.1021/acs.jcim.4c01657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2024]
Abstract
Ensuring drug safety during pregnancy is critical due to the potential risks to both the mother and fetus. However, the exclusion of pregnant women from clinical trials complicates the assessment of adverse drug reactions (ADRs) in this population. This study aimed to develop and validate risk prediction models for pregnancy-related ADRs of drugs using advanced Machine Learning (ML) and Deep Learning (DL) techniques, leveraging real-world data from the FDA Adverse Event Reporting System. We explored three methods─Information Component, Reporting Odds Ratio, and 95% confidence interval of ROR─for classifying drugs into high-risk and low-risk categories. DL models, including Directed Message Passing Neural Networks (DMPNN), Graph Neural Networks, and Graph Convolutional Networks, were developed and compared to traditional ML models like Random Forest, Support Vector Machines, and XGBoost. Among these, the DMPNN model, which integrated molecular graph information and molecular descriptors, exhibited the highest predictive performance, particularly at the preferred term level. The model was validated against external data sets from SIDER and DailyMed, demonstrating strong generalizability. Additionally, the model was applied to assess the risk of 22 oral hypoglycemic drugs, and potential substructure alerts for pregnancy-related ADRs were identified. These findings suggest that the DMPNN model is a valuable tool for predicting ADRs in pregnant women, offering significant advancement in drug safety assessment and providing crucial insights for safer medication use during pregnancy.
Collapse
Affiliation(s)
- Jinfu Peng
- Xiangya School of Pharmaceutical Sciences, Central South University, No. 172 Tongzipo Road, Changsha 410031, Hunan, China
| | - Li Fu
- Xiangya School of Pharmaceutical Sciences, Central South University, No. 172 Tongzipo Road, Changsha 410031, Hunan, China
| | - Guoping Yang
- Xiangya School of Pharmaceutical Sciences, Central South University, No. 172 Tongzipo Road, Changsha 410031, Hunan, China
- The Third Xiangya Hospital, Central South University, No. 138 Tongzipo Road, Changsha 410031, Hunan, China
| | - Dongshen Cao
- Xiangya School of Pharmaceutical Sciences, Central South University, No. 172 Tongzipo Road, Changsha 410031, Hunan, China
| |
Collapse
|
8
|
Xu Y, Liu X, Xia W, Ge J, Ju CW, Zhang H, Zhang JZ. ChemXTree: A Feature-Enhanced Graph Neural Network-Neural Decision Tree Framework for ADMET Prediction. J Chem Inf Model 2024; 64:8440-8452. [PMID: 39497657 PMCID: PMC11600499 DOI: 10.1021/acs.jcim.4c01186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2024] [Revised: 10/18/2024] [Accepted: 10/29/2024] [Indexed: 11/07/2024]
Abstract
The rapid progression of machine learning, especially deep learning (DL), has catalyzed a new era in drug discovery, introducing innovative approaches for predicting molecular properties. Despite the many methods available for feature representation, efficiently utilizing rich, high-dimensional information remains a significant challenge. Our work introduces ChemXTree, a novel graph-based model that integrates a Gate Modulation Feature Unit (GMFU) and neural decision tree (NDT) in the output layer to address this challenge. Extensive evaluations on benchmark data sets, including MoleculeNet and eight additional drug databases, have demonstrated ChemXTree's superior performance, surpassing or matching the current state-of-the-art models. Visualization techniques clearly demonstrate that ChemXTree significantly improves the separation between substrates and nonsubstrates in the latent space. In summary, ChemXTree demonstrates a promising approach for integrating advanced feature extraction with neural decision trees, offering significant improvements in predictive accuracy for drug discovery tasks and opening new avenues for optimizing molecular properties.
Collapse
Affiliation(s)
- Yuzhi Xu
- Shanghai
Frontiers Science Center of Artificial Intelligence and Deep Learning
and NYU-ECNU Center for Computational Chemistry, NYU Shanghai, Shanghai 200062, China
- Department
of Chemistry, New York University, New York, New York 10003, United States
| | - Xinxin Liu
- Department
of Computer and Information Science, University
of Pennsylvania, Philadelphia, Pennsylvania 19104, United States
- Department
of Materials Science and Engineering, University
of Pennsylvania, Philadelphia, Pennsylvania 19104, United States
| | - Wei Xia
- Shanghai
Frontiers Science Center of Artificial Intelligence and Deep Learning
and NYU-ECNU Center for Computational Chemistry, NYU Shanghai, Shanghai 200062, China
- Department
of Chemistry, New York University, New York, New York 10003, United States
| | - Jiankai Ge
- Chemical
and Biomolecular Engineering, University
of Illinois at Urbana−Champaign, Urbana, Illinois 61801, United States
| | - Cheng-Wei Ju
- Pritzker
School of Molecular Engineering, The University
of Chicago, Chicago, Illinois 60615, United States
| | - Haiping Zhang
- Faculty of
Synthetic Biology, Shenzhen Institute of
Advanced Technology, Shenzhen 518055, China
| | - John Z.H. Zhang
- Shanghai
Frontiers Science Center of Artificial Intelligence and Deep Learning
and NYU-ECNU Center for Computational Chemistry, NYU Shanghai, Shanghai 200062, China
- Department
of Chemistry, New York University, New York, New York 10003, United States
- Faculty of
Synthetic Biology, Shenzhen Institute of
Advanced Technology, Shenzhen 518055, China
- Shanghai
Engineering Research Center of Molecular Therapeutics and New Drug
Development, School of Chemistry and Molecular Engineering, East China Normal University, 200062 Shanghai, China
| |
Collapse
|
9
|
He G, Liu S, Liu Z, Wang C, Zhang K, Li H. Prototype-based contrastive substructure identification for molecular property prediction. Brief Bioinform 2024; 25:bbae565. [PMID: 39494969 PMCID: PMC11533112 DOI: 10.1093/bib/bbae565] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2024] [Revised: 08/11/2024] [Accepted: 10/22/2024] [Indexed: 11/05/2024] Open
Abstract
Substructure-based representation learning has emerged as a powerful approach to featurize complex attributed graphs, with promising results in molecular property prediction (MPP). However, existing MPP methods mainly rely on manually defined rules to extract substructures. It remains an open challenge to adaptively identify meaningful substructures from numerous molecular graphs to accommodate MPP tasks. To this end, this paper proposes Prototype-based cOntrastive Substructure IdentificaTion (POSIT), a self-supervised framework to autonomously discover substructural prototypes across graphs so as to guide end-to-end molecular fragmentation. During pre-training, POSIT emphasizes two key aspects of substructure identification: firstly, it imposes a soft connectivity constraint to encourage the generation of topologically meaningful substructures; secondly, it aligns resultant substructures with derived prototypes through a prototype-substructure contrastive clustering objective, ensuring attribute-based similarity within clusters. In the fine-tuning stage, a cross-scale attention mechanism is designed to integrate substructure-level information to enhance molecular representations. The effectiveness of the POSIT framework is demonstrated by experimental results from diverse real-world datasets, covering both classification and regression tasks. Moreover, visualization analysis validates the consistency of chemical priors with identified substructures. The source code is publicly available at https://github.com/VRPharmer/POSIT.
Collapse
Affiliation(s)
- Gaoqi He
- School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China
| | - Shun Liu
- School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China
| | - Zhuoran Liu
- School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China
| | - Changbo Wang
- School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China
| | - Kai Zhang
- School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China
| | - Honglin Li
- Innovation Center for AI and Drug Discovery, East China Normal University, 200062 Shanghai, China
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science & Technology, 200237 Shanghai, China
| |
Collapse
|
10
|
Niu X, Zhang Q, Dang Y, Hu W, Sun Y. MolPackL: Quantification and Interpretation of Intermolecular Interactions Driven by Molecular Packing. J Am Chem Soc 2024; 146:24075-24084. [PMID: 39141522 DOI: 10.1021/jacs.4c08132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/16/2024]
Abstract
In organic optoelectronic devices, the properties of the aggregated organic materials depend not only on individual molecules or monomers but also significantly on their packing modes. Different from their inorganic counterparts linked by explicit covalent bonds, organic solids exhibit intricate and numerous intermolecular interactions (IMIs). Due to the intrinsic complexity and disorder of IMIs, identifying and understanding them is a formidable challenge in experimental, theoretical, and data-driven approaches. In this work, we constructed an innovative algorithm framework, Molecular Packing Learning (MolPackL), which can accurately quantify elusive IMIs using contact density histograms (CDHs) and efficiently extract intermolecular features for further property prediction of organic solids. It performs satisfactorily in training predictive models of IMI-related properties in molecular crystals. Particularly, the band gap predictive model based on MolPackL achieved the best-reported performance, with an MAE of 0.20 eV and an impressive R2 of 0.92. Class activation mapping (CAM) visually demonstrates MolPackL's accurate identification of effective interaction sites as the molecular packing changes. What is more, the elemental importance analysis verified that the superior score benefits from MolPackL's ability to comprehensively consider multiple influencing factors of IMIs. In summary, MolPackL provides a new framework for quantitative assessment and understanding of the effect of IMIs. The development of MolPackL marks a significant advancement in establishing predictive models of molecular aggregates, deepening the comprehension of IMIs on the material properties. Given the superior performance, we believe that MolPackL will also become a powerful tool in the design of high-performance organic optoelectronic materials.
Collapse
Affiliation(s)
- Xinxin Niu
- Tianjin Key Laboratory of Molecular Optoelectronic Sciences, Department of Chemistry, School of Science, Tianjin University, Tianjin 300072, P.R. China
| | - Qian Zhang
- Tianjin Key Laboratory of Molecular Optoelectronic Sciences, Department of Chemistry, School of Science, Tianjin University, Tianjin 300072, P.R. China
| | - Yanfeng Dang
- Tianjin Key Laboratory of Molecular Optoelectronic Sciences, Department of Chemistry, School of Science, Tianjin University, Tianjin 300072, P.R. China
| | - Wenping Hu
- Tianjin Key Laboratory of Molecular Optoelectronic Sciences, Department of Chemistry, School of Science, Tianjin University, Tianjin 300072, P.R. China
- Joint School of National University of Singapore and Tianjin University, Fuzhou 350207, P.R. China
| | - Yajing Sun
- Tianjin Key Laboratory of Molecular Optoelectronic Sciences, Department of Chemistry, School of Science, Tianjin University, Tianjin 300072, P.R. China
| |
Collapse
|
11
|
Li B, Chen H, Lin X, Duan H. Multimodal learning system integrating electronic medical records and hysteroscopic images for reproductive outcome prediction and risk stratification of endometrial injury: a multicenter diagnostic study. Int J Surg 2024; 110:3237-3248. [PMID: 38935827 PMCID: PMC11175765 DOI: 10.1097/js9.0000000000001241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2023] [Accepted: 02/19/2024] [Indexed: 06/29/2024]
Abstract
OBJECTIVE To develop a multimodal learning application system that integrates electronic medical records (EMR) and hysteroscopic images for reproductive outcome prediction and risk stratification of patients with intrauterine adhesions (IUAs) resulting from endometrial injuries. MATERIALS AND METHODS EMR and 5014 revisited hysteroscopic images of 753 post hysteroscopic adhesiolysis patients from the multicenter IUA database we established were randomly allocated to training, validation, and test datasets. The respective datasets were used for model development, tuning, and testing of the multimodal learning application. MobilenetV3 was employed for image feature extraction, and XGBoost for EMR and image feature ensemble learning. The performance of the application was compared against the single-modal approaches (EMR or hysteroscopic images), DeepSurv and ElasticNet models, along with the clinical scoring systems. The primary outcome was the 1-year conception prediction accuracy, and the secondary outcome was the assisted reproductive technology (ART) benefit ratio after risk stratification. RESULTS The multimodal learning system exhibited superior performance in predicting conception within 1-year, achieving areas under the curves of 0.967 (95% CI: 0.950-0.985), 0.936 (95% CI: 0.883-0.989), and 0.965 (95% CI: 0.935-0.994) in the training, validation, and test datasets, respectively, surpassing single-modal approaches, other models and clinical scoring systems (all P<0.05). The application of the model operated seamlessly on the hysteroscopic platform, with an average analysis time of 3.7±0.8 s per patient. By employing the application's conception probability-based risk stratification, mid-high-risk patients demonstrated a significant ART benefit (odds ratio=6, 95% CI: 1.27-27.8, P=0.02), while low-risk patients exhibited good natural conception potential, with no significant increase in conception rates from ART treatment (P=1). CONCLUSIONS The multimodal learning system using hysteroscopic images and EMR demonstrates promise in accurately predicting the natural conception of patients with IUAs and providing effective postoperative stratification, potentially contributing to ART triage after IUA procedures.
Collapse
Affiliation(s)
- Bohan Li
- Department of Minimally Invasive Gynecologic Center, Beijing Obstetrics and Gynecology Hospital, Capital Medical University, Beijing Maternal and Child Health Care Hospital
| | - Hui Chen
- School of Biomedical Engineering
- Beijing Advanced Innovation Center for Big Data-based Precision Medicine, Capital Medical University, Beijing
| | - Xiaona Lin
- Assisted Reproduction Unit, Department of Obstetrics and Gynecology, Sir Run Run Shaw Hospital, School of Medicine, Zhejiang University, Key Laboratory of Reproductive Dysfunction Management of Zhejiang Province, Hangzhou, People’s Republic of China
| | - Hua Duan
- Department of Minimally Invasive Gynecologic Center, Beijing Obstetrics and Gynecology Hospital, Capital Medical University, Beijing Maternal and Child Health Care Hospital
| |
Collapse
|
12
|
Li B, Wang Z, Liu Z, Tao Y, Sha C, He M, Li X. DrugMetric: quantitative drug-likeness scoring based on chemical space distance. Brief Bioinform 2024; 25:bbae321. [PMID: 38975893 PMCID: PMC11229036 DOI: 10.1093/bib/bbae321] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 05/20/2024] [Accepted: 06/27/2024] [Indexed: 07/09/2024] Open
Abstract
The process of drug discovery is widely known to be lengthy and resource-intensive. Artificial Intelligence approaches bring hope for accelerating the identification of molecules with the necessary properties for drug development. Drug-likeness assessment is crucial for the virtual screening of candidate drugs. However, traditional methods like Quantitative Estimation of Drug-likeness (QED) struggle to distinguish between drug and non-drug molecules accurately. Additionally, some deep learning-based binary classification models heavily rely on selecting training negative sets. To address these challenges, we introduce a novel unsupervised learning framework called DrugMetric, an innovative framework for quantitatively assessing drug-likeness based on the chemical space distance. DrugMetric blends the powerful learning ability of variational autoencoders with the discriminative ability of the Gaussian Mixture Model. This synergy enables DrugMetric to identify significant differences in drug-likeness across different datasets effectively. Moreover, DrugMetric incorporates principles of ensemble learning to enhance its predictive capabilities. Upon testing over a variety of tasks and datasets, DrugMetric consistently showcases superior scoring and classification performance. It excels in quantifying drug-likeness and accurately distinguishing candidate drugs from non-drugs, surpassing traditional methods including QED. This work highlights DrugMetric as a practical tool for drug-likeness scoring, facilitating the acceleration of virtual drug screening, and has potential applications in other biochemical fields.
Collapse
Affiliation(s)
- Bowen Li
- Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, 310018 Zhejiang, China
| | - Zhen Wang
- Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, 310018 Zhejiang, China
- College of Electrical and Information Engineering, Hunan University, Changsha, 410082 Hunan, China
| | - Ziqi Liu
- Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, 310018 Zhejiang, China
- Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou, 310024 Zhejiang, China
| | - Yanxin Tao
- Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, 310018 Zhejiang, China
| | - Chulin Sha
- Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, 310018 Zhejiang, China
| | - Min He
- Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, 310018 Zhejiang, China
- College of Electrical and Information Engineering, Hunan University, Changsha, 410082 Hunan, China
| | - Xiaolin Li
- Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, 310018 Zhejiang, China
- ElasticMind Inc, Hangzhou, 310018 Zhejiang, China
| |
Collapse
|
13
|
Du W, Zhao L, Wu R, Huang B, Liu S, Liu Y, Huang H, Shi G. Predicting drug-Protein interaction with deep learning framework for molecular graphs and sequences: Potential candidates against SAR-CoV-2. PLoS One 2024; 19:e0299696. [PMID: 38728335 PMCID: PMC11086825 DOI: 10.1371/journal.pone.0299696] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Accepted: 02/14/2024] [Indexed: 05/12/2024] Open
Abstract
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) caused the COVID-19 disease, which represents a new life-threatening disaster. Regarding viral infection, many therapeutics have been investigated to alleviate the epidemiology such as vaccines and receptor decoys. However, the continuous mutating coronavirus, especially the variants of Delta and Omicron, are tended to invalidate the therapeutic biological product. Thus, it is necessary to develop molecular entities as broad-spectrum antiviral drugs. Coronavirus replication is controlled by the viral 3-chymotrypsin-like cysteine protease (3CLpro) enzyme, which is required for the virus's life cycle. In the cases of severe acute respiratory syndrome coronavirus (SARS-CoV) and middle east respiratory syndrome coronavirus (MERS-CoV), 3CLpro has been shown to be a promising therapeutic development target. Here we proposed an attention-based deep learning framework for molecular graphs and sequences, training from the BindingDB 3CLpro dataset (114,555 compounds). After construction of such model, we conducted large-scale screening the in vivo/vitro dataset (276,003 compounds) from Zinc Database and visualize the candidate compounds with attention score. geometric-based affinity prediction was employed for validation. Finally, we established a 3CLpro-specific deep learning framework, namely GraphDPI-3CL (AUROC: 0.958) achieved superior performance beyond the existing state of the art model and discovered 10 molecules with a high binding affinity of 3CLpro and superior binding mode.
Collapse
Affiliation(s)
- Weian Du
- Department of Dermatology, Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
| | - Liang Zhao
- Shenzhen Health Development Research and Data Management Center, Shenzhen, China
| | - Rong Wu
- Department of Dermatology, Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
| | - Boning Huang
- School of Finance, Shanghai University of Finance and Economics, Shanghai, China
| | - Si Liu
- Department of Cosmetic and Plastic Surgery, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| | - Yufeng Liu
- Department of Cosmetic and Plastic Surgery, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| | - Huaiqiu Huang
- Department of Dermatology, Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
| | - Ge Shi
- Department of Cosmetic and Plastic Surgery, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| |
Collapse
|
14
|
Yang H, Liu J, Chen K, Cong S, Cai S, Li Y, Jia Z, Wu H, Lou T, Wei Z, Yang X, Xiao H. D-CyPre: a machine learning-based tool for accurate prediction of human CYP450 enzyme metabolic sites. PeerJ Comput Sci 2024; 10:e2040. [PMID: 38855237 PMCID: PMC11157575 DOI: 10.7717/peerj-cs.2040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Accepted: 04/15/2024] [Indexed: 06/11/2024]
Abstract
The advancement of graph neural networks (GNNs) has made it possible to accurately predict metabolic sites. Despite the combination of GNNs with XGBOOST showing impressive performance, this technology has not yet been applied in the realm of metabolic site prediction. Previous metabolic site prediction tools focused on bonds and atoms, regardless of the overall molecular skeleton. This study introduces a novel tool, named D-CyPre, that amalgamates atom, bond, and molecular skeleton information via two directed message-passing neural networks (D-MPNN) to predict the metabolic sites of the nine cytochrome P450 enzymes using XGBOOST. In D-CyPre Precision Mode, the model produces fewer, but more accurate results (Jaccard score: 0.497, F1: 0.660, and precision: 0.737 in the test set). In D-CyPre Recall Mode, the model produces less accurate, but more comprehensive results (Jaccard score: 0.506, F1: 0.669, and recall: 0.720 in the test set). In the test set of 68 reactants, D-CyPre outperformed BioTransformer on all isoenzymes and CyProduct on most isoenzymes (5/9). For the subtypes where D-CyPre outperformed CyProducts, the Jaccard score and F1 scores increased by 24% and 16% in Precision Mode (4/9) and 19% and 12% in Recall Mode (5/9), respectively, relative to the second-best CyProduct. Overall, D-CyPre provides more accurate prediction results for human CYP450 enzyme metabolic sites.
Collapse
Affiliation(s)
- Haolan Yang
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, China
- Beijing University of Chinese Medicine, Research Center of Chinese Medicine Analysis and Transformation, Beijing, China
| | - Jie Liu
- Beijing University of Chinese Medicine, Research Center of Chinese Medicine Analysis and Transformation, Beijing, China
| | - Kui Chen
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, China
- Beijing University of Chinese Medicine, Research Center of Chinese Medicine Analysis and Transformation, Beijing, China
| | - Shiyu Cong
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, China
- Beijing University of Chinese Medicine, Research Center of Chinese Medicine Analysis and Transformation, Beijing, China
| | - Shengnan Cai
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, China
- Beijing University of Chinese Medicine, Research Center of Chinese Medicine Analysis and Transformation, Beijing, China
| | - Yueting Li
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, China
- Beijing University of Chinese Medicine, Research Center of Chinese Medicine Analysis and Transformation, Beijing, China
| | - Zhixin Jia
- Beijing University of Chinese Medicine, Research Center of Chinese Medicine Analysis and Transformation, Beijing, China
| | - Hao Wu
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, China
- Beijing University of Chinese Medicine, Research Center of Chinese Medicine Analysis and Transformation, Beijing, China
| | - Tianyu Lou
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, China
- Beijing University of Chinese Medicine, Research Center of Chinese Medicine Analysis and Transformation, Beijing, China
| | - Zuying Wei
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, China
- Beijing University of Chinese Medicine, Research Center of Chinese Medicine Analysis and Transformation, Beijing, China
| | - Xiaoqin Yang
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, China
- Beijing University of Chinese Medicine, Research Center of Chinese Medicine Analysis and Transformation, Beijing, China
| | - Hongbin Xiao
- Beijing University of Chinese Medicine, Research Center of Chinese Medicine Analysis and Transformation, Beijing, China
| |
Collapse
|
15
|
Nada H, Kim S, Lee K. PT-Finder: A multi-modal neural network approach to target identification. Comput Biol Med 2024; 174:108444. [PMID: 38636325 DOI: 10.1016/j.compbiomed.2024.108444] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Revised: 04/04/2024] [Accepted: 04/07/2024] [Indexed: 04/20/2024]
Abstract
Efficient target identification for bioactive compounds, including novel synthetic analogs, is crucial for accelerating the drug discovery pipeline. However, the process of target identification presents significant challenges and is often expensive, which in turn can hinder the drug discovery efforts. To address these challenges machine learning applications have arisen as a promising approach for predicting the targets for novel chemical compounds. These methods allow the exploration of ligand-target interactions, uncovering of biochemical mechanisms, and the investigation of drug repurposing. Typically, the current target identification tools rely on assessing ligand structural similarities. Herein, a multi-modal neural network model was built using a library of proteins, their respective sequences, and active inhibitors. Subsequent validations showed the model to possess accuracy of 82 % and MPRAUC of 0.80. Leveraging the trained model, we developed PT-Finder (Protein Target Finder), a user-friendly offline application that is capable of predicting the target proteins for hundreds of compounds within a few seconds. This combination of offline operation, speed, and accuracy positions PT-Finder as a powerful tool to accelerate drug discovery workflows. PT-Finder and its source codes have been made freely accessible for download at https://github.com/PT-Finder/PT-Finder.
Collapse
Affiliation(s)
- Hossam Nada
- BK21 FOUR Team and Integrated Research Institute for Drug Development, College of Pharmacy, Dongguk University-Seoul, Goyang, 10326, Republic of Korea
| | - Sungdo Kim
- BK21 FOUR Team and Integrated Research Institute for Drug Development, College of Pharmacy, Dongguk University-Seoul, Goyang, 10326, Republic of Korea
| | - Kyeong Lee
- BK21 FOUR Team and Integrated Research Institute for Drug Development, College of Pharmacy, Dongguk University-Seoul, Goyang, 10326, Republic of Korea.
| |
Collapse
|
16
|
Zhang Q, Cai L, Liao N, Lu Y, Zhang J, Zhang C, Zeng K. Work Function Prediction by Graph Neural Networks for Configurationally Hybridized Boron-Doped Graphene. LANGMUIR : THE ACS JOURNAL OF SURFACES AND COLLOIDS 2024; 40:7087-7094. [PMID: 38511875 DOI: 10.1021/acs.langmuir.4c00228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/22/2024]
Abstract
Graphene, serving as electrodes, is widely applied in electronic and optoelectronic devices. Work function as one of the fundamental intrinsic characteristics of graphene directly affects the interfacial properties of the electrodes, thereby affecting the performance of the devices. Much work has been done to regulate the work function of graphene to expand its application fields, and doping has been demonstrated as an effective method. However, the numerous types of doped graphene make the investigation of its work function time-consuming and labor-intensive. In order to quickly obtain the relationship between the structure and property, a deep learning method is employed to predict the work function in this study. Specifically, a data set of over 30,000 compositions with the work function on boron-doped graphene at different concentrations and doping positions via density functional theory simulations was established through ab initio calculations. Then, a novel fusion model (GT-Net) combining transformers and graph neural networks (GNNs) was proposed. After that, improved effective GNN-based descriptors were developed. Finally, three different GNN methods were compared, and the results show that the proposed method could accurately predicate the work function with the R2 = 0.975 and RMSE = 0.027. This study not only provides the possibility of designing materials with specific properties at the atomic level but also demonstrates the performance of GNNs on graph-level tasks with the same graph structure and atomic number.
Collapse
Affiliation(s)
- Qingwei Zhang
- Chongqing University of Technology, Chongqing 401120, China
| | - Lin Cai
- Chongqing University of Technology, Chongqing 401120, China
| | - Ningsheng Liao
- Chongqing University of Technology, Chongqing 401120, China
| | - Yunhua Lu
- Chongqing University of Technology, Chongqing 401120, China
| | - Junan Zhang
- Chongqing University of Technology, Chongqing 401120, China
| | - Chao Zhang
- School of Chemical Engineering, Qinghai University, Xining 810016, China
| | - Kangli Zeng
- Chongqing University of Technology, Chongqing 401120, China
| |
Collapse
|
17
|
Qi X, Zhao Y, Qi Z, Hou S, Chen J. Machine Learning Empowering Drug Discovery: Applications, Opportunities and Challenges. Molecules 2024; 29:903. [PMID: 38398653 PMCID: PMC10892089 DOI: 10.3390/molecules29040903] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2024] [Revised: 02/08/2024] [Accepted: 02/14/2024] [Indexed: 02/25/2024] Open
Abstract
Drug discovery plays a critical role in advancing human health by developing new medications and treatments to combat diseases. How to accelerate the pace and reduce the costs of new drug discovery has long been a key concern for the pharmaceutical industry. Fortunately, by leveraging advanced algorithms, computational power and biological big data, artificial intelligence (AI) technology, especially machine learning (ML), holds the promise of making the hunt for new drugs more efficient. Recently, the Transformer-based models that have achieved revolutionary breakthroughs in natural language processing have sparked a new era of their applications in drug discovery. Herein, we introduce the latest applications of ML in drug discovery, highlight the potential of advanced Transformer-based ML models, and discuss the future prospects and challenges in the field.
Collapse
Affiliation(s)
- Xin Qi
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou 215011, China; (Y.Z.); (S.H.); (J.C.)
| | - Yuanchun Zhao
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou 215011, China; (Y.Z.); (S.H.); (J.C.)
| | - Zhuang Qi
- School of Software, Shandong University, Jinan 250101, China;
| | - Siyu Hou
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou 215011, China; (Y.Z.); (S.H.); (J.C.)
| | - Jiajia Chen
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou 215011, China; (Y.Z.); (S.H.); (J.C.)
| |
Collapse
|
18
|
Pandey M, Shah SK, Gromiha MM. Computational approaches for identifying disease-causing mutations in proteins. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2023; 139:141-171. [PMID: 38448134 DOI: 10.1016/bs.apcsb.2023.11.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/08/2024]
Abstract
Advancements in genome sequencing have expanded the scope of investigating mutations in proteins across different diseases. Amino acid mutations in a protein alter its structure, stability and function and some of them lead to diseases. Identification of disease-causing mutations is a challenging task and it will be helpful for designing therapeutic strategies. Hence, mutation data available in the literature have been curated and stored in several databases, which have been effectively utilized for developing computational methods to identify deleterious mutations (drivers), using sequence and structure-based properties of proteins. In this chapter, we describe the contents of specific databases that have information on disease-causing and neutral mutations followed by sequence and structure-based properties. Further, characteristic features of disease-causing mutations will be discussed along with computational methods for identifying cancer hotspot residues and disease-causing mutations in proteins.
Collapse
Affiliation(s)
- Medha Pandey
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, India
| | - Suraj Kumar Shah
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, India
| | - M Michael Gromiha
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, India; International Research Frontiers Initiative, School of Computing, Tokyo Institute of Technology, Yokohama, Japan.
| |
Collapse
|
19
|
Wu Y, Li K, Li M, Pu X, Guo Y. Attention Mechanism-Based Graph Neural Network Model for Effective Activity Prediction of SARS-CoV-2 Main Protease Inhibitors: Application to Drug Repurposing as Potential COVID-19 Therapy. J Chem Inf Model 2023; 63:7011-7031. [PMID: 37960886 DOI: 10.1021/acs.jcim.3c01280] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
Compared to de novo drug discovery, drug repurposing provides a time-efficient way to treat coronavirus disease 19 (COVID-19) that is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). SARS-CoV-2 main protease (Mpro) has been proved to be an attractive drug target due to its pivotal involvement in viral replication and transcription. Here, we present a graph neural network-based deep-learning (DL) strategy to prioritize the existing drugs for their potential therapeutic effects against SARS-CoV-2 Mpro. Mpro inhibitors were represented as molecular graphs ready for graph attention network (GAT) and graph isomorphism network (GIN) modeling for predicting the inhibitory activities. The result shows that the GAT model outperforms the GIN and other competitive models and yields satisfactory predictions for unseen Mpro inhibitors, confirming its robustness and generalization. The attention mechanism of GAT enables to capture the dominant substructures and thus to realize the interpretability of the model. Finally, we applied the optimal GAT model in conjunction with molecular docking simulations to screen the Drug Repurposing Hub (DRH) database. As a result, 18 drug hits with best consensus prediction scores and binding affinity values were identified as the potential therapeutics against COVID-19. Both the extensive literature searching and evaluations on adsorption, distribution, metabolism, excretion, and toxicity (ADMET) illustrate the premium drug-likeness and pharmacokinetic properties of the drug candidates. Overall, our work not only provides an effective GAT-based DL prediction tool for inhibitory activity of SARS-CoV-2 Mpro inhibitors but also provides theoretical guidelines for drug discovery in the COVID-19 treatment.
Collapse
Affiliation(s)
- Yanling Wu
- College of Chemistry, Sichuan University, Chengdu 610064, China
| | - Kun Li
- College of Chemistry, Sichuan University, Chengdu 610064, China
| | - Menglong Li
- College of Chemistry, Sichuan University, Chengdu 610064, China
| | - Xuemei Pu
- College of Chemistry, Sichuan University, Chengdu 610064, China
| | - Yanzhi Guo
- College of Chemistry, Sichuan University, Chengdu 610064, China
| |
Collapse
|
20
|
Hu J, Li Z, Lin J, Zhang L. Prediction and Interpretability of Glass Transition Temperature of Homopolymers by Data-Augmented Graph Convolutional Neural Networks. ACS APPLIED MATERIALS & INTERFACES 2023; 15:54006-54017. [PMID: 37934171 DOI: 10.1021/acsami.3c13698] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2023]
Abstract
Establishing the structure-property relationship by machine learning (ML) models is extremely valuable for accelerating the molecular design of polymers. However, existing ML models for the polymers are subject to scarcity issues of training data and fewer variations of graph structures of molecules. In addition, limited works have explored the interpretability of ML models to infer the latent knowledge in the field of polymer science that could inspire ML-assisted molecular design. In this contribution, we integrate graph convolutional neural networks (GCNs) with data augmentation strategy to predict the glass transition temperature Tg of polymers. It is demonstrated that the data-augmented GCN model outperforms the conventional models and achieves a higher accuracy for the prediction of Tg despite a small amount of training data. Furthermore, taking advantage of molecular graph representations, the data-augmented GCN model has the capability to infer the importance of atoms or substructures from the understanding of Tg, which generally agrees with the experimental findings in the field of polymer science. The inferred knowledge of the GCN model is used to advise on the design of functional polymers with specific Tg. The data-augmented GCN model possesses prominent superiorities in the establishment of structure-property relationship and also provides an efficient way for accelerating the rational design of polymer molecules.
Collapse
Affiliation(s)
- Junyang Hu
- Shanghai Key Laboratory of Advanced Polymeric Materials, School of Materials Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Zean Li
- Shanghai Key Laboratory of Advanced Polymeric Materials, School of Materials Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Jiaping Lin
- Shanghai Key Laboratory of Advanced Polymeric Materials, School of Materials Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Liangshun Zhang
- Shanghai Key Laboratory of Advanced Polymeric Materials, School of Materials Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
| |
Collapse
|
21
|
Shilpa S, Kashyap G, Sunoj RB. Recent Applications of Machine Learning in Molecular Property and Chemical Reaction Outcome Predictions. J Phys Chem A 2023; 127:8253-8271. [PMID: 37769193 DOI: 10.1021/acs.jpca.3c04779] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/30/2023]
Abstract
Burgeoning developments in machine learning (ML) and its rapidly growing adaptations in chemistry are noteworthy. Motivated by the successful deployments of ML in the realm of molecular property prediction (MPP) and chemical reaction prediction (CRP), herein we highlight some of its most recent applications in predictive chemistry. We present a nonmathematical and concise overview of the progression of ML implementations, ranging from an ensemble-based random forest model to advanced graph neural network algorithms. Similarly, the prospects of various feature engineering and feature learning approaches that work in conjunction with ML models are described. Highly accurate predictions reported in MPP tasks (e.g., lipophilicity, solubility, distribution coefficient), using methods such as D-MPNN, MolCLR, SMILES-BERT, and MolBERT, offer promising avenues in molecular design and drug discovery. Whereas MPP pertains to a given molecule, ML applications in chemical reactions present a different level of challenge, primarily arising from the simultaneous involvement of multiple molecules and their diverse roles in a reaction setting. The reported RMSEs in MPP tasks range from 0.287 to 2.20, while those for yield predictions are well over 4.9 in the lower end, reaching thresholds of >10.0 in several examples. Our Review concludes with a set of persisting challenges in dealing with reaction data sets and an overall optimistic outlook on benefits of ML-driven workflows for various MPP as well as CRP tasks.
Collapse
Affiliation(s)
- Shilpa Shilpa
- Department of Chemistry, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| | - Gargee Kashyap
- Department of Chemistry, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| | - Raghavan B Sunoj
- Department of Chemistry, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
- Centre for Machine Intelligence and Data Science, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| |
Collapse
|
22
|
Dou B, Zhu Z, Merkurjev E, Ke L, Chen L, Jiang J, Zhu Y, Liu J, Zhang B, Wei GW. Machine Learning Methods for Small Data Challenges in Molecular Science. Chem Rev 2023; 123:8736-8780. [PMID: 37384816 PMCID: PMC10999174 DOI: 10.1021/acs.chemrev.3c00189] [Citation(s) in RCA: 79] [Impact Index Per Article: 39.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023]
Abstract
Small data are often used in scientific and engineering research due to the presence of various constraints, such as time, cost, ethics, privacy, security, and technical limitations in data acquisition. However, big data have been the focus for the past decade, small data and their challenges have received little attention, even though they are technically more severe in machine learning (ML) and deep learning (DL) studies. Overall, the small data challenge is often compounded by issues, such as data diversity, imputation, noise, imbalance, and high-dimensionality. Fortunately, the current big data era is characterized by technological breakthroughs in ML, DL, and artificial intelligence (AI), which enable data-driven scientific discovery, and many advanced ML and DL technologies developed for big data have inadvertently provided solutions for small data problems. As a result, significant progress has been made in ML and DL for small data challenges in the past decade. In this review, we summarize and analyze several emerging potential solutions to small data challenges in molecular science, including chemical and biological sciences. We review both basic machine learning algorithms, such as linear regression, logistic regression (LR), k-nearest neighbor (KNN), support vector machine (SVM), kernel learning (KL), random forest (RF), and gradient boosting trees (GBT), and more advanced techniques, including artificial neural network (ANN), convolutional neural network (CNN), U-Net, graph neural network (GNN), Generative Adversarial Network (GAN), long short-term memory (LSTM), autoencoder, transformer, transfer learning, active learning, graph-based semi-supervised learning, combining deep learning with traditional machine learning, and physical model-based data augmentation. We also briefly discuss the latest advances in these methods. Finally, we conclude the survey with a discussion of promising trends in small data challenges in molecular science.
Collapse
Affiliation(s)
- Bozheng Dou
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Zailiang Zhu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Ekaterina Merkurjev
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Lu Ke
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Long Chen
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Jian Jiang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Yueying Zhu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Jie Liu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Bengong Zhang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824, United States
| |
Collapse
|
23
|
Molecular Property Prediction by Combining LSTM and GAT. Biomolecules 2023; 13:biom13030503. [PMID: 36979438 PMCID: PMC10046625 DOI: 10.3390/biom13030503] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Revised: 02/10/2023] [Accepted: 03/06/2023] [Indexed: 03/12/2023] Open
Abstract
Molecular property prediction is an important direction in computer-aided drug design. In this paper, to fully explore the information from SMILE stings and graph data of molecules, we combined the SALSTM and GAT methods in order to mine the feature information of molecules from sequences and graphs. The embedding atoms are obtained through SALSTM, firstly using SMILES strings, and they are combined with graph node features and fed into the GAT to extract the global molecular representation. At the same time, data augmentation is added to enlarge the training dataset and improve the performance of the model. Finally, to enhance the interpretability of the model, the attention layers of both models are fused together to highlight the key atoms. Comparison with other graph-based and sequence-based methods, for multiple datasets, shows that our method can achieve high prediction accuracy with good generalizability.
Collapse
|
24
|
Song Y, Chen J, Wang W, Chen G, Ma Z. Double-head transformer neural network for molecular property prediction. J Cheminform 2023; 15:27. [PMID: 36823530 PMCID: PMC9951429 DOI: 10.1186/s13321-023-00700-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2022] [Accepted: 02/16/2023] [Indexed: 02/25/2023] Open
Abstract
Existing molecular property prediction methods based on deep learning ignore the generalization ability of the nonlinear representation of molecular features and the reasonable assignment of weights of molecular features, making it difficult to further improve the accuracy of molecular property prediction. To solve the above problems, an end-to-end double-head transformer neural network (DHTNN) is proposed in this paper for high-precision molecular property prediction. For the data distribution characteristics of the molecular dataset, DHTNN specially designs a new activation function, beaf, which can greatly improve the generalization ability of the nonlinear representation of molecular features. A residual network is introduced in the molecular encoding part to solve the gradient explosion problem and ensure that the model can converge quickly. The transformer based on double-head attention is used to extract molecular intrinsic detail features, and the weights are reasonably assigned for predicting molecular properties with high accuracy. Our model, which was tested on the MoleculeNet [1] benchmark dataset, showed significant performance improvements over other state-of-the-art methods.
Collapse
Affiliation(s)
- Yuanbing Song
- College of Communication and Art Design, University of Shanghai for Science and Technology, Shanghai, China
| | - Jinghua Chen
- College of Communication and Art Design, University of Shanghai for Science and Technology, Shanghai, China
| | - Wenju Wang
- College of Communication and Art Design, University of Shanghai for Science and Technology, Shanghai, China.
| | - Gang Chen
- College of Communication and Art Design, University of Shanghai for Science and Technology, Shanghai, China
| | - Zhichong Ma
- College of Communication and Art Design, University of Shanghai for Science and Technology, Shanghai, China
| |
Collapse
|
25
|
Tian Y, Wang X, Yao X, Liu H, Yang Y. Predicting molecular properties based on the interpretable graph neural network with multistep focus mechanism. Brief Bioinform 2023; 24:6918752. [PMID: 36526280 DOI: 10.1093/bib/bbac534] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2022] [Revised: 10/24/2022] [Accepted: 11/07/2022] [Indexed: 12/23/2022] Open
Abstract
Graph neural networks based on deep learning methods have been extensively applied to the molecular property prediction because of its powerful feature learning ability and good performance. However, most of them are black boxes and cannot give the reasonable explanation about the underlying prediction mechanisms, which seriously reduce people's trust on the neural network-based prediction models. Here we proposed a novel graph neural network named iteratively focused graph network (IFGN), which can gradually identify the key atoms/groups in the molecule that are closely related to the predicted properties by the multistep focus mechanism. At the same time, the combination of the multistep focus mechanism with visualization can also generate multistep interpretations, thus allowing us to gain a deep understanding of the predictive behaviors of the model. For all studied eight datasets, the IFGN model achieved good prediction performance, indicating that the proposed multistep focus mechanism also can improve the performance of the model obviously besides increasing the interpretability of built model. For researchers to use conveniently, the corresponding website (http://graphadmet.cn/works/IFGN) was also developed and can be used free of charge.
Collapse
Affiliation(s)
- Yanan Tian
- Faculty of Applied Science, Macao Polytechnic University, Macao, China
| | - Xiaorui Wang
- State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Macao, China
| | - Xiaojun Yao
- State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Macao, China
| | - Huanxiang Liu
- Faculty of Applied Science, Macao Polytechnic University, Macao, China
| | - Ying Yang
- Department of Quality Management, Guangdong Provincial Center for Disease Prevention and Control, Guangzhou, China
| |
Collapse
|
26
|
Zhu W, Zhang Y, Zhao D, Xu J, Wang L. HiGNN: A Hierarchical Informative Graph Neural Network for Molecular Property Prediction Equipped with Feature-Wise Attention. J Chem Inf Model 2023; 63:43-55. [PMID: 36519623 DOI: 10.1021/acs.jcim.2c01099] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
Elucidating and accurately predicting the druggability and bioactivities of molecules plays a pivotal role in drug design and discovery and remains an open challenge. Recently, graph neural networks (GNNs) have made remarkable advancements in graph-based molecular property prediction. However, current graph-based deep learning methods neglect the hierarchical information of molecules and the relationships between feature channels. In this study, we propose a well-designed hierarchical informative graph neural network (termed HiGNN) framework for predicting molecular property by utilizing a corepresentation learning of molecular graphs and chemically synthesizable breaking of retrosynthetically interesting chemical substructure (BRICS) fragments. Furthermore, a plug-and-play feature-wise attention block is first designed in HiGNN architecture to adaptively recalibrate atomic features after the message passing phase. Extensive experiments demonstrate that HiGNN achieves state-of-the-art predictive performance on many challenging drug discovery-associated benchmark data sets. In addition, we devise a molecule-fragment similarity mechanism to comprehensively investigate the interpretability of the HiGNN model at the subgraph level, indicating that HiGNN as a powerful deep learning tool can help chemists and pharmacists identify the key components of molecules for designing better molecules with desired properties or functions. The source code is publicly available at https://github.com/idruglab/hignn.
Collapse
Affiliation(s)
- Weimin Zhu
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou510006, China
| | - Yi Zhang
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou510006, China
| | - Duancheng Zhao
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou510006, China
| | - Jianrong Xu
- Department of Pharmacology and Chemical Biology, Shanghai Jiao Tong University School of Medicine, Shanghai200025, China.,Academy of Integrative Medicine, Shanghai University of Traditional Chinese Medicine, Shanghai201203, China
| | - Ling Wang
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou510006, China
| |
Collapse
|
27
|
Tian H, Ketkar R, Tao P. ADMETboost: a web server for accurate ADMET prediction. J Mol Model 2022; 28:408. [PMID: 36454321 PMCID: PMC9903341 DOI: 10.1007/s00894-022-05373-8] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Accepted: 10/31/2022] [Indexed: 12/03/2022]
Abstract
The absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties are important in drug discovery as they define efficacy and safety. In this work, we applied an ensemble of features, including fingerprints and descriptors, and a tree-based machine learning model, extreme gradient boosting, for accurate ADMET prediction. Our model performs well in the Therapeutics Data Commons ADMET benchmark group. For 22 tasks, our model is ranked first in 18 tasks and top 3 in 21 tasks. The trained machine learning models are integrated in ADMETboost, a web server that is publicly available at https://ai-druglab.smu.edu/admet .
Collapse
Affiliation(s)
- Hao Tian
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, 75205, TX, USA
| | | | - Peng Tao
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, 75205, TX, USA.
| |
Collapse
|
28
|
Hormazabal RS, Kang JW, Park K, Yang DR. Not from Scratch: Predicting Thermophysical Properties through Model-Based Transfer Learning Using Graph Convolutional Networks. J Chem Inf Model 2022; 62:5411-5424. [PMID: 36315416 DOI: 10.1021/acs.jcim.2c00846] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
In this study, a framework for the prediction of thermophysical properties based on transfer learning from existing estimation models is explored. The predictive capabilities of conventional group-contribution methods and traditional machine-learning approaches rely heavily on the availability of experimental datasets and their uncertainty. Through the use of a pretraining scheme, which leverages the knowledge established by other estimation methods, improved prediction models for thermophysical properties can be obtained after fine-tuning networks with more accurate experimental data. As our experiments show, for the case of critical properties of compounds, this pipeline not only improves the performance of the models on commonly found organic structures but can also help these models generalize to less explored areas of chemical space, where experimental data is scarce, such as inorganics and heavier organic compounds. Transfer learning from estimation models data also allows for graph-based deep learning models to create more flexible molecular features over a bigger chemical space, which leads to improved predictive capabilities and can give insights into the relationship between molecular structures and thermophysical properties. The generated molecular features can discriminate behavior discrepancy between isomers without the need of additional parameters. Also, this approach shows better robustness to outliers in experimental datasets.
Collapse
Affiliation(s)
- Rodrigo S Hormazabal
- Department of Chemical and Biological Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul02841, Republic of Korea
| | - Jeong Won Kang
- Department of Chemical and Biological Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul02841, Republic of Korea
| | - Kiho Park
- School of Chemical Engineering, Chonnam National University, 77 Yongbong-ro, Buk-gu, Gwangju61186, Republic of Korea
| | - Dae Ryook Yang
- Department of Chemical and Biological Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul02841, Republic of Korea
| |
Collapse
|
29
|
Reiser P, Neubert M, Eberhard A, Torresi L, Zhou C, Shao C, Metni H, van Hoesel C, Schopmans H, Sommer T, Friederich P. Graph neural networks for materials science and chemistry. COMMUNICATIONS MATERIALS 2022; 3:93. [PMID: 36468086 PMCID: PMC9702700 DOI: 10.1038/s43246-022-00315-6] [Citation(s) in RCA: 132] [Impact Index Per Article: 44.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Accepted: 11/07/2022] [Indexed: 05/14/2023]
Abstract
Machine learning plays an increasingly important role in many areas of chemistry and materials science, being used to predict materials properties, accelerate simulations, design new structures, and predict synthesis routes of new materials. Graph neural networks (GNNs) are one of the fastest growing classes of machine learning models. They are of particular relevance for chemistry and materials science, as they directly work on a graph or structural representation of molecules and materials and therefore have full access to all relevant information required to characterize materials. In this Review, we provide an overview of the basic principles of GNNs, widely used datasets, and state-of-the-art architectures, followed by a discussion of a wide range of recent applications of GNNs in chemistry and materials science, and concluding with a road-map for the further development and application of GNNs.
Collapse
Affiliation(s)
- Patrick Reiser
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany
- Institute of Nanotechnology, Karlsruhe Institute of Technology, Hermann-von-Helmholtz-Platz 1, 76344 Eggenstein-Leopoldshafen, Germany
| | - Marlen Neubert
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany
| | - André Eberhard
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany
| | - Luca Torresi
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany
| | - Chen Zhou
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany
| | - Chen Shao
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany
- Present Address: Institute for Applied Informatics and Formal Description Systems, Karlsruhe Institute of Technology, Kaiserstr. 89, 76133 Karlsruhe, Germany
| | - Houssam Metni
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany
- ECPM, Université de Strasbourg, 25 Rue Becquerel, 67087 Strasbourg, France
| | - Clint van Hoesel
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany
- Department of Applied Physics, Eindhoven University of Technology, Groene Loper 19, 5612 AP Eindhoven, The Netherlands
| | - Henrik Schopmans
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany
- Institute of Nanotechnology, Karlsruhe Institute of Technology, Hermann-von-Helmholtz-Platz 1, 76344 Eggenstein-Leopoldshafen, Germany
| | - Timo Sommer
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany
- Institute for Theory of Condensed Matter, Karlsruhe Institute of Technology, Wolfgang-Gaede-Str. 1, 76131 Karlsruhe, Germany
- Present Address: School of Chemistry, Trinity College Dublin, College Green, Dublin 2, Ireland
| | - Pascal Friederich
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Am Fasanengarten 5, 76131 Karlsruhe, Germany
- Institute of Nanotechnology, Karlsruhe Institute of Technology, Hermann-von-Helmholtz-Platz 1, 76344 Eggenstein-Leopoldshafen, Germany
| |
Collapse
|
30
|
Yang L, Chen P, He K, Wang R, Chen G, Shan G, Zhu L. Predicting bioconcentration factor and estrogen receptor bioactivity of bisphenol a and its analogues in adult zebrafish by directed message passing neural networks. ENVIRONMENT INTERNATIONAL 2022; 169:107536. [PMID: 36152365 DOI: 10.1016/j.envint.2022.107536] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/03/2022] [Revised: 08/23/2022] [Accepted: 09/19/2022] [Indexed: 06/16/2023]
Abstract
The bioconcentration factor (BCF) is a key parameter for bioavailability assessment of environmental pollutants in regulatory frameworks. The comparative toxicology and mechanism of action of congeners are also of concern. However, there are limitations to acquire them by conducting field and laboratory experiments while machinelearning is emerging as a promising predictive tool to fill the gap. In this study, the Direct Message Passing Neural Network (DMPNN) was applied to predict logBCFs of bisphenol A (BPA) and its four analogues (bisphenol AF (BPAF), bisphenol B (BPB), bisphenol F (BPF) and bisphenol S (BPS)). For the test set, the Pearson correlation coefficient (PCC) and mean square error (MSE) were 0.85 and 0.52 respectively, suggesting a good predictive performance. The predicted logBCFs values by the DMPNN ranging from 0.35 (BPS) to 2.14 (BPAF) coincided well with those by the classical EPI Suite (BCFBAF model). Besides, estrogen receptor α (ERα) bioactivity of these bisphenols was also predicted well by the DMPNN, with a probability of 97.0 % (BPB) to 99.7 % (BPAF), which was validated by the extent of vitellogenin (VTG) induction in male zebrafish as a biomarker except BPS. Thus, with little need for expert knowledge, DMPNN is confirmed to be a useful tool to accurately predict logBCF and screen for estrogenic activity from molecular structures. Moreover, a gender difference was noted in the changes of three endpoints (logBCF, ER binding affinity and VTG levels), the rank order of which was BPAF > BPB > BPA > BPF > BPS consistently, and abnormal amino acid metabolism is featured as an omics signature of abnormal hormone protein expression.
Collapse
Affiliation(s)
- Liping Yang
- Key Laboratory of Pollution Processes and Environmental Criteria, Ministry of Education, Tianjin Key Laboratory of Environmental Remediation and Pollution Control, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China
| | - Pengyu Chen
- Key Laboratory of Pollution Processes and Environmental Criteria, Ministry of Education, Tianjin Key Laboratory of Environmental Remediation and Pollution Control, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China; College of Oceanography, Hohai University, Nanjing 210098, China
| | - Keyan He
- Key Laboratory of Pollution Processes and Environmental Criteria, Ministry of Education, Tianjin Key Laboratory of Environmental Remediation and Pollution Control, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China
| | - Ruihan Wang
- College of Chemistry, Sichuan University, Chengdu, Sichuan 610064, China
| | - Geng Chen
- School of Pharmaceutical Science and Technology, Hangzhou Institute for Advanced Study, UCAS, Hangzhou 330106, China
| | - Guoqiang Shan
- Key Laboratory of Pollution Processes and Environmental Criteria, Ministry of Education, Tianjin Key Laboratory of Environmental Remediation and Pollution Control, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China.
| | - Lingyan Zhu
- Key Laboratory of Pollution Processes and Environmental Criteria, Ministry of Education, Tianjin Key Laboratory of Environmental Remediation and Pollution Control, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China
| |
Collapse
|
31
|
Clipman SJ, Mehta SH, Mohapatra S, Srikrishnan AK, Zook KJC, Duggal P, Saravanan S, Nandagopal P, Kumar MS, Lucas GM, Latkin CA, Solomon SS. Deep learning and social network analysis elucidate drivers of HIV transmission in a high-incidence cohort of people who inject drugs. SCIENCE ADVANCES 2022; 8:eabf0158. [PMID: 36260674 PMCID: PMC9581475 DOI: 10.1126/sciadv.abf0158] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Accepted: 08/31/2022] [Indexed: 06/16/2023]
Abstract
Globally, people who inject drugs (PWID) experience some of the fastest-growing HIV epidemics. Network-based approaches represent a powerful tool for understanding and combating these epidemics; however, detailed social network studies are limited and pose analytical challenges. We collected longitudinal social (injection partners) and spatial (injection venues) network information from 2512 PWID in New Delhi, India. We leveraged network analysis and graph neural networks (GNNs) to uncover factors associated with HIV transmission and identify optimal intervention delivery points. Longitudinal HIV incidence was 21.3 per 100 person-years. Overlapping community detection using GNNs revealed seven communities, with HIV incidence concentrated within one community. The injection venue most strongly associated with incidence was found to overlap six of the seven communities, suggesting that an intervention deployed at this one location could reach the majority of the sample. These findings highlight the utility of network analysis and deep learning in HIV program design.
Collapse
Affiliation(s)
- Steven J. Clipman
- Division of Infectious Diseases, Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Shruti H. Mehta
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Shobha Mohapatra
- YR Gaitonde Centre for AIDS Research and Education (YRGCARE), Chennai, India
| | | | - Katie J. C. Zook
- Division of Infectious Diseases, Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Priya Duggal
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Shanmugam Saravanan
- YR Gaitonde Centre for AIDS Research and Education (YRGCARE), Chennai, India
| | | | | | - Gregory M. Lucas
- Division of Infectious Diseases, Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Carl A. Latkin
- Department of Health, Behavior and Society, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Sunil S. Solomon
- Division of Infectious Diseases, Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| |
Collapse
|
32
|
A pocket-based 3D molecule generative model fueled by experimental electron density. Sci Rep 2022; 12:15100. [PMID: 36068257 PMCID: PMC9448726 DOI: 10.1038/s41598-022-19363-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2022] [Accepted: 08/29/2022] [Indexed: 11/08/2022] Open
Abstract
We report for the first time the use of experimental electron density (ED) as training data for the generation of drug-like three-dimensional molecules based on the structure of a target protein pocket. Similar to a structural biologist building molecules based on their ED, our model functions with two main components: a generative adversarial network (GAN) to generate the ligand ED in the input pocket and an ED interpretation module for molecule generation. The model was tested on three targets: a kinase (hematopoietic progenitor kinase 1), protease (SARS-CoV-2 main protease), and nuclear receptor (vitamin D receptor), and evaluated with a reference dataset composed of over 8000 compounds that have their activities reported in the literature. The evaluation considered the chemical validity, chemical space distribution-based diversity, and similarity with reference active compounds concerning the molecular structure and pocket-binding mode. Our model can generate molecules with similar structures to classical active compounds and novel compounds sharing similar binding modes with active compounds, making it a promising tool for library generation supporting high-throughput virtual screening. The ligand ED generated can also be used to support fragment-based drug design. Our model is available as an online service to academic users via https://edmg.stonewise.cn/#/create .
Collapse
|
33
|
Design, synthesis, and biological evaluation of pyrrolopyrimidine derivatives as novel Bruton's tyrosine kinase (BTK) inhibitors. Eur J Med Chem 2022; 241:114611. [DOI: 10.1016/j.ejmech.2022.114611] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 07/09/2022] [Accepted: 07/10/2022] [Indexed: 11/22/2022]
|
34
|
Mayr F, Wieder M, Wieder O, Langer T. Improving Small Molecule pK a Prediction Using Transfer Learning With Graph Neural Networks. Front Chem 2022; 10:866585. [PMID: 35721000 PMCID: PMC9204323 DOI: 10.3389/fchem.2022.866585] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Accepted: 04/04/2022] [Indexed: 11/13/2022] Open
Abstract
Enumerating protonation states and calculating microstate pK a values of small molecules is an important yet challenging task for lead optimization and molecular modeling. Commercial and non-commercial solutions have notable limitations such as restrictive and expensive licenses, high CPU/GPU hour requirements, or the need for expert knowledge to set up and use. We present a graph neural network model that is trained on 714,906 calculated microstate pK a predictions from molecules obtained from the ChEMBL database. The model is fine-tuned on a set of 5,994 experimental pK a values significantly improving its performance on two challenging test sets. Combining the graph neural network model with Dimorphite-DL, an open-source program for enumerating ionization states, we have developed the open-source Python package pkasolver, which is able to generate and enumerate protonation states and calculate pK a values with high accuracy.
Collapse
|
35
|
Lou C, Yang H, Wang J, Huang M, Li W, Liu G, Lee PW, Tang Y. IDL-PPBopt: A Strategy for Prediction and Optimization of Human Plasma Protein Binding of Compounds via an Interpretable Deep Learning Method. J Chem Inf Model 2022; 62:2788-2799. [PMID: 35607907 DOI: 10.1021/acs.jcim.2c00297] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
The prediction and optimization of pharmacokinetic properties are essential in lead optimization. Traditional strategies mainly depend on the empirical chemical rules from medicinal chemists. However, with the rising amount of data, it is getting more difficult to manually extract useful medicinal chemistry knowledge. To this end, we introduced IDL-PPBopt, a computational strategy for predicting and optimizing the plasma protein binding (PPB) property based on an interpretable deep learning method. At first, a curated PPB data set was used to construct an interpretable deep learning model, which showed excellent predictive performance with a root mean squared error of 0.112 for the entire test set. Then, we designed a detection protocol based on the model and Wilcoxon test to identify the PPB-related substructures (named privileged substructures, PSubs) for each molecule. In total, 22 general privileged substructures (GPSubs) were identified, which shared some common features such as nitrogen-containing groups, diamines with two carbon units, and azetidine. Furthermore, a series of second-level chemical rules for each GPSub were derived through a statistical test and then summarized into substructure pairs. We demonstrated that these substructure pairs were equally applicable outside the training set and accordingly customized the structural modification schemes for each GPSub, which provided alternatives for the optimization of the PPB property. Therefore, IDL-PPBopt provides a promising scheme for the prediction and optimization of the PPB property and would be helpful for lead optimization of other pharmacokinetic properties.
Collapse
Affiliation(s)
- Chaofeng Lou
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| | - Hongbin Yang
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| | - Jiye Wang
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| | - Mengting Huang
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| | - Weihua Li
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| | - Guixia Liu
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| | - Philip W Lee
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| | - Yun Tang
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| |
Collapse
|
36
|
Rodríguez-Pérez R, Miljković F, Bajorath J. Machine Learning in Chemoinformatics and Medicinal Chemistry. Annu Rev Biomed Data Sci 2022; 5:43-65. [PMID: 35440144 DOI: 10.1146/annurev-biodatasci-122120-124216] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
In chemoinformatics and medicinal chemistry, machine learning has evolved into an important approach. In recent years, increasing computational resources and new deep learning algorithms have put machine learning onto a new level, addressing previously unmet challenges in pharmaceutical research. In silico approaches for compound activity predictions, de novo design, and reaction modeling have been further advanced by new algorithmic developments and the emergence of big data in the field. Herein, novel applications of machine learning and deep learning in chemoinformatics and medicinal chemistry are reviewed. Opportunities and challenges for new methods and applications are discussed, placing emphasis on proper baseline comparisons, robust validation methodologies, and new applicability domains. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 5 is August 2022. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
Collapse
Affiliation(s)
- Raquel Rodríguez-Pérez
- Department of Life Science Informatics, B-IT (Bonn-Aachen International Center for Information Technology), Chemical Biology and Medicinal Chemistry Program Unit, LIMES (Life and Medical Sciences Institute), Rheinische Friedrich-Wilhelms-Universität, Bonn, Germany; .,Current affiliation: Novartis Institutes for Biomedical Research, Novartis Campus, Basel, Switzerland
| | - Filip Miljković
- Department of Life Science Informatics, B-IT (Bonn-Aachen International Center for Information Technology), Chemical Biology and Medicinal Chemistry Program Unit, LIMES (Life and Medical Sciences Institute), Rheinische Friedrich-Wilhelms-Universität, Bonn, Germany; .,Current affiliation: Data Science and AI, Imaging and Data Analytics, Clinical Pharmacology and Safety Sciences, R&D AstraZeneca, Gothenburg, Sweden
| | - Jürgen Bajorath
- Department of Life Science Informatics, B-IT (Bonn-Aachen International Center for Information Technology), Chemical Biology and Medicinal Chemistry Program Unit, LIMES (Life and Medical Sciences Institute), Rheinische Friedrich-Wilhelms-Universität, Bonn, Germany;
| |
Collapse
|
37
|
Gu Y, Zheng S, Xu Z, Yin Q, Li L, Li J. An efficient curriculum learning-based strategy for molecular graph learning. Brief Bioinform 2022; 23:6562682. [DOI: 10.1093/bib/bbac099] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2021] [Revised: 01/18/2022] [Accepted: 02/27/2022] [Indexed: 12/14/2022] Open
Abstract
Abstract
Computational methods have been widely applied to resolve various core issues in drug discovery, such as molecular property prediction. In recent years, a data-driven computational method-deep learning had achieved a number of impressive successes in various domains. In drug discovery, graph neural networks (GNNs) take molecular graph data as input and learn graph-level representations in non-Euclidean space. An enormous amount of well-performed GNNs have been proposed for molecular graph learning. Meanwhile, efficient use of molecular data during training process, however, has not been paid enough attention. Curriculum learning (CL) is proposed as a training strategy by rearranging training queue based on calculated samples' difficulties, yet the effectiveness of CL method has not been determined in molecular graph learning. In this study, inspired by chemical domain knowledge and task prior information, we proposed a novel CL-based training strategy to improve the training efficiency of molecular graph learning, called CurrMG. Consisting of a difficulty measurer and a training scheduler, CurrMG is designed as a plug-and-play module, which is model-independent and easy-to-use on molecular data. Extensive experiments demonstrated that molecular graph learning models could benefit from CurrMG and gain noticeable improvement on five GNN models and eight molecular property prediction tasks (overall improvement is 4.08%). We further observed CurrMG’s encouraging potential in resource-constrained molecular property prediction. These results indicate that CurrMG can be used as a reliable and efficient training strategy for molecular graph learning.
Availability: The source code is available in https://github.com/gu-yaowen/CurrMG.
Collapse
Affiliation(s)
- Yaowen Gu
- Institute of Medical Information (IMI), Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS & PUMC), Beijing 100020, China
| | - Si Zheng
- Institute of Medical Information (IMI), Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS & PUMC), Beijing 100020, China
- Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
| | - Zidu Xu
- Institute of Medical Information (IMI), Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS & PUMC), Beijing 100020, China
| | - Qijin Yin
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Liang Li
- Key Laboratory of Antibiotic Bioengineering of National Health and Family Planning Commission (NHFPC), Institute of Medicinal Biotechnology (IMB), Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS & PUMC), Beijing 100020, China
| | - Jiao Li
- Institute of Medical Information (IMI), Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS & PUMC), Beijing 100020, China
| |
Collapse
|
38
|
Keshavarzi Arshadi A, Salem M, Firouzbakht A, Yuan JS. MolData, a molecular benchmark for disease and target based machine learning. J Cheminform 2022; 14:10. [PMID: 35255958 PMCID: PMC8899453 DOI: 10.1186/s13321-022-00590-y] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2021] [Accepted: 02/13/2022] [Indexed: 12/25/2022] Open
Abstract
Deep learning’s automatic feature extraction has been a revolutionary addition to computational drug discovery, infusing both the capabilities of learning abstract features and discovering complex molecular patterns via learning from molecular data. Since biological and chemical knowledge are necessary for overcoming the challenges of data curation, balancing, training, and evaluation, it is important for databases to contain information regarding the exact target and disease of each bioassay. The existing depositories such as PubChem or ChEMBL offer the screening data for millions of molecules against a variety of cells and targets, however, their bioassays contain complex biological descriptions which can hinder their usage by the machine learning community. In this work, a comprehensive disease and target-based dataset is collected from PubChem in order to facilitate and accelerate molecular machine learning for better drug discovery. MolData is one the largest efforts to date for democratizing the molecular machine learning, with roughly 170 million drug screening results from 1.4 million unique molecules assigned to specific diseases and targets. It also provides 30 unique categories of targets and diseases. Correlation analysis of the MolData bioassays unveils valuable information for drug repurposing for multiple diseases including cancer, metabolic disorders, and infectious diseases. Finally, we provide a benchmark of more than 30 models trained on each category using multitask learning. MolData aims to pave the way for computational drug discovery and accelerate the advancement of molecular artificial intelligence in a practical manner. The MolData benchmark data is available at https://GitHub.com/Transilico/MolData as well as within the additional files.
Collapse
|
39
|
Nakarin F, Boonpalit K, Kinchagawat J, Wachiraphan P, Rungrotmongkol T, Nutanong S. Assisting Multitargeted Ligand Affinity Prediction of Receptor Tyrosine Kinases Associated Nonsmall Cell Lung Cancer Treatment with Multitasking Principal Neighborhood Aggregation. Molecules 2022; 27:molecules27041226. [PMID: 35209011 PMCID: PMC8878292 DOI: 10.3390/molecules27041226] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2021] [Revised: 01/30/2022] [Accepted: 01/31/2022] [Indexed: 11/16/2022] Open
Abstract
A multitargeted therapeutic approach with hybrid drugs is a promising strategy to enhance anticancer efficiency and overcome drug resistance in nonsmall cell lung cancer (NSCLC) treatment. Estimating affinities of small molecules against targets of interest typically proceeds as a preliminary action for recent drug discovery in the pharmaceutical industry. In this investigation, we employed machine learning models to provide a computationally affordable means for computer-aided screening to accelerate the discovery of potential drug compounds. In particular, we introduced a quantitative structure–activity-relationship (QSAR)-based multitask learning model to facilitate an in silico screening system of multitargeted drug development. Our method combines a recently developed graph-based neural network architecture, principal neighborhood aggregation (PNA), with a descriptor-based deep neural network supporting synergistic utilization of molecular graph and fingerprint features. The model was generated by more than ten-thousands affinity-reported ligands of seven crucial receptor tyrosine kinases in NSCLC from two public data sources. As a result, our multitask model demonstrated better performance than all other benchmark models, as well as achieving satisfying predictive ability regarding applicable QSAR criteria for most tasks within the model’s applicability. Since our model could potentially be a screening tool for practical use, we have provided a model implementation platform with a tutorial that is freely accessible hence, advising the first move in a long journey of cancer drug development.
Collapse
Affiliation(s)
- Fahsai Nakarin
- School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology (VISTEC), Rayong 21210, Thailand; (K.B.); (J.K.); (P.W.); (S.N.)
- Correspondence: ; Tel.: +66-33-014-444
| | - Kajjana Boonpalit
- School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology (VISTEC), Rayong 21210, Thailand; (K.B.); (J.K.); (P.W.); (S.N.)
| | - Jiramet Kinchagawat
- School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology (VISTEC), Rayong 21210, Thailand; (K.B.); (J.K.); (P.W.); (S.N.)
| | - Patcharapol Wachiraphan
- School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology (VISTEC), Rayong 21210, Thailand; (K.B.); (J.K.); (P.W.); (S.N.)
| | - Thanyada Rungrotmongkol
- Center of Excellence in Biocatalyst and Sustainable Biotechnology, Faculty of Science, Chulalongkorn University, Bangkok 10330, Thailand;
- Program in Bioinformatics and Computational Biology, Graduate School, Chulalongkorn University, Bangkok 10330, Thailand
| | - Sarana Nutanong
- School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology (VISTEC), Rayong 21210, Thailand; (K.B.); (J.K.); (P.W.); (S.N.)
| |
Collapse
|
40
|
Alves LA, Ferreira NCDS, Maricato V, Alberto AVP, Dias EA, Jose Aguiar Coelho N. Graph Neural Networks as a Potential Tool in Improving Virtual Screening Programs. Front Chem 2022; 9:787194. [PMID: 35127645 PMCID: PMC8811035 DOI: 10.3389/fchem.2021.787194] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Accepted: 12/10/2021] [Indexed: 11/23/2022] Open
Abstract
Despite the increasing number of pharmaceutical companies, university laboratories and funding, less than one percent of initially researched drugs enter the commercial market. In this context, virtual screening (VS) has gained much attention due to several advantages, including timesaving, reduced reagent and consumable costs and the performance of selective analyses regarding the affinity between test molecules and pharmacological targets. Currently, VS is based mainly on algorithms that apply physical and chemistry principles and quantum mechanics to estimate molecule affinities and conformations, among others. Nevertheless, VS has not reached the expected results concerning the improvement of market-approved drugs, comprising less than twenty drugs that have reached this goal to date. In this context, graph neural networks (GNN), a recent deep-learning subtype, may comprise a powerful tool to improve VS results concerning natural products that may be used both simultaneously with standard algorithms or isolated. This review discusses the pros and cons of GNN applied to VS and the future perspectives of this learnable algorithm, which may revolutionize drug discovery if certain obstacles concerning spatial coordinates and adequate datasets, among others, can be overcome.
Collapse
Affiliation(s)
- Luiz Anastacio Alves
- Laboratory of Cellular Communication, Oswaldo Cruz Institute – Fiocruz, Rio de Janeiro, Brazil
| | | | - Victor Maricato
- Laboratory of Cellular Communication, Oswaldo Cruz Institute – Fiocruz, Rio de Janeiro, Brazil
| | | | - Evellyn Araujo Dias
- Laboratory of Cellular Communication, Oswaldo Cruz Institute – Fiocruz, Rio de Janeiro, Brazil
| | - Nt Jose Aguiar Coelho
- National Institute of Industrial Property - INPI and Veiga de Almeida University - UVA, Rio de Janeiro, Brazil
| |
Collapse
|