1
|
Chen L, Li B, Chen Y, Lin M, Zhang S, Li C, Pang Y, Wang L. ADCNet: a unified framework for predicting the activity of antibody-drug conjugates. Brief Bioinform 2025; 26:bbaf228. [PMID: 40421657 DOI: 10.1093/bib/bbaf228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2024] [Revised: 02/25/2025] [Accepted: 04/28/2025] [Indexed: 05/28/2025] Open
Abstract
Antibody-drug conjugates (ADCs) have revolutionized the field of cancer treatment in the era of precision medicine due to their ability to precisely target cancer cells and release highly effective drugs. Nevertheless, the rational design and discovery of ADCs remain challenging because the relationship between their quintuple structures and activities is difficult to explore and understand. To address this issue, we first introduce a unified deep learning framework called ADCNet to explore such relationship and help design potential ADCs. The ADCNet highly integrates the protein representation learning language model ESM-2 and small-molecule representation learning language model functional group-based bidirectional encoder representations from transformers to achieve activity prediction through learning meaningful features from antigen and antibody protein sequences of ADC, SMILES strings of linker and payload, and drug-antibody ratio (DAR) value. Based on a carefully designed and manually tailored ADC data set, extensive evaluation results reveal that ADCNet performs best on the test set compared to baseline machine learning models across all evaluation metrics. For example, it achieves an average prediction accuracy of 87.12%, a balanced accuracy of 0.8689, and an area under receiver operating characteristic curve of 0.9293 on the test set. In addition, cross-validation, ablation experiments, and external independent testing results further prove the stability, advancement, and robustness of the ADCNet architecture. For the convenience of the community, we develop the first online platform (https://ADCNet.idruglab.cn) for the prediction of ADCs activity based on the optimal ADCNet model, and the source code is publicly available at https://github.com/idrugLab/ADCNet.
Collapse
Affiliation(s)
- Liye Chen
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, No. 382 Waihuan East Road, Higher Education Mega Center, Guangzhou 510006, China
| | - Biaoshun Li
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, No. 382 Waihuan East Road, Higher Education Mega Center, Guangzhou 510006, China
| | - Yihao Chen
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, No. 382 Waihuan East Road, Higher Education Mega Center, Guangzhou 510006, China
| | - Mujie Lin
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, No. 382 Waihuan East Road, Higher Education Mega Center, Guangzhou 510006, China
| | - Shipeng Zhang
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, No. 382 Waihuan East Road, Higher Education Mega Center, Guangzhou 510006, China
| | - Chenxin Li
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, No. 382 Waihuan East Road, Higher Education Mega Center, Guangzhou 510006, China
| | - Yu Pang
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, No. 382 Waihuan East Road, Higher Education Mega Center, Guangzhou 510006, China
| | - Ling Wang
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, No. 382 Waihuan East Road, Higher Education Mega Center, Guangzhou 510006, China
| |
Collapse
|
2
|
Yadalam PK, Ardila CM. Enhanced hierarchical attention networks for predictive interactome analysis of LncRNA and CircRNA in oral herpes virus. J Oral Biol Craniofac Res 2025; 15:445-453. [PMID: 40144645 PMCID: PMC11938150 DOI: 10.1016/j.jobcr.2025.02.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2025] [Revised: 02/18/2025] [Accepted: 02/22/2025] [Indexed: 03/28/2025] Open
Abstract
BACKGROUND Non-coding RNAs, including lncRNAs, circRNAs, and microRNAs, constitute 98 % of the human transcriptome and are vital regulators of gene expression, cellular processes, and host-pathogen interactions, particularly in viral infections. This study explores lncRNA-circRNA interactions and their biological significance in oral viral infections. METHODS ViRBase, a database with over 820,000 interactions involving 50,000 RNAs from 116 viruses and 36 host organisms, was used to analyze herpesvirus datasets. The study employed hierarchical attention and knowledge graph embeddings to represent nodes and edges in the knowledge graph. These served as input features for a hierarchical attention model trained over 100 epochs. Model performance was evaluated based on loss calculation, optimization, and attention weight stability. RESULTS The model achieved a final loss of 0.000180 at Epoch 100, with stable attention weights confirming reliability. Node embedding statistics showed a mean of 0.005110 and a standard deviation of 0.013370, while attention weights had a high mean of 0.997178, emphasizing model robustness. CONCLUSION This study provides insights into lncRNA-circRNA interactions in herpes viral infections, enhancing therapeutic development, disease progression monitoring, and understanding host-pathogen interactions, paving the way for targeted interventions and improved outcomes.
Collapse
Affiliation(s)
- Pradeep Kumar Yadalam
- Department of Periodontics, Saveetha Dental College, Saveetha Institute of Medical and technology sciences, SIMATS, Saveetha. University, Chennai, Tamil Nadu, India
| | - Carlos M. Ardila
- Department of Periodontics, Saveetha Dental College, Saveetha Institute of Medical and technology sciences, SIMATS, Saveetha. University, Chennai, Tamil Nadu, India
- Department of Basic Sciences, Biomedical Stomatology Research Group, Faculty of Dentistry, Universidad de Antioquia U de A, Medellín, Colombia
| |
Collapse
|
3
|
Pala MA. Graph-Aware AURALSTM: An Attentive Unified Representation Architecture with BiLSTM for Enhanced Molecular Property Prediction. Mol Divers 2025:10.1007/s11030-025-11197-4. [PMID: 40279083 DOI: 10.1007/s11030-025-11197-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2025] [Accepted: 04/12/2025] [Indexed: 04/26/2025]
Abstract
Predicting molecular properties with high accuracy is essential across scientific fields, from drug discovery and biotechnology to materials science and environmental research. In biomedical sciences, accurate molecular property prediction is crucial for elucidating disease mechanisms, identifying potential drug candidates, and optimising various processes. However, existing approaches, often based on low-dimensional representations, fail to capture the intricate spatial and structural complexities of molecular data. This study introduces a novel hybrid deep learning model, the Graph-Aware AURA-LSTM (Attentive Unified Representation Architecture-Long Short-Term Memory), designed to determine molecular properties with unprecedented accuracy using advanced graphical representations. AURA-LSTM combines multiple Graph Neural Network (GNN) architectures, specifically Graph Convolutional Networks (GCNs), Graph Attention Networks (GATs), and Graph Isomorphism Networks (GINs), in a parallel structure to comprehensively capture the multidimensional structural features of molecules. Within this architecture, GCNs incorporate local structural relationships, GATs apply attention mechanisms to highlight critical structural elements, and GINs capture intricate molecular details through isomorphic distinction, resulting in a richly detailed feature matrix. The feature layer then processes this BiLSTM matrix, which evaluates temporal relationships to enhance molecular feature classification. Evaluated on eight benchmark datasets, AURA-LSTM demonstrated superior performance, consistently achieving over 90% accuracy and outperforming state-of-the-art methods. These results position AURA-LSTM as a robust tool for molecular feature classification, uniquely capable of integrating temporally aware insights from distinct GNN architectures.
Collapse
Affiliation(s)
- Muhammed Ali Pala
- Department of Electrical and Electronics Engineering, Faculty of Technology, Sakarya University of Applied Sciences, 54050, Sakarya, Turkey.
- Biomedical Technologies Application and Research Center (BIYOTAM), Sakarya University of Applied Sciences, Sakarya, Turkey.
| |
Collapse
|
4
|
Gong X, Liu Q, Han R, Guo Y, Wang G. MIFS: An adaptive multipath information fused self-supervised framework for drug discovery. Neural Netw 2025; 184:107088. [PMID: 39778297 DOI: 10.1016/j.neunet.2024.107088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2024] [Revised: 12/13/2024] [Accepted: 12/21/2024] [Indexed: 01/11/2025]
Abstract
The production of expressive molecular representations with scarce labeled data is challenging for AI-driven drug discovery. Mainstream studies often follow a pipeline that pre-trains a specific molecular encoder and then fine-tunes it. However, the significant challenges of these methods are (1) neglecting the propagation of diverse information within molecules and (2) the absence of knowledge and chemical constraints in the pre-training strategy. In this study, we propose an adaptive multipath information fused self-supervised framework (MIFS) that explores molecular representations from large-scale unlabeled data to aid drug discovery. In MIFS, we innovatively design a dedicated molecular graph encoder called Mol-EN, which implements three pathways of information propagation: atom-to-atom, chemical bond-to-atom, and group-to-atom, to comprehensively perceive and capture abundant semantic information. Furthermore, a novel adaptive pre-training strategy based on molecular scaffolds is devised to pre-train Mol-EN on 11 million unlabeled molecules. It optimizes Mol-EN by constructing a topological contrastive loss to provide additional chemical insights into molecular structures. Subsequently, the pre-trained Mol-EN is fine-tuned on 14 widespread drug discovery benchmark datasets, including molecular properties prediction, drug-target interactions, and drug-drug interactions. Notably, to further enhance chemical knowledge, we introduce an elemental knowledge graph (ElementKG) in the fine-tuning phase. Extensive experiments show that MIFS achieves competitive performance while providing plausible explanations for predictions from a chemical perspective.
Collapse
Affiliation(s)
- Xu Gong
- Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China.
| | - Qun Liu
- Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China.
| | - Rui Han
- Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China.
| | - Yike Guo
- Department of Computer Science and Engineering, The Hong Kong University of Science and Engineering, 999077, Hong Kong, China.
| | - Guoyin Wang
- Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China; College of Computer and Information Science, Chongqing Normal University, Chongqing, 401331, China.
| |
Collapse
|
5
|
Zhang Y, Huang J, Li X, Sun W, Zhang N, Zhang J, Chen T, Wang L. Self-awareness of retrosynthesis via chemically inspired contrastive learning for reinforced molecule generation. Brief Bioinform 2025; 26:bbaf185. [PMID: 40254835 PMCID: PMC12009711 DOI: 10.1093/bib/bbaf185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2024] [Revised: 03/19/2025] [Accepted: 03/30/2025] [Indexed: 04/22/2025] Open
Abstract
The recent progress of deep generative models in modeling complex real-world data distributions has enabled the generation of novel compounds with potential therapeutic applications for various diseases. However, most studies fail to optimize the properties of generated molecules from the perspective of the intrinsic nature of chemical reactions. In this work, we propose a novel molecule generation model to overcome the limitation by deep reinforcement learning, in which an agent learns to optimize the properties of molecules initialized with a chemically inspired contrastive pretrained model. We finally assess the generation model by evaluating its ability to generate inhibitors against two prominent therapeutic targets in cancer treatment. Experimental results show that our model could generate 100% valid and novel structures and also exhibits superior performance in generating molecules with fewer structural alerts against several baselines. More importantly, the molecules generated by our proposed model show potent biological activities against ataxia telangiectasia and Rad3-related (ATR) and cyclin-dependent kinase 9 (CDK9) targets in wet-lab experiments.
Collapse
Affiliation(s)
- Yi Zhang
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, No. 382 Waihuan East Road, Higher Education Mega Center, Guangzhou 510006, China
| | - Jindi Huang
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, No. 382 Waihuan East Road, Higher Education Mega Center, Guangzhou 510006, China
| | - Xinze Li
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, No. 382 Waihuan East Road, Higher Education Mega Center, Guangzhou 510006, China
| | - Wenqi Sun
- Guizhou Provincial Engineering Technology Research Center for Chemical Drug R&D, College of Pharmacy, Guizhou Medical University, No. 6 Ankang Avenue, Guian New District, Guiyang 561113, China
| | - Nana Zhang
- Guizhou Provincial Engineering Technology Research Center for Chemical Drug R&D, College of Pharmacy, Guizhou Medical University, No. 6 Ankang Avenue, Guian New District, Guiyang 561113, China
| | - Jiquan Zhang
- Guizhou Provincial Engineering Technology Research Center for Chemical Drug R&D, College of Pharmacy, Guizhou Medical University, No. 6 Ankang Avenue, Guian New District, Guiyang 561113, China
| | - Tiegen Chen
- Zhongshan Institute for Drug Discovery, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Zhongshan Life Science Park, No. 10 Heqing Road, Tsui Hang New District, Zhongshan 528400, China
| | - Ling Wang
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, No. 382 Waihuan East Road, Higher Education Mega Center, Guangzhou 510006, China
| |
Collapse
|
6
|
Pereira TO, Abbasi M, Arrais JP. ABIET: An explainable transformer for identifying functional groups in biological active molecules. Comput Biol Med 2025; 187:109740. [PMID: 39894011 DOI: 10.1016/j.compbiomed.2025.109740] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2024] [Revised: 12/18/2024] [Accepted: 01/21/2025] [Indexed: 02/04/2025]
Abstract
Recent advancements in deep learning have revolutionized the field of drug discovery, with Transformer-based models emerging as powerful tools for molecular design and property prediction. However, the lack of explainability in such models remains a significant challenge. In this study, we introduce ABIET (Attention-Based Importance Estimation Tool), an explainable Transformer model designed to identify the most critical regions for drug-target interactions - functional groups (FGs) - in biologically active molecules. Functional groups play a pivotal role in determining chemical behavior and biological interactions. Our approach leverages attention weights from Transformer-encoder architectures trained on SMILES representations to assess the relative importance of molecular subregions. By processing attention scores using a specific strategy - considering bidirectional interactions, layer-based extraction, and activation transformations - we effectively distinguish FGs from non-FG atoms. Experimental validation on diverse datasets targeting pharmacological receptors, including VEGFR2, AA2A, GSK3, JNK3, and DRD2, demonstrates the model's robustness and interpretability. Comparative analysis with state-of-the-art gradient-based and perturbation-based methods confirms ABIET's superior performance, with functional groups receiving statistically higher importance scores. This work enhances the transparency of Transformer predictions, providing critical insights for molecular design, structure-activity analysis, and targeted drug development.
Collapse
Affiliation(s)
- Tiago O Pereira
- Centre for Informatics and Systems of the University of Coimbra, Department of Informatics Engineering, Univ Coimbra, Coimbra, Portugal.
| | - Maryam Abbasi
- Centre for Informatics and Systems of the University of Coimbra, Department of Informatics Engineering, Univ Coimbra, Coimbra, Portugal; Applied Research Institute, Polytechnic Institute of Coimbra, Coimbra, Portugal; Research Centre for Natural Resources Environment and Society, Polytechnic Institute of Coimbra, Coimbra, Portugal
| | - Joel P Arrais
- Centre for Informatics and Systems of the University of Coimbra, Department of Informatics Engineering, Univ Coimbra, Coimbra, Portugal
| |
Collapse
|
7
|
Kianfar A, Razzaghi P, Asgari Z. Integrating convolutional layers and biformer network with forward-forward and backpropagation training. Sci Rep 2025; 15:7230. [PMID: 40021838 PMCID: PMC11871031 DOI: 10.1038/s41598-025-92218-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2025] [Accepted: 02/26/2025] [Indexed: 03/03/2025] Open
Abstract
Accurate molecular property prediction is crucial for drug discovery and computational chemistry, facilitating the identification of promising compounds and accelerating therapeutic development. Traditional machine learning falters with high-dimensional data and manual feature engineering, while existing deep learning approaches may not capture complex molecular structures, leaving a research gap. We introduce Deep-CBN, a novel framework designed to enhance molecular property prediction by capturing intricate molecular representations directly from raw data, thus improving accuracy and efficiency. Our methodology combines convolutional neural networks (CNNs) with a BiFormer attention mechanism, employing both the forward-forward algorithm and backpropagation. The model operates in three stages: (1) feature learning, extracting local features from SMILES strings using CNNs; (2) attention refinement, capturing global context with a BiFormer module enhanced by the forward-forward algorithm; and (3) prediction subnetwork tuning, fine-tuning via backpropagation. Evaluations on benchmark datasets-including Tox21, BBBP, SIDER, ClinTox, BACE, HIV, and MUV-show that Deep-CBN achieves near-perfect ROC-AUC scores, significantly outperforming state-of-the-art methods. These findings demonstrate its effectiveness in capturing complex molecular patterns, offering a robust tool to accelerate drug discovery processes.
Collapse
Affiliation(s)
- Ali Kianfar
- Department of Computer Science and Information Technology, Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan, Iran
| | - Parvin Razzaghi
- Department of Computer Science and Information Technology, Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan, Iran.
| | - Zahra Asgari
- School of Life Science Engineering, College of Interdisciplinary Science and Technology, University of Tehran, Tehran, Iran
| |
Collapse
|
8
|
Lin B, Yan S, Zhen B. A machine learning method for predicting molecular antimicrobial activity. Sci Rep 2025; 15:6559. [PMID: 39994442 PMCID: PMC11850884 DOI: 10.1038/s41598-025-91190-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2024] [Accepted: 02/18/2025] [Indexed: 02/26/2025] Open
Abstract
In response to the increasing concern over antibiotic resistance and the limitations of traditional methods in antibiotic discovery, we introduce a machine learning-based method named MFAGCN. This method predicts the antimicrobial efficacy of molecules by integrating three types of molecular fingerprints-MACCS, PubChem, and ECFP-along with molecular graph representations as input features, with a specific focus on molecular functional groups. MFAGCN incorporates an attention mechanism to assign different weights to the importance of information from different neighboring nodes. Comparative experiments with baseline models on two public datasets demonstrate MFAGCN's superior performance. Additionally, we conducted an analysis of the functional group distribution in both the training and test sets to validate the model's predictions. Furthermore, structural similarity analyses with known antibiotics are performed to prevent the rediscovery of established antibiotics. This approach enables researchers to rapidly screen molecules with potent antimicrobial properties and facilitates the identification of functional groups that influence antimicrobial performance, providing valuable insights for further antibiotic development.
Collapse
Affiliation(s)
- Bangjiang Lin
- Quanzhou Institute of Equipment Manufacturing, Haixi Institutes, Chinese Academy of Sciences, Quanzhou, 362216, China.
- College of Electrical Engineering and Automation, Fuzhou University, Fuzhou, 350108, China.
| | - Shujie Yan
- Quanzhou Institute of Equipment Manufacturing, Haixi Institutes, Chinese Academy of Sciences, Quanzhou, 362216, China
- College of Electrical Engineering and Automation, Fuzhou University, Fuzhou, 350108, China
| | - Bowen Zhen
- Quanzhou Institute of Equipment Manufacturing, Haixi Institutes, Chinese Academy of Sciences, Quanzhou, 362216, China
- College of Electrical Engineering and Automation, Fuzhou University, Fuzhou, 350108, China
| |
Collapse
|
9
|
Park J, Han M, Lee K, Park S. Hierarchical Graph Attention Network with Positive and Negative Attentions for Improved Interpretability: ISA-PN. J Chem Inf Model 2025; 65:1115-1127. [PMID: 39654089 DOI: 10.1021/acs.jcim.4c01035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2025]
Abstract
With the advancement of deep learning (DL) methods in chemistry and materials science, the interpretability of DL models has become a critical issue in elucidating quantitative (molecular) structure-property relationships. Although attention mechanisms have been generally employed to explain the importance of molecular substructures that contribute to molecular properties, their interpretability remains limited. In this work, we introduce a versatile segmentation method and develop an interpretable subgraph attention (ISA) network with positive and negative streams (ISA-PN) to enhance the understanding of molecular structure-property relationships. The predictive performance of the ISA models was validated using data sets for aqueous solubility, lipophilicity, and melting temperature, with a particular focus on evaluating interpretability for the aqueous solubility data set. The ISA-PN model enables the quantification of the contributions of molecular substructures through positive and negative attention scores. Comparative analyses of the ISA, ISA-PN, and GC-Net (group contribution network) models demonstrate that the ISA-PN model significantly improves interpretability while maintaining similar accuracy levels. This study highlights the efficacy of the ISA-PN model in providing meaningful insights into the contributions of molecular substructures to molecular properties, thereby enhancing the interpretability of DL models in chemical applications.
Collapse
Affiliation(s)
- Jinyong Park
- Department of Chemistry and Research Institute for Natural Science, Korea University, Seoul 02841, Korea
| | - Minhi Han
- Department of Chemistry and Research Institute for Natural Science, Korea University, Seoul 02841, Korea
| | - Kiwoong Lee
- Department of Chemistry and Research Institute for Natural Science, Korea University, Seoul 02841, Korea
| | - Sungnam Park
- Department of Chemistry and Research Institute for Natural Science, Korea University, Seoul 02841, Korea
| |
Collapse
|
10
|
Ye W, Li J, Cai X. Mfgnn: Multi-Scale Feature-Attentive Graph Neural Networks for Molecular Property Prediction. J Comput Chem 2025; 46:e70011. [PMID: 39840745 DOI: 10.1002/jcc.70011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Revised: 12/03/2024] [Accepted: 12/09/2024] [Indexed: 01/23/2025]
Abstract
In the realm of artificial intelligence-driven drug discovery (AIDD), accurately predicting the influence of molecular structures on their properties is a critical research focus. While deep learning models based on graph neural networks (GNNs) have made significant advancements in this area, prior studies have primarily concentrated on molecule-level representations, often neglecting the impact of functional group structures and the potential relationships between fragments on molecular property predictions. To address this gap, we introduce the multi-scale feature attention graph neural network (MfGNN), which enhances traditional atom-based molecular graph representations by incorporating fragment-level representations derived from chemically synthesizable BRICS fragments. MfGNN not only effectively captures both the structural information of molecules and the features of functional groups but also pays special attention to the potential relationships between fragments, exploring how they collectively influence molecular properties. This model integrates two core mechanisms: a graph attention mechanism that captures embeddings of molecules and functional groups, and a feature extraction module that systematically processes BRICS fragment-level features to uncover relationships among the fragments. Our comprehensive experiments demonstrate that MfGNN outperforms leading machine learning and deep learning models, achieving state-of-the-art performance in 8 out of 11 learning tasks across various domains, including physical chemistry, biophysics, physiology, and toxicology. Furthermore, ablation studies reveal that the integration of multi-scale feature information and the feature extraction module enhances the richness of molecular features, thereby improving the model's predictive capabilities.
Collapse
Affiliation(s)
- Weiting Ye
- College of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, Guangdong, China
| | - Jingcheng Li
- College of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, Guangdong, China
| | - Xianfa Cai
- College of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, Guangdong, China
| |
Collapse
|
11
|
Nguyen LD, Nguyen QH, Trinh QH, Nguyen BP. From SMILES to Enhanced Molecular Property Prediction: A Unified Multimodal Framework with Predicted 3D Conformers and Contrastive Learning Techniques. J Chem Inf Model 2024; 64:9173-9195. [PMID: 39641280 DOI: 10.1021/acs.jcim.4c01240] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/07/2024]
Abstract
We present a novel molecular property prediction framework that requires only the SMILES format as input but is designed to be multimodal by incorporating predicted 3D conformer representations. Our model captures comprehensive molecular features by leveraging both the sequential character structure of SMILES and the three-dimensional spatial structure of conformers. The framework employs contrastive learning techniques, utilizing InfoNCE loss to align SMILES and conformer embeddings, along with task-specific loss functions, such as ConR for regression and SupCon for classification. To address data imbalance, we incorporate feature distribution smoothing (FDS), a common challenge in drug discovery. We evaluated the framework through multiple case studies, including SARS-CoV-2 drug docking score prediction, molecular property prediction using MoleculeNet data sets, and kinase inhibitor prediction for JAK-1, JAK-2, and MAPK-14 using custom data sets curated from PubChem. The results consistently outperformed state-of-the-art methods, with ConR and FDS significantly improving regression tasks and SupCon enhancing classification performance. These findings highlight the flexibility and robustness of our multimodal model, demonstrating its effectiveness across diverse molecular property prediction tasks, with promising applications in drug discovery and molecular analysis.
Collapse
Affiliation(s)
- Long D Nguyen
- School of Information and Communication Technology, Hanoi University of Science and Technology, 1 Dai Co Viet, Hanoi 100000, Vietnam
| | - Quang H Nguyen
- School of Information and Communication Technology, Hanoi University of Science and Technology, 1 Dai Co Viet, Hanoi 100000, Vietnam
| | - Quang H Trinh
- School of Information and Communication Technology, Hanoi University of Science and Technology, 1 Dai Co Viet, Hanoi 100000, Vietnam
| | - Binh P Nguyen
- School of Mathematics and Statistics, Victoria University of Wellington, Kelburn Parade, Wellington 6012, New Zealand
| |
Collapse
|
12
|
Zhao D, Zhang Y, Chen Y, Li B, Zhou W, Wang L. Highly Accurate and Explainable Predictions of Small-Molecule Antioxidants for Eight In Vitro Assays Simultaneously through an Alternating Multitask Learning Strategy. J Chem Inf Model 2024; 64:9098-9110. [PMID: 38888465 DOI: 10.1021/acs.jcim.4c00748] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/20/2024]
Abstract
Small molecule antioxidants can inhibit or retard oxidation reactions and protect against free radical damage to cells, thus playing a key role in food, cosmetics, pharmaceuticals, the environment, as well as materials. Experimentally driven antioxidant discovery is a major paradigm, and computationally assisted antioxidants are rarely reported. In this study, a functional-group-based alternating multitask self-supervised molecular representation learning method is proposed to simultaneously predict the antioxidant activities of small molecules for eight commonly used in vitro antioxidant assays. Extensive evaluation results reveal that compared with the baseline models, the multitask FG-BERT model achieves the best overall predictive performance, with the highest average F1, BA, ROC-AUC, and PRC-AUC values of 0.860, 0.880, 0.954, and 0.937 for the test sets, respectively. The Y-scrambling testing results further demonstrate that such a deep learning model was not constructed by accident and that it has reliable predictive capabilities. Additionally, the excellent interpretability of the multitask FG-BERT model makes it easy to identify key structural fragments/groups that contribute significantly to the antioxidant effect of a given molecule. Finally, an online antioxidant activity prediction platform called AOP (freely available at https://aop.idruglab.cn/) and its local version were developed based on the high-quality multitask FG-BERT model for experts and nonexperts in the field. We anticipate that it will contribute to the discovery of novel small-molecule antioxidants.
Collapse
Affiliation(s)
- Duancheng Zhao
- Joint International Research Laboratory of Synthetic Biology and Medicine, Ministry of Education, Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Yanhong Zhang
- Joint International Research Laboratory of Synthetic Biology and Medicine, Ministry of Education, Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Yihao Chen
- Joint International Research Laboratory of Synthetic Biology and Medicine, Ministry of Education, Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Biaoshun Li
- Joint International Research Laboratory of Synthetic Biology and Medicine, Ministry of Education, Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Wenguang Zhou
- Central Laboratory of The Sixth Affiliated Hospital, School of Medicine, South China University of Technology, Foshan 528200, China
| | - Ling Wang
- Joint International Research Laboratory of Synthetic Biology and Medicine, Ministry of Education, Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| |
Collapse
|
13
|
Si Z, Liu D, Nie W, Hu J, Wang C, Jiang T, Yu H, Fu Y. Data-Based Prediction of Redox Potentials via Introducing Chemical Features into the Transformer Architecture. J Chem Inf Model 2024; 64:8453-8463. [PMID: 39513760 DOI: 10.1021/acs.jcim.4c01299] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2024]
Abstract
Rapid and accurate prediction of basic physicochemical parameters of molecules will greatly accelerate the target-orientated design of novel reactions and materials but has been long challenging. Herein, a chemical language model-based deep learning method, TransChem, has been developed for the prediction of redox potentials of organic molecules. Embedding an effective molecular characterization (combining spatial and electronic features), a nonlinear molecular messaging approach (Mol-Attention), and a perturbation learning method, TransChem, shows high accuracy in predicting the redox potential of organic radicals comprising over 100,000 data (R2 > 0.97, MAE <0.09 V) and is generalized to the smaller 2,1,3-benzothiadiazole data set (<3000 data points) and electron affinity data set (660 data) with low MAE of 0.07 V and 0.18 eV, respectively. In this context, a self-developed data set, i.e., the oxidation potential (OP) of a full-space disubstituted phenol data set (OPP-data set, total set: 74,529), has been predicted by TransChem with a high-throughput, and active learning strategy. The rapid and reliable prediction of OP could hopefully accelerate the screening of plausible reagents in highly selective cross-coupling of phenol derivatives. This study presents an important attempt to guide language modeling with chemical knowledge, while TransChem demonstrates state-of-the-art (SOTA) predictive performance on redox potential prediction benchmark data sets for its better understanding of molecular design and conformational relationships.
Collapse
Affiliation(s)
- Zhan Si
- Department of Chemistry and Centre for Atomic Engineering of Advanced Materials, Anhui Province Key Laboratory of Chemistry for Inorganic/Organic Hybrid Functionalized Materials, Anhui University, Hefei 230601, China
| | - Deguang Liu
- Key Laboratory of Precision and Intelligent Chemistry, CAS Key Laboratory of Urban Pollutant Conversion, Anhui Province Key Laboratory of Biomass Clean Energy, University of Science and Technology of China, Hefei 230026, China
| | - Wan Nie
- Department of Computer Science, City University of Hong Kong, Hong Kong 999077, China
| | - Jingjing Hu
- Department of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China
| | - Chen Wang
- Department of Chemistry and Centre for Atomic Engineering of Advanced Materials, Anhui Province Key Laboratory of Chemistry for Inorganic/Organic Hybrid Functionalized Materials, Anhui University, Hefei 230601, China
| | - Tingting Jiang
- Department of Chemistry and Centre for Atomic Engineering of Advanced Materials, Anhui Province Key Laboratory of Chemistry for Inorganic/Organic Hybrid Functionalized Materials, Anhui University, Hefei 230601, China
| | - Haizhu Yu
- Department of Chemistry and Centre for Atomic Engineering of Advanced Materials, Anhui Province Key Laboratory of Chemistry for Inorganic/Organic Hybrid Functionalized Materials, Anhui University, Hefei 230601, China
| | - Yao Fu
- Key Laboratory of Precision and Intelligent Chemistry, CAS Key Laboratory of Urban Pollutant Conversion, Anhui Province Key Laboratory of Biomass Clean Energy, University of Science and Technology of China, Hefei 230026, China
| |
Collapse
|
14
|
Kang Y, Xia Q, Jiang Y, Li Z. MVGNet: Prediction of PI3K Inhibitors Using Multitask Learning and Multiview Frameworks. ACS OMEGA 2024; 9:45159-45168. [PMID: 39554430 PMCID: PMC11561616 DOI: 10.1021/acsomega.4c06224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/05/2024] [Revised: 10/09/2024] [Accepted: 10/15/2024] [Indexed: 11/19/2024]
Abstract
PI3K (phosphatidylinositol 3-kinase) is an intracellular phosphatidylinositol kinase composed of a regulatory subunit, p85, and a catalytic subunit, p110. Based on the different structures of the p110 catalytic subunit, PI3K can be divided into four isoforms: PI3Kα, PI3Kβ, PI3Kγ, and PI3Kδ. As molecularly targeted drugs, PI3K inhibitors have demonstrated antiproliferative effects on tumor cells and can also induce cancer cell death. In this study, a multiview deep learning framework (MVGNet) is proposed, which integrates fragment-based pharmacophore information and utilizes multitask learning to capture correlation information between subtasks. This framework predicts the inhibitory activity of molecules against the four PI3K isoforms (PI3Kα, PI3Kβ, PI3Kγ, and PI3Kδ). Compared to baseline prediction models based on three traditional machine learning methods (RF, SVM, and XGBoost) and four deep learning algorithms (GAT, D-MPNN, CMPNN, and KANO), our model demonstrates superior performance. The evaluation results show that our model achieves the highest average AUC-ROC and AUC-PR values on the test set, which are 0.927 ± 0.006 and 0.980 ± 0.002, respectively. This study provides a reference for exploring the structure-activity relationship of PI3K inhibitors.
Collapse
Affiliation(s)
- Yanlei Kang
- Zhejiang Province Key Laboratory of Smart Management & Application of Modern Agricultural Re-sources, School of Information Engineering, Huzhou University, Huzhou 313000, Zhejiang Province,China
| | - Qiwei Xia
- Zhejiang Province Key Laboratory of Smart Management & Application of Modern Agricultural Re-sources, School of Information Engineering, Huzhou University, Huzhou 313000, Zhejiang Province,China
| | - Yunliang Jiang
- Zhejiang Province Key Laboratory of Smart Management & Application of Modern Agricultural Re-sources, School of Information Engineering, Huzhou University, Huzhou 313000, Zhejiang Province,China
- School of Computer Science and Technology, Zhejiang Normal University, Jinhua 321004, Zhejiang Province, China
| | - Zhong Li
- Zhejiang Province Key Laboratory of Smart Management & Application of Modern Agricultural Re-sources, School of Information Engineering, Huzhou University, Huzhou 313000, Zhejiang Province,China
| |
Collapse
|
15
|
Lin M, Cai J, Wei Y, Peng X, Luo Q, Li B, Chen Y, Wang L. MalariaFlow: A comprehensive deep learning platform for multistage phenotypic antimalarial drug discovery. Eur J Med Chem 2024; 277:116776. [PMID: 39173285 DOI: 10.1016/j.ejmech.2024.116776] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2024] [Revised: 07/31/2024] [Accepted: 08/01/2024] [Indexed: 08/24/2024]
Abstract
Malaria remains a significant global health challenge due to the growing drug resistance of Plasmodium parasites and the failure to block transmission within human host. While machine learning (ML) and deep learning (DL) methods have shown promise in accelerating antimalarial drug discovery, the performance of deep learning models based on molecular graph and other co-representation approaches warrants further exploration. Current research has overlooked mutant strains of the malaria parasite with varying degrees of sensitivity or resistance, and has not covered the prediction of inhibitory activities across the three major life cycle stages (liver, asexual blood, and gametocyte) within the human host, which is crucial for both treatment and transmission blocking. In this study, we manually curated a benchmark antimalarial activity dataset comprising 407,404 unique compounds and 410,654 bioactivity data points across ten Plasmodium phenotypes and three stages. The performance was systematically compared among two fingerprint-based ML models (RF::Morgan and XGBoost:Morgan), four graph-based DL models (GCN, GAT, MPNN, and Attentive FP), and three co-representations DL models (FP-GNN, HiGNN, and FG-BERT), which reveal that: 1) The FP-GNN model achieved the best predictive performance, outperforming the other methods in distinguishing active and inactive compounds across balanced, more positive, and more negative datasets, with an overall AUROC of 0.900; 2) Fingerprint-based ML models outperformed graph-based DL models on large datasets (>1000 compounds), but the three co-representations DL models were able to incorporate domain-specific chemical knowledge to bridge this gap, achieving better predictive performance. These findings provide valuable guidance for selecting appropriate ML and DL methods for antimalarial activity prediction tasks. The interpretability analysis of the FP-GNN model revealed its ability to accurately capture the key structural features responsible for the liver- and blood-stage activities of the known antimalarial drug atovaquone. Finally, we developed a web server, MalariaFlow, incorporating these high-quality models for antimalarial activity prediction, virtual screening, and similarity search, successfully predicting novel triple-stage antimalarial hits validated through experimental testing, demonstrating its effectiveness and value in discovering potential multistage antimalarial drug candidates.
Collapse
Affiliation(s)
- Mujie Lin
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Junxi Cai
- School of Civil Engineering and Transportation, South China University of Technology, Guangzhou, 510006, China
| | - Yuancheng Wei
- School of Software Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Xinru Peng
- School of Software Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Qianhui Luo
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Biaoshun Li
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Yihao Chen
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Ling Wang
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, 510006, China.
| |
Collapse
|
16
|
He G, Liu S, Liu Z, Wang C, Zhang K, Li H. Prototype-based contrastive substructure identification for molecular property prediction. Brief Bioinform 2024; 25:bbae565. [PMID: 39494969 PMCID: PMC11533112 DOI: 10.1093/bib/bbae565] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2024] [Revised: 08/11/2024] [Accepted: 10/22/2024] [Indexed: 11/05/2024] Open
Abstract
Substructure-based representation learning has emerged as a powerful approach to featurize complex attributed graphs, with promising results in molecular property prediction (MPP). However, existing MPP methods mainly rely on manually defined rules to extract substructures. It remains an open challenge to adaptively identify meaningful substructures from numerous molecular graphs to accommodate MPP tasks. To this end, this paper proposes Prototype-based cOntrastive Substructure IdentificaTion (POSIT), a self-supervised framework to autonomously discover substructural prototypes across graphs so as to guide end-to-end molecular fragmentation. During pre-training, POSIT emphasizes two key aspects of substructure identification: firstly, it imposes a soft connectivity constraint to encourage the generation of topologically meaningful substructures; secondly, it aligns resultant substructures with derived prototypes through a prototype-substructure contrastive clustering objective, ensuring attribute-based similarity within clusters. In the fine-tuning stage, a cross-scale attention mechanism is designed to integrate substructure-level information to enhance molecular representations. The effectiveness of the POSIT framework is demonstrated by experimental results from diverse real-world datasets, covering both classification and regression tasks. Moreover, visualization analysis validates the consistency of chemical priors with identified substructures. The source code is publicly available at https://github.com/VRPharmer/POSIT.
Collapse
Affiliation(s)
- Gaoqi He
- School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China
| | - Shun Liu
- School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China
| | - Zhuoran Liu
- School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China
| | - Changbo Wang
- School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China
| | - Kai Zhang
- School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China
| | - Honglin Li
- Innovation Center for AI and Drug Discovery, East China Normal University, 200062 Shanghai, China
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science & Technology, 200237 Shanghai, China
| |
Collapse
|
17
|
Zhu Y, Zhang Y, Li X, Wang L. 3MTox: A motif-level graph-based multi-view chemical language model for toxicity identification with deep interpretation. JOURNAL OF HAZARDOUS MATERIALS 2024; 476:135114. [PMID: 38986414 DOI: 10.1016/j.jhazmat.2024.135114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/27/2024] [Revised: 06/24/2024] [Accepted: 07/04/2024] [Indexed: 07/12/2024]
Abstract
Toxicity identification plays a key role in maintaining human health, as it can alert humans to the potential hazards caused by long-term exposure to a wide variety of chemical compounds. Experimental methods for determining toxicity are time-consuming, and costly, while computational methods offer an alternative for the early identification of toxicity. For example, some classical ML and DL methods, which demonstrate excellent performance in toxicity prediction. However, these methods also have some defects, such as over-reliance on artificial features and easy overfitting, etc. Proposing novel models with superior prediction performance is still an urgent task. In this study, we propose a motifs-level graph-based multi-view pretraining language model, called 3MTox, for toxicity identification. The 3MTox model uses Bidirectional Encoder Representations from Transformers (BERT) as the backbone framework, and a motif graph as input. The results of extensive experiments showed that our 3MTox model achieved state-of-the-art performance on toxicity benchmark datasets and outperformed the baseline models considered. In addition, the interpretability of the model ensures that the it can quickly and accurately identify toxicity sites in a given molecule, thereby contributing to the determination of the status of toxicity and associated analyses. We think that the 3MTox model is among the most promising tools that are currently available for toxicity identification.
Collapse
Affiliation(s)
- Yingying Zhu
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Ministry of Education, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Yanhong Zhang
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Ministry of Education, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Xinze Li
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Ministry of Education, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Ling Wang
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Ministry of Education, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China.
| |
Collapse
|
18
|
Liu Y, Yu Y, Wu B, Qian J, Mu H, Gu L, Zhou R, Zhang H, Wu H, Bu Y. A comprehensive prediction system for silkworm acute toxicity assessment of environmental and in-silico pesticides. ECOTOXICOLOGY AND ENVIRONMENTAL SAFETY 2024; 282:116759. [PMID: 39029220 DOI: 10.1016/j.ecoenv.2024.116759] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/12/2024] [Revised: 07/03/2024] [Accepted: 07/16/2024] [Indexed: 07/21/2024]
Abstract
The excessive application and loss of pesticides poses a great risk to the ecosystem, and the environmental safety assessment of pesticides is time-consuming and expensive using traditional animal toxicity tests. In this work, a pesticide acute toxicity dataset was created for silkworm integrating extensive experiments and various common pesticide formulations considering the sensitivity of silkworm to adverse environment, its economic value in China, and a gap in machine learning (ML) research on the toxicity prediction of this species, which addressed the previous limitation of only being able to predict toxicity classification without specific toxicity values. A new comprehensive voting model (CVR) was developed based on ML, combined with three regression algorithms, namely, Bayesian Ridge (BR), K Neighbors Regressor (KNN), Random Forest Regressor (RF) to accurately calculate lethal concentration 50 % (LC50). Three conformal models were successfully constructed, marking the first combination of conformal models with confidence intervals to predict silkworm toxicity. Further, the mechanism by analyzing structural alerts was summarized, and identified 25 warning structures, 24 positive compounds and 14 negative compounds. Importantly, a novel comprehensive prediction system was constructed that can provide LC50 and confidence intervals, structural alerts analysis, lipid-water partition coefficient (LogP) and similarity analysis, which can comprehensively evaluate the ecological toxicity risk of substances to make up for the incomplete toxicity data of new pesticides. The validity and generalization of the CVR model were verified by an external validation set. In addition, five new, low-toxic and green pesticide alternatives were designed through 50,000 cycles. Moreover, our software and ST Profiler can provide low-cost information access to accelerate environmental risk assessment, which can predict not only a single chemical, but also batches of chemicals, simply by inputting the SMILES / CAS / (Chinese / English) name of chemicals.
Collapse
Affiliation(s)
- Yutong Liu
- Research Center of Solid Waste Pollution and Prevention, Nanjing Institute of Environmental Science, Ministry of Ecology and Environment, Nanjing 210042, PR China; Department of Chemistry, College of Sciences, Nanjing Agricultural University, Nanjing 210095, PR China
| | - Yue Yu
- Research Center of Solid Waste Pollution and Prevention, Nanjing Institute of Environmental Science, Ministry of Ecology and Environment, Nanjing 210042, PR China
| | - Bing Wu
- State Key Laboratory of Pollution Control and Resource Reuse, School of Environment, Nanjing University, Nanjing 210023, PR China
| | - Jieshu Qian
- School of Environmental Engineering, Wuxi University, Jiangsu 214105, PR China
| | - Hongxin Mu
- Research Center of Solid Waste Pollution and Prevention, Nanjing Institute of Environmental Science, Ministry of Ecology and Environment, Nanjing 210042, PR China
| | - Luyao Gu
- Research Center of Solid Waste Pollution and Prevention, Nanjing Institute of Environmental Science, Ministry of Ecology and Environment, Nanjing 210042, PR China
| | - Rong Zhou
- Research Center of Solid Waste Pollution and Prevention, Nanjing Institute of Environmental Science, Ministry of Ecology and Environment, Nanjing 210042, PR China
| | - Houhu Zhang
- Research Center of Solid Waste Pollution and Prevention, Nanjing Institute of Environmental Science, Ministry of Ecology and Environment, Nanjing 210042, PR China
| | - Hua Wu
- Department of Chemistry, College of Sciences, Nanjing Agricultural University, Nanjing 210095, PR China.
| | - Yuanqing Bu
- Research Center of Solid Waste Pollution and Prevention, Nanjing Institute of Environmental Science, Ministry of Ecology and Environment, Nanjing 210042, PR China; Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology, Nanjing University of Information Science & Technology, Nanjing 210044, PR China.
| |
Collapse
|
19
|
Zhang Q, Mao D, Tu Y, Wu YY. A New Fingerprint and Graph Hybrid Neural Network for Predicting Molecular Properties. J Chem Inf Model 2024; 64:5853-5866. [PMID: 39052623 DOI: 10.1021/acs.jcim.4c00586] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/27/2024]
Abstract
Machine learning plays a role in accelerating drug discovery, and the design of effective machine learning models is crucial for accurately predicting molecular properties. Characterizing molecules typically involves the use of molecular fingerprints and molecular graphs. These are input into a multilayer perceptron (MLP) and variants of graph neural networks, such as graph attention networks (GATs). Due to the diverse types and large dimension of fingerprints, models may contain many features that are relatively irrelevant or redundant; meanwhile, although the GAT excels in handling heterogeneous graph tasks, it lacks the ability to extract collaborative information from neighboring nodes, which is crucial in scenarios where it cannot capture the joint influence of adjacent groups on atoms. To overcome these challenges, we introduce a hybrid model, combining improved GAT and MLP. In GAT, the recurrent neural network is employed to capture collaborative information. To address the dimensionality issue, we propose a feature selection algorithm, which is based on the principle of maximizing relevance while minimizing redundancy. Through experiments on 13 public data sets and 14 breast cell lines, our model demonstrates superior performance compared to state-of-the-art deep learning and traditional machine learning algorithms. Additionally, a series of ablation experiments were conducted to demonstrate the advantages of our improved version, as well as its antinoise capability and interpretability. These results indicate that our model holds promising prospects for practical applications.
Collapse
Affiliation(s)
- Qingtian Zhang
- College of Physics Science and Technology, Yangzhou University, Jiangsu 225009, China
| | - Dangxin Mao
- College of Physics Science and Technology, Yangzhou University, Jiangsu 225009, China
| | - Yusong Tu
- College of Physics Science and Technology, Yangzhou University, Jiangsu 225009, China
| | - Yuan-Yan Wu
- College of Physics Science and Technology, Yangzhou University, Jiangsu 225009, China
| |
Collapse
|
20
|
Lavecchia A. Advancing drug discovery with deep attention neural networks. Drug Discov Today 2024; 29:104067. [PMID: 38925473 DOI: 10.1016/j.drudis.2024.104067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2024] [Revised: 06/10/2024] [Accepted: 06/19/2024] [Indexed: 06/28/2024]
Abstract
In the dynamic field of drug discovery, deep attention neural networks are revolutionizing our approach to complex data. This review explores the attention mechanism and its extended architectures, including graph attention networks (GATs), transformers, bidirectional encoder representations from transformers (BERT), generative pre-trained transformers (GPTs) and bidirectional and auto-regressive transformers (BART). Delving into their core principles and multifaceted applications, we uncover their pivotal roles in catalyzing de novo drug design, predicting intricate molecular properties and deciphering elusive drug-target interactions. Despite challenges, these attention-based architectures hold unparalleled promise to drive transformative breakthroughs and accelerate progress in pharmaceutical research.
Collapse
Affiliation(s)
- Antonio Lavecchia
- Drug Discovery Laboratory, Department of Pharmacy, University of Napoli Federico II, I-80131 Naples, Italy.
| |
Collapse
|
21
|
Samanipour S, Barron LP, van Herwerden D, Praetorius A, Thomas KV, O’Brien JW. Exploring the Chemical Space of the Exposome: How Far Have We Gone? JACS AU 2024; 4:2412-2425. [PMID: 39055136 PMCID: PMC11267556 DOI: 10.1021/jacsau.4c00220] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/08/2024] [Revised: 05/29/2024] [Accepted: 05/31/2024] [Indexed: 07/27/2024]
Abstract
Around two-thirds of chronic human disease can not be explained by genetics alone. The Lancet Commission on Pollution and Health estimates that 16% of global premature deaths are linked to pollution. Additionally, it is now thought that humankind has surpassed the safe planetary operating space for introducing human-made chemicals into the Earth System. Direct and indirect exposure to a myriad of chemicals, known and unknown, poses a significant threat to biodiversity and human health, from vaccine efficacy to the rise of antimicrobial resistance as well as autoimmune diseases and mental health disorders. The exposome chemical space remains largely uncharted due to the sheer number of possible chemical structures, estimated at over 1060 unique forms. Conventional methods have cataloged only a fraction of the exposome, overlooking transformation products and often yielding uncertain results. In this Perspective, we have reviewed the latest efforts in mapping the exposome chemical space and its subspaces. We also provide our view on how the integration of data-driven approaches might be able to bridge the identified gaps.
Collapse
Affiliation(s)
- Saer Samanipour
- Van’t
Hoff Institute for Molecular Sciences (HIMS), University of Amsterdam, Amsterdam 1090 GD, The Netherlands
- UvA
Data Science Center, University of Amsterdam, Amsterdam 1090 GD, The Netherlands
- Queensland
Alliance for Environmental Health Sciences (QAEHS), The University of Queensland, Cornwall Street, Woolloongabba, Queensland 4102, Australia
| | - Leon Patrick Barron
- Van’t
Hoff Institute for Molecular Sciences (HIMS), University of Amsterdam, Amsterdam 1090 GD, The Netherlands
- MRC
Centre for Environment and Health, Environmental Research Group, School
of Public Health, Faculty of Medicine, Imperial
College London, London W12 0BZ, United Kingdom
| | - Denice van Herwerden
- Van’t
Hoff Institute for Molecular Sciences (HIMS), University of Amsterdam, Amsterdam 1090 GD, The Netherlands
| | - Antonia Praetorius
- Institute
for Biodiversity and Ecosystem Dynamics (IBED), University of Amsterdam, Amsterdam 1090 GD, The Netherlands
| | - Kevin V. Thomas
- Queensland
Alliance for Environmental Health Sciences (QAEHS), The University of Queensland, Cornwall Street, Woolloongabba, Queensland 4102, Australia
| | - Jake William O’Brien
- Van’t
Hoff Institute for Molecular Sciences (HIMS), University of Amsterdam, Amsterdam 1090 GD, The Netherlands
- Queensland
Alliance for Environmental Health Sciences (QAEHS), The University of Queensland, Cornwall Street, Woolloongabba, Queensland 4102, Australia
| |
Collapse
|
22
|
Zhang R, Lin Y, Wu Y, Deng L, Zhang H, Liao M, Peng Y. MvMRL: a multi-view molecular representation learning method for molecular property prediction. Brief Bioinform 2024; 25:bbae298. [PMID: 38920342 PMCID: PMC11200189 DOI: 10.1093/bib/bbae298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2024] [Revised: 05/09/2024] [Accepted: 06/07/2024] [Indexed: 06/27/2024] Open
Abstract
Effective molecular representation learning is very important for Artificial Intelligence-driven Drug Design because it affects the accuracy and efficiency of molecular property prediction and other molecular modeling relevant tasks. However, previous molecular representation learning studies often suffer from limitations, such as over-reliance on a single molecular representation, failure to fully capture both local and global information in molecular structure, and ineffective integration of multiscale features from different molecular representations. These limitations restrict the complete and accurate representation of molecular structure and properties, ultimately impacting the accuracy of predicting molecular properties. To this end, we propose a novel multi-view molecular representation learning method called MvMRL, which can incorporate feature information from multiple molecular representations and capture both local and global information from different views well, thus improving molecular property prediction. Specifically, MvMRL consists of four parts: a multiscale CNN-SE Simplified Molecular Input Line Entry System (SMILES) learning component and a multiscale Graph Neural Network encoder to extract local feature information and global feature information from the SMILES view and the molecular graph view, respectively; a Multi-Layer Perceptron network to capture complex non-linear relationship features from the molecular fingerprint view; and a dual cross-attention component to fuse feature information on the multi-views deeply for predicting molecular properties. We evaluate the performance of MvMRL on 11 benchmark datasets, and experimental results show that MvMRL outperforms state-of-the-art methods, indicating its rationality and effectiveness in molecular property prediction. The source code of MvMRL was released in https://github.com/jedison-github/MvMRL.
Collapse
Affiliation(s)
- Ru Zhang
- Guangxi Key Lab of Human-Machine Interaction and Intelligent Decision, Nanning Normal University, No. 175, Mingxiu East Road, Xixiang Tang District, Nanning 530001, China
| | - Yanmei Lin
- Guangxi Key Lab of Human-Machine Interaction and Intelligent Decision, Nanning Normal University, No. 175, Mingxiu East Road, Xixiang Tang District, Nanning 530001, China
- Center for Applied Mathematics of Guangxi, Nanning Normal University, 508 Xinning Road, Wuming District, Nanning 530100, China
| | - Yijia Wu
- Guangxi Key Lab of Human-Machine Interaction and Intelligent Decision, Nanning Normal University, No. 175, Mingxiu East Road, Xixiang Tang District, Nanning 530001, China
| | - Lei Deng
- School of Computer Science and Engineering, Central South University, 932 Lushan South Road, Changsha 410083, China
| | - Hao Zhang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, 1068 Xueyuan Avenue, Shenzhen University Town, Shenzhen 518000, China
| | - Mingzhi Liao
- Center of Bioinformatics, College of Life Sciences, Northwest A&F University, 3 Taicheng Road, Yangling, Shaanxi 712100, China
| | - Yuzhong Peng
- Guangxi Key Lab of Human-Machine Interaction and Intelligent Decision, Nanning Normal University, No. 175, Mingxiu East Road, Xixiang Tang District, Nanning 530001, China
- Guangxi Academy of Sciences, 174 East University Road, Nanning 530007, China
| |
Collapse
|
23
|
Kengkanna A, Ohue M. Enhancing property and activity prediction and interpretation using multiple molecular graph representations with MMGX. Commun Chem 2024; 7:74. [PMID: 38580841 PMCID: PMC10997661 DOI: 10.1038/s42004-024-01155-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Accepted: 03/18/2024] [Indexed: 04/07/2024] Open
Abstract
Graph Neural Networks (GNNs) excel in compound property and activity prediction, but the choice of molecular graph representations significantly influences model learning and interpretation. While atom-level molecular graphs resemble natural topology, they overlook key substructures or functional groups and their interpretation partially aligns with chemical intuition. Recent research suggests alternative representations using reduced molecular graphs to integrate higher-level chemical information and leverages both representations for model. However, there is a lack of studies about applicability and impact of different molecular graphs on model learning and interpretation. Here, we introduce MMGX (Multiple Molecular Graph eXplainable discovery), investigating the effects of multiple molecular graphs, including Atom, Pharmacophore, JunctionTree, and FunctionalGroup, on model learning and interpretation with various perspectives. Our findings indicate that multiple graphs relatively improve model performance, but in varying degrees depending on datasets. Interpretation from multiple graphs in different views provides more comprehensive features and potential substructures consistent with background knowledge. These results help to understand model decisions and offer valuable insights for subsequent tasks. The concept of multiple molecular graph representations and diverse interpretation perspectives has broad applicability across tasks, architectures, and explanation techniques, enhancing model learning and interpretation for relevant applications in drug discovery.
Collapse
Affiliation(s)
- Apakorn Kengkanna
- Department of Computer Science, School of Computing, Tokyo Institute of Technology, Kanagawa, 226-8501, Japan
| | - Masahito Ohue
- Department of Computer Science, School of Computing, Tokyo Institute of Technology, Kanagawa, 226-8501, Japan.
| |
Collapse
|
24
|
Song Z, Chen J, Cheng J, Chen G, Qi Z. Computer-Aided Molecular Design of Ionic Liquids as Advanced Process Media: A Review from Fundamentals to Applications. Chem Rev 2024; 124:248-317. [PMID: 38108629 DOI: 10.1021/acs.chemrev.3c00223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
The unique physicochemical properties, flexible structural tunability, and giant chemical space of ionic liquids (ILs) provide them a great opportunity to match different target properties to work as advanced process media. The crux of the matter is how to efficiently and reliably tailor suitable ILs toward a specific application. In this regard, the computer-aided molecular design (CAMD) approach has been widely adapted to cover this family of high-profile chemicals, that is, to perform computer-aided IL design (CAILD). This review discusses the past developments that have contributed to the state-of-the-art of CAILD and provides a perspective about how future works could pursue the acceleration of the practical application of ILs. In a broad context of CAILD, key aspects related to the forward structure-property modeling and reverse molecular design of ILs are overviewed. For the former forward task, diverse IL molecular representations, modeling algorithms, as well as representative models on physical properties, thermodynamic properties, among others of ILs are introduced. For the latter reverse task, representative works formulating different molecular design scenarios are summarized. Beyond the substantial progress made, some future perspectives to move CAILD a step forward are finally provided.
Collapse
Affiliation(s)
- Zhen Song
- State Key laboratory of Chemical Engineering, School of Chemical Engineering, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Jiahui Chen
- State Key laboratory of Chemical Engineering, School of Chemical Engineering, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Jie Cheng
- State Key laboratory of Chemical Engineering, School of Chemical Engineering, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Guzhong Chen
- State Key laboratory of Chemical Engineering, School of Chemical Engineering, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Zhiwen Qi
- State Key laboratory of Chemical Engineering, School of Chemical Engineering, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| |
Collapse
|
25
|
Maier JC, Wang CI, Jackson NE. Distilling coarse-grained representations of molecular electronic structure with continuously gated message passing. J Chem Phys 2024; 160:024109. [PMID: 38193551 DOI: 10.1063/5.0179253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Accepted: 12/14/2023] [Indexed: 01/10/2024] Open
Abstract
Bottom-up methods for coarse-grained (CG) molecular modeling are critically needed to establish rigorous links between atomistic reference data and reduced molecular representations. For a target molecule, the ideal reduced CG representation is a function of both the conformational ensemble of the system and the target physical observable(s) to be reproduced at the CG resolution. However, there is an absence of algorithms for selecting CG representations of molecules from which complex properties, including molecular electronic structure, can be accurately modeled. We introduce continuously gated message passing (CGMP), a graph neural network (GNN) method for atomically decomposing molecular electronic structure sampled over conformational ensembles. CGMP integrates 3D-invariant GNNs and a novel gated message passing system to continuously reduce the atomic degrees of freedom accessible for electronic predictions, resulting in a one-shot importance ranking of atoms contributing to a target molecular property. Moreover, CGMP provides the first approach by which to quantify the degeneracy of "good" CG representations conditioned on specific prediction targets, facilitating the development of more transferable CG representations. We further show how CGMP can be used to highlight multiatom correlations, illuminating a path to developing CG electronic Hamiltonians in terms of interpretable collective variables for arbitrarily complex molecules.
Collapse
Affiliation(s)
- J Charlie Maier
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
| | - Chun-I Wang
- Department of Chemistry, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
| | - Nicholas E Jackson
- Department of Chemistry, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA
| |
Collapse
|
26
|
Zhang Y, Liu C, Liu M, Liu T, Lin H, Huang CB, Ning L. Attention is all you need: utilizing attention in AI-enabled drug discovery. Brief Bioinform 2023; 25:bbad467. [PMID: 38189543 PMCID: PMC10772984 DOI: 10.1093/bib/bbad467] [Citation(s) in RCA: 37] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 11/03/2023] [Accepted: 11/25/2023] [Indexed: 01/09/2024] Open
Abstract
Recently, attention mechanism and derived models have gained significant traction in drug development due to their outstanding performance and interpretability in handling complex data structures. This review offers an in-depth exploration of the principles underlying attention-based models and their advantages in drug discovery. We further elaborate on their applications in various aspects of drug development, from molecular screening and target binding to property prediction and molecule generation. Finally, we discuss the current challenges faced in the application of attention mechanisms and Artificial Intelligence technologies, including data quality, model interpretability and computational resource constraints, along with future directions for research. Given the accelerating pace of technological advancement, we believe that attention-based models will have an increasingly prominent role in future drug discovery. We anticipate that these models will usher in revolutionary breakthroughs in the pharmaceutical domain, significantly accelerating the pace of drug development.
Collapse
Affiliation(s)
- Yang Zhang
- Innovative Institute of Chinese Medicine and Pharmacy, Academy for Interdiscipline, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Caiqi Liu
- Department of Gastrointestinal Medical Oncology, Harbin Medical University Cancer Hospital, No.150 Haping Road, Nangang District, Harbin, Heilongjiang 150081, China
- Key Laboratory of Molecular Oncology of Heilongjiang Province, No.150 Haping Road, Nangang District, Harbin, Heilongjiang 150081, China
| | - Mujiexin Liu
- Chongqing Key Laboratory of Sichuan-Chongqing Co-construction for Diagnosis and Treatment of Infectious Diseases Integrated Traditional Chinese and Western Medicine, College of Medical Technology, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Tianyuan Liu
- Graduate School of Science and Technology, University of Tsukuba, Tsukuba, Japan
| | - Hao Lin
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Cheng-Bing Huang
- School of Computer Science and Technology, Aba Teachers University, Aba, China
| | - Lin Ning
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu 611844, China
| |
Collapse
|
27
|
Li B, Lin M, Chen T, Wang L. FG-BERT: a generalized and self-supervised functional group-based molecular representation learning framework for properties prediction. Brief Bioinform 2023; 24:bbad398. [PMID: 37930026 DOI: 10.1093/bib/bbad398] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Revised: 09/25/2023] [Accepted: 10/14/2023] [Indexed: 11/07/2023] Open
Abstract
Artificial intelligence-based molecular property prediction plays a key role in molecular design such as bioactive molecules and functional materials. In this study, we propose a self-supervised pretraining deep learning (DL) framework, called functional group bidirectional encoder representations from transformers (FG-BERT), pertained based on ~1.45 million unlabeled drug-like molecules, to learn meaningful representation of molecules from function groups. The pretrained FG-BERT framework can be fine-tuned to predict molecular properties. Compared to state-of-the-art (SOTA) machine learning and DL methods, we demonstrate the high performance of FG-BERT in evaluating molecular properties in tasks involving physical chemistry, biophysics and physiology across 44 benchmark datasets. In addition, FG-BERT utilizes attention mechanisms to focus on FG features that are critical to the target properties, thereby providing excellent interpretability for downstream training tasks. Collectively, FG-BERT does not require any artificially crafted features as input and has excellent interpretability, providing an out-of-the-box framework for developing SOTA models for a variety of molecule (especially for drug) discovery tasks.
Collapse
Affiliation(s)
- Biaoshun Li
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Ministry of Education, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Mujie Lin
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Ministry of Education, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Tiegen Chen
- Zhongshan Institute for Drug Discovery, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Room 109, Building C, SSIP Healthcare and Medicine Demonstration Zone, Zhongshan Tsuihang New District, Zhongshan, Guangdong, 528400, China
| | - Ling Wang
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Ministry of Education, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| |
Collapse
|
28
|
Han S, Fu H, Wu Y, Zhao G, Song Z, Huang F, Zhang Z, Liu S, Zhang W. HimGNN: a novel hierarchical molecular graph representation learning framework for property prediction. Brief Bioinform 2023; 24:bbad305. [PMID: 37594313 DOI: 10.1093/bib/bbad305] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Revised: 07/18/2023] [Accepted: 08/04/2023] [Indexed: 08/19/2023] Open
Abstract
Accurate prediction of molecular properties is an important topic in drug discovery. Recent works have developed various representation schemes for molecular structures to capture different chemical information in molecules. The atom and motif can be viewed as hierarchical molecular structures that are widely used for learning molecular representations to predict chemical properties. Previous works have attempted to exploit both atom and motif to address the problem of information loss in single representation learning for various tasks. To further fuse such hierarchical information, the correspondence between learned chemical features from different molecular structures should be considered. Herein, we propose a novel framework for molecular property prediction, called hierarchical molecular graph neural networks (HimGNN). HimGNN learns hierarchical topology representations by applying graph neural networks on atom- and motif-based graphs. In order to boost the representational power of the motif feature, we design a Transformer-based local augmentation module to enrich motif features by introducing heterogeneous atom information in motif representation learning. Besides, we focus on the molecular hierarchical relationship and propose a simple yet effective rescaling module, called contextual self-rescaling, that adaptively recalibrates molecular representations by explicitly modelling interdependencies between atom and motif features. Extensive computational experiments demonstrate that HimGNN can achieve promising performances over state-of-the-art baselines on both classification and regression tasks in molecular property prediction.
Collapse
Affiliation(s)
- Shen Han
- College of Informatics, Huazhong Agricultural University, People's Republic of China
| | - Haitao Fu
- College of Informatics, Huazhong Agricultural University, People's Republic of China
| | - Yuyang Wu
- College of Plant Science and Technology, Huazhong Agricultural University, People's Republic of China
| | - Ganglan Zhao
- College of Informatics, Huazhong Agricultural University, People's Republic of China
| | - Zhenyu Song
- College of Informatics, Huazhong Agricultural University, People's Republic of China
| | - Feng Huang
- College of Informatics, Huazhong Agricultural University, People's Republic of China
| | - Zhongfei Zhang
- Computer Science Department, Binghamton University, Binghamton, NY, USA
| | - Shichao Liu
- College of Informatics, Huazhong Agricultural University, People's Republic of China and Agricultural Bioinformatics Key Laboratory of Hubei Province, Hubei Engineering Technology Research Center of Agricultural Big Data, Key Laboratory of Smart Animal Farming Technology, Ministry of Agriculture, Huazhong Agricultural University
| | - Wen Zhang
- College of Informatics, Huazhong Agricultural University, People's Republic of China and Agricultural Bioinformatics Key Laboratory of Hubei Province, Hubei Engineering Technology Research Center of Agricultural Big Data, Key Laboratory of Smart Animal Farming Technology, Ministry of Agriculture, Huazhong Agricultural University
| |
Collapse
|
29
|
Hagg A, Kirschner KN. Open-Source Machine Learning in Computational Chemistry. J Chem Inf Model 2023; 63:4505-4532. [PMID: 37466636 PMCID: PMC10430767 DOI: 10.1021/acs.jcim.3c00643] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Indexed: 07/20/2023]
Abstract
The field of computational chemistry has seen a significant increase in the integration of machine learning concepts and algorithms. In this Perspective, we surveyed 179 open-source software projects, with corresponding peer-reviewed papers published within the last 5 years, to better understand the topics within the field being investigated by machine learning approaches. For each project, we provide a short description, the link to the code, the accompanying license type, and whether the training data and resulting models are made publicly available. Based on those deposited in GitHub repositories, the most popular employed Python libraries are identified. We hope that this survey will serve as a resource to learn about machine learning or specific architectures thereof by identifying accessible codes with accompanying papers on a topic basis. To this end, we also include computational chemistry open-source software for generating training data and fundamental Python libraries for machine learning. Based on our observations and considering the three pillars of collaborative machine learning work, open data, open source (code), and open models, we provide some suggestions to the community.
Collapse
Affiliation(s)
- Alexander Hagg
- Institute
of Technology, Resource and Energy-Efficient Engineering (TREE), University of Applied Sciences Bonn-Rhein-Sieg, 53757 Sankt Augustin, Germany
- Department
of Electrical Engineering, Mechanical Engineering and Technical Journalism, University of Applied Sciences Bonn-Rhein-Sieg, 53757 Sankt Augustin, Germany
| | - Karl N. Kirschner
- Institute
of Technology, Resource and Energy-Efficient Engineering (TREE), University of Applied Sciences Bonn-Rhein-Sieg, 53757 Sankt Augustin, Germany
- Department
of Computer Science, University of Applied
Sciences Bonn-Rhein-Sieg, 53757 Sankt Augustin, Germany
| |
Collapse
|