1
|
Feng L, Fu X, Du Z, Guo Y, Zhuo L, Yang Y, Cao D, Yao X. MultiCTox: Empowering Accurate Cardiotoxicity Prediction through Adaptive Multimodal Learning. J Chem Inf Model 2025; 65:3517-3528. [PMID: 40145660 DOI: 10.1021/acs.jcim.5c00022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/28/2025]
Abstract
Cardiotoxicity refers to the inhibitory effects of drugs on cardiac ion channels. Accurate prediction of cardiotoxicity is crucial yet challenging, as it directly impacts the evaluation of cardiac drug efficacy and safety. Numerous methods have been developed to predict cardiotoxicity, yet their performance remains limited. A key limitation is that these methods often rely solely on single-modal data, making multimodal data integration challenging. As a result, we present a multimodal method integrating molecular SMILES, structure, and fingerprint to enhance cardiotoxicity prediction. First, we designed a fusion layer to unify representations from different modalities. During training, the model maximizes intramodal similarity for the same molecule while minimizing intermolecular similarity, ensuring consistent cross-modal representations. This study evaluates the inhibitory effects of candidate drugs on voltage-gated potassium (hERG), sodium (Nav1.5), and calcium (Cav1.2) channels. Experimental results demonstrate that the proposed model significantly outperforms existing state-of-the-art methods in cardiotoxicity prediction. We anticipate that this model will contribute significantly to the development and safety evaluation of cardiac drugs, reducing cardiotoxicity-related risks.
Collapse
Affiliation(s)
- Lin Feng
- School of Data Science and Artificial Intelligence, Wenzhou University of Technology, Wenzhou 325027, China
| | - Xiangzheng Fu
- College of Information Science and Engineering, Hunan University, Changsha 410000, China
| | - Zhenya Du
- School of Nursing, Teaching and Research Department of Public Medical Courses, Guangzhou xinhua University, Guangzhou 510520, China
| | - Yuting Guo
- School of Data Science and Artificial Intelligence, Wenzhou University of Technology, Wenzhou 325027, China
| | - Linlin Zhuo
- School of Data Science and Artificial Intelligence, Wenzhou University of Technology, Wenzhou 325027, China
| | - Yan Yang
- School of Data Science and Artificial Intelligence, Wenzhou University of Technology, Wenzhou 325027, China
| | - Dongsheng Cao
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410003, China
| | - Xiaojun Yao
- Faculty of Applied Sciences, Macao Polytechnic University, Macao 999078, China
| |
Collapse
|
2
|
Wei Z, Shen Y, Tang X, Wen J, Song Y, Wei M, Cheng J, Zhu X. AVPpred-BWR: antiviral peptides prediction via biological words representation. Bioinformatics 2025; 41:btaf126. [PMID: 40152250 PMCID: PMC11968319 DOI: 10.1093/bioinformatics/btaf126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2024] [Revised: 02/17/2025] [Accepted: 03/26/2025] [Indexed: 03/29/2025] Open
Abstract
MOTIVATION Antiviral peptides (AVPs) are short chains of amino acids, showing great potential as antiviral drugs. The traditional wisdom (e.g. wet experiments) for identifying the AVPs is time-consuming and laborious, while cutting-edge computational methods are less accurate to predict them. RESULTS In this article, we propose an AVPs prediction model via biological words representation, dubbed AVPpred-BWR. Based on the fact that the secondary structures of AVPs mainly consist of α-helix and loop, we explore the biological words of 1mer (corresponding to loops) and 4mer (4 continuous residues, corresponding to α-helix). That is, the peptides sequences are decomposed into biological words, and then the concealed sequential information is represented by training the Word2Vec models. Moreover, in order to extract multi-scale features, we leverage a CNN-Transformer framework to process the embeddings of 1mer and 4mer generated by Word2Vec models. To the best of our knowledge, this is the first time to realize the word segmentation of protein primary structure sequences based on the regularity of protein secondary structure. AVPpred-BWR illustrates clear improvements over its competitors on the independent test set (e.g. improvements of 4.6% and 11.0% for AUROC and MCC, respectively, compared to UniDL4BioPep). AVAILABILITY AND IMPLEMENTATION AVPpred-BWR is publicly available at: https://github.com/zyweizm/AVPpred-BWR or https://zenodo.org/records/14880447 (doi: 10.5281/zenodo.14880447).
Collapse
Affiliation(s)
- Zhuoyu Wei
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Yongqi Shen
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Xiang Tang
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Jian Wen
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Youyi Song
- School of Science, China Pharmaceutical University, Nanjing 210009, China
| | - Mingqiang Wei
- School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
| | - Jing Cheng
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Xiaolei Zhu
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| |
Collapse
|
3
|
Liu Z, Gu A, Bao Y, Lin GN. Epigenetic Impacts of Non-Coding Mutations Deciphered Through Pre-Trained DNA Language Model at Single-Cell Resolution. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2025; 12:e2413571. [PMID: 39888214 PMCID: PMC11924033 DOI: 10.1002/advs.202413571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/24/2024] [Revised: 01/20/2025] [Indexed: 02/01/2025]
Abstract
DNA methylation plays a critical role in gene regulation, affecting cellular differentiation and disease progression, particularly in non-coding regions. However, predicting the epigenetic consequences of non-coding mutations at single-cell resolution remains a challenge. Existing tools have limited prediction capacity and struggle to capture dynamic, cell-type-specific regulatory changes that are crucial for understanding disease mechanisms. Here, Methven, a deep learning framework designed is presented to predict the effects of non-coding mutations on DNA methylation at single-cell resolution. Methven integrates DNA sequence with single-cell ATAC-seq data and models SNP-CpG interactions over 100 kbp genomic distances. By using a divide-and-conquer approach, Methven accurately predicts both short- and long-range regulatory interactions and leverages the pre-trained DNA language model for enhanced precision in classification and regression tasks. Methven outperforms existing methods and demonstrates robust generalizability to monocyte datasets. Importantly, it identifies CpG sites associated with rheumatoid arthritis, revealing key pathways involved in immune regulation and disease progression. Methven's ability to detect progressive epigenetic changes provides crucial insights into gene regulation in complex diseases. These findings demonstrate Methven's potential as a powerful tool for basic research and clinical applications, advancing this understanding of non-coding mutations and their role in disease, while offering new opportunities for personalized medicine.
Collapse
Affiliation(s)
- Zhe Liu
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Biomedical EngineeringShanghai Jiao Tong UniversityShanghai200230China
| | - An Gu
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Biomedical EngineeringShanghai Jiao Tong UniversityShanghai200230China
| | - Yihang Bao
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Biomedical EngineeringShanghai Jiao Tong UniversityShanghai200230China
| | - Guan Ning Lin
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Biomedical EngineeringShanghai Jiao Tong UniversityShanghai200230China
- Shanghai Key Laboratory of Psychotic DisordersShanghai200230China
- Engineering Research Center of Digital Medicine of the Ministry of EducationShanghai200230China
| |
Collapse
|
4
|
Yu J, Peng Z, Gan L, Liu J, Bai Y, Wan S. Impact Localization System of CFRP Structure Based on EFPI Sensors. SENSORS (BASEL, SWITZERLAND) 2025; 25:1091. [PMID: 40006320 PMCID: PMC11859712 DOI: 10.3390/s25041091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/29/2024] [Revised: 02/06/2025] [Accepted: 02/10/2025] [Indexed: 02/27/2025]
Abstract
Carbon fiber composites (CFRPs) are prone to impact loads during their production, transportation, and service life. These impacts can induce microscopic damage that is always undetectable to the naked eye, thereby posing a significant safety risk to the structural integrity of CFRP structures. In this study, we developed an impact localization system for CFRP structures using extrinsic Fabry-Perot interferometric (EFPI) sensors. The impact signals detected by EFPI sensors are demodulated at high speeds using an intensity modulation method. An impact localization method for the CFRP structure based on the energy-entropy ratio endpoint detection and CNN-BIGRU-Attention is proposed. The time difference of arrival (TDOA) between signals from different EFPI sensors is collected to characterize the impact location. The attention mechanism is integrated into the CNN-BIGRU model to enhance the significance of the TDOA of impact signals detected by proximal EFPI sensors. The model is trained using the training set, with its parameters optimized using the sand cat swarm optimization algorithm and validation set. The localization performance of different models is then evaluated and compared using the test set. The impact localization system based on the CNN-BIGRU-Attention model using EFPI sensors was validated on a CFRP plate with an experimental area of 400 mm × 400 mm. The average error in impact localization is 8.14 mm, and the experimental results demonstrate the effectiveness and satisfactory performance of the proposed method.
Collapse
Affiliation(s)
- Junsong Yu
- Key Laboratory of Nondestructive Testing, Ministry of Education, Nanchang Hangkong University, Nanchang 330063, China; (Z.P.); (L.G.); (J.L.)
- Key Laboratory of Opto-Electronic Information Science and Technology of Jiangxi Province, Nanchang Hangkong University, Nanchang 330063, China
| | - Zipeng Peng
- Key Laboratory of Nondestructive Testing, Ministry of Education, Nanchang Hangkong University, Nanchang 330063, China; (Z.P.); (L.G.); (J.L.)
| | - Linghui Gan
- Key Laboratory of Nondestructive Testing, Ministry of Education, Nanchang Hangkong University, Nanchang 330063, China; (Z.P.); (L.G.); (J.L.)
| | - Jun Liu
- Key Laboratory of Nondestructive Testing, Ministry of Education, Nanchang Hangkong University, Nanchang 330063, China; (Z.P.); (L.G.); (J.L.)
| | - Yufang Bai
- School of Electrical Engineering, Shanghai Dianji University, Shanghai 201306, China
| | - Shengpeng Wan
- Key Laboratory of Nondestructive Testing, Ministry of Education, Nanchang Hangkong University, Nanchang 330063, China; (Z.P.); (L.G.); (J.L.)
- Key Laboratory of Opto-Electronic Information Science and Technology of Jiangxi Province, Nanchang Hangkong University, Nanchang 330063, China
| |
Collapse
|
5
|
Wang H, Zhuang L, Ding Y, Tiwari P, Liang C. EDDINet: Enhancing drug-drug interaction prediction via information flow and consensus constrained multi-graph contrastive learning. Artif Intell Med 2025; 159:103029. [PMID: 39608043 DOI: 10.1016/j.artmed.2024.103029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Revised: 08/20/2024] [Accepted: 11/18/2024] [Indexed: 11/30/2024]
Abstract
Predicting drug-drug interactions (DDIs) is crucial for understanding and preventing adverse drug reactions (ADRs). However, most existing methods inadequately explore the interactive information between drugs in a self-supervised manner, limiting our comprehension of drug-drug associations. This paper introduces EDDINet: Enhancing Drug-Drug Interaction Prediction via Information Flow and Consensus-Constrained Multi-Graph Contrastive Learning for precise DDI prediction. We first present a cross-modal information-flow mechanism to integrate diverse drug features, enriching the structural insights conveyed by the drug feature vector. Next, we employ contrastive learning to filter various biological networks, enhancing the model's robustness. Additionally, we propose a consensus regularization framework that collaboratively trains multi-view models, producing high-quality drug representations. To unify drug representations derived from different biological information, we utilize an attention mechanism for DDI prediction. Extensive experiments demonstrate that EDDINet surpasses state-of-the-art unsupervised models and outperforms some supervised baseline models in DDI prediction tasks. Our approach shows significant advantages and holds promising potential for advancing DDI research and improving drug safety assessments. Our codes are available at: https://github.com/95LY/EDDINet_code.
Collapse
Affiliation(s)
- Hong Wang
- School of Information Science and Engineering, Shandong Normal University, Jinan, 250358, China.
| | - Luhe Zhuang
- School of Information Science and Engineering, Shandong Normal University, Jinan, 250358, China.
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, China.
| | - Prayag Tiwari
- School of Information Technology, Halmstad University, Halmstad 301 18, Sweden.
| | - Cheng Liang
- School of Information Science and Engineering, Shandong Normal University, Jinan, 250358, China.
| |
Collapse
|
6
|
Lu X, Xie L, Xu L, Mao R, Xu X, Chang S. Multimodal fused deep learning for drug property prediction: Integrating chemical language and molecular graph. Comput Struct Biotechnol J 2024; 23:1666-1679. [PMID: 38680871 PMCID: PMC11046066 DOI: 10.1016/j.csbj.2024.04.030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 04/01/2024] [Accepted: 04/10/2024] [Indexed: 05/01/2024] Open
Abstract
Accurately predicting molecular properties is a challenging but essential task in drug discovery. Recently, many mono-modal deep learning methods have been successfully applied to molecular property prediction. However, mono-modal learning is inherently limited as it relies solely on a single modality of molecular representation, which restricts a comprehensive understanding of drug molecules. To overcome the limitations, we propose a multimodal fused deep learning (MMFDL) model to leverage information from different molecular representations. Specifically, we construct a triple-modal learning model by employing Transformer-Encoder, Bidirectional Gated Recurrent Unit (BiGRU), and graph convolutional network (GCN) to process three modalities of information from chemical language and molecular graph: SMILES-encoded vectors, ECFP fingerprints, and molecular graphs, respectively. We evaluate the proposed triple-modal model using five fusion approaches on six molecule datasets, including Delaney, Llinas2020, Lipophilicity, SAMPL, BACE, and pKa from DataWarrior. The results show that the MMFDL model achieves the highest Pearson coefficients, and stable distribution of Pearson coefficients in the random splitting test, outperforming mono-modal models in accuracy and reliability. Furthermore, we validate the generalization ability of our model in the prediction of binding constants for protein-ligand complex molecules, and assess the resilience capability against noise. Through analysis of feature distributions in chemical space and the assigned contribution of each modal model, we demonstrate that the MMFDL model shows the ability to acquire complementary information by using proper models and suitable fusion approaches. By leveraging diverse sources of bioinformatics information, multimodal deep learning models hold the potential for successful drug discovery.
Collapse
Affiliation(s)
- Xiaohua Lu
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Liangxu Xie
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Lei Xu
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Rongzhi Mao
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Xiaojun Xu
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Shan Chang
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou 213001, China
| |
Collapse
|
7
|
Song J, Wei M, Zhao S, Zhai H, Dai Q, Duan X. Drug Sensitivity Prediction Based on Multi-stage Multi-modal Drug Representation Learning. Interdiscip Sci 2024:10.1007/s12539-024-00668-1. [PMID: 39528873 DOI: 10.1007/s12539-024-00668-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2024] [Revised: 10/05/2024] [Accepted: 10/09/2024] [Indexed: 11/16/2024]
Abstract
Accurate prediction of anticancer drug responses is essential for developing personalized treatment plans in order to improve cancer patient survival rates and reduce healthcare costs. To this end, we propose a drug sensitivity prediction model based on multi-stage multi-modal drug representations (ModDRDSP) to reflect the properties of drugs more comprehensively, and to better model the complex interactions between cells and drugs. Specifically, we adopt the SMILES representation learning method based on the deep hierarchical bi-directional GRU network (DSBiGRU) and the molecular graph representation learning method based on the deep message-crossing network (DMCN) for the multi-modal information of drugs. Additionally, we integrate the multi-omics information of cell lines based on a convolutional neural network (CNN). Finally, we use an ensemble deep forest algorithm for the prediction of drug sensitivity. After validation, the ModDRDSP shows impressive performance which outperforms the four current industry-leading models. More importantly, ablation experiments demonstrate the validity of each module of the proposed model, and case studies show the good results of ModDRDSP for predicting drug sensitivity, further establishing the superiority of ModDRDSP in terms of performance.
Collapse
Affiliation(s)
- Jinmiao Song
- School of Software, Xinjiang University, Urumqi, 830046, China
| | - Mingjie Wei
- School of Computer Science and Engineering, Dalian Minzu University, Dalian, 116650, China.
- State Ethnic Afairs Commission Key Laboratory of Big Data Applied Technology, Dalian Minzu University, Dalian, 116650, China.
- Dalian Key Laboratory of Digital Technology for Minzu Culture, Dalian Minzu University, Dalian, 116650, China.
| | - Shuang Zhao
- School of Computer Science and Engineering, Dalian Minzu University, Dalian, 116650, China
- State Ethnic Afairs Commission Key Laboratory of Big Data Applied Technology, Dalian Minzu University, Dalian, 116650, China
- Dalian Key Laboratory of Digital Technology for Minzu Culture, Dalian Minzu University, Dalian, 116650, China
| | - Hui Zhai
- The First Affiliated Hospital, Xinjiang Medical University, Urumqi, 830011, China
| | - Qiguo Dai
- School of Computer Science and Engineering, Dalian Minzu University, Dalian, 116650, China
- State Ethnic Afairs Commission Key Laboratory of Big Data Applied Technology, Dalian Minzu University, Dalian, 116650, China
- Dalian Key Laboratory of Digital Technology for Minzu Culture, Dalian Minzu University, Dalian, 116650, China
| | - Xiaodong Duan
- School of Computer Science and Engineering, Dalian Minzu University, Dalian, 116650, China.
- State Ethnic Afairs Commission Key Laboratory of Big Data Applied Technology, Dalian Minzu University, Dalian, 116650, China.
- Dalian Key Laboratory of Digital Technology for Minzu Culture, Dalian Minzu University, Dalian, 116650, China.
| |
Collapse
|
8
|
Tang C, Todo Y, Kodera S, Sun R, Shimada A, Hirata A. A novel multivariate time series forecasting dendritic neuron model for COVID-19 pandemic transmission tendency. Neural Netw 2024; 179:106527. [PMID: 39029298 DOI: 10.1016/j.neunet.2024.106527] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2022] [Revised: 02/21/2024] [Accepted: 07/07/2024] [Indexed: 07/21/2024]
Abstract
A novel coronavirus discovered in late 2019 (COVID-19) quickly spread into a global epidemic and, thankfully, was brought under control by 2022. Because of the virus's unknown mutations and the vaccine's waning potency, forecasting is still essential for resurgence prevention and medical resource management. Computational efficiency and long-term accuracy are two bottlenecks for national-level forecasting. This study develops a novel multivariate time series forecasting model, the densely connected highly flexible dendritic neuron model (DFDNM) to predict daily and weekly positive COVID-19 cases. DFDNM's high flexibility mechanism improves its capacity to deal with nonlinear challenges. The dense introduction of shortcut connections alleviates the vanishing and exploding gradient problems, encourages feature reuse, and improves feature extraction. To deal with the rapidly growing parameters, an improved variation of the adaptive moment estimation (AdamW) algorithm is employed as the learning algorithm for the DFDNM because of its strong optimization ability. The experimental results and statistical analysis conducted across three Japanese prefectures confirm the efficacy and feasibility of the DFDNM while outperforming various state-of-the-art machine learning models. To the best of our knowledge, the proposed DFDNM is the first to restructure the dendritic neuron model's neural architecture, demonstrating promising use in multivariate time series prediction. Because of its optimal performance, the DFDNM may serve as an important reference for national and regional government decision-makers aiming to optimize pandemic prevention and medical resource management. We also verify that DFDMN is efficiently applicable not only to COVID-19 transmission prediction, but also to more general multivariate prediction tasks. It leads us to believe that it might be applied as a promising prediction model in other fields.
Collapse
Affiliation(s)
- Cheng Tang
- Faculty of Information Science and Electrical Engineering, Kyushu University, Fukuoka, 819-0395, Japan; Department of Electrical and Mechanical Engineering, Nagoya Institute of Technology, Nagoya-shi, 466-8555, Japan.
| | - Yuki Todo
- Faculty of Electrical and Computer Engineering, Kanazawa University, Kanazawa-shi, 920-1192, Japan
| | - Sachiko Kodera
- Department of Electrical and Mechanical Engineering, Nagoya Institute of Technology, Nagoya-shi, 466-8555, Japan
| | - Rong Sun
- Faculty of Electrical and Computer Engineering, Kanazawa University, Kanazawa-shi, 920-1192, Japan; Division of Medical Oncology & Respiratory Medicine, Department of Internal Medicine, Faculty of Medicine, Shimane University, Izumo, Japan
| | - Atsushi Shimada
- Faculty of Information Science and Electrical Engineering, Kyushu University, Fukuoka, 819-0395, Japan
| | - Akimasa Hirata
- Department of Electrical and Mechanical Engineering, Nagoya Institute of Technology, Nagoya-shi, 466-8555, Japan.
| |
Collapse
|
9
|
Lin X, Yin Z, Zhang X, Hu J. KGRLFF: Detecting Drug-Drug Interactions Based on Knowledge Graph Representation Learning and Feature Fusion. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:2035-2049. [PMID: 39074014 DOI: 10.1109/tcbb.2024.3434992] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/31/2024]
Abstract
Accurate prediction of drug-drug interactions (DDIs) plays an important role in improving the efficiency of drug development and ensuring the safety of combination therapy. Most existing models rely on a single source of information to predict DDIs, and few models can perform tasks on biomedical knowledge graphs. This paper proposes a new hybrid method, namely Knowledge Graph Representation Learning and Feature Fusion (KGRLFF), to fully exploit the information from the biomedical knowledge graph and molecular structure of drugs to better predict DDIs. KGRLFF first uses a Bidirectional Random Walk sampling method based on the PageRank algorithm (BRWP) to obtain higher-order neighborhood information of drugs in the knowledge graph, including neighboring nodes, semantic relations, and higher-order information associated with triple facts. Then, an embedded representation learning model named Knowledge Graph-based Cyclic Recursive Aggregation (KGCRA) is used to learn the embedded representations of drugs by recursively propagating and aggregating messages with drugs as both the source and destination. In addition, the model learns the molecular structures of the drugs to obtain the structured features. Finally, a Feature Representation Fusion Strategy (FRFS) was developed to integrate embedded representations and structured feature representations. Experimental results showed that KGRLFF is feasible for predicting potential DDIs.
Collapse
|
10
|
Abdollahi S, Schaub DP, Barroso M, Laubach NC, Hutwelker W, Panzer U, Gersting SØW, Bonn S. A comprehensive comparison of deep learning-based compound-target interaction prediction models to unveil guiding design principles. J Cheminform 2024; 16:118. [PMID: 39468635 PMCID: PMC11520803 DOI: 10.1186/s13321-024-00913-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2024] [Accepted: 10/10/2024] [Indexed: 10/30/2024] Open
Abstract
The evaluation of compound-target interactions (CTIs) is at the heart of drug discovery efforts. Given the substantial time and monetary costs of classical experimental screening, significant efforts have been dedicated to develop deep learning-based models that can accurately predict CTIs. A comprehensive comparison of these models on a large, curated CTI dataset is, however, still lacking. Here, we perform an in-depth comparison of 12 state-of-the-art deep learning architectures that use different protein and compound representations. The models were selected for their reported performance and architectures. To reliably compare model performance, we curated over 300 thousand binding and non-binding CTIs and established several gold-standard datasets of varying size and information. Based on our findings, DeepConv-DTI consistently outperforms other models in CTI prediction performance across the majority of datasets. It achieves an MCC of 0.6 or higher for most of the datasets and is one of the fastest models in training and inference. These results indicate that utilizing convolutional-based windows as in DeepConv-DTI to traverse trainable embeddings is a highly effective approach for capturing informative protein features. We also observed that physicochemical embeddings of targets increased model performance. We therefore modified DeepConv-DTI to include normalized physicochemical properties, which resulted in the overall best performing model Phys-DeepConv-DTI. This work highlights how the systematic evaluation of input features of compounds and targets, as well as their corresponding neural network architectures, can serve as a roadmap for the future development of improved CTI models.Scientific contributionThis work features comprehensive CTI datasets to allow for the objective comparison and benchmarking of CTI prediction algorithms. Based on this dataset, we gained insights into which embeddings of compounds and targets and which deep learning-based algorithms perform best, providing a blueprint for the future development of CTI algorithms. Using the insights gained from this screen, we provide a novel CTI algorithm with state-of-the-art performance.
Collapse
Affiliation(s)
- Sina Abdollahi
- Institute of Medical Systems Biology, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
| | - Darius P Schaub
- Institute of Medical Systems Biology, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
- III. Department of Medicine, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
| | - Madalena Barroso
- University Children's Research, UCR@Kinder-UKE, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
| | - Nora C Laubach
- University Children's Research, UCR@Kinder-UKE, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
| | - Wiebke Hutwelker
- University Children's Research, UCR@Kinder-UKE, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
| | - Ulf Panzer
- III. Department of Medicine, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
- Hamburg Center for Translational Immunology (HCTI), University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
| | - S Øren W Gersting
- University Children's Research, UCR@Kinder-UKE, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany.
| | - Stefan Bonn
- Institute of Medical Systems Biology, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany.
- Hamburg Center for Translational Immunology (HCTI), University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany.
- Center for Biomedical AI, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany.
| |
Collapse
|
11
|
Shoombuatong W, Meewan I, Mookdarsanit L, Schaduangrat N. Stack-HDAC3i: A high-precision identification of HDAC3 inhibitors by exploiting a stacked ensemble-learning framework. Methods 2024; 230:147-157. [PMID: 39191338 DOI: 10.1016/j.ymeth.2024.08.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2024] [Revised: 08/07/2024] [Accepted: 08/17/2024] [Indexed: 08/29/2024] Open
Abstract
Epigenetics involves reversible modifications in gene expression without altering the genetic code itself. Among these modifications, histone deacetylases (HDACs) play a key role by removing acetyl groups from lysine residues on histones. Overexpression of HDACs is linked to the proliferation and survival of tumor cells. To combat this, HDAC inhibitors (HDACi) are commonly used in cancer treatments. However, pan-HDAC inhibition can lead to numerous side effects. Therefore, isoform-selective HDAC inhibitors, such as HDAC3i, could be advantageous for treating various medical conditions while minimizing off-target effects. To date, computational approaches that use only the SMILES notation without any experimental evidence have become increasingly popular and necessary for the initial discovery of novel potential therapeutic drugs. In this study, we develop an innovative and high-precision stacked-ensemble framework, called Stack-HDAC3i, which can directly identify HDAC3i using only the SMILES notation. Using an up-to-date benchmark dataset, we first employed both molecular descriptors and Mol2Vec embeddings to generate feature representations that cover multi-view information embedded in HDAC3i, such as structural and contextual information. Subsequently, these feature representations were used to train baseline models using nine popular ML algorithms. Finally, the probabilistic features derived from the selected baseline models were fused to construct the final stacked model. Both cross-validation and independent tests showed that Stack-HDAC3i is a high-accuracy prediction model with great generalization ability for identifying HDAC3i. Furthermore, in the independent test, Stack-HDAC3i achieved an accuracy of 0.926 and Matthew's correlation coefficient of 0.850, which are 0.44-6.11% and 0.83-11.90% higher than its constituent baseline models, respectively.
Collapse
Affiliation(s)
- Watshara Shoombuatong
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand.
| | - Ittipat Meewan
- Center for Advanced Therapeutics, Institute of Molecular Biosciences, Mahidol University, Nakhon Pathom 73170, Thailand
| | - Lawankorn Mookdarsanit
- Business Information System, Faculty of Management Science, Chandrakasem Rajabhat University, Bangkok 10900, Thailand
| | - Nalini Schaduangrat
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| |
Collapse
|
12
|
Dabouei A, Mishra I, Kapur K, Cao C, Bridges AA, Xu M. Deep Video Analysis for Bacteria Genotype Prediction. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.16.613253. [PMID: 39345538 PMCID: PMC11429917 DOI: 10.1101/2024.09.16.613253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/01/2024]
Abstract
Genetic modification of microbes is central to many biotechnology fields, such as industrial microbiology, bioproduction, and drug discovery. Understanding how specific genetic modifications influence observable bacterial behaviors is crucial for advancing these fields. In this study, we propose a supervised model to classify bacteria harboring single gene modifications to draw connections between phenotype and genotype. In particular, we demonstrate that the spatiotemporal patterns of Vibrio cholerae growth, recorded in terms of low-resolution bright-field microscopy videos, are highly predictive of the genotype class. Additionally, we introduce a weakly supervised approach to identify key moments in culture growth that significantly contribute to prediction accuracy. By focusing on the temporal expressions of bacterial behavior, our findings offer valuable insights into the underlying mechanisms and developmental stages by which specific genes control observable phenotypes. This research opens new avenues for automating the analysis of phenotypes, with potential applications for drug discovery, disease management, etc. Furthermore, this work highlights the potential of using machine learning techniques to explore the functional roles of specific genes using a low-resolution light microscope.
Collapse
|
13
|
Wang T, Du Z, Zhuo L, Fu X, Zou Q, Yao X. MultiCBlo: Enhancing predictions of compound-induced inhibition of cardiac ion channels with advanced multimodal learning. Int J Biol Macromol 2024; 276:133825. [PMID: 39002900 DOI: 10.1016/j.ijbiomac.2024.133825] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Revised: 07/09/2024] [Accepted: 07/10/2024] [Indexed: 07/15/2024]
Abstract
Predicting compound-induced inhibition of cardiac ion channels is crucial and challenging, significantly impacting cardiac drug efficacy and safety assessments. Despite the development of various computational methods for compound-induced inhibition prediction in cardiac ion channels, their performance remains limited. Most methods struggle to fuse multi-source data, relying solely on specific dataset training, leading to poor accuracy and generalization. We introduce MultiCBlo, a model that fuses multimodal information through a progressive learning approach, designed to predict compound-induced inhibition of cardiac ion channels with high accuracy. MultiCBlo employs progressive multimodal information fusion technology to integrate the compound's SMILES sequence, graph structure, and fingerprint, enhancing its representation. This is the first application of progressive multimodal learning for predicting compound-induced inhibition of cardiac ion channels, to our knowledge. The objective of this study was to predict the compound-induced inhibition of three major cardiac ion channels: hERG, Cav1.2, and Nav1.5. The results indicate that MultiCBlo significantly outperforms current models in predicting compound-induced inhibition of cardiac ion channels. We hope that MultiCBlo will facilitate cardiac drug development and reduce compound toxicity risks. Code and data are accessible at: https://github.com/taowang11/MultiCBlo. The online prediction platform is freely accessible at: https://huggingface.co/spaces/wtttt/PCICB.
Collapse
Affiliation(s)
- Tao Wang
- School of Data Science and Artificial Intelligence, Wenzhou University of Technology, 325027 Wenzhou, China
| | - Zhenya Du
- Guangzhou Xinhua University, 510520 Guangzhou, China
| | - Linlin Zhuo
- School of Data Science and Artificial Intelligence, Wenzhou University of Technology, 325027 Wenzhou, China.
| | - Xiangzheng Fu
- College of Computer Science and Electronic Engineering, Hunan University, 410012 Changsha, China.
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 611730 Chengdu, China
| | - Xiaojun Yao
- Faculty of Applied Sciences, Macao Polytechnic University, 999078 Macao, China.
| |
Collapse
|
14
|
Lin RH, Lin P, Wang CC, Tung CW. A novel multitask learning algorithm for tasks with distinct chemical space: zebrafish toxicity prediction as an example. J Cheminform 2024; 16:91. [PMID: 39095893 PMCID: PMC11297603 DOI: 10.1186/s13321-024-00891-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Accepted: 07/27/2024] [Indexed: 08/04/2024] Open
Abstract
Data scarcity is one of the most critical issues impeding the development of prediction models for chemical effects. Multitask learning algorithms leveraging knowledge from relevant tasks showed potential for dealing with tasks with limited data. However, current multitask methods mainly focus on learning from datasets whose task labels are available for most of the training samples. Since datasets were generated for different purposes with distinct chemical spaces, the conventional multitask learning methods may not be suitable. This study presents a novel multitask learning method MTForestNet that can deal with data scarcity problems and learn from tasks with distinct chemical space. The MTForestNet consists of nodes of random forest classifiers organized in the form of a progressive network, where each node represents a random forest model learned from a specific task. To demonstrate the effectiveness of the MTForestNet, 48 zebrafish toxicity datasets were collected and utilized as an example. Among them, two tasks are very different from other tasks with only 1.3% common chemicals shared with other tasks. In an independent test, MTForestNet with a high area under the receiver operating characteristic curve (AUC) value of 0.911 provided superior performance over compared single-task and multitask methods. The overall toxicity derived from the developed models of zebrafish toxicity is well correlated with the experimentally determined overall toxicity. In addition, the outputs from the developed models of zebrafish toxicity can be utilized as features to boost the prediction of developmental toxicity. The developed models are effective for predicting zebrafish toxicity and the proposed MTForestNet is expected to be useful for tasks with distinct chemical space that can be applied in other tasks.Scieific contributionA novel multitask learning algorithm MTForestNet was proposed to address the challenges of developing models using datasets with distinct chemical space that is a common issue of cheminformatics tasks. As an example, zebrafish toxicity prediction models were developed using the proposed MTForestNet which provide superior performance over conventional single-task and multitask learning methods. In addition, the developed zebrafish toxicity prediction models can reduce animal testing.
Collapse
Affiliation(s)
- Run-Hsin Lin
- Institute of Biotechnology and Pharmaceutical Research, National Health Research Institutes, Miaoli County, 35053, Taiwan
- Graduate Institute of Data Science, College of Management, Taipei Medical University, Taipei, 10675, Taiwan
| | - Pinpin Lin
- National Institute of Environmental Health Sciences, National Health Research Institutes, Miaoli County, 35053, Taiwan
| | - Chia-Chi Wang
- Department and Graduate Institute of Veterinary Medicine, School of Veterinary Medicine, National Taiwan University, Taipei, 10617, Taiwan
| | - Chun-Wei Tung
- Institute of Biotechnology and Pharmaceutical Research, National Health Research Institutes, Miaoli County, 35053, Taiwan.
- Graduate Institute of Data Science, College of Management, Taipei Medical University, Taipei, 10675, Taiwan.
| |
Collapse
|
15
|
Yuan L, Zhao L, Lai J, Jiang Y, Zhang Q, Shen Z, Zheng CH, Huang DS. iCRBP-LKHA: Large convolutional kernel and hybrid channel-spatial attention for identifying circRNA-RBP interaction sites. PLoS Comput Biol 2024; 20:e1012399. [PMID: 39173070 PMCID: PMC11373821 DOI: 10.1371/journal.pcbi.1012399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2024] [Revised: 09/04/2024] [Accepted: 08/08/2024] [Indexed: 08/24/2024] Open
Abstract
Circular RNAs (circRNAs) play vital roles in transcription and translation. Identification of circRNA-RBP (RNA-binding protein) interaction sites has become a fundamental step in molecular and cell biology. Deep learning (DL)-based methods have been proposed to predict circRNA-RBP interaction sites and achieved impressive identification performance. However, those methods cannot effectively capture long-distance dependencies, and cannot effectively utilize the interaction information of multiple features. To overcome those limitations, we propose a DL-based model iCRBP-LKHA using deep hybrid networks for identifying circRNA-RBP interaction sites. iCRBP-LKHA adopts five encoding schemes. Meanwhile, the neural network architecture, which consists of large kernel convolutional neural network (LKCNN), convolutional block attention module with one-dimensional convolution (CBAM-1D) and bidirectional gating recurrent unit (BiGRU), can explore local information, global context information and multiple features interaction information automatically. To verify the effectiveness of iCRBP-LKHA, we compared its performance with shallow learning algorithms on 37 circRNAs datasets and 37 circRNAs stringent datasets. And we compared its performance with state-of-the-art DL-based methods on 37 circRNAs datasets, 37 circRNAs stringent datasets and 31 linear RNAs datasets. The experimental results not only show that iCRBP-LKHA outperforms other competing methods, but also demonstrate the potential of this model in identifying other RNA-RBP interaction sites.
Collapse
Affiliation(s)
- Lin Yuan
- Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
- Shandong Engineering Research Center of Big Data Applied Technology, Faculty of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
- Shandong Provincial Key Laboratory of Computer Networks, Shandong Fundamental Research Center for Computer Science, Jinan, China
| | - Ling Zhao
- Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
- Shandong Engineering Research Center of Big Data Applied Technology, Faculty of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
- Shandong Provincial Key Laboratory of Computer Networks, Shandong Fundamental Research Center for Computer Science, Jinan, China
| | - Jinling Lai
- Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
- Shandong Engineering Research Center of Big Data Applied Technology, Faculty of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
- Shandong Provincial Key Laboratory of Computer Networks, Shandong Fundamental Research Center for Computer Science, Jinan, China
| | - Yufeng Jiang
- Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
- Shandong Engineering Research Center of Big Data Applied Technology, Faculty of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
- Shandong Provincial Key Laboratory of Computer Networks, Shandong Fundamental Research Center for Computer Science, Jinan, China
| | - Qinhu Zhang
- Eastern Institute for Advanced Study, Eastern Institute of Technology, Ningbo, China
| | - Zhen Shen
- School of Computer and Software, Nanyang Institute of Technology, Nanyang, China
| | - Chun-Hou Zheng
- Key Lab of Intelligent Computing and Signal Processing of Ministry of Education, School of Artificial Intelligence, Anhui University, Hefei, China
| | - De-Shuang Huang
- Eastern Institute for Advanced Study, Eastern Institute of Technology, Ningbo, China
| |
Collapse
|
16
|
Yang X, Jin J, Wang R, Li Z, Wang Y, Wei L. CACPP: A Contrastive Learning-Based Siamese Network to Identify Anticancer Peptides Based on Sequence Only. J Chem Inf Model 2024; 64:2807-2816. [PMID: 37252890 DOI: 10.1021/acs.jcim.3c00297] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Anticancer peptides (ACPs) recently have been receiving increasing attention in cancer therapy due to their low consumption, few adverse side effects, and easy accessibility. However, it remains a great challenge to identify anticancer peptides via experimental approaches, requiring expensive and time-consuming experimental studies. In addition, traditional machine-learning-based methods are proposed for ACP prediction mainly depending on hand-crafted feature engineering, which normally achieves low prediction performance. In this study, we propose CACPP (Contrastive ACP Predictor), a deep learning framework based on the convolutional neural network (CNN) and contrastive learning for accurately predicting anticancer peptides. In particular, we introduce the TextCNN model to extract the high-latent features based on the peptide sequences only and exploit the contrastive learning module to learn more distinguishable feature representations to make better predictions. Comparative results on the benchmark data sets indicate that CACPP outperforms all the state-of-the-art methods in the prediction of anticancer peptides. Moreover, to intuitively show that our model has good classification ability, we visualize the dimension reduction of the features from our model and explore the relationship between ACP sequences and anticancer functions. Furthermore, we also discuss the influence of data set construction on model prediction and explore our model performance on the data sets with verified negative samples.
Collapse
Affiliation(s)
- Xuetong Yang
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Junru Jin
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Ruheng Wang
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Zhongshen Li
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Yu Wang
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| |
Collapse
|
17
|
Zhang C, Xie L, Lu X, Mao R, Xu L, Xu X. Developing an Improved Cycle Architecture for AI-Based Generation of New Structures Aimed at Drug Discovery. Molecules 2024; 29:1499. [PMID: 38611779 PMCID: PMC11013495 DOI: 10.3390/molecules29071499] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2024] [Revised: 03/18/2024] [Accepted: 03/21/2024] [Indexed: 04/14/2024] Open
Abstract
Drug discovery involves a crucial step of optimizing molecules with the desired structural groups. In the domain of computer-aided drug discovery, deep learning has emerged as a prominent technique in molecular modeling. Deep generative models, based on deep learning, play a crucial role in generating novel molecules when optimizing molecules. However, many existing molecular generative models have limitations as they solely process input information in a forward way. To overcome this limitation, we propose an improved generative model called BD-CycleGAN, which incorporates BiLSTM (bidirectional long short-term memory) and Mol-CycleGAN (molecular cycle generative adversarial network) to preserve the information of molecular input. To evaluate the proposed model, we assess its performance by analyzing the structural distribution and evaluation matrices of generated molecules in the process of structural transformation. The results demonstrate that the BD-CycleGAN model achieves a higher success rate and exhibits increased diversity in molecular generation. Furthermore, we demonstrate its application in molecular docking, where it successfully increases the docking score for the generated molecules. The proposed BD-CycleGAN architecture harnesses the power of deep learning to facilitate the generation of molecules with desired structural features, thus offering promising advancements in the field of drug discovery processes.
Collapse
Affiliation(s)
| | | | | | | | - Lei Xu
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China; (C.Z.); (L.X.); (X.L.); (R.M.)
| | - Xiaojun Xu
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China; (C.Z.); (L.X.); (X.L.); (R.M.)
| |
Collapse
|
18
|
Shi W, Lin K, Zhao Y, Li Z, Zhou T. Toward a comprehensive understanding of alicyclic compounds: Bio-effects perspective and deep learning approach. THE SCIENCE OF THE TOTAL ENVIRONMENT 2024; 912:168927. [PMID: 38042202 DOI: 10.1016/j.scitotenv.2023.168927] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 11/17/2023] [Accepted: 11/25/2023] [Indexed: 12/04/2023]
Abstract
The escalating use of alicyclic compounds in modern industrial production has led to a rapid increase of these substances in the environment, posing significant health hazards. Addressing this challenge necessitates a comprehensive understanding of these compounds, which can be achieved through the deep learning approach. Graph neural networks (GNN) known for its' extraordinary ability to process graph data with rich relationships, have been employed in various molecular prediction tasks. In this study, alicyclic molecules screened from PCBA, Toxcast and Tox21 are made as general bioactivity and biological targets' activity prediction datasets. GNN-based models are trained on the two datasets, while the Attentive FP and PAGTN achieve best performance individually. In addition, alicyclic carbon atoms make the greatest contribution to biological activity, which indicate that the alicycle structures have significant impact on the carbon atoms' contribution. Moreover, there are terrific number of active molecules in other public datasets, indicates that alicyclic compounds deserve more attention in POPs control. This study uncovered deeper structural-activity relationships within these compounds, offering new perspectives and methodologies for academic research in the field.
Collapse
Affiliation(s)
- Wenjie Shi
- The State Key Laboratory of Pollution Control and Resource Reuse, School of Environmental Science and Engineering, Tongji University, 1239 Siping Road, Shanghai 200092, China.
| | - Kunsen Lin
- The State Key Laboratory of Pollution Control and Resource Reuse, School of Environmental Science and Engineering, Tongji University, 1239 Siping Road, Shanghai 200092, China.
| | - Youcai Zhao
- The State Key Laboratory of Pollution Control and Resource Reuse, School of Environmental Science and Engineering, Tongji University, 1239 Siping Road, Shanghai 200092, China; Shanghai Institute of Pollution Control and Ecological Security, 1515 North Zhongshan Rd. (No. 2), Shanghai 200092, PR China
| | - Zongsheng Li
- The State Key Laboratory of Pollution Control and Resource Reuse, School of Environmental Science and Engineering, Tongji University, 1239 Siping Road, Shanghai 200092, China
| | - Tao Zhou
- The State Key Laboratory of Pollution Control and Resource Reuse, School of Environmental Science and Engineering, Tongji University, 1239 Siping Road, Shanghai 200092, China; Shanghai Institute of Pollution Control and Ecological Security, 1515 North Zhongshan Rd. (No. 2), Shanghai 200092, PR China.
| |
Collapse
|
19
|
Song T, Yang Q, Qu P, Qiao L, Wang X. Attenphos: General Phosphorylation Site Prediction Model Based on Attention Mechanism. Int J Mol Sci 2024; 25:1526. [PMID: 38338804 PMCID: PMC10855885 DOI: 10.3390/ijms25031526] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Revised: 01/18/2024] [Accepted: 01/23/2024] [Indexed: 02/12/2024] Open
Abstract
Phosphorylation site prediction has important application value in the field of bioinformatics. It can act as an important reference and help with protein function research, protein structure research, and drug discovery. So, it is of great significance to propose scientific and effective calculation methods to accurately predict phosphorylation sites. In this study, we propose a new method, Attenphos, based on the self-attention mechanism for predicting general phosphorylation sites in proteins. The method not only captures the long-range dependence information of proteins but also better represents the correlation between amino acids through feature vector encoding transformation. Attenphos takes advantage of the one-dimensional convolutional layer to reduce the number of model parameters, improve model efficiency and prediction accuracy, and enhance model generalization. Comparisons between our method and existing state-of-the-art prediction tools were made using balanced datasets from human proteins and unbalanced datasets from mouse proteins. We performed prediction comparisons using independent test sets. The results showed that Attenphos demonstrated the best overall performance in the prediction of Serine (S), Threonine (T), and Tyrosine (Y) sites on both balanced and unbalanced datasets. Compared to current state-of-the-art methods, Attenphos has significantly higher prediction accuracy. This proves the potential of Attenphos in accelerating the identification and functional analysis of protein phosphorylation sites and provides new tools and ideas for biological research and drug discovery.
Collapse
Affiliation(s)
| | | | | | | | - Xun Wang
- Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum, Qingdao 266555, China; (T.S.); (Q.Y.); (P.Q.); (L.Q.)
| |
Collapse
|
20
|
Wu J, Su Y, Yang A, Ren J, Xiang Y. An improved multi-modal representation-learning model based on fusion networks for property prediction in drug discovery. Comput Biol Med 2023; 165:107452. [PMID: 37690287 DOI: 10.1016/j.compbiomed.2023.107452] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2023] [Revised: 08/12/2023] [Accepted: 09/04/2023] [Indexed: 09/12/2023]
Abstract
Accurate characterization of molecular representations plays an important role in the property prediction based on deep learning (DL) for drug discovery. However, most previous researches considered only one type of molecular representations, resulting in that it difficult to capture the full molecular feature information. In this study, a novel DL framework called multi-modal molecular representation learning fusion network (MMRLFN) is developed, which could simultaneously learn and integrate drug molecular features from molecular graphs and SMILES sequences. The developed MMRLFN method is composed of three complementary deep neural networks to learn various features from different molecular representations, such as molecular topology, local chemical background information, and substructures at varying scales. Eight public datasets involving various molecular properties used in drug discovery were employed to train and evaluate the developed MMRLFN. The obtained models showed better performances than the existing models based on mono-modal molecular representations. Additionally, a thorough analysis of the noise resistance and interpretability of the MMRLFN has been carried out. The generalization ability and effectiveness of the MMRLFN has been verified by case studies as well. Overall, the MMRLFN can accurately predict molecular properties and provide potentially valuable information from large datasets, thereby maximizing the possibility of successful drug discovery.
Collapse
Affiliation(s)
- Jinzhou Wu
- School of Intelligent Technology and Engineering, Chongqing University of Science and Technology, Chongqing, 401331, China
| | - Yang Su
- School of Intelligent Technology and Engineering, Chongqing University of Science and Technology, Chongqing, 401331, China.
| | - Ao Yang
- School of Safety Engineering (School of Emergency Management), Chongqing University of Science and Technology, Chongqing, 401331, China
| | - Jingzheng Ren
- Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University, China
| | - Yi Xiang
- School of Intelligent Technology and Engineering, Chongqing University of Science and Technology, Chongqing, 401331, China
| |
Collapse
|
21
|
Li Z, Tu X, Chen Y, Lin W. HetDDI: a pre-trained heterogeneous graph neural network model for drug-drug interaction prediction. Brief Bioinform 2023; 24:bbad385. [PMID: 37903412 DOI: 10.1093/bib/bbad385] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2023] [Revised: 08/12/2023] [Accepted: 09/13/2023] [Indexed: 11/01/2023] Open
Abstract
The simultaneous use of two or more drugs due to multi-disease comorbidity continues to increase, which may cause adverse reactions between drugs that seriously threaten public health. Therefore, the prediction of drug-drug interaction (DDI) has become a hot topic not only in clinics but also in bioinformatics. In this study, we propose a novel pre-trained heterogeneous graph neural network (HGNN) model named HetDDI, which aggregates the structural information in drug molecule graphs and rich semantic information in biomedical knowledge graph to predict DDIs. In HetDDI, we first initialize the parameters of the model with different pre-training methods. Then we apply the pre-trained HGNN to learn the feature representation of drugs from multi-source heterogeneous information, which can more effectively utilize drugs' internal structure and abundant external biomedical knowledge, thus leading to better DDI prediction. We evaluate our model on three DDI prediction tasks (binary-class, multi-class and multi-label) with three datasets and further assess its performance on three scenarios (S1, S2 and S3). The results show that the accuracy of HetDDI can achieve 98.82% in the binary-class task, 98.13% in the multi-class task and 96.66% in the multi-label one on S1, which outperforms the state-of-the-art methods by at least 2%. On S2 and S3, our method also achieves exciting performance. Furthermore, the case studies confirm that our model performs well in predicting unknown DDIs. Source codes are available at https://github.com/LinsLab/HetDDI.
Collapse
Affiliation(s)
- Zhe Li
- School of Computer Science, University of South China, Hengyang, 421001 Hunan, China
| | - Xinyi Tu
- School of Computer Science, University of South China, Hengyang, 421001 Hunan, China
| | - Yuping Chen
- School of Pharmacy, University of South China, Hengyang 421001, China
| | - Wenbin Lin
- School of Mathematics and Physics, University of South China, Hengyang 421001, China
| |
Collapse
|
22
|
Teng S, Yin C, Wang Y, Chen X, Yan Z, Cui L, Wei L. MolFPG: Multi-level fingerprint-based Graph Transformer for accurate and robust drug toxicity prediction. Comput Biol Med 2023; 164:106904. [PMID: 37453376 DOI: 10.1016/j.compbiomed.2023.106904] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Revised: 03/20/2023] [Accepted: 04/10/2023] [Indexed: 07/18/2023]
Abstract
Drug toxicity prediction is essential to drug development, which can help screen compounds with potential toxicity and reduce the cost and risk of animal experiments and clinical trials. However, traditional handcrafted feature-based and molecular-graph-based approaches are insufficient for molecular representation learning. To address the problem, we developed an innovative molecular fingerprint Graph Transformer framework (MolFPG) with a global-aware module for interpretable toxicity prediction. Our approach encodes compounds using multiple molecular fingerprinting techniques and integrates Graph Transformer-based molecular representation for feature learning and toxic prediction. Experimental results show that our proposed approach has high accuracy and reliability in predicting drug toxicity. In addition, we explored the relationship between drug features and toxicity through an interpretive analysis approach, which improved the interpretability of the approach. Our results highlight the potential of Graph Transformers and multi-level fingerprints for accelerating the drug discovery process by reliably, effectively alarming drug safety. We believe that our study will provide vital support and reference for further development in the field of drug development and toxicity assessment.
Collapse
Affiliation(s)
- Saisai Teng
- School of Software, Shandong University, Jinan, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
| | - Chenglin Yin
- School of Software, Shandong University, Jinan, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
| | - Yu Wang
- School of Software, Shandong University, Jinan, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
| | | | - Zhongmin Yan
- School of Software, Shandong University, Jinan, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China.
| | - Lizhen Cui
- School of Software, Shandong University, Jinan, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China.
| | - Leyi Wei
- School of Software, Shandong University, Jinan, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China.
| |
Collapse
|
23
|
Du BX, Long Y, Li X, Wu M, Shi JY. CMMS-GCL: cross-modality metabolic stability prediction with graph contrastive learning. Bioinformatics 2023; 39:btad503. [PMID: 37572298 PMCID: PMC10457661 DOI: 10.1093/bioinformatics/btad503] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Revised: 07/26/2023] [Accepted: 08/11/2023] [Indexed: 08/14/2023] Open
Abstract
MOTIVATION Metabolic stability plays a crucial role in the early stages of drug discovery and development. Accurately modeling and predicting molecular metabolic stability has great potential for the efficient screening of drug candidates as well as the optimization of lead compounds. Considering wet-lab experiment is time-consuming, laborious, and expensive, in silico prediction of metabolic stability is an alternative choice. However, few computational methods have been developed to address this task. In addition, it remains a significant challenge to explain key functional groups determining metabolic stability. RESULTS To address these issues, we develop a novel cross-modality graph contrastive learning model named CMMS-GCL for predicting the metabolic stability of drug candidates. In our framework, we design deep learning methods to extract features for molecules from two modality data, i.e. SMILES sequence and molecule graph. In particular, for the sequence data, we design a multihead attention BiGRU-based encoder to preserve the context of symbols to learn sequence representations of molecules. For the graph data, we propose a graph contrastive learning-based encoder to learn structure representations by effectively capturing the consistencies between local and global structures. We further exploit fully connected neural networks to combine the sequence and structure representations for model training. Extensive experimental results on two datasets demonstrate that our CMMS-GCL consistently outperforms seven state-of-the-art methods. Furthermore, a collection of case studies on sequence data and statistical analyses of the graph structure module strengthens the validation of the interpretability of crucial functional groups recognized by CMMS-GCL. Overall, CMMS-GCL can serve as an effective and interpretable tool for predicting metabolic stability, identifying critical functional groups, and thus facilitating the drug discovery process and lead compound optimization. AVAILABILITY AND IMPLEMENTATION The code and data underlying this article are freely available at https://github.com/dubingxue/CMMS-GCL.
Collapse
Affiliation(s)
- Bing-Xue Du
- School of Life Sciences, Northwestern Polytechnical University, Xi’an 710072, China
- Institute for Infocomm Research (IR), Agency for Science, Technology and Research (A*STAR), Singapore 138632, Singapore
| | - Yahui Long
- Singapore Immunology Network (SIgN), Agency for Science, Technology and Research (A*STAR), Singapore 138648, Singapore
| | - Xiaoli Li
- Institute for Infocomm Research (IR), Agency for Science, Technology and Research (A*STAR), Singapore 138632, Singapore
| | - Min Wu
- Institute for Infocomm Research (IR), Agency for Science, Technology and Research (A*STAR), Singapore 138632, Singapore
| | - Jian-Yu Shi
- School of Life Sciences, Northwestern Polytechnical University, Xi’an 710072, China
| |
Collapse
|
24
|
Sun Y, Zhang J, Yu Z, Zhang Y, Liu Z. The Bidirectional Gated Recurrent Unit Network Based on the Inception Module (Inception-BiGRU) Predicts the Missing Data by Well Logging Data. ACS OMEGA 2023; 8:27710-27724. [PMID: 37546590 PMCID: PMC10399194 DOI: 10.1021/acsomega.3c03677] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 06/23/2023] [Indexed: 08/08/2023]
Abstract
As a key bridge between logging and seismic data, acoustic (AC) logging data is of great significance for reservoir lithology, physical property analysis, and quantitative evaluation, and completing AC logging data can help to obtain high-resolution inversion profiles, which can provide a reliable basis for reservoir geological interpretation. However, in the actual mining process, the AC logging data is always missing due to instrument failure and borehole collapse in many areas, and re-logging is not only expensive but also difficult to achieve. However, the AC data can be completed by other obtained logging parameters. In this paper, a bidirectional gated recurrent unit network based on the Inception module is developed to complete the AC logging data. The Inception module extracts the logging data features and inputs the extracted logging data features into the bidirectional gated recurrent unit network, which can fully consider the characteristics of the current data and the data before and after the logging sequence to complete the missing AC logging data. Experimental results show that the hybrid model (Inception-BiGRU) has higher accuracy than traditional and widely used series forecasting models (gated recurrent unit network and long short-term memory network), and this method also provides a new idea for the completion of AC logging data.
Collapse
Affiliation(s)
- Youzhuang Sun
- College of Earth
Science and Technology, China University
of Petroleum, Qingdao 266555, China
| | - Junhua Zhang
- College of Earth
Science and Technology, China University
of Petroleum, Qingdao 266555, China
| | - Zhengjun Yu
- Shengli Oilfield Geophysical Research Institute, Dongying 257000, China
| | - Yongan Zhang
- College of Computer Science, China University of Petroleum, Qingdao 266555, China
| | - Zhen Liu
- College of Earth
Science and Technology, China University
of Petroleum, Qingdao 266555, China
| |
Collapse
|
25
|
Jiang J, Zhang R, Yuan Y, Li T, Li G, Zhao Z, Yu Z. NoiseMol: A noise-robusted data augmentation via perturbing noise for molecular property prediction. J Mol Graph Model 2023; 121:108454. [PMID: 36963306 DOI: 10.1016/j.jmgm.2023.108454] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Revised: 03/05/2023] [Accepted: 03/13/2023] [Indexed: 03/17/2023]
Abstract
Simplified Molecular-Input Line-Entry System (SMILES) is one of a widely used molecular representation methods for molecular property prediction. We conjecture that all the characters in the SMILES string of a molecule are essential for making up the molecules, but most of them make little contribution to determining a particular property of the molecule. Therefore, we verified the conjecture in the pre-experiment. Motivated by the result, we propose to inject proper noisy information into the SMILES to augment the training data by increasing the diversity of the labeled molecules. To this end, we explore injecting perturbing noise into the original labeled SMILES strings to construct augmented data for alleviating the limitation of the labeled compound data and enhancing the model to extract more useful molecular representation for molecular property prediction. Specifically, we directly adopt mask, swap, deletion, and fusion operations on SMILES strings to randomly mask, swap, and delete atoms in SMILES strings. Then, the augmented data is used by two strategies: each epoch alternately feeds the original and perturbing noisy molecules, or each batch alternately feeds the original and perturbing noisy molecules. We conduct experiments on both Transformer and BiGRU models to validate the effectiveness by adopting widely used datasets from MoleculeNet and ZINC. Experimental results demonstrate that the proposed method outperforms strong baselines on all the datasets. NoiseMol obtains the best performance on BBBP and FDA when compared with state-of-the-art methods. Besides, NoiseMol achieves the best accuracy on LogP. Therefore, injecting perturbing noise into the labeled SMILES strings is an effective and efficient method, which improves the prediction performance, generalization, and robustness of the deep learning models.
Collapse
Affiliation(s)
- Jing Jiang
- School of Information Science and Engineering, Lanzhou University, Lanzhou, Gansu, China; Key Laboratory of China's Ethnic Languages and Information Technology of Ministry of Education, Northwest Minzu University, Lanzhou, Gansu, China.
| | - Ruisheng Zhang
- School of Information Science and Engineering, Lanzhou University, Lanzhou, Gansu, China.
| | - Yongna Yuan
- School of Information Science and Engineering, Lanzhou University, Lanzhou, Gansu, China.
| | - Tongfeng Li
- School of Information Science and Engineering, Lanzhou University, Lanzhou, Gansu, China; Computer College, Qinghai Normal University, Xining, Qinghai, China.
| | - Gaili Li
- School of Information Science and Engineering, Lanzhou University, Lanzhou, Gansu, China.
| | - Zhili Zhao
- School of Information Science and Engineering, Lanzhou University, Lanzhou, Gansu, China.
| | - Zhixuan Yu
- School of Information Science and Engineering, Lanzhou University, Lanzhou, Gansu, China.
| |
Collapse
|
26
|
Bao W, Gu Y, Chen B, Yu H. Golgi_DF: Golgi proteins classification with deep forest. Front Neurosci 2023; 17:1197824. [PMID: 37250391 PMCID: PMC10213405 DOI: 10.3389/fnins.2023.1197824] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 04/19/2023] [Indexed: 05/31/2023] Open
Abstract
Introduction Golgi is one of the components of the inner membrane system in eukaryotic cells. Its main function is to send the proteins involved in the synthesis of endoplasmic reticulum to specific parts of cells or secrete them outside cells. It can be seen that Golgi is an important organelle for eukaryotic cells to synthesize proteins. Golgi disorders can cause various neurodegenerative and genetic diseases, and the accurate classification of Golgi proteins is helpful to develop corresponding therapeutic drugs. Methods This paper proposed a novel Golgi proteins classification method, which is Golgi_DF with the deep forest algorithm. Firstly, the classified proteins method can be converted the vector features containing various information. Secondly, the synthetic minority oversampling technique (SMOTE) is utilized to deal with the classified samples. Next, the Light GBM method is utilized to feature reduction. Meanwhile, the features can be utilized in the penultimate dense layer. Therefore, the reconstructed features can be classified with the deep forest algorithm. Results In Golgi_DF, this method can be utilized to select the important features and identify Golgi proteins. Experiments show that the well-performance than the other art-of-the state methods. Golgi_DF as a standalone tools, all its source codes publicly available at https://github.com/baowz12345/golgiDF. Discussion Golgi_DF employed reconstructed feature to classify the Golgi proteins. Such method may achieve more available features among the UniRep features.
Collapse
Affiliation(s)
- Wenzheng Bao
- School of Information Engineering, Xuzhou University of Technology, Xuzhou, China
| | - Yujian Gu
- School of Information Engineering, Xuzhou University of Technology, Xuzhou, China
| | - Baitong Chen
- Department of Stomatology, Xuzhou First People’s Hospital, Xuzhou, China
- The Affiliated Hospital of China University of Mining and Technology, Xuzhou, China
| | - Huiping Yu
- Department of Neurosurgery, The Hospital of Joint Logistic, Quanzhou, China
| |
Collapse
|
27
|
Nemoto S, Mizuno T, Kusuhara H. Investigation of chemical structure recognition by encoder-decoder models in learning progress. J Cheminform 2023; 15:45. [PMID: 37046349 PMCID: PMC10100163 DOI: 10.1186/s13321-023-00713-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Accepted: 03/18/2023] [Indexed: 04/14/2023] Open
Abstract
Descriptor generation methods using latent representations of encoder-decoder (ED) models with SMILES as input are useful because of the continuity of descriptor and restorability to the structure. However, it is not clear how the structure is recognized in the learning progress of ED models. In this work, we created ED models of various learning progress and investigated the relationship between structural information and learning progress. We showed that compound substructures were learned early in ED models by monitoring the accuracy of downstream tasks and input-output substructure similarity using substructure-based descriptors, which suggests that existing evaluation methods based on the accuracy of downstream tasks may not be sensitive enough to evaluate the performance of ED models with SMILES as descriptor generation methods. On the other hand, we showed that structure restoration was time-consuming, and in particular, insufficient learning led to the estimation of a larger structure than the actual one. It can be inferred that determining the endpoint of the structure is a difficult task for the model. To our knowledge, this is the first study to link the learning progress of SMILES by ED model to chemical structures for a wide range of chemicals.
Collapse
Affiliation(s)
- Shumpei Nemoto
- Department of Pharmaceutical Sciences, The University of Tokyo, Bunkyo, Tokyo, Japan
| | - Tadahaya Mizuno
- Department of Pharmaceutical Sciences, The University of Tokyo, Bunkyo, Tokyo, Japan.
| | - Hiroyuki Kusuhara
- Department of Pharmaceutical Sciences, The University of Tokyo, Bunkyo, Tokyo, Japan
| |
Collapse
|
28
|
SuHAN: Substructural hierarchical attention network for molecular representation. J Mol Graph Model 2023; 119:108401. [PMID: 36584590 DOI: 10.1016/j.jmgm.2022.108401] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2022] [Revised: 12/16/2022] [Accepted: 12/23/2022] [Indexed: 12/26/2022]
Abstract
Recently, molecular representation and property exploration, with the combination of neural network, play a critical role in the field of drug design and discovery for assisting in drug related research. However, previous research in molecular representation relies heavily on artificial extraction of features based on biological experiments which may result in a manually introduced noise of molecular information with high cost in time and money. In this paper, a novel method named Substructural Hierarchical Attention Network (SuHAN) is proposed to discover inherent characteristics of molecules for representation learning. Specifically, SuHAN is composed of the cascaded layer: atom-level layer and substructure-level layer. Molecule in the SMILES format is divided into several substructural fragments by predefined partition rules, and then they are fed into atom-level layer and substructure-level layer successively to obtain feature representation from different perspective: atomic view and substructural view. In this way, the prominent structural features that may be omitted in global extraction are excavated from a fine-grained viewpoint and fused to reconstruct representative pattern in an overall view. Experiments on biophysics and physiology datasets demonstrate that our model is competitive with a significant improvement of both accuracy and stability in performance. We confirmed that the substructural segments and progressive hierarchical networks lead to an effective molecular representation for downstream tasks. These results provide a novel perspective about reconstructing overall pattern through local prominent structure.
Collapse
|
29
|
Liu J, Lei X, Zhang Y, Pan Y. The prediction of molecular toxicity based on BiGRU and GraphSAGE. Comput Biol Med 2023; 153:106524. [PMID: 36623439 DOI: 10.1016/j.compbiomed.2022.106524] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Revised: 12/10/2022] [Accepted: 12/31/2022] [Indexed: 01/04/2023]
Abstract
The prediction of molecules toxicity properties plays an crucial role in the realm of the drug discovery, since it can swiftly screen out the expected drug moleculars. The conventional method for predicting toxicity is to use some in vivo or in vitro biological experiments in the laboratory, which can easily pose a threat significant time and financial waste and even ethical issues. Therefore, using computational approaches to predict molecular toxicity has become a common strategy in modern drug discovery. In this article, we propose a novel model named MTBG, which primarily makes use of both SMILES (Simplified molecular input line entry system) strings and graph structures of molecules to extract drug molecular feature in the field of drug molecular toxicity prediction. To verify the performance of the MTBG model, we opt the Tox21 dataset and several widely used baseline models. Experimental results demonstrate that our model can perform better than these baseline models.
Collapse
Affiliation(s)
- Jianping Liu
- School of Computer Science, Shaanxi Normal University, Xi'an, 710119, China
| | - Xiujuan Lei
- School of Computer Science, Shaanxi Normal University, Xi'an, 710119, China.
| | - Yuchen Zhang
- School of Computer Science, Shaanxi Normal University, Xi'an, 710119, China
| | - Yi Pan
- Faculty of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
| |
Collapse
|
30
|
Ma M, Lei X. A dual graph neural network for drug-drug interactions prediction based on molecular structure and interactions. PLoS Comput Biol 2023; 19:e1010812. [PMID: 36701288 PMCID: PMC9879511 DOI: 10.1371/journal.pcbi.1010812] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Accepted: 12/12/2022] [Indexed: 01/27/2023] Open
Abstract
Expressive molecular representation plays critical roles in researching drug design, while effective methods are beneficial to learning molecular representations and solving related problems in drug discovery, especially for drug-drug interactions (DDIs) prediction. Recently, a lot of work has been put forward using graph neural networks (GNNs) to forecast DDIs and learn molecular representations. However, under the current GNNs structure, the majority of approaches learn drug molecular representation from one-dimensional string or two-dimensional molecular graph structure, while the interaction information between chemical substructure remains rarely explored, and it is neglected to identify key substructures that contribute significantly to the DDIs prediction. Therefore, we proposed a dual graph neural network named DGNN-DDI to learn drug molecular features by using molecular structure and interactions. Specifically, we first designed a directed message passing neural network with substructure attention mechanism (SA-DMPNN) to adaptively extract substructures. Second, in order to improve the final features, we separated the drug-drug interactions into pairwise interactions between each drug's unique substructures. Then, the features are adopted to predict interaction probability of a DDI tuple. We evaluated DGNN-DDI on real-world dataset. Compared to state-of-the-art methods, the model improved DDIs prediction performance. We also conducted case study on existing drugs aiming to predict drug combinations that may be effective for the novel coronavirus disease 2019 (COVID-19). Moreover, the visual interpretation results proved that the DGNN-DDI was sensitive to the structure information of drugs and able to detect the key substructures for DDIs. These advantages demonstrated that the proposed method enhanced the performance and interpretation capability of DDI prediction modeling.
Collapse
Affiliation(s)
- Mei Ma
- School of Computer Science, Shaanxi Normal University, Xi’an, China
- School of Mathematics and Statistics, Qinghai Normal University, Qinghai, China
| | - Xiujuan Lei
- School of Computer Science, Shaanxi Normal University, Xi’an, China
- * E-mail:
| |
Collapse
|
31
|
Jiang J, Zhang R, Ma J, Liu Y, Yang E, Du S, Zhao Z, Yuan Y. TranGRU: focusing on both the local and global information of molecules for molecular property prediction. APPL INTELL 2022; 53:15246-15260. [PMID: 36405344 PMCID: PMC9662124 DOI: 10.1007/s10489-022-04280-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/17/2022] [Indexed: 11/16/2022]
Abstract
Molecular property prediction is an essential but challenging task in drug discovery. The recurrent neural network (RNN) and Transformer are the mainstream methods for sequence modeling, and both have been successfully applied independently for molecular property prediction. As the local information and global information of molecules are very important for molecular properties, we aim to integrate the bi-directional gated recurrent unit (BiGRU) into the original Transformer encoder, together with self-attention to better capture local and global molecular information simultaneously. To this end, we propose the TranGRU approach, which encodes the local and global information of molecules by using the BiGRU and self-attention, respectively. Then, we use a gate mechanism to reasonably fuse the two molecular representations. In this way, we enhance the ability of the proposed model to encode both local and global molecular information. Compared to the baselines and state-of-the-art methods when treating each task as a single-task classification on Tox21, the proposed approach outperforms the baselines on 9 out of 12 tasks and state-of-the-art methods on 5 out of 12 tasks. TranGRU also obtains the best ROC-AUC scores on BBBP, FDA, LogP, and Tox21 (multitask classification) and has a comparable performance on ToxCast, BACE, and ecoli. On the whole, TranGRU achieves better performance for molecular property prediction. The source code is available in GitHub: https://github.com/Jiangjing0122/TranGRU.
Collapse
Affiliation(s)
- Jing Jiang
- School of Information Science and Engineering, Lanzhou University, Tianshui Road, Lanzhou, 730000 Gansu China
- Key Laboratory of China’s Ethnic Languages and Information Technology of Ministry of Education, Northwest Minzu University, Baiyin Road, Lanzhou, 730030 Gansu China
| | - Ruisheng Zhang
- School of Information Science and Engineering, Lanzhou University, Tianshui Road, Lanzhou, 730000 Gansu China
| | - Jun Ma
- School of Information Science and Engineering, Lanzhou University, Tianshui Road, Lanzhou, 730000 Gansu China
| | - Yunwu Liu
- School of Information Science and Engineering, Lanzhou University, Tianshui Road, Lanzhou, 730000 Gansu China
| | - Enjie Yang
- School of Information Science and Engineering, Lanzhou University, Tianshui Road, Lanzhou, 730000 Gansu China
| | - Shikang Du
- School of Information Science and Engineering, Lanzhou University, Tianshui Road, Lanzhou, 730000 Gansu China
| | - Zhili Zhao
- School of Information Science and Engineering, Lanzhou University, Tianshui Road, Lanzhou, 730000 Gansu China
| | - Yongna Yuan
- School of Information Science and Engineering, Lanzhou University, Tianshui Road, Lanzhou, 730000 Gansu China
| |
Collapse
|
32
|
Yan X, Liu Y. Graph-sequence attention and transformer for predicting drug-target affinity. RSC Adv 2022; 12:29525-29534. [PMID: 36320763 PMCID: PMC9562047 DOI: 10.1039/d2ra05566j] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2022] [Accepted: 10/04/2022] [Indexed: 11/30/2022] Open
Abstract
Drug-target binding affinity (DTA) prediction has drawn increasing interest due to its substantial position in the drug discovery process. The development of new drugs is costly, time-consuming, and often accompanied by safety issues. Drug repurposing can avoid the expensive and lengthy process of drug development by finding new uses for already approved drugs. Therefore, it is of great significance to develop effective computational methods to predict DTAs. The attention mechanisms allow the computational method to focus on the most relevant parts of the input and have been proven to be useful for various tasks. In this study, we proposed a novel model based on self-attention, called GSATDTA, to predict the binding affinity between drugs and targets. For the representation of drugs, we use Bi-directional Gated Recurrent Units (BiGRU) to extract the SMILES representation from SMILES sequences, and graph neural networks to extract the graph representation of the molecular graphs. Then we utilize an attention mechanism to fuse the two representations of the drug. For the target/protein, we utilized an efficient transformer to learn the representation of the protein, which can capture the long-distance relationships in the sequence of amino acids. We conduct extensive experiments to compare our model with state-of-the-art models. Experimental results show that our model outperforms the current state-of-the-art methods on two independent datasets.
Collapse
Affiliation(s)
- Xiangfeng Yan
- School of Computer Science and Technology, Heilongjiang University Harbin China
| | - Yong Liu
- School of Computer Science and Technology, Heilongjiang University Harbin China
| |
Collapse
|
33
|
Rational Discovery of Antimicrobial Peptides by Means of Artificial Intelligence. MEMBRANES 2022; 12:membranes12070708. [PMID: 35877911 PMCID: PMC9320227 DOI: 10.3390/membranes12070708] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/14/2022] [Revised: 07/05/2022] [Accepted: 07/06/2022] [Indexed: 11/16/2022]
Abstract
Antibiotic resistance is a worldwide public health problem due to the costs and mortality rates it generates. However, the large pharmaceutical industries have stopped searching for new antibiotics because of their low profitability, given the rapid replacement rates imposed by the increasingly observed resistance acquired by microorganisms. Alternatively, antimicrobial peptides (AMPs) have emerged as potent molecules with a much lower rate of resistance generation. The discovery of these peptides is carried out through extensive in vitro screenings of either rational or non-rational libraries. These processes are tedious and expensive and generate only a few AMP candidates, most of which fail to show the required activity and physicochemical properties for practical applications. This work proposes implementing an artificial intelligence algorithm to reduce the required experimentation and increase the efficiency of high-activity AMP discovery. Our deep learning (DL) model, called AMPs-Net, outperforms the state-of-the-art method by 8.8% in average precision. Furthermore, it is highly accurate to predict the antibacterial and antiviral capacity of a large number of AMPs. Our search led to identifying two unreported antimicrobial motifs and two novel antimicrobial peptides related to them. Moreover, by coupling DL with molecular dynamics (MD) simulations, we were able to find a multifunctional peptide with promising therapeutic effects. Our work validates our previously proposed pipeline for a more efficient rational discovery of novel AMPs.
Collapse
|
34
|
Parakkal S, Datta R, Das D. DeepBBBP: High accuracy Blood-Brain-Barrier Permeability Prediction with a Mixed Deep Learning Model. Mol Inform 2022; 41:e2100315. [PMID: 35393777 DOI: 10.1002/minf.202100315] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Accepted: 04/07/2022] [Indexed: 11/05/2022]
Abstract
Blood-brain-barrier permeability (BBBP) is an important property that is used to establish the drug-likeness of a molecule, as it establishes whether the molecule can cross the BBB when desired. It also eliminates those molecules which are not supposed to cross the barrier, as doing so would lead to toxicity. BBBP can be measured in vivo, in vitro or in silico. With the advent and subsequent rise of in silico methods for virtual drug screening, quite a bit of work has been done to predict this feature using statistical machine learning (ML) and deep learning (DL) based methods. In this work a mixed DL-based model, consisting of a Multi-layer Perceptron (MLP) and Convolutional Neural Network layers, has been paired with Mol2vec. Mol2vec is a convenient and unsupervised machine learning technique which produces high-dimensional vector representations of molecules and its molecular substructures. These succinct vector representations are utilized as inputs to the mixed DL model that is used for BBBP predictions. Several well-known benchmarks incorporating BBBP data have been used for supervised training and prediction by our mixed DL model which demonstrates superior results when compared to existing ML and DL techniques used for predicting BBBP.
Collapse
|
35
|
Zhao S, Pan Q, Zou Q, Ju Y, Shi L, Su X. Identifying and Classifying Enhancers by Dinucleotide-Based Auto-Cross Covariance and Attention-Based Bi-LSTM. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:7518779. [PMID: 35422876 PMCID: PMC9005296 DOI: 10.1155/2022/7518779] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/20/2021] [Accepted: 03/12/2022] [Indexed: 11/17/2022]
Abstract
Enhancers are a class of noncoding DNA elements located near structural genes. In recent years, their identification and classification have been the focus of research in the field of bioinformatics. However, due to their high free scattering and position variability, although the performance of the prediction model has been continuously improved, there is still a lot of room for progress. In this paper, density-based spatial clustering of applications with noise (DBSCAN) was used to screen the physicochemical properties of dinucleotides to extract dinucleotide-based auto-cross covariance (DACC) features; then, the features are reduced by feature selection Python toolkit MRMD 2.0. The reduced features are input into the random forest to identify enhancers. The enhancer classification model was built by word2vec and attention-based Bi-LSTM. Finally, the accuracies of our enhancer identification and classification models were 77.25% and 73.50%, respectively, and the Matthews' correlation coefficients (MCCs) were 0.5470 and 0.4881, respectively, which were better than the performance of most predictors.
Collapse
Affiliation(s)
- Shulin Zhao
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Qingfeng Pan
- General Hospital of Heilongjiang Province Land Reclamation Bureau, Harbin, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen, China
| | - Lei Shi
- Department of Spine Surgery, Changzheng Hospital, Naval Medical University, Shanghai, China
| | - Xi Su
- Foshan Maternal and Child Health Hospital, Foshan, Guangdong, China
| |
Collapse
|
36
|
Nallasamy V, Seshiah M. Protein Structure Prediction Using Quantile Dragonfly and Structural Class-Based Deep Learning. INT J PATTERN RECOGN 2022. [DOI: 10.1142/s021800142250015x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Predicting three-dimensional structure of a protein in the field of computational molecular biology has received greater attention. Most of the recent research works aimed at exploring search space, however with the increasing nature and size of data, protein structure identification and prediction are still in the preliminary stage. This work is aimed at exploring search space to tackle protein structure prediction with minimum execution time and maximum accuracy by means of quantile regressive dragonfly and structural class homolog-based deep learning (QRD-SCHDL). The proposed QRD-SCHDL method consists of two distinct steps. They are protein structure identification and prediction. In the first step, protein structure identification is performed by means of QRD optimization model to identify protein structure with minimum error. Here the protein structure identification is first performed as the raw database contains sequence information and does not contain structural information. An optimization model is designed to obtain the structural information from the database. However, protein structure gives much more insight than its sequence. Therefore, to perform computational prediction of protein structure from its sequence, actual protein structure prediction is made. The second step involves the actual protein structure prediction via structural class and homolog-based deep learning. For each protein structure prediction, a scoring matrix is obtained by utilizing structural class maximum correlation coefficient. Finally, the proposed method is tested on a set of different unique numbers of protein data and compared to the state-of-the-art methods. The obtained results showed the potentiality of the proposed method in terms of metrics, error rate, protein structure prediction time, protein structure prediction accuracy, precision, specificity, recall, ROC, Kappa coefficient and [Formula: see text]-measure, respectively. It also shows that the proposed QRD-SCHDL method attains comparable results and outperformed in certain cases, thereby signifying the efficiency of the proposed work.
Collapse
Affiliation(s)
- Varanavasi Nallasamy
- Department of Computer Science, Periyar University, Salem-636011, Tamil Nadu, India
| | - Malarvizhi Seshiah
- Department of Computer Science, Thiruvalluvar Government Arts College, Rasipuram-637401, Namakkal, Tamil Nadu, India
| |
Collapse
|
37
|
A Novel Molecular Representation Learning for Molecular Property Prediction with a Multiple SMILES-Based Augmentation. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:8464452. [PMID: 35178082 PMCID: PMC8843876 DOI: 10.1155/2022/8464452] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Revised: 12/15/2021] [Accepted: 12/27/2021] [Indexed: 11/17/2022]
Abstract
Deep learning has brought a rapid development in the aspect of molecular representation for various tasks, such as molecular property prediction. The prediction of molecular properties is a crucial task in the field of drug discovery for finding specific drugs with good pharmacological activity and pharmacokinetic properties. SMILES string is always used as a kind of character approach in deep neural network models, inspired by natural language processing techniques. However, the deep learning models are hindered by the nonunique nature of the SMILES string. To efficiently learn molecular features along all message paths, in this paper we encode multiple SMILES for every molecule as an automated data augmentation for the prediction of molecular properties, which alleviates the overfitting problem caused by the small amount of data in the datasets of molecular property prediction. As a result, by using the multiple SMILES-based augmentation, we obtained better molecular representation and showed superior performance in the tasks of predicting molecular properties.
Collapse
|
38
|
Mouse4mC-BGRU: deep learning for predicting DNA N4-methylcytosine sites in mouse genome. Methods 2022; 204:258-262. [DOI: 10.1016/j.ymeth.2022.01.009] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2021] [Revised: 01/14/2022] [Accepted: 01/24/2022] [Indexed: 12/12/2022] Open
|
39
|
Machine learning & deep learning in data-driven decision making of drug discovery and challenges in high-quality data acquisition in the pharmaceutical industry. Future Med Chem 2021; 14:245-270. [PMID: 34939433 DOI: 10.4155/fmc-2021-0243] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
Predicting novel small molecule bioactivities for the target deconvolution, hit-to-lead optimization in drug discovery research, requires molecular representation. Previous reports have demonstrated that machine learning (ML) and deep learning (DL) have substantial implications in virtual screening, peptide synthesis, drug ADMET screening and biomarker discovery. These strategies can increase the positive outcomes in the drug discovery process without false-positive rates and can be achieved in a cost-effective way with a minimum duration of time by high-quality data acquisition. This review substantially discusses the recent updates in AI tools as cheminformatics application in medicinal chemistry for the data-driven decision making of drug discovery and challenges in high-quality data acquisition in the pharmaceutical industry while improving small-molecule bioactivities and properties.
Collapse
|
40
|
|
41
|
Dou L, Zhou W, Zhang L, Xu L, Han K. Accurate identification of RNA D modification using multiple features. RNA Biol 2021; 18:2236-2246. [PMID: 33729104 PMCID: PMC8632091 DOI: 10.1080/15476286.2021.1898160] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Revised: 02/13/2021] [Accepted: 02/23/2021] [Indexed: 10/21/2022] Open
Abstract
As one of the common post-transcriptional modifications in tRNAs, dihydrouridine (D) has prominent effects on regulating the flexibility of tRNA as well as cancerous diseases. Facing with the expensive and time-consuming sequencing techniques to detect D modification, precise computational tools can largely promote the progress of molecular mechanisms and medical developments. We proposed a novel predictor, called iRNAD_XGBoost, to identify potential D sites using multiple RNA sequence representations. In this method, by considering the imbalance problem using hybrid sampling method SMOTEEEN, the XGBoost-selected top 30 features are applied to construct model. The optimized model showed high Sn and Sp values of 97.13% and 97.38% over jackknife test, respectively. For the independent experiment, these two metrics separately achieved 91.67% and 94.74%. Compared with iRNAD method, this model illustrated high generalizability and consistent prediction efficiencies for positive and negative samples, which yielded satisfactory MCC scores of 0.94 and 0.86, respectively. It is inferred that the chemical property and nucleotide density features (CPND), electron-ion interaction pseudopotential (EIIP and PseEIIP) as well as dinucleotide composition (DNC) are crucial to the recognition of D modification. The proposed predictor is a promising tool to help experimental biologists investigate molecular functions.
Collapse
Affiliation(s)
- Lijun Dou
- School of Automotive and Transportation Engineering, Shenzhen Polytechnic, Shenzhen, GuangdongChina
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, SichuanChina
| | - Wenyang Zhou
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, HeilongjiangChina
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, Guangdong, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, GuangdongChina
| | - Ke Han
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin, HeilongjiangChina
| |
Collapse
|
42
|
Ao C, Zou Q, Yu L. NmRF: identification of multispecies RNA 2'-O-methylation modification sites from RNA sequences. Brief Bioinform 2021; 23:6446272. [PMID: 34850821 DOI: 10.1093/bib/bbab480] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 10/05/2021] [Accepted: 10/18/2021] [Indexed: 12/12/2022] Open
Abstract
2'-O-methylation (Nm) is a post-transcriptional modification of RNA that is catalyzed by 2'-O-methyltransferase and involves replacing the H on the 2'-hydroxyl group with a methyl group. The 2'-O-methylation modification site is detected in a variety of RNA types (miRNA, tRNA, mRNA, etc.), plays an important role in biological processes and is associated with different diseases. There are few functional mechanisms developed at present, and traditional high-throughput experiments are time-consuming and expensive to explore functional mechanisms. For a deeper understanding of relevant biological mechanisms, it is necessary to develop efficient and accurate recognition tools based on machine learning. Based on this, we constructed a predictor called NmRF based on optimal mixed features and random forest classifier to identify 2'-O-methylation modification sites. The predictor can identify modification sites of multiple species at the same time. To obtain a better prediction model, a two-step strategy is adopted; that is, the optimal hybrid feature set is obtained by combining the light gradient boosting algorithm and incremental feature selection strategy. In 10-fold cross-validation, the accuracies of Homo sapiens and Saccharomyces cerevisiae were 89.069 and 93.885%, and the AUC were 0.9498 and 0.9832, respectively. The rigorous 10-fold cross-validation and independent tests confirm that the proposed method is significantly better than existing tools. A user-friendly web server is accessible at http://lab.malab.cn/∼acy/NmRF.
Collapse
Affiliation(s)
- Chunyan Ao
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.,Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
43
|
Tng SS, Le NQK, Yeh HY, Chua MCH. Improved Prediction Model of Protein Lysine Crotonylation Sites Using Bidirectional Recurrent Neural Networks. J Proteome Res 2021; 21:265-273. [PMID: 34812044 DOI: 10.1021/acs.jproteome.1c00848] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
Histone lysine crotonylation (Kcr) is a post-translational modification of histone proteins that is involved in the regulation of gene transcription, acute and chronic kidney injury, spermatogenesis, depression, cancer, and so forth. The identification of Kcr sites in proteins is important for characterizing and regulating primary biological mechanisms. The use of computational approaches such as machine learning and deep learning algorithms have emerged in recent years as the traditional wet-lab experiments are time-consuming and costly. We propose as part of this study a deep learning model based on a recurrent neural network (RNN) termed as Sohoko-Kcr for the prediction of Kcr sites. Through the embedded encoding of the peptide sequences, we investigate the efficiency of RNN-based models such as long short-term memory (LSTM), bidirectional LSTM (BiLSTM), and bidirectional gated recurrent unit (BiGRU) networks using cross-validation and independent tests. We also established the comparison between Sohoko-Kcr and other published tools to verify the efficiency of our model based on 3-fold, 5-fold, and 10-fold cross-validations using independent set tests. The results then show that the BiGRU model has consistently displayed outstanding performance and computational efficiency. Based on the proposed model, a webserver called Sohoko-Kcr was deployed for free use and is accessible at https://sohoko-research-9uu23.ondigitalocean.app.
Collapse
Affiliation(s)
- Sian Soo Tng
- Institute of Systems Science, National University of Singapore, 29 Heng Mui Keng Terrace, Singapore 119620, Singapore
| | - Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei 106, Taiwan.,Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei 106, Taiwan.,Translational Imaging Research Center, Taipei Medical University Hospital, Taipei 110, Taiwan
| | - Hui-Yuan Yeh
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Avenue, Singapore 639818, Singapore
| | - Matthew Chin Heng Chua
- Institute of Systems Science, National University of Singapore, 29 Heng Mui Keng Terrace, Singapore 119620, Singapore
| |
Collapse
|
44
|
Lv Q, Chen G, Zhao L, Zhong W, Yu-Chian Chen C. Mol2Context-vec: learning molecular representation from context awareness for drug discovery. Brief Bioinform 2021; 22:6357185. [PMID: 34428290 DOI: 10.1093/bib/bbab317] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 07/15/2021] [Accepted: 07/21/2021] [Indexed: 11/14/2022] Open
Abstract
With the rapid development of proteomics and the rapid increase of target molecules for drug action, computer-aided drug design (CADD) has become a basic task in drug discovery. One of the key challenges in CADD is molecular representation. High-quality molecular expression with chemical intuition helps to promote many boundary problems of drug discovery. At present, molecular representation still faces several urgent problems, such as the polysemy of substructures and unsmooth information flow between atomic groups. In this research, we propose a deep contextualized Bi-LSTM architecture, Mol2Context-vec, which can integrate different levels of internal states to bring dynamic representations of molecular substructures. And the obtained molecular context representation can capture the interactions between any atomic groups, especially a pair of atomic groups that are topologically distant. Experiments show that Mol2Context-vec achieves state-of-the-art performance on multiple benchmark datasets. In addition, the visual interpretation of Mol2Context-vec is very close to the structural properties of chemical molecules as understood by humans. These advantages indicate that Mol2Context-vec can be used as a reliable and effective tool for molecular expression. Availability: The source code is available for download in https://github.com/lol88/Mol2Context-vec.
Collapse
Affiliation(s)
- Qiujie Lv
- School of Intelligent Systems Engineering, Sun Yat-sen University, Shenzhen, 510275, China
| | - Guanxing Chen
- School of Intelligent Systems Engineering, Sun Yat-sen University, Shenzhen, 510275, China
| | - Lu Zhao
- The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510655, China
| | - Weihe Zhong
- School of Intelligent Systems Engineering, Sun Yat-sen University, Shenzhen, 510275, China
| | - Calvin Yu-Chian Chen
- School of Intelligent Systems Engineering, Sun Yat-sen University, Shenzhen, 510275, China.,Department of Medical Research, China Medical University Hospital, Taichung 40447, Taiwan.,Department of Bioinformatics and Medical Engineering, Asia University, Taichung 41354, Taiwan
| |
Collapse
|
45
|
Li J, He S, Guo F, Zou Q. HSM6AP: a high-precision predictor for the Homo sapiens N6-methyladenosine (m^6 A) based on multiple weights and feature stitching. RNA Biol 2021; 18:1882-1892. [PMID: 33446014 PMCID: PMC8583144 DOI: 10.1080/15476286.2021.1875180] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Revised: 12/02/2020] [Accepted: 01/08/2021] [Indexed: 01/21/2023] Open
Abstract
Recent studies have shown that RNA methylation modification can affect RNA transcription, metabolism, splicing and stability. In addition, RNA methylation modification has been associated with cancer, obesity and other diseases. Based on information about human genome and machine learning, this paper discusses the effect of the fusion sequence and gene-level feature extraction on the accuracy of methylation site recognition. The significant limitation of existing computing tools was exposed by discovered of new features. (1) Most prediction models are based solely on sequence features and use SVM or random forest as classification methods. (2) Limited by the number of samples, the model may not achieve good performance. In order to establish a better prediction model for methylation sites, we must set specific weighting strategies for training samples and find more powerful and informative feature matrices to establish a comprehensive model. In this paper, we present HSM6AP, a high-precision predictor for the Homo sapiens N6-methyladenosine (m 6 A ) based on multiple weights and feature stitching. Compared with existing methods, HSM6AP samples were creatively weighted during training, and a wide range of features were explored. Max-Relevance-Max-Distance (MRMD) is employed for feature selection, and the feature matrix is generated by fusing a single feature. The extreme gradient boosting (XGBoost), an integrated machine learning algorithm based on decision tree, is used for model training and improves model performance through parameter adjustment. Two rigorous independent data sets demonstrated the superiority of HSM6AP in identifying methylation sites. HSM6AP is an advanced predictor that can be directly employed by users (especially non-professional users) to predict methylation sites. Users can access our related tools and data sets at the following website: http://lab.malab.cn/~lijing/HSM6AP.html The codes of our tool can be publicly accessible at https://github.com/lijingtju/HSm6AP.git.
Collapse
Affiliation(s)
- Jing Li
- Institute of computational biology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Shida He
- Institute of computational biology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Fei Guo
- Institute of computational biology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- Bioinformatics Laboratory, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
46
|
Yu Y, He W, Jin J, Cui L, Zeng R, Wei L. iDNA-ABT : advanced deep learning model for detecting DNA methylation with adaptive features and transductive information maximization. Bioinformatics 2021; 37:4603-4610. [PMID: 34601568 DOI: 10.1093/bioinformatics/btab677] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Revised: 09/07/2021] [Accepted: 09/29/2021] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION DNA methylation plays an important role in epigenetic modification, the occurrence, and the development of diseases. Therefore, the identification of DNA methylation sites is critical for better understanding and revealing their functional mechanisms. To date, several machine learning and deep learning methods have been developed for the prediction of different methylation types. However, they still highly rely on manual features, which can largely limit the high-latent information extraction. Moreover, most of them are designed for one specific methylation type, and therefore cannot predict multiple methylation sites in multiple species simultaneously. In this study, we propose iDNA-ABT, an advanced deep learning model that utilizes adaptive embedding based on bidirectional transformers for language understanding together with a novel transductive information maximization (TIM) loss. RESULTS Benchmark results show that our proposed iDNA-ABT can automatically and adaptively learn the distinguishing features of biological sequences from multiple species, and thus perform significantly better than the state-of-the-art methods in predicting three different DNA methylation. In addition, TIM loss is proven to be effective in dichotomous tasks via the comparison experiment. Furthermore, we verify that our features have strong adaptability and robustness to different species through comparison of adaptive embedding and six handcrafted feature encodings. Importantly, our model shows great generalization ability in different species, demonstrating that our model can adaptively capture the cross-species differences and improve the predictive performance. For the convenient use of our method, we further established an online webserver as the implementation of the proposed iDNA-ABT. AVAILABILITY our proposed iDNA-ABT, which is now freely accessible via http://server.wei-group.net/iDNA_ABT and our source codes are available in the GitHub repository (https://github.com/YUYING07/iDNA_ABT). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yingying Yu
- School of Software, Shandong University, Jinan, China.,Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
| | - Wenjia He
- School of Software, Shandong University, Jinan, China.,Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
| | - Junru Jin
- School of Software, Shandong University, Jinan, China.,Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
| | - Lizhen Cui
- School of Software, Shandong University, Jinan, China.,Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
| | - Rao Zeng
- Department of Software Engineering, Xiamen University, Xiamen, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan, China.,Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
| |
Collapse
|
47
|
Feng C, Wei H, Yang D, Feng B, Ma Z, Han S, Zou Q, Shi H. ORS-Pred: An optimized reduced scheme-based identifier for antioxidant proteins. Proteomics 2021; 21:e2100017. [PMID: 34009737 DOI: 10.1002/pmic.202100017] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Revised: 04/22/2021] [Accepted: 05/12/2021] [Indexed: 12/30/2022]
Abstract
Antioxidant proteins can terminate a chain of reactions caused by free radicals and protect cells from damage. To identify antioxidant proteins rapidly, a computational model was proposed based on the optimized recoding scheme, sequence information and machine learning methods. First, over 600 recoding schemes were collected to build a scheme set. Then, the original sequence was recoded as a reduced expression whose g-gap dipeptides (g = 0, 1, 2) were used as the features of proteins. Furthermore, a random forest method was used to evaluate the classification ability of the obtained dipeptide features. After going through all schemes, the best predictive performance scheme was chosen as the optimized reduction scheme. Finally, for the RF method, a grid search strategy was used to select a better parameter combination to identify antioxidant proteins. In the experiment, the present method correctly recognized 90.13-99.87% of the antioxidant samples. Other experimental results also proved that the present method was efficient to identify antioxidant proteins. Finally, we also developed a web server that was freely accessible to researchers.
Collapse
Affiliation(s)
- Changli Feng
- Department of Information Science and Technology, Taishan University, Taian, China
| | - Haiyan Wei
- Department of Teachers and Education, Taishan University, Taian, China
| | - Deyun Yang
- Department of Information Science and Technology, Taishan University, Taian, China
| | - Bin Feng
- Department of Information Science and Technology, Taishan University, Taian, China
| | - Zhaogui Ma
- Department of Information Science and Technology, Taishan University, Taian, China
| | - Shuguang Han
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.,China and Hainan Key Laboratory for Computational Science and Application, Hainan Normal University, Haikou, China
| | - Hua Shi
- School of Opto-electronic and Communication Engineering, Xiamen University of Technology, Xiamen, China
| |
Collapse
|
48
|
Zulfiqar H, Yuan SS, Huang QL, Sun ZJ, Dao FY, Yu XL, Lin H. Identification of cyclin protein using gradient boost decision tree algorithm. Comput Struct Biotechnol J 2021; 19:4123-4131. [PMID: 34527186 PMCID: PMC8346528 DOI: 10.1016/j.csbj.2021.07.013] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 07/15/2021] [Accepted: 07/15/2021] [Indexed: 12/12/2022] Open
Abstract
Cyclin proteins are capable to regulate the cell cycle by forming a complex with cyclin-dependent kinases to activate cell cycle. Correct recognition of cyclin proteins could provide key clues for studying their functions. However, their sequences share low similarity, which results in poor prediction for sequence similarity-based methods. Thus, it is urgent to construct a machine learning model to identify cyclin proteins. This study aimed to develop a computational model to discriminate cyclin proteins from non-cyclin proteins. In our model, protein sequences were encoded by seven kinds of features that are amino acid composition, composition of k-spaced amino acid pairs, tri peptide composition, pseudo amino acid composition, geary correlation, normalized moreau-broto autocorrelation and composition/transition/distribution. Afterward, these features were optimized by using analysis of variance (ANOVA) and minimum redundancy maximum relevance (mRMR) with incremental feature selection (IFS) technique. A gradient boost decision tree (GBDT) classifier was trained on the optimal features. Five-fold cross-validated results showed that our model would identify cyclins with an accuracy of 93.06% and AUC value of 0.971, which are higher than the two recent studies on the same data.
Collapse
Affiliation(s)
- Hasan Zulfiqar
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Shi-Shi Yuan
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Qin-Lai Huang
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zi-Jie Sun
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Fu-Ying Dao
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Xiao-Long Yu
- School of Materials Science and Engineering, Hainan University, Haikou 570228, China
| | - Hao Lin
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
49
|
Xu L, Ru X, Song R. Application of Machine Learning for Drug-Target Interaction Prediction. Front Genet 2021; 12:680117. [PMID: 34234813 PMCID: PMC8255962 DOI: 10.3389/fgene.2021.680117] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2021] [Accepted: 05/28/2021] [Indexed: 11/13/2022] Open
Abstract
Exploring drug–target interactions by biomedical experiments requires a lot of human, financial, and material resources. To save time and cost to meet the needs of the present generation, machine learning methods have been introduced into the prediction of drug–target interactions. The large amount of available drug and target data in existing databases, the evolving and innovative computer technologies, and the inherent characteristics of various types of machine learning have made machine learning techniques the mainstream method for drug–target interaction prediction research. In this review, details of the specific applications of machine learning in drug–target interaction prediction are summarized, the characteristics of each algorithm are analyzed, and the issues that need to be further addressed and explored for future research are discussed. The aim of this review is to provide a sound basis for the construction of high-performance models.
Collapse
Affiliation(s)
- Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Xiaoqing Ru
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
| | - Rong Song
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| |
Collapse
|
50
|
Song B, Li Z, Lin X, Wang J, Wang T, Fu X. Pretraining model for biological sequence data. Brief Funct Genomics 2021; 20:181-195. [PMID: 34050350 PMCID: PMC8194843 DOI: 10.1093/bfgp/elab025] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 04/13/2021] [Accepted: 04/21/2021] [Indexed: 12/26/2022] Open
Abstract
With the development of high-throughput sequencing technology, biological sequence data reflecting life information becomes increasingly accessible. Particularly on the background of the COVID-19 pandemic, biological sequence data play an important role in detecting diseases, analyzing the mechanism and discovering specific drugs. In recent years, pretraining models that have emerged in natural language processing have attracted widespread attention in many research fields not only to decrease training cost but also to improve performance on downstream tasks. Pretraining models are used for embedding biological sequence and extracting feature from large biological sequence corpus to comprehensively understand the biological sequence data. In this survey, we provide a broad review on pretraining models for biological sequence data. Moreover, we first introduce biological sequences and corresponding datasets, including brief description and accessible link. Subsequently, we systematically summarize popular pretraining models for biological sequences based on four categories: CNN, word2vec, LSTM and Transformer. Then, we present some applications with proposed pretraining models on downstream tasks to explain the role of pretraining models. Next, we provide a novel pretraining scheme for protein sequences and a multitask benchmark for protein pretraining models. Finally, we discuss the challenges and future directions in pretraining models for biological sequences.
Collapse
Affiliation(s)
| | | | | | | | | | - Xiangzheng Fu
- Corresponding author: Xiangzheng Fu, College of Information Science and Engineering, Hunan University, Changsha, Hunan, China. Tel: 86-0731-88821907; E-mail:
| |
Collapse
|