51
|
Li W, Zhang H, Li M, Han M, Yin Y. MGEGFP: a multi-view graph embedding method for gene function prediction based on adaptive estimation with GCN. Brief Bioinform 2022; 23:6659744. [PMID: 35947989 DOI: 10.1093/bib/bbac333] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 07/02/2022] [Accepted: 07/21/2022] [Indexed: 11/14/2022] Open
Abstract
In recent years, a number of computational approaches have been proposed to effectively integrate multiple heterogeneous biological networks, and have shown impressive performance for inferring gene function. However, the previous methods do not fully represent the critical neighborhood relationship between genes during the feature learning process. Furthermore, it is difficult to accurately estimate the contributions of different views for multi-view integration. In this paper, we propose MGEGFP, a multi-view graph embedding method based on adaptive estimation with Graph Convolutional Network (GCN), to learn high-quality gene representations among multiple interaction networks for function prediction. First, we design a dual-channel GCN encoder to disentangle the view-specific information and the consensus pattern across diverse networks. By the aid of disentangled representations, we develop a multi-gate module to adaptively estimate the contributions of different views during each reconstruction process and make full use of the multiplexity advantages, where a diversity preservation constraint is designed to prevent the over-fitting problem. To validate the effectiveness of our model, we conduct experiments on networks from the STRING database for both yeast and human datasets, and compare the performance with seven state-of-the-art methods in five evaluation metrics. Moreover, the ablation study manifests the important contribution of the designed dual-channel encoder, multi-gate module and the diversity preservation constraint in MGEGFP. The experimental results confirm the superiority of our proposed method and suggest that MGEGFP can be a useful tool for gene function prediction.
Collapse
Affiliation(s)
- Wei Li
- College of Artificial Intelligence, Nankai University, Tongyan Road, 300350, Tianjin, China
| | - Han Zhang
- College of Artificial Intelligence, Nankai University, Tongyan Road, 300350, Tianjin, China
| | - Minghe Li
- College of Artificial Intelligence, Nankai University, Tongyan Road, 300350, Tianjin, China
| | - Mingjing Han
- College of Artificial Intelligence, Nankai University, Tongyan Road, 300350, Tianjin, China
| | - Yanbin Yin
- Department of Food Science and Technology, University of Nebraska - Lincoln, 1400 R Street, 68588, Nebraska, USA
| |
Collapse
|
52
|
Sharma N, Naorem LD, Jain S, Raghava GPS. ToxinPred2: an improved method for predicting toxicity of proteins. Brief Bioinform 2022; 23:6590152. [PMID: 35595541 DOI: 10.1093/bib/bbac174] [Citation(s) in RCA: 108] [Impact Index Per Article: 36.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2022] [Revised: 03/31/2022] [Accepted: 04/18/2022] [Indexed: 12/13/2022] Open
Abstract
Proteins/peptides have shown to be promising therapeutic agents for a variety of diseases. However, toxicity is one of the obstacles in protein/peptide-based therapy. The current study describes a web-based tool, ToxinPred2, developed for predicting the toxicity of proteins. This is an update of ToxinPred developed mainly for predicting toxicity of peptides and small proteins. The method has been trained, tested and evaluated on three datasets curated from the recent release of the SwissProt. To provide unbiased evaluation, we performed internal validation on 80% of the data and external validation on the remaining 20% of data. We have implemented the following techniques for predicting protein toxicity; (i) Basic Local Alignment Search Tool-based similarity, (ii) Motif-EmeRging and with Classes-Identification-based motif search and (iii) Prediction models. Similarity and motif-based techniques achieved a high probability of correct prediction with poor sensitivity/coverage, whereas models based on machine-learning techniques achieved balance sensitivity and specificity with reasonably high accuracy. Finally, we developed a hybrid method that combined all three approaches and achieved a maximum area under receiver operating characteristic curve around 0.99 with Matthews correlation coefficient 0.91 on the validation dataset. In addition, we developed models on alternate and realistic datasets. The best machine learning models have been implemented in the web server named 'ToxinPred2', which is available at https://webs.iiitd.edu.in/raghava/toxinpred2/ and a standalone version at https://github.com/raghavagps/toxinpred2. This is a general method developed for predicting the toxicity of proteins regardless of their source of origin.
Collapse
Affiliation(s)
- Neelam Sharma
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Leimarembi Devi Naorem
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Shipra Jain
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| |
Collapse
|
53
|
Chen Q, Yang C, Xie Y, Wang Y, Li X, Wang K, Huang J, Yan W. GM-Pep: A High Efficiency Strategy to De Novo Design Functional Peptide Sequences. J Chem Inf Model 2022; 62:2617-2629. [PMID: 35533298 DOI: 10.1021/acs.jcim.2c00089] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Although peptides are regarded as ideal therapeutic agents, only a small proportion of the marketed drugs are peptides. In the past decade, pharmacists have paid great attention to the development of peptide therapeutics. Except a few approved chemically/rationally designed peptides, most attempts failed due to unsatisfactory efficacy or safety. Luckily, computation methods, such as artificial intelligence, have been utilized to accelerate the discovery of therapeutic peptides by predicting the activity, toxicity, and absorption, distribution, metabolism, and excretion of polypeptides. Usually, a specific biological activity of a peptide could be accurately determined by an interest-oriented binary classification constructed of a positive set and another un-experimentally validated negative set regardless of other characteristics, which suggests that it could be challenging to realize the comprehensive evaluation of the research object in the early stage of drug research and development. Herein, we proposed an integrated method (GM-Pep) that contained a conditional variational autoencoder model (CVAE) and a positive sample training multiclassifier (Deep-Multiclassifier) to effectively generate a single bioactive peptide sequence without toxicity and referential side effects. The results showed that our Deep-Multiclassifier model gave a sequence accuracy of up to 96.41% [toxicity (94.48%), antifungal (96.58%), antihypertensive (97.18%), and antibacterial (96.91%), respectively]. The properties of Deep-Multiclassifier and CVAE were validated through 12 first synthesized antibacterial peptides or compared to random peptides. The source code and data sets are available at https://github.com/TimothyChen225/GM-Pep.
Collapse
Affiliation(s)
- Qushuo Chen
- The Institute of Pharmacology, Key Laboratory of Preclinical Study for New Drugs of Gansu Province, School of Basic Medical Sciences, Lanzhou University, Lanzhou, Gansu 730000, China
| | - Changyan Yang
- The Institute of Pharmacology, Key Laboratory of Preclinical Study for New Drugs of Gansu Province, School of Basic Medical Sciences, Lanzhou University, Lanzhou, Gansu 730000, China
| | - Yihao Xie
- The Institute of Pharmacology, Key Laboratory of Preclinical Study for New Drugs of Gansu Province, School of Basic Medical Sciences, Lanzhou University, Lanzhou, Gansu 730000, China
| | - Yuqiang Wang
- School of Stomatology, Lanzhou University,Lanzhou, Gansu 730000, China
| | - Xiaoxu Li
- School of Computer and Communication, Lanzhou University of Technology, Lanzhou, Gansu 730050, China
| | - Kairong Wang
- The Institute of Pharmacology, Key Laboratory of Preclinical Study for New Drugs of Gansu Province, School of Basic Medical Sciences, Lanzhou University, Lanzhou, Gansu 730000, China
| | - Jinqi Huang
- Department of Hematology, Affiliated Hospital of Guangdong Medical University, Zhanjiang, Guangdong 524000, China
| | - Wenjin Yan
- The Institute of Pharmacology, Key Laboratory of Preclinical Study for New Drugs of Gansu Province, School of Basic Medical Sciences, Lanzhou University, Lanzhou, Gansu 730000, China
| |
Collapse
|
54
|
Malcertificate: Research and Implementation of a Malicious Certificate Detection Algorithm Based on GCN. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12094440] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Encryption is widely used to ensure the security and confidentiality of information. Because people trust in encryption technology, a series of attack methods based on certificates have been derived. Malicious certificates protect many malicious behaviors and threaten data security. To counter this threat, machine learning algorithms are widely used in malicious certificate detection. However, the detection efficiency of such algorithms largely depends on whether the extracted features can effectively represent the data. In contrast, graph convolutional networks (GCNs) can automatically extract useful features. GCNs are powerful at fitting graph data, which can improve the effectiveness of learning systems by efficiently embedding prior knowledge in an end-to-end manner. In this paper, we propose an algorithm for detecting malicious digital certificates with GCNs. Firstly, we transform the digital certificate dataset with pem document structure into a corpus of graph structure based on attribute co-occurrence and document attribute relations. Then, we put the graph structure certificate dataset into a GCN for training. The results of the experiment show that GCN is very effective in certificate classification and outperforms traditional machine learning algorithms and extant neural network algorithms. The accuracy of our algorithm to detect malicious certificates is 97.41%. This shows that our algorithm is very effective.
Collapse
|
55
|
Liu M, Sun ZL, Zeng Z, Lam KM. MGF6mARice: prediction of DNA N6-methyladenine sites in rice by exploiting molecular graph feature and residual block. Brief Bioinform 2022; 23:6553606. [PMID: 35325050 DOI: 10.1093/bib/bbac082] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 02/13/2022] [Accepted: 02/16/2022] [Indexed: 11/12/2022] Open
Abstract
DNA N6-methyladenine (6mA) is produced by the N6 position of the adenine being methylated, which occurs at the molecular level, and is involved in numerous vital biological processes in the rice genome. Given the shortcomings of biological experiments, researchers have developed many computational methods to predict 6mA sites and achieved good performance. However, the existing methods do not consider the occurrence mechanism of 6mA to extract features from the molecular structure. In this paper, a novel deep learning method is proposed by devising DNA molecular graph feature and residual block structure for 6mA sites prediction in rice, named MGF6mARice. Firstly, the DNA sequence is changed into a simplified molecular input line entry system (SMILES) format, which reflects chemical molecular structure. Secondly, for the molecular structure data, we construct the DNA molecular graph feature based on the principle of graph convolutional network. Then, the residual block is designed to extract higher level, distinguishable features from molecular graph features. Finally, the prediction module is used to obtain the result of whether it is a 6mA site. By means of 10-fold cross-validation, MGF6mARice outperforms the state-of-the-art approaches. Multiple experiments have shown that the molecular graph feature and residual block can promote the performance of MGF6mARice in 6mA prediction. To the best of our knowledge, it is the first time to derive a feature of DNA sequence by considering the chemical molecular structure. We hope that MGF6mARice will be helpful for researchers to analyze 6mA sites in rice.
Collapse
Affiliation(s)
- Mengya Liu
- School of Computer Science and Technology, Anhui University, Hefei, 230601, China
| | - Zhan-Li Sun
- School of Artificial Intelligence, Anhui University, Hefei, 230601, China
| | - Zhigang Zeng
- School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, 430074, China
| | - Kin-Man Lam
- Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong, China
| |
Collapse
|
56
|
Robles-Loaiza AA, Pinos-Tamayo EA, Mendes B, Ortega-Pila JA, Proaño-Bolaños C, Plisson F, Teixeira C, Gomes P, Almeida JR. Traditional and Computational Screening of Non-Toxic Peptides and Approaches to Improving Selectivity. Pharmaceuticals (Basel) 2022; 15:323. [PMID: 35337121 PMCID: PMC8953747 DOI: 10.3390/ph15030323] [Citation(s) in RCA: 34] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2022] [Revised: 03/01/2022] [Accepted: 03/04/2022] [Indexed: 12/27/2022] Open
Abstract
Peptides have positively impacted the pharmaceutical industry as drugs, biomarkers, or diagnostic tools of high therapeutic value. However, only a handful have progressed to the market. Toxicity is one of the main obstacles to translating peptides into clinics. Hemolysis or hemotoxicity, the principal source of toxicity, is a natural or disease-induced event leading to the death of vital red blood cells. Initial screenings for toxicity have been widely evaluated using erythrocytes as the gold standard. More recently, many online databases filled with peptide sequences and their biological meta-data have paved the way toward hemolysis prediction using user-friendly, fast-access machine learning-driven programs. This review details the growing contributions of in silico approaches developed in the last decade for the large-scale prediction of erythrocyte lysis induced by peptides. After an overview of the pharmaceutical landscape of peptide therapeutics, we highlighted the relevance of early hemolysis studies in drug development. We emphasized the computational models and algorithms used to this end in light of historical and recent findings in this promising field. We benchmarked seven predictors using peptides from different data sets, having 7-35 amino acids in length. According to our predictions, the models have scored an accuracy over 50.42% and a minimal Matthew's correlation coefficient over 0.11. The maximum values for these statistical parameters achieved 100.0% and 1.00, respectively. Finally, strategies for optimizing peptide selectivity were described, as well as prospects for future investigations. The development of in silico predictive approaches to peptide toxicity has just started, but their important contributions clearly demonstrate their potential for peptide science and computer-aided drug design. Methodology refinement and increasing use will motivate the timely and accurate in silico identification of selective, non-toxic peptide therapeutics.
Collapse
Affiliation(s)
- Alberto A. Robles-Loaiza
- Biomolecules Discovery Group, Universidad Regional Amazónica Ikiam, Tena 150150, Ecuador; (A.A.R.-L.); (B.M.); (J.A.O.-P.); (C.P.-B.)
| | - Edgar A. Pinos-Tamayo
- Escuela Superior Politécnica del Litoral, ESPOL, Centro Nacional de Acuicultura e Investigaciones Marinas (CENAIM), Campus Gustavo Galindo Km. 30, 5 Vía Perimetral, Guayaquil 09-01-5863, Ecuador;
| | - Bruno Mendes
- Biomolecules Discovery Group, Universidad Regional Amazónica Ikiam, Tena 150150, Ecuador; (A.A.R.-L.); (B.M.); (J.A.O.-P.); (C.P.-B.)
| | - Josselyn A. Ortega-Pila
- Biomolecules Discovery Group, Universidad Regional Amazónica Ikiam, Tena 150150, Ecuador; (A.A.R.-L.); (B.M.); (J.A.O.-P.); (C.P.-B.)
| | - Carolina Proaño-Bolaños
- Biomolecules Discovery Group, Universidad Regional Amazónica Ikiam, Tena 150150, Ecuador; (A.A.R.-L.); (B.M.); (J.A.O.-P.); (C.P.-B.)
| | - Fabien Plisson
- Consejo Nacional de Ciencia y Tecnología, Unidad de Genómica Avanzada, Laboratorio Nacional de Genómica para la Biodiversidad (Langebio), Centro de Investigación Y de Estudios Avanzados del IPN, Irapuato 36824, Mexico;
| | - Cátia Teixeira
- Laboratório Associado para a Química Verde-REQUIMTE, Departamento de Química e Bioquímica, Faculdade de Ciências, Universidade do Porto, 4169-007 Porto, Portugal; (C.T.); (P.G.)
| | - Paula Gomes
- Laboratório Associado para a Química Verde-REQUIMTE, Departamento de Química e Bioquímica, Faculdade de Ciências, Universidade do Porto, 4169-007 Porto, Portugal; (C.T.); (P.G.)
| | - José R. Almeida
- Biomolecules Discovery Group, Universidad Regional Amazónica Ikiam, Tena 150150, Ecuador; (A.A.R.-L.); (B.M.); (J.A.O.-P.); (C.P.-B.)
| |
Collapse
|
57
|
Ahmad S, Charoenkwan P, Quinn JMW, Moni MA, Hasan MM, Lio' P, Shoombuatong W. SCORPION is a stacking-based ensemble learning framework for accurate prediction of phage virion proteins. Sci Rep 2022; 12:4106. [PMID: 35260777 PMCID: PMC8904530 DOI: 10.1038/s41598-022-08173-5] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Accepted: 03/03/2022] [Indexed: 12/30/2022] Open
Abstract
Fast and accurate identification of phage virion proteins (PVPs) would greatly aid facilitation of antibacterial drug discovery and development. Although, several research efforts based on machine learning (ML) methods have been made for in silico identification of PVPs, these methods have certain limitations. Therefore, in this study, we propose a new computational approach, termed SCORPION, (StaCking-based Predictior fOR Phage VIrion PrOteiNs), to accurately identify PVPs using only protein primary sequences. Specifically, we explored comprehensive 13 different feature descriptors from different aspects (i.e., compositional information, composition-transition-distribution information, position-specific information and physicochemical properties) with 10 popular ML algorithms to construct a pool of optimal baseline models. These optimal baseline models were then used to generate probabilistic features (PFs) and considered as a new feature vector. Finally, we utilized a two-step feature selection strategy to determine the optimal PF feature vector and used this feature vector to develop a stacked model (SCORPION). Both tenfold cross-validation and independent test results indicate that SCORPION achieves superior predictive performance than its constitute baseline models and existing methods. We anticipate SCORPION will serve as a useful tool for the cost-effective and large-scale screening of new PVPs. The source codes and datasets for this work are available for downloading in the GitHub repository (https://github.com/saeed344/SCORPION).
Collapse
Affiliation(s)
- Saeed Ahmad
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| | - Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, 50200, Thailand
| | - Julian M W Quinn
- Bone Biology Division, Garvan Institute of Medical Research, 384 Victoria Street, Darlinghurst, NSW, 2010, Australia
| | - Mohammad Ali Moni
- Faculty of Health and Behavioural Sciences, School of Health and Rehabilitation Sciences, The University of Queensland, St Lucia, QLD, 4072, Australia
| | - Md Mehedi Hasan
- Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane Center for Biomedical Informatics and Genomics, Tulane University, New Orleans, LA, 70112, USA
| | - Pietro Lio'
- Department of Computer Science and Technology, University of Cambridge, Cambridge, CB3 0FD, UK
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand.
| |
Collapse
|
58
|
Wei L, Ye X, Sakurai T, Mu Z, Wei L. ToxIBTL: prediction of peptide toxicity based on information bottleneck and transfer learning. Bioinformatics 2022; 38:1514-1524. [PMID: 34999757 DOI: 10.1093/bioinformatics/btac006] [Citation(s) in RCA: 79] [Impact Index Per Article: 26.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2021] [Revised: 11/29/2021] [Accepted: 01/04/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Recently, peptides have emerged as a promising class of pharmaceuticals for various diseases treatment poised between traditional small molecule drugs and therapeutic proteins. However, one of the key bottlenecks preventing them from therapeutic peptides is their toxicity toward human cells, and few available algorithms for predicting toxicity are specially designed for short-length peptides. RESULTS We present ToxIBTL, a novel deep learning framework by utilizing the information bottleneck principle and transfer learning to predict the toxicity of peptides as well as proteins. Specifically, we use evolutionary information and physicochemical properties of peptide sequences and integrate the information bottleneck principle into a feature representation learning scheme, by which relevant information is retained and the redundant information is minimized in the obtained features. Moreover, transfer learning is introduced to transfer the common knowledge contained in proteins to peptides, which aims to improve the feature representation capability. Extensive experimental results demonstrate that ToxIBTL not only achieves a higher prediction performance than state-of-the-art methods on the peptide dataset, but also has a competitive performance on the protein dataset. Furthermore, a user-friendly online web server is established as the implementation of the proposed ToxIBTL. AVAILABILITY AND IMPLEMENTATION The proposed ToxIBTL and data can be freely accessible at http://server.wei-group.net/ToxIBTL. Our source code is available at https://github.com/WLYLab/ToxIBTL. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lesong Wei
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| | - Tetsuya Sakurai
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| | - Zengchao Mu
- School of Mathematics and Statistics, Shandong University, Weihai, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan, China
| |
Collapse
|
59
|
Shoombuatong W, Basith S, Pitti T, Lee G, Manavalan B. THRONE: a new approach for accurate prediction of human RNA N7-methylguanosine sites. J Mol Biol 2022; 434:167549. [DOI: 10.1016/j.jmb.2022.167549] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Revised: 03/08/2022] [Accepted: 03/10/2022] [Indexed: 12/30/2022]
|
60
|
Wei L, Long W, Wei L. MDL-CPI: multi-view deep learning model for compound-protein interaction prediction. Methods 2022; 204:418-427. [PMID: 35114401 DOI: 10.1016/j.ymeth.2022.01.008] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Revised: 01/17/2022] [Accepted: 01/24/2022] [Indexed: 10/19/2022] Open
Abstract
Elucidating the mechanisms of Compound-Protein Interactions (CPIs) plays an essential role in drug discovery and development. Many computational efforts have been done to accelerate the development of this field. However, the current predictive performance is still not satisfactory, and existing methods consider only protein and compound features, ignoring their interactive information. In this study, we propose a multi-view deep learning method named MDL-CPI for CPI prediction. To sufficiently extract discriminative information, we introduce a hybrid architecture that leverages BERT (Bidirectional Encoder Representations from Transformers) and CNN (Convolutional Neural Network) to extract protein features from a sequential perspective, use the GNN (Graph Neural Networks) to extract compound features from a structural perspective, and generate a unified feature space by using AE2 (Autoencoder in Autoencoder Networks) network to learn the interactive information between BERT-CNN and Graph embeddings. Comparative results on benchmark datasets show that our proposed method exhibits better performance compared to existing CPI prediction methods, demonstrating the strong predictive ability of our model. Importantly, we demonstrate that the learned interactive information between compounds and proteins is critical to improve predictive performance. We release our source code and dataset at: https://github.com/Longwt123/MDL-CPI.
Collapse
|
61
|
Charoenkwan P, Nantasenamat C, Hasan MM, Moni MA, Lio' P, Manavalan B, Shoombuatong W. StackDPPIV: A novel computational approach for accurate prediction of dipeptidyl peptidase IV (DPP-IV) inhibitory peptides. Methods 2021; 204:189-198. [PMID: 34883239 DOI: 10.1016/j.ymeth.2021.12.001] [Citation(s) in RCA: 41] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2021] [Revised: 11/30/2021] [Accepted: 12/01/2021] [Indexed: 12/12/2022] Open
Abstract
The development of efficient and effective bioinformatics tools and pipelines for identifying peptides with dipeptidyl peptidase IV (DPP-IV) inhibitory activities from large-scale protein datasets is of great importance for the discovery and development of potential and promising antidiabetic drugs. In this study, we present a novel stacking-based ensemble learning predictor (termed StackDPPIV) designed for identification of DPP-IV inhibitory peptides. Unlike the existing method, which is based on single-feature-based methods, we combined five popular machine learning algorithms in conjunction with ten different feature encodings from multiple perspectives to generate a pool of various baseline models. Subsequently, the probabilistic features derived from these baseline models were systematically integrated and deemed as new feature representations. Finally, in order to improve the predictive performance, the genetic algorithm based on the self-assessment-report was utilized to determine a set of informative probabilistic features and then used the optimal one for developing the final meta-predictor (StackDPPIV). Experiment results demonstrated that StackDPPIV could outperform its constituent baseline models on both the training and independent datasets. Furthermore, StackDPPIV achieved an accuracy of 0.891, MCC of 0.784 and AUC of 0.961, which were 9.4%, 19.0% and 11.4%, respectively, higher than that of the existing method on the independent test. Feature analysis demonstrated that our feature representations had more discriminative ability as compared to conventional feature descriptors, which highlights the combination of different features was essential for the performance improvement. In order to implement the proposed predictor, we had built a user-friendly online web server at http://pmlabstack.pythonanywhere.com/StackDPPIV.
Collapse
Affiliation(s)
- Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai 50200, Thailand
| | - Chanin Nantasenamat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| | - Md Mehedi Hasan
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Mohammad Ali Moni
- School of Health and Rehabilitation Sciences, Faculty of Health and Behavioural Sciences, the University of Queensland St Lucia, QLD 4072, Australia
| | - Pietro Lio'
- Department of Computer Science and Technology, University of Cambridge, Cambridge CB3 0FD, UK
| | - Balachandran Manavalan
- Department of Physiology, Ajou University School of Medicine, Suwon 16499, Republic of Korea.
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand.
| |
Collapse
|
62
|
Malik AA, Chotpatiwetchkul W, Phanus-Umporn C, Nantasenamat C, Charoenkwan P, Shoombuatong W. StackHCV: a web-based integrative machine-learning framework for large-scale identification of hepatitis C virus NS5B inhibitors. J Comput Aided Mol Des 2021; 35:1037-1053. [PMID: 34622387 DOI: 10.1007/s10822-021-00418-1] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2021] [Accepted: 09/17/2021] [Indexed: 01/07/2023]
Abstract
Fast and accurate identification of inhibitors with potency against HCV NS5B polymerase is currently a challenging task. As conventional experimental methods is the gold standard method for the design and development of new HCV inhibitors, they often require costly investment of time and resources. In this study, we develop a novel machine learning-based meta-predictor (termed StackHCV) for accurate and large-scale identification of HCV inhibitors. Unlike the existing method, which is based on single-feature-based approach, we first constructed a pool of various baseline models by employing a wide range of heterogeneous molecular fingerprints with five popular machine learning algorithms (k-nearest neighbor, multi-layer perceptron, partial least squares, random forest and support vectors machine). Secondly, we integrated these baseline models in order to develop the final meta-based model by means of the stacking strategy. Extensive benchmarking experiments showed that StackHCV achieved a more accurate and stable performance as compared to its constituent baseline models on the training dataset and also outperformed the existing predictor on the independent test dataset. To facilitate the high-throughput identification of HCV inhibitors, we built a web server that can be freely accessed at http://camt.pythonanywhere.com/StackHCV . It is expected that StackHCV could be a useful tool for fast and precise identification of potential drugs against HCV NS5B particularly for liver cancer therapy and other clinical applications.
Collapse
Affiliation(s)
- Aijaz Ahmad Malik
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| | - Warot Chotpatiwetchkul
- Applied Computational Chemistry Research Unit, Department of Chemistry, School of Science, King Mongkut's Institute of Technology Ladkrabang, Bangkok, 10520, Thailand
| | - Chuleeporn Phanus-Umporn
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| | - Chanin Nantasenamat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| | - Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, 50200, Thailand.
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand.
| |
Collapse
|
63
|
Li M, Zhang W. PHIAF: prediction of phage-host interactions with GAN-based data augmentation and sequence-based feature fusion. Brief Bioinform 2021; 23:6362109. [PMID: 34472593 DOI: 10.1093/bib/bbab348] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2021] [Revised: 07/05/2021] [Accepted: 07/18/2021] [Indexed: 01/01/2023] Open
Abstract
Phage therapy has become one of the most promising alternatives to antibiotics in the treatment of bacterial diseases, and identifying phage-host interactions (PHIs) helps to understand the possible mechanism through which a phage infects bacteria to guide the development of phage therapy. Compared with wet experiments, computational methods of identifying PHIs can reduce costs and save time and are more effective and economic. In this paper, we propose a PHI prediction method with a generative adversarial network (GAN)-based data augmentation and sequence-based feature fusion (PHIAF). First, PHIAF applies a GAN-based data augmentation module, which generates pseudo PHIs to alleviate the data scarcity. Second, PHIAF fuses the features originated from DNA and protein sequences for better performance. Third, PHIAF utilizes an attention mechanism to consider different contributions of DNA/protein sequence-derived features, which also provides interpretability of the prediction model. In computational experiments, PHIAF outperforms other state-of-the-art PHI prediction methods when evaluated via 5-fold cross-validation (AUC and AUPR are 0.88 and 0.86, respectively). An ablation study shows that data augmentation, feature fusion and an attention mechanism are all beneficial to improve the prediction performance of PHIAF. Additionally, four new PHIs with the highest PHIAF score in the case study were verified by recent literature. In conclusion, PHIAF is a promising tool to accelerate the exploration of phage therapy.
Collapse
Affiliation(s)
- Menglu Li
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Wen Zhang
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| |
Collapse
|