1
|
Shahid, Hayat M, Alghamdi W, Akbar S, Raza A, Kadir RA, Sarker MR. pACP-HybDeep: predicting anticancer peptides using binary tree growth based transformer and structural feature encoding with deep-hybrid learning. Sci Rep 2025; 15:565. [PMID: 39747941 PMCID: PMC11695694 DOI: 10.1038/s41598-024-84146-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2024] [Accepted: 12/20/2024] [Indexed: 01/04/2025] Open
Abstract
Worldwide, Cancer remains a significant health concern due to its high mortality rates. Despite numerous traditional therapies and wet-laboratory methods for treating cancer-affected cells, these approaches often face limitations, including high costs and substantial side effects. Recently the high selectivity of peptides has garnered significant attention from scientists due to their reliable targeted actions and minimal adverse effects. Furthermore, keeping the significant outcomes of the existing computational models, we propose a highly reliable and effective model namely, pACP-HybDeep for the accurate prediction of anticancer peptides. In this model, training peptides are numerically encoded using an attention-based ProtBERT-BFD encoder to extract semantic features along with CTDT-based structural information. Furthermore, a k-nearest neighbor-based binary tree growth (BTG) algorithm is employed to select an optimal feature set from the multi-perspective vector. The selected feature vector is subsequently trained using a CNN + RNN-based deep learning model. Our proposed pACP-HybDeep model demonstrated a high training accuracy of 95.33%, and an AUC of 0.97. To validate the generalization capabilities of the model, our pACP-HybDeep model achieved accuracies of 94.92%, 92.26%, and 91.16% on independent datasets Ind-S1, Ind-S2, and Ind-S3, respectively. The demonstrated efficacy, and reliability of the pACP-HybDeep model using test datasets establish it as a valuable tool for researchers in academia and pharmaceutical drug design.
Collapse
Affiliation(s)
- Shahid
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, 23200, KP, Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, 23200, KP, Pakistan.
| | - Wajdi Alghamdi
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Shahid Akbar
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, 23200, KP, Pakistan.
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, China.
| | - Ali Raza
- Department of Computer Science, MY University, Islamabad, 45750, Pakistan
| | - Rabiah Abdul Kadir
- Institute of Visual Informatics, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor, Malaysia.
| | - Mahidur R Sarker
- Institute of Visual Informatics, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor, Malaysia
- Universidad de Dise˜no, Innovaci´on y Tecnología, UDIT, Av. Alfonso XIII, 97, 28016, Madrid, Spain
| |
Collapse
|
2
|
Bhattarai S, Tayara H, Chong KT. Advancing Peptide-Based Cancer Therapy with AI: In-Depth Analysis of State-of-the-Art AI Models. J Chem Inf Model 2024; 64:4941-4957. [PMID: 38874445 DOI: 10.1021/acs.jcim.4c00295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2024]
Abstract
Anticancer peptides (ACPs) play a vital role in selectively targeting and eliminating cancer cells. Evaluating and comparing predictions from various machine learning (ML) and deep learning (DL) techniques is challenging but crucial for anticancer drug research. We conducted a comprehensive analysis of 15 ML and 10 DL models, including the models released after 2022, and found that support vector machines (SVMs) with feature combination and selection significantly enhance overall performance. DL models, especially convolutional neural networks (CNNs) with light gradient boosting machine (LGBM) based feature selection approaches, demonstrate improved characterization. Assessment using a new test data set (ACP10) identifies ACPred, MLACP 2.0, AI4ACP, mACPred, and AntiCP2.0_AAC as successive optimal predictors, showcasing robust performance. Our review underscores current prediction tool limitations and advocates for an omnidirectional ACP prediction framework to propel ongoing research.
Collapse
Affiliation(s)
- Sadik Bhattarai
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju-si, 54896 Jeollabuk-do, South Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju-si, 54896 Jeollabuk-do, South Korea
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju-si, 54896 Jeollabuk-do, South Korea
- Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju-si, 54896 Jeollabuk-do, South Korea
| |
Collapse
|
3
|
Ghafoor H, Asim MN, Ibrahim MA, Ahmed S, Dengel A. CAPTURE: Comprehensive anti-cancer peptide predictor with a unique amino acid sequence encoder. Comput Biol Med 2024; 176:108538. [PMID: 38759585 DOI: 10.1016/j.compbiomed.2024.108538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 04/26/2024] [Accepted: 04/28/2024] [Indexed: 05/19/2024]
Abstract
Anticancer peptides (ACPs) key properties including bioactivity, high efficacy, low toxicity, and lack of drug resistance make them ideal candidates for cancer therapies. To deeply explore the potential of ACPs and accelerate development of cancer therapies, although 53 Artificial Intelligence supported computational predictors have been developed for ACPs and non ACPs classification but only one predictor has been developed for ACPs functional types annotations. Moreover, these predictors extract amino acids distribution patterns to transform peptides sequences into statistical vectors that are further fed to classifiers for discriminating peptides sequences and annotating peptides functional classes. Overall, these predictors remain fail in extracting diverse types of amino acids distribution patterns from peptide sequences. The paper in hand presents a unique CARE encoder that transforms peptides sequences into statistical vectors by extracting 4 different types of distribution patterns including correlation, distribution, composition, and transition. Across public benchmark dataset, proposed encoder potential is explored under two different evaluation settings namely; intrinsic and extrinsic. Extrinsic evaluation indicates that 12 different machine learning classifiers achieve superior performance with the proposed encoder as compared to 55 existing encoders. Furthermore, an intrinsic evaluation reveals that, unlike existing encoders, the proposed encoder generates more discriminative clusters for ACPs and non-ACPs classes. Across 8 public benchmark ACPs and non-ACPs classification datasets, proposed encoder and Adaboost classifier based CAPTURE predictor outperforms existing predictors with an average accuracy, recall and MCC score of 1%, 4%, and 2% respectively. In generalizeability evaluation case study, across 7 benchmark anti-microbial peptides classification datasets, CAPTURE surpasses existing predictors by an average AU-ROC of 2%. CAPTURE predictive pipeline along with label powerset method outperforms state-of-the-art ACPs functional types predictor by 5%, 5%, 5%, 6%, and 3% in terms of average accuracy, subset accuracy, precision, recall, and F1 respectively. CAPTURE web application is available at https://sds_genetic_analysis.opendfki.de/CAPTURE.
Collapse
Affiliation(s)
- Hina Ghafoor
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany; German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany.
| | - Muhammad Ali Ibrahim
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany; German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Andreas Dengel
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany; German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| |
Collapse
|
4
|
Zhang S, Zhao Y, Liang Y. AACFlow: an end-to-end model based on attention augmented convolutional neural network and flow-attention mechanism for identification of anticancer peptides. Bioinformatics 2024; 40:btae142. [PMID: 38452348 PMCID: PMC10973939 DOI: 10.1093/bioinformatics/btae142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2023] [Revised: 03/01/2024] [Accepted: 03/06/2024] [Indexed: 03/09/2024] Open
Abstract
MOTIVATION Anticancer peptides (ACPs) have natural cationic properties and can act on the anionic cell membrane of cancer cells to kill cancer cells. Therefore, ACPs have become a potential anticancer drug with good research value and prospect. RESULTS In this article, we propose AACFlow, an end-to-end model for identification of ACPs based on deep learning. End-to-end models have more room to automatically adjust according to the data, making the overall fit better and reducing error propagation. The combination of attention augmented convolutional neural network (AAConv) and multi-layer convolutional neural network (CNN) forms a deep representation learning module, which is used to obtain global and local information on the sequence. Based on the concept of flow network, multi-head flow-attention mechanism is introduced to mine the deep features of the sequence to improve the efficiency of the model. On the independent test dataset, the ACC, Sn, Sp, and AUC values of AACFlow are 83.9%, 83.0%, 84.8%, and 0.892, respectively, which are 4.9%, 1.5%, 8.0%, and 0.016 higher than those of the baseline model. The MCC value is 67.85%. In addition, we visualize the features extracted by each module to enhance the interpretability of the model. Various experiments show that our model is more competitive in predicting ACPs.
Collapse
Affiliation(s)
- Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, China
| | - Ya Zhao
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, China
| | - Yunyun Liang
- School of Science, Xi’an Polytechnic University, Xi'an 710048, China
| |
Collapse
|
5
|
Sun M, Hu H, Pang W, Zhou Y. ACP-BC: A Model for Accurate Identification of Anticancer Peptides Based on Fusion Features of Bidirectional Long Short-Term Memory and Chemically Derived Information. Int J Mol Sci 2023; 24:15447. [PMID: 37895128 PMCID: PMC10607064 DOI: 10.3390/ijms242015447] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2023] [Revised: 09/10/2023] [Accepted: 10/20/2023] [Indexed: 10/29/2023] Open
Abstract
Anticancer peptides (ACPs) have been proven to possess potent anticancer activities. Although computational methods have emerged for rapid ACPs identification, their accuracy still needs improvement. In this study, we propose a model called ACP-BC, a three-channel end-to-end model that utilizes various combinations of data augmentation techniques. In the first channel, features are extracted from the raw sequence using a bidirectional long short-term memory network. In the second channel, the entire sequence is converted into a chemical molecular formula, which is further simplified using Simplified Molecular Input Line Entry System notation to obtain deep abstract features through a bidirectional encoder representation transformer (BERT). In the third channel, we manually selected four effective features according to dipeptide composition, binary profile feature, k-mer sparse matrix, and pseudo amino acid composition. Notably, the application of chemical BERT in predicting ACPs is novel and successfully integrated into our model. To validate the performance of our model, we selected two benchmark datasets, ACPs740 and ACPs240. ACP-BC achieved prediction accuracy with 87% and 90% on these two datasets, respectively, representing improvements of 1.3% and 7% compared to existing state-of-the-art methods on these datasets. Therefore, systematic comparative experiments have shown that the ACP-BC can effectively identify anticancer peptides.
Collapse
Affiliation(s)
- Mingwei Sun
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (M.S.); (H.H.)
| | - Haoyuan Hu
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (M.S.); (H.H.)
| | - Wei Pang
- School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh EH14 4AS, UK;
| | - You Zhou
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (M.S.); (H.H.)
- College of Software, Jilin University, Changchun 130012, China
| |
Collapse
|
6
|
Liu D, Lin Z, Jia C. NeuroCNN_GNB: an ensemble model to predict neuropeptides based on a convolution neural network and Gaussian naive Bayes. Front Genet 2023; 14:1226905. [PMID: 37576553 PMCID: PMC10414792 DOI: 10.3389/fgene.2023.1226905] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2023] [Accepted: 06/30/2023] [Indexed: 08/15/2023] Open
Abstract
Neuropeptides contain more chemical information than other classical neurotransmitters and have multiple receptor recognition sites. These characteristics allow neuropeptides to have a correspondingly higher selectivity for nerve receptors and fewer side effects. Traditional experimental methods, such as mass spectrometry and liquid chromatography technology, still need the support of a complete neuropeptide precursor database and the basic characteristics of neuropeptides. Incomplete neuropeptide precursor and information databases will lead to false-positives or reduce the sensitivity of recognition. In recent years, studies have proven that machine learning methods can rapidly and effectively predict neuropeptides. In this work, we have made a systematic attempt to create an ensemble tool based on four convolution neural network models. These baseline models were separately trained on one-hot encoding, AAIndex, G-gap dipeptide encoding and word2vec and integrated using Gaussian Naive Bayes (NB) to construct our predictor designated NeuroCNN_GNB. Both 5-fold cross-validation tests using benchmark datasets and independent tests showed that NeuroCNN_GNB outperformed other state-of-the-art methods. Furthermore, this novel framework provides essential interpretations that aid the understanding of model success by leveraging the powerful Shapley Additive exPlanation (SHAP) algorithm, thereby highlighting the most important features relevant for predicting neuropeptides.
Collapse
Affiliation(s)
- Di Liu
- Information Science and Technology College, Dalian Maritime University, Dalian, China
| | - Zhengkui Lin
- Information Science and Technology College, Dalian Maritime University, Dalian, China
| | - Cangzhi Jia
- School of Science, Dalian Maritime University, Dalian, China
| |
Collapse
|
7
|
Liang Y, Ma X. iACP-GE: accurate identification of anticancer peptides by using gradient boosting decision tree and extra tree. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2023; 34:1-19. [PMID: 36562289 DOI: 10.1080/1062936x.2022.2160011] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Accepted: 12/12/2022] [Indexed: 06/17/2023]
Abstract
Cancer is one of the main diseases threatening human life, accounting for millions of deaths around the world each year. Traditional physical and chemical methods for cancer treatment are extremely time-consuming, lab-intensive, expensive, inefficient and difficult to be applied in a high-throughput way. Hence, it is an urgent task to develop automated computational methods to enable fast and accurate identification of anticancer peptides (ACPs). In this paper, we develop a novel model named iACP-GE to identify ACPs. Multi-features are extracted by using binary encoding, enhanced grouped amino acid composition and BLOSUM62 encoding based on the N5C5 sequence, as well as detrended forward moving-average auto-cross correlation analysis based on physicochemical properties of 20 natural amino acids. Thus, 835 features are obtained for each sample, in order to avoid information redundancy, gradient boosting decision tree was adopted as the feature selection strategy. Then, the optimal feature subset is input to the extra tree classifier. The accuracies of ACP740 and ACP240 datasets with the 5-fold cross-validation were 90.54% and 91.25%, respectively. Experimental results indicate that iACP-GE significantly outperforms several existing models on ACP740 and ACP240 datasets and can be used as an effective tool for the identification of ACPs. The datasets and source codes for iACP-GE are available at https://github.com/yunyunliang88/iACP-GE.
Collapse
Affiliation(s)
- Y Liang
- School of Science, Xi'an Polytechnic University, Xi'an, P. R. China
| | - X Ma
- School of Science, Xi'an Polytechnic University, Xi'an, P. R. China
| |
Collapse
|
8
|
Zhou C, Peng D, Liao B, Jia R, Wu F. ACP_MS: prediction of anticancer peptides based on feature extraction. Brief Bioinform 2022; 23:6793775. [PMID: 36326080 DOI: 10.1093/bib/bbac462] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 09/10/2022] [Accepted: 09/27/2022] [Indexed: 11/06/2022] Open
Abstract
Anticancer peptides (ACPs) are bioactive peptides with antitumor activity and have become the most promising drugs in the treatment of cancer. Therefore, the accurate prediction of ACPs is of great significance to the research of cancer diseases. In the paper, we developed a more efficient prediction model called ACP_MS. Firstly, the monoMonoKGap method is used to extract the characteristic of anticancer peptide sequences and form the digital features. Then, the AdaBoost model is used to select the most discriminating features from the digital features. Finally, a stochastic gradient descent algorithm is introduced to identify anticancer peptide sequences. We adopt 7-fold cross-validation and independent test set validation, and the final accuracy of the main dataset reached 92.653% and 91.597%, respectively. The accuracy of the alternate dataset reached 98.678% and 98.317%, respectively. Compared with other advanced prediction models, the ACP_MS model improves the identification ability of anticancer peptide sequences. The data of this model can be downloaded from the public website for free https://github.com/Zhoucaimao1998/Zc.
Collapse
Affiliation(s)
- Caimao Zhou
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China.,Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China.,School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Dejun Peng
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China.,Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China.,School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Bo Liao
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China.,Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China.,School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Ranran Jia
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China.,Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China.,School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Fangxiang Wu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China.,Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China.,School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| |
Collapse
|
9
|
ACP-ADA: A Boosting Method with Data Augmentation for Improved Prediction of Anticancer Peptides. Int J Mol Sci 2022; 23:ijms232012194. [PMID: 36293050 PMCID: PMC9603247 DOI: 10.3390/ijms232012194] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2022] [Revised: 10/08/2022] [Accepted: 10/11/2022] [Indexed: 11/30/2022] Open
Abstract
Cancer is the second-leading cause of death worldwide, and therapeutic peptides that target and destroy cancer cells have received a great deal of interest in recent years. Traditional wet experiments are expensive and inefficient for identifying novel anticancer peptides; therefore, the development of an effective computational approach is essential to recognize ACP candidates before experimental methods are used. In this study, we proposed an Ada-boosting algorithm with the base learner random forest called ACP-ADA, which integrates binary profile feature, amino acid index, and amino acid composition with a 210-dimensional feature space vector to represent the peptides. Training samples in the feature space were augmented to increase the sample size and further improve the performance of the model in the case of insufficient samples. Furthermore, we used five-fold cross-validation to find model parameters, and the cross-validation results showed that ACP-ADA outperforms existing methods for this feature combination with data augmentation in terms of performance metrics. Specifically, ACP-ADA recorded an average accuracy of 86.4% and a Mathew’s correlation coefficient of 74.01% for dataset ACP740 and 90.83% and 81.65% for dataset ACP240; consequently, it can be a very useful tool in drug development and biomedical research.
Collapse
|
10
|
Zakharova E, Orsi M, Capecchi A, Reymond J. Machine Learning Guided Discovery of Non-Hemolytic Membrane Disruptive Anticancer Peptides. ChemMedChem 2022; 17:e202200291. [PMID: 35880810 PMCID: PMC9541320 DOI: 10.1002/cmdc.202200291] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2022] [Revised: 06/29/2022] [Indexed: 12/05/2022]
Abstract
Most antimicrobial peptides (AMPs) and anticancer peptides (ACPs) fold into membrane disruptive cationic amphiphilic α-helices, many of which are however also unpredictably hemolytic and toxic. Here we exploited the ability of recurrent neural networks (RNN) to distinguish active from inactive and non-hemolytic from hemolytic AMPs and ACPs to discover new non-hemolytic ACPs. Our discovery pipeline involved: 1) sequence generation using either a generative RNN or a genetic algorithm, 2) RNN classification for activity and hemolysis, 3) selection for sequence novelty, helicity and amphiphilicity, and 4) synthesis and testing. Experimental evaluation of thirty-three peptides resulted in eleven active ACPs, four of which were non-hemolytic, with properties resembling those of the natural ACP lasioglossin III. These experiments show the first example of direct machine learning guided discovery of non-hemolytic ACPs.
Collapse
Affiliation(s)
- Elena Zakharova
- Department of ChemistryBiochemistry and Pharmaceutical SciencesUniversity of BernFreiestrasse 33012BernSwitzerland
| | - Markus Orsi
- Department of ChemistryBiochemistry and Pharmaceutical SciencesUniversity of BernFreiestrasse 33012BernSwitzerland
| | - Alice Capecchi
- Department of ChemistryBiochemistry and Pharmaceutical SciencesUniversity of BernFreiestrasse 33012BernSwitzerland
| | - Jean‐Louis Reymond
- Department of ChemistryBiochemistry and Pharmaceutical SciencesUniversity of BernFreiestrasse 33012BernSwitzerland
| |
Collapse
|
11
|
Akbar S, Hayat M, Tahir M, Khan S, Alarfaj FK. cACP-DeepGram: Classification of anticancer peptides via deep neural network and skip-gram-based word embedding model. Artif Intell Med 2022; 131:102349. [DOI: 10.1016/j.artmed.2022.102349] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2021] [Revised: 05/24/2022] [Accepted: 07/04/2022] [Indexed: 12/28/2022]
|
12
|
Feng C, Wu J, Wei H, Xu L, Zou Q. CRCF: A Method of Identifying Secretory Proteins of Malaria Parasites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2149-2157. [PMID: 34061749 DOI: 10.1109/tcbb.2021.3085589] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Malaria is a mosquito-borne disease that results in millions of cases and deaths annually. The development of a fast computational method that identifies secretory proteins of the malaria parasite is important for research on antimalarial drugs and vaccines. Thus, a method was developed to identify the secretory proteins of malaria parasites. In this method, a reduced alphabet was selected to recode the original protein sequence. A feature synthesis method was used to synthesise three different types of feature information. Finally, the random forest method was used as a classifier to identify the secretory proteins. In addition, a web server was developed to share the proposed algorithm. Experiments using the benchmark dataset demonstrated that the overall accuracy achieved by the proposed method was greater than 97.8 percent using the 10-fold cross-validation method. Furthermore, the reduced schemes and characteristic performance analyses are discussed.
Collapse
|
13
|
To Assist Oncologists: An Efficient Machine Learning-Based Approach for Anti-Cancer Peptides Classification. SENSORS 2022; 22:s22114005. [PMID: 35684624 PMCID: PMC9185351 DOI: 10.3390/s22114005] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Revised: 05/19/2022] [Accepted: 05/20/2022] [Indexed: 12/10/2022]
Abstract
In the modern technological era, Anti-cancer peptides (ACPs) have been considered a promising cancer treatment. It’s critical to find new ACPs to ensure a better knowledge of their functioning processes and vaccine development. Thus, timely and efficient ACPs using a computational technique are highly needed because of the enormous peptide sequences generated in the post-genomic era. Recently, numerous adaptive statistical algorithms have been developed for separating ACPs and NACPs. Despite great advancements, existing approaches still have insufficient feature descriptors and learning methods, limiting predictive performance. To address this, a trustworthy framework is developed for the precise identification of ACPs. Particularly, the presented approach incorporates four hypothetical feature encoding mechanisms namely: amino acid, dipeptide, tripeptide, and an improved version of pseudo amino acid composition are applied to indicate the motif of the target class. Moreover, principal component analysis (PCA) is employed for feature pruning, while selecting optimal, deep, and highly variated features. Due to the diverse nature of learning, experiments are performed over numerous algorithms to select the optimum operating method. After investigating the empirical outcomes, the support vector machine with hybrid feature space shows better performance. The proposed framework achieved an accuracy of 97.09% and 98.25% over the benchmark and independent datasets, respectively. The comparative analysis demonstrates that our proposed model outperforms as compared to the existing methods and is beneficial in drug development, and oncology.
Collapse
|
14
|
Multi-channel CNN based anticancer peptides identification. Anal Biochem 2022; 650:114707. [PMID: 35568159 DOI: 10.1016/j.ab.2022.114707] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Revised: 01/27/2022] [Accepted: 04/27/2022] [Indexed: 11/20/2022]
Abstract
Cancer is one of the most dangerous diseases in the world that often leads to misery and death. Current treatments include different kinds of anticancer therapy which exhibit different types of side effects. Because of certain physicochemical properties, anticancer peptides (ACPs) have opened a new path of treatments for this deadly disease. That is why a well-performed methodology for identifying novel anticancer peptides has great importance in the fight against cancer. In addition to the laboratory techniques, various machine learning and deep learning methodologies have developed in recent years for this task. Although these models have shown reasonable predictive ability, there's still room for improvement in terms of performance and exploring new types of algorithms. In this work, we have proposed a novel multi-channel convolutional neural network (CNN) for identifying anticancer peptides from protein sequences. We have collected data from the existing state-of-the-art methodologies and applied binary encoding for data preprocessing. We have also employed k-fold cross-validation to train our models on benchmark datasets and compared our models' performance on the independent datasets. The comparison has indicated our models' superiority on various evaluation metrics. We think our work can be a valuable asset in finding novel anticancer peptides. We have provided a user-friendly web server for academic purposes and it is publicly available at: \texttt{http://103.99.176.239/iacp-cnn/}.
Collapse
|
15
|
Development of Anticancer Peptides Using Artificial Intelligence and Combinational Therapy for Cancer Therapeutics. Pharmaceutics 2022; 14:pharmaceutics14050997. [PMID: 35631583 PMCID: PMC9147327 DOI: 10.3390/pharmaceutics14050997] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2022] [Revised: 04/28/2022] [Accepted: 05/04/2022] [Indexed: 01/27/2023] Open
Abstract
Cancer is a group of diseases causing abnormal cell growth, altering the genome, and invading or spreading to other parts of the body. Among therapeutic peptide drugs, anticancer peptides (ACPs) have been considered to target and kill cancer cells because cancer cells have unique characteristics such as a high negative charge and abundance of microvilli in the cell membrane when compared to a normal cell. ACPs have several advantages, such as high specificity, cost-effectiveness, low immunogenicity, minimal toxicity, and high tolerance under normal physiological conditions. However, the development and identification of ACPs are time-consuming and expensive in traditional wet-lab-based approaches. Thus, the application of artificial intelligence on the approaches can save time and reduce the cost to identify candidate ACPs. Recently, machine learning (ML), deep learning (DL), and hybrid learning (ML combined DL) have emerged into the development of ACPs without experimental analysis, owing to advances in computer power and big data from the power system. Additionally, we suggest that combination therapy with classical approaches and ACPs might be one of the impactful approaches to increase the efficiency of cancer therapy.
Collapse
|
16
|
Breast and Lung Anticancer Peptides Classification Using N-Grams and Ensemble Learning Techniques. BIG DATA AND COGNITIVE COMPUTING 2022. [DOI: 10.3390/bdcc6020040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Anticancer peptides (ACPs) are short protein sequences; they perform functions like some hormones and enzymes inside the body. The role of any protein or peptide is related to its structure and the sequence of amino acids that make up it. There are 20 types of amino acids in humans, and each of them has a particular characteristic according to its chemical structure. Current machine and deep learning models have been used to classify ACPs problems. However, these models have neglected Amino Acid Repeats (AARs) that play an essential role in the function and structure of peptides. Therefore, in this paper, ACPs offer a promising route for novel anticancer peptides by extracting AARs based on N-Grams and k-mers using two peptides’ datasets. These datasets pointed to breast and lung cancer cells assembled and curated manually from the Cancer Peptide and Protein Database (CancerPPD). Every dataset consists of a sequence of peptides and their synthesis and anticancer activity on breast and lung cancer cell lines. Five different feature selection methods were used in this paper to improve classification performance and reduce the experimental costs. After that, ACPs were classified using four classifiers, namely AdaBoost, Random Forest Tree (RFT), Multi-class Support Vector Machine (SVM), and Multi-Layer Perceptron (MLP). These classifiers were evaluated by applying five well-known evaluation metrics. Experimental results showed that the breast and lung ACPs classification process provided an accurate performance that reached 89.25% and 92.56%, respectively. In terms of AUC, it reached 95.35% and 96.92% for both breast and lung ACPs, respectively. The proposed classifiers performed competently somewhat equally in AUC, accuracy, precision, F-measures, and recall, except for Multi-class SVM-based feature selection, which showed superior performance. As a result, this paper significantly improved the predictive performance that can effectively distinguish ACPs as virtual inactive, experimental inactive, moderately active, and very active.
Collapse
|
17
|
Zhang H, Zou Q, Ju Y, Song C, Chen D. Distance-based support vector machine to predict DNA N6-methyladenine modification. Curr Bioinform 2022. [DOI: 10.2174/1574893617666220404145517] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
DNA N6-methyladenine plays an important role in the restriction-modification system to isolate invasion from adventive DNA. The shortcomings of the high time-consumption and high costs of experimental methods have been exposed, and some computational methods have emerged. The support vector machine theory has received extensive attention in the bioinformatics field due to its solid theoretical foundation and many good characteristics.
Objective:
General machine learning methods include an important step of extracting features. The research has omitted this step and replaced with easy-to-obtain sequence distances matrix to obtain better results
Method:
First sequence alignment technology was used to achieve the similarity matrix. Then a novel transformation turned the similarity matrix into a distance matrix. Next, the similarity-distance matrix is made positive semi-definite so that it can be used in the kernel matrix. Finally, the LIBSVM software was applied to solve the support vector machine.
Results:
The five-fold cross-validation of this model on rice and mouse data has achieved excellent accuracy rates of 92.04% and 96.51%, respectively. This shows that the DB-SVM method has obvious advantages compared with traditional machine learning methods. Meanwhile this model achieved 0.943,0.982 and 0.818 accuracy,0.944, 0.982, and 0.838 Matthews correlation coefficient and 0.942, 0.982 and 0.840 F1 scores for the rice, M. musculus and cross-species genome datasets, respectively.
Conclusion:
These outcomes show that this model outperforms the iIM-CNN and csDMA in the prediction of DNA 6mA modification, which are the lastest research on DNA 6mA.
Collapse
Affiliation(s)
- Haoyu Zhang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610051, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610051, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen 361005, China
| | - Chenggang Song
- Beidahuang Industry Group General Hospital, Harbin 150001, China
| | - Dong Chen
- College of Electrical and Information Engineering, Quzhou University, Quzhou 324000, China
| |
Collapse
|
18
|
Delaunay M, Ha-Duong T. Computational Tools and Strategies to Develop Peptide-Based Inhibitors of Protein-Protein Interactions. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2022; 2405:205-230. [PMID: 35298816 DOI: 10.1007/978-1-0716-1855-4_11] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Protein-protein interactions play crucial and subtle roles in many biological processes and modifications of their fine mechanisms generally result in severe diseases. Peptide derivatives are very promising therapeutic agents for modulating protein-protein associations with sizes and specificities between those of small compounds and antibodies. For the same reasons, rational design of peptide-based inhibitors naturally borrows and combines computational methods from both protein-ligand and protein-protein research fields. In this chapter, we aim to provide an overview of computational tools and approaches used for identifying and optimizing peptides that target protein-protein interfaces with high affinity and specificity. We hope that this review will help to implement appropriate in silico strategies for peptide-based drug design that builds on available information for the systems of interest.
Collapse
Affiliation(s)
| | - Tâp Ha-Duong
- Université Paris-Saclay, CNRS, BioCIS, Châtenay-Malabry, France.
| |
Collapse
|
19
|
Lin X. Genomic Variation Prediction: A Summary From Different Views. Front Cell Dev Biol 2021; 9:795883. [PMID: 34901036 PMCID: PMC8656232 DOI: 10.3389/fcell.2021.795883] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2021] [Accepted: 11/11/2021] [Indexed: 12/02/2022] Open
Abstract
Structural variations in the genome are closely related to human health and the occurrence and development of various diseases. To understand the mechanisms of diseases, find pathogenic targets, and carry out personalized precision medicine, it is critical to detect such variations. The rapid development of high-throughput sequencing technologies has accelerated the accumulation of large amounts of genomic mutation data, including synonymous mutations. Identifying pathogenic synonymous mutations that play important roles in the occurrence and development of diseases from all the available mutation data is of great importance. In this paper, machine learning theories and methods are reviewed, efficient and accurate pathogenic synonymous mutation prediction methods are developed, and a standardized three-level variant analysis framework is constructed. In addition, multiple variation tolerance prediction models are studied and integrated, and new ideas for structural variation detection based on deep information mining are explored.
Collapse
Affiliation(s)
- Xiuchun Lin
- College of Information and Electrical Engineering, China Agricultural University, Beijing, China
| |
Collapse
|
20
|
Ahmed S, Muhammod R, Khan ZH, Adilina S, Sharma A, Shatabda S, Dehzangi A. ACP-MHCNN: an accurate multi-headed deep-convolutional neural network to predict anticancer peptides. Sci Rep 2021; 11:23676. [PMID: 34880291 PMCID: PMC8654959 DOI: 10.1038/s41598-021-02703-3] [Citation(s) in RCA: 46] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2021] [Accepted: 11/17/2021] [Indexed: 01/10/2023] Open
Abstract
Although advancing the therapeutic alternatives for treating deadly cancers has gained much attention globally, still the primary methods such as chemotherapy have significant downsides and low specificity. Most recently, Anticancer peptides (ACPs) have emerged as a potential alternative to therapeutic alternatives with much fewer negative side-effects. However, the identification of ACPs through wet-lab experiments is expensive and time-consuming. Hence, computational methods have emerged as viable alternatives. During the past few years, several computational ACP identification techniques using hand-engineered features have been proposed to solve this problem. In this study, we propose a new multi headed deep convolutional neural network model called ACP-MHCNN, for extracting and combining discriminative features from different information sources in an interactive way. Our model extracts sequence, physicochemical, and evolutionary based features for ACP identification using different numerical peptide representations while restraining parameter overhead. It is evident through rigorous experiments using cross-validation and independent-dataset that ACP-MHCNN outperforms other models for anticancer peptide identification by a substantial margin on our employed benchmarks. ACP-MHCNN outperforms state-of-the-art model by 6.3%, 8.6%, 3.7%, 4.0%, and 0.20 in terms of accuracy, sensitivity, specificity, precision, and MCC respectively. ACP-MHCNN and its relevant codes and datasets are publicly available at: https://github.com/mrzResearchArena/Anticancer-Peptides-CNN . ACP-MHCNN is also publicly available as an online predictor at: https://anticancer.pythonanywhere.com/ .
Collapse
Affiliation(s)
- Sajid Ahmed
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| | - Rafsanjani Muhammod
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| | - Zahid Hossain Khan
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| | - Sheikh Adilina
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| | - Alok Sharma
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, 230-0045, Japan
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, QLD, 4111, Australia
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh.
| | - Abdollah Dehzangi
- Department of Computer Science, Rutgers University, Camden, NJ, 08102, USA.
- Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, 08102, USA.
| |
Collapse
|
21
|
Jiao S, Zou Q, Guo H, Shi L. iTTCA-RF: a random forest predictor for tumor T cell antigens. J Transl Med 2021; 19:449. [PMID: 34706730 PMCID: PMC8554859 DOI: 10.1186/s12967-021-03084-x] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Accepted: 09/16/2021] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Cancer is one of the most serious diseases threatening human health. Cancer immunotherapy represents the most promising treatment strategy due to its high efficacy and selectivity and lower side effects compared with traditional treatment. The identification of tumor T cell antigens is one of the most important tasks for antitumor vaccines development and molecular function investigation. Although several machine learning predictors have been developed to identify tumor T cell antigen, more accurate tumor T cell antigen identification by existing methodology is still challenging. METHODS In this study, we used a non-redundant dataset of 592 tumor T cell antigens (positive samples) and 393 tumor T cell antigens (negative samples). Four types feature encoding methods have been studied to build an efficient predictor, including amino acid composition, global protein sequence descriptors and grouped amino acid and peptide composition. To improve the feature representation ability of the hybrid features, we further employed a two-step feature selection technique to search for the optimal feature subset. The final prediction model was constructed using random forest algorithm. RESULTS Finally, the top 263 informative features were selected to train the random forest classifier for detecting tumor T cell antigen peptides. iTTCA-RF provides satisfactory performance, with balanced accuracy, specificity and sensitivity values of 83.71%, 78.73% and 88.69% over tenfold cross-validation as well as 73.14%, 62.67% and 83.61% over independent tests, respectively. The online prediction server was freely accessible at http://lab.malab.cn/~acy/iTTCA . CONCLUSIONS We have proven that the proposed predictor iTTCA-RF is superior to the other latest models, and will hopefully become an effective and useful tool for identifying tumor T cell antigens presented in the context of major histocompatibility complex class I.
Collapse
Affiliation(s)
- Shihu Jiao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Huannan Guo
- Department of Oncology, General Hospital of Heilongjiang Province Land Reclamation Bureau, Harbin, China.
| | - Lei Shi
- Department of Spine Surgery, Changzheng Hospital, Naval Medical University, Shanghai, China.
| |
Collapse
|
22
|
Niu M, Wu J, Zou Q, Liu Z, Xu L. rBPDL:Predicting RNA-Binding Proteins Using Deep Learning. IEEE J Biomed Health Inform 2021; 25:3668-3676. [PMID: 33780344 DOI: 10.1109/jbhi.2021.3069259] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
RNA-binding protein (RBP) is a powerful and wide-ranging regulator that plays an important role in cell development, differentiation, metabolism, health and disease. The prediction of RBPs provides valuable guidance for biologists. Although experimental methods have made great progress in predicting RBP, they are time-consuming and not flexible. Therefore, we developed a network model, rBPDL, by combining a convolutional neural network and long short-term memory for multilabel classification of RBPs. Moreover, to achieve better prediction results, we used a voting algorithm for ensemble learning of the model. We compared rBPDL with state-of-the-art methods and found that rBPDL significantly improved identification performance for the RBP68 dataset, with a macro-Area Under Curve (AUC), micro-AUC, and weighted AUC of 0.936, 0.962, and 0.946, respectively. Furthermore, through AUC statistical analysis of the RBP domain, we analyzed the performance of rBPDL and found that the RBP identification performance in the same domain was similar. In addition, we analyzed the performance preferences and physicochemical properties of the binding protein amino acids and explored the characteristics that affect the binding by using the RBP86 dataset.
Collapse
|
23
|
Liang X, Li F, Chen J, Li J, Wu H, Li S, Song J, Liu Q. Large-scale comparative review and assessment of computational methods for anti-cancer peptide identification. Brief Bioinform 2021; 22:bbaa312. [PMID: 33316035 PMCID: PMC8294543 DOI: 10.1093/bib/bbaa312] [Citation(s) in RCA: 52] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Revised: 09/30/2020] [Accepted: 08/25/2020] [Indexed: 12/13/2022] Open
Abstract
Anti-cancer peptides (ACPs) are known as potential therapeutics for cancer. Due to their unique ability to target cancer cells without affecting healthy cells directly, they have been extensively studied. Many peptide-based drugs are currently evaluated in the preclinical and clinical trials. Accurate identification of ACPs has received considerable attention in recent years; as such, a number of machine learning-based methods for in silico identification of ACPs have been developed. These methods promote the research on the mechanism of ACPs therapeutics against cancer to some extent. There is a vast difference in these methods in terms of their training/testing datasets, machine learning algorithms, feature encoding schemes, feature selection methods and evaluation strategies used. Therefore, it is desirable to summarize the advantages and disadvantages of the existing methods, provide useful insights and suggestions for the development and improvement of novel computational tools to characterize and identify ACPs. With this in mind, we firstly comprehensively investigate 16 state-of-the-art predictors for ACPs in terms of their core algorithms, feature encoding schemes, performance evaluation metrics and webserver/software usability. Then, comprehensive performance assessment is conducted to evaluate the robustness and scalability of the existing predictors using a well-prepared benchmark dataset. We provide potential strategies for the model performance improvement. Moreover, we propose a novel ensemble learning framework, termed ACPredStackL, for the accurate identification of ACPs. ACPredStackL is developed based on the stacking ensemble strategy combined with SVM, Naïve Bayesian, lightGBM and KNN. Empirical benchmarking experiments against the state-of-the-art methods demonstrate that ACPredStackL achieves a comparative performance for predicting ACPs. The webserver and source code of ACPredStackL is freely available at http://bigdata.biocie.cn/ACPredStackL/ and https://github.com/liangxiaoq/ACPredStackL, respectively.
Collapse
Affiliation(s)
- Xiao Liang
- College of Information Engineering, Northwest A&F University, Yangling, 712100, China
- Shaanxi Key Laboratory of Agricultural Information Perception and Intelligent Service, Yangling, Shaanxi 712100, China
| | - Fuyi Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC 3800, Australia
- Department of Microbiology and Immunology, Peter Doherty Institute for Infection and Immunity, University of Melbourne, Melbourne, Victoria, Australia
| | - Jinxiang Chen
- College of Information Engineering, Northwest A&F University, Yangling, 712100, China
| | - Junlong Li
- College of Information Engineering, Northwest A&F University, Yangling, 712100, China
| | - Hao Wu
- College of Information Engineering, Northwest A&F University, Yangling, 712100, China
| | - Shuqin Li
- College of Information Engineering, Northwest A&F University, Yangling, 712100, China
- Shaanxi Key Laboratory of Agricultural Information Perception and Intelligent Service, Yangling, Shaanxi 712100, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC 3800, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia
| | - Quanzhong Liu
- College of Information Engineering, Northwest A&F University, Yangling, 712100, China
- Shaanxi Key Laboratory of Agricultural Information Perception and Intelligent Service, Yangling, Shaanxi 712100, China
| |
Collapse
|
24
|
Chen XG, Zhang W, Yang X, Li C, Chen H. ACP-DA: Improving the Prediction of Anticancer Peptides Using Data Augmentation. Front Genet 2021; 12:698477. [PMID: 34276801 PMCID: PMC8279753 DOI: 10.3389/fgene.2021.698477] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Accepted: 06/07/2021] [Indexed: 12/09/2022] Open
Abstract
Anticancer peptides (ACPs) have provided a promising perspective for cancer treatment, and the prediction of ACPs is very important for the discovery of new cancer treatment drugs. It is time consuming and expensive to use experimental methods to identify ACPs, so computational methods for ACP identification are urgently needed. There have been many effective computational methods, especially machine learning-based methods, proposed for such predictions. Most of the current machine learning methods try to find suitable features or design effective feature learning techniques to accurately represent ACPs. However, the performance of these methods can be further improved for cases with insufficient numbers of samples. In this article, we propose an ACP prediction model called ACP-DA (Data Augmentation), which uses data augmentation for insufficient samples to improve the prediction performance. In our method, to better exploit the information of peptide sequences, peptide sequences are represented by integrating binary profile features and AAindex features, and then the samples in the training set are augmented in the feature space. After data augmentation, the samples are used to train the machine learning model, which is used to predict ACPs. The performance of ACP-DA exceeds that of existing methods, and ACP-DA achieves better performance in the prediction of ACPs compared with a method without data augmentation. The proposed method is available at http://github.com/chenxgscuec/ACPDA.
Collapse
Affiliation(s)
- Xian-Gan Chen
- School of Biomedical Engineering, South-Central University for Nationalities, Wuhan, China.,Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central University for Nationalities, Wuhan, China.,Key Laboratory of Cognitive Science (South-Central University for Nationalities), State Ethnic Affairs Commission, Wuhan, China
| | - Wen Zhang
- College of Informatics, Huazhong Agricultural University, Wuhan, China.,Hubei Engineering Technology Research Center of Agricultural Big Data, Wuhan, China
| | - Xiaofei Yang
- School of Biomedical Engineering, South-Central University for Nationalities, Wuhan, China.,Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central University for Nationalities, Wuhan, China.,Key Laboratory of Cognitive Science (South-Central University for Nationalities), State Ethnic Affairs Commission, Wuhan, China
| | - Chenhong Li
- School of Biomedical Engineering, South-Central University for Nationalities, Wuhan, China.,Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central University for Nationalities, Wuhan, China.,Key Laboratory of Cognitive Science (South-Central University for Nationalities), State Ethnic Affairs Commission, Wuhan, China
| | - Hengling Chen
- School of Biomedical Engineering, South-Central University for Nationalities, Wuhan, China.,Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central University for Nationalities, Wuhan, China.,Key Laboratory of Cognitive Science (South-Central University for Nationalities), State Ethnic Affairs Commission, Wuhan, China
| |
Collapse
|
25
|
Zhu W, Guo Y, Zou Q. Prediction of presynaptic and postsynaptic neurotoxins based on feature extraction. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2021; 18:5943-5958. [PMID: 34517517 DOI: 10.3934/mbe.2021297] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
A neurotoxin is essentially a protein that mainly acts on the nervous system; it has a selective toxic effect on the central nervous system and neuromuscular nodes, can cause muscle paralysis and respiratory paralysis, and has strong lethality. According to their principle of action, neurotoxins are divided into presynaptic neurotoxins and postsynaptic neurotoxins. Correctly identifying presynaptic and postsynaptic nerve toxins provides important clues for future drug development and the discovery of drug targets. Therefore, a predictive model, Neu_LR, was constructed in this paper. The monoMonokGap method was used to extract the frequency characteristics of presynaptic and postsynaptic neurotoxin sequences and carry out feature selection, then, based on the important features obtained after dimensionality reduction, the prediction model Neu_LR was constructed using a logistic regression algorithm, and ten-fold cross-validation and independent test set validation were used. The final accuracy rates were 99.6078 and 94.1176%, respectively, which proved that the Neu_LR model had good predictive performance and robustness, and could meet the prediction requirements of presynaptic and postsynaptic neurotoxins. The data and source code of the model can be freely download from https://github.com/gyx123681/.
Collapse
Affiliation(s)
- Wen Zhu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Yuxin Guo
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| |
Collapse
|
26
|
Huang KY, Tseng YJ, Kao HJ, Chen CH, Yang HH, Weng SL. Identification of subtypes of anticancer peptides based on sequential features and physicochemical properties. Sci Rep 2021; 11:13594. [PMID: 34193950 PMCID: PMC8245499 DOI: 10.1038/s41598-021-93124-9] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2020] [Accepted: 06/08/2021] [Indexed: 11/25/2022] Open
Abstract
Anticancer peptides (ACPs) are a kind of bioactive peptides which could be used as a novel type of anticancer drug that has several advantages over chemistry-based drug, including high specificity, strong tumor penetration capacity, and low toxicity to normal cells. As the number of experimentally verified bioactive peptides has increased significantly, various of in silico approaches are imperative for investigating the characteristics of ACPs. However, the lack of methods for investigating the differences in physicochemical properties of ACPs. In this study, we compared the N- and C-terminal amino acid composition for each peptide, there are three major subtypes of ACPs that are defined based on the distribution of positively charged residues. For the first time, we were motivated to develop a two-step machine learning model for identification of the subtypes of ACPs, which classify the input data into the corresponding group before applying the classifier. Further, to improve the predictive power, the hybrid feature sets were considered for prediction. Evaluation by five-fold cross-validation showed that the two-step model trained with sequence-based features and physicochemical properties was most effective in discriminating between ACPs and non-ACPs. The two-step model trained with the hybrid features performed well, with a sensitivity of 86.75%, a specificity of 85.75%, an accuracy of 86.08%, and a Matthews Correlation Coefficient value of 0.703. Furthermore, the model also consistently provides the effective performance in independent testing set, with sensitivity of 77.6%, specificity of 94.74%, accuracy of 88.99% and the MCC value reached 0.75. Finally, the two-step model has been implemented as a web-based tool, namely iDACP, which is now freely available at http://mer.hc.mmh.org.tw/iDACP/ .
Collapse
Affiliation(s)
- Kai-Yao Huang
- Department of Medical Research, Hsinchu Mackay Memorial Hospital, Hsinchu City, 300, Taiwan
- Department of Medicine, Mackay Medical College, New Taipei City, 252, Taiwan
| | - Yi-Jhan Tseng
- Department of Medical Research, Hsinchu Mackay Memorial Hospital, Hsinchu City, 300, Taiwan
| | - Hui-Ju Kao
- Department of Medical Research, Hsinchu Mackay Memorial Hospital, Hsinchu City, 300, Taiwan
| | - Chia-Hung Chen
- Department of Medical Research, Hsinchu Mackay Memorial Hospital, Hsinchu City, 300, Taiwan
| | - Hsiao-Hsiang Yang
- Department of Medical Research, Hsinchu Mackay Memorial Hospital, Hsinchu City, 300, Taiwan
| | - Shun-Long Weng
- Department of Medicine, Mackay Medical College, New Taipei City, 252, Taiwan.
- Department of Obstetrics and Gynecology, Hsinchu Mackay Memorial Hospital, Hsinchu City, 300, Taiwan.
- Mackay Junior College of Medicine, Medicine, Nursing and Management College, Taipei City, 112, Taiwan.
| |
Collapse
|
27
|
CWLy-RF: A novel approach for identifying cell wall lyases based on random forest classifier. Genomics 2021; 113:2919-2924. [PMID: 34186189 DOI: 10.1016/j.ygeno.2021.06.038] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2021] [Revised: 06/20/2021] [Accepted: 06/25/2021] [Indexed: 02/05/2023]
Abstract
Drug resistance of pathogenic bacteria has become increasingly serious due to the abuse of antibiotics in recent years. Researchers have found that cell wall lyases are effective antibacterial agents that can specifically recognize target bacteria and degrade bacterial peptidoglycan. Traditional wet experiments are usually expensive, time-consuming and laborious for the identification of lyases. Therefore, there is an urgent need to develop prediction tools based on computer methods to identify lyases quickly and accurately. In this paper, a new predictor, CWLy-RF, is proposed based on the random forest (RF) algorithm to identify cell wall lyases. In this method, we combined three features, namely, 400D, 188D and the composition of k-spaced amino acid group pairs, using mixed-feature representation methods. Afterward, we improved the feature representation ability with the selected top 100 features by using the information gain method and trained a predictive model using RF. The constructed prediction model is evaluated by using 10-fold cross-validation. The accuracy obtained was 96.09%, the AUC was 0.993, the MCC was 0.922, the sensitivity was 94.92%, and the specificity was 97.32%. We have proved that the proposed predictor CWLy-RF is superior to other latest models, and it will hopefully become an effective and useful tool for identifying lyases.
Collapse
|
28
|
Zhao Y, Wang S, Fei W, Feng Y, Shen L, Yang X, Wang M, Wu M. Prediction of Anticancer Peptides with High Efficacy and Low Toxicity by Hybrid Model Based on 3D Structure of Peptides. Int J Mol Sci 2021; 22:5630. [PMID: 34073203 PMCID: PMC8198792 DOI: 10.3390/ijms22115630] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Revised: 04/30/2021] [Accepted: 05/19/2021] [Indexed: 02/07/2023] Open
Abstract
Recently, anticancer peptides (ACPs) have emerged as unique and promising therapeutic agents for cancer treatment compared with antibody and small molecule drugs. In addition to experimental methods of ACPs discovery, it is also necessary to develop accurate machine learning models for ACP prediction. In this study, features were extracted from the three-dimensional (3D) structure of peptides to develop the model, compared to most of the previous computational models, which are based on sequence information. In order to develop ACPs with more potency, more selectivity and less toxicity, the model for predicting ACPs, hemolytic peptides and toxic peptides were established by peptides 3D structure separately. Multiple datasets were collected according to whether the peptide sequence was chemically modified. After feature extraction and screening, diverse algorithms were used to build the model. Twelve models with excellent performance (Acc > 90%) in the ACPs mixed datasets were used to form a hybrid model to predict the candidate ACPs, and then the optimal model of hemolytic peptides (Acc = 73.68%) and toxic peptides (Acc = 85.5%) was used for safety prediction. Novel ACPs were found by using those models, and five peptides were randomly selected to determine their anticancer activity and toxic side effects in vitro experiments.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Min Wang
- State Key Laboratory of Natural Medicines, School of Life Science and Technology, China Pharmaceutical University, Nanjing 210009, China; (Y.Z.); (S.W.); (W.F.); (Y.F.); (L.S.); (X.Y.)
| | - Min Wu
- State Key Laboratory of Natural Medicines, School of Life Science and Technology, China Pharmaceutical University, Nanjing 210009, China; (Y.Z.); (S.W.); (W.F.); (Y.F.); (L.S.); (X.Y.)
| |
Collapse
|
29
|
Xu L, Jiang S, Wu J, Zou Q. An in silico approach to identification, categorization and prediction of nucleic acid binding proteins. Brief Bioinform 2021. [PMID: 32793956 DOI: 10.1101/2020.05.05.078741] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/12/2023] Open
Abstract
The interaction between proteins and nucleic acid plays an important role in many processes, such as transcription, translation and DNA repair. The mechanisms of related biological events can be understood by exploring the function of proteins in these interactions. The number of known protein sequences has increased rapidly in recent years, but the databases for describing the structure and function of protein have unfortunately grown quite slowly. Thus, improving such databases is meaningful for predicting protein-nucleic acid interactions. Furthermore, the mechanism of related biological events, such as viral infection or designing novel drug targets, can be further understood by understanding the function of proteins in these interactions. The information for each sequence, including its function and interaction sites, were collected and identified, and a database called PNIDB was built. The proteins in PNIDB were grouped into 27 classes, such as transcription, immune system, and structural protein, etc. The function of each protein was then predicted using a machine learning method. Using our method, the predictor was trained on labeled sequences, and then the function of a protein was predicted based on the trained classifier. The prediction accuracy achieved a score of 77.43% by 10-fold cross validation.
Collapse
Affiliation(s)
- Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic
| | | | - Jin Wu
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China
| | - Quan Zou
- School of Management, Shenzhen Polytechnic
| |
Collapse
|
30
|
|
31
|
Abstract
Background:
Bioluminescence is a unique and significant phenomenon in nature.
Bioluminescence is important for the lifecycle of some organisms and is valuable in biomedical
research, including for gene expression analysis and bioluminescence imaging technology. In recent
years, researchers have identified a number of methods for predicting bioluminescent proteins
(BLPs), which have increased in accuracy, but could be further improved.
Method:
In this study, a new bioluminescent proteins prediction method, based on a voting
algorithm, is proposed. Four methods of feature extraction based on the amino acid sequence were
used. 314 dimensional features in total were extracted from amino acid composition,
physicochemical properties and k-spacer amino acid pair composition. In order to obtain the highest
MCC value to establish the optimal prediction model, a voting algorithm was then used to build the
model. To create the best performing model, the selection of base classifiers and vote counting rules
are discussed.
Results:
The proposed model achieved 93.4% accuracy, 93.4% sensitivity and
91.7% specificity in the test set, which was better than any other method. A previous prediction of
bioluminescent proteins in three lineages was also improved using the model building method,
resulting in greatly improved accuracy.
Collapse
Affiliation(s)
- Shulin Zhao
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen, China
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba Science City, Japan
| | - Jun Zhang
- Rehabilitation Department, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Shuguang Han
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
32
|
Convolutional neural networks with image representation of amino acid sequences for protein function prediction. Comput Biol Chem 2021; 92:107494. [PMID: 33930742 DOI: 10.1016/j.compbiolchem.2021.107494] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Accepted: 04/21/2021] [Indexed: 01/11/2023]
Abstract
Proteins are one of the most important molecules that govern the cellular processes in most of the living organisms. Various functions of the proteins are of paramount importance to understand the basics of life. Several supervised learning approaches are applied in this field to predict the functionality of proteins. In this paper, we propose a convolutional neural network based approach ProtConv to predict the functionality of proteins by converting the amino-acid sequences to a two dimensional image. We have used a protein embedding technique using transfer learning to generate the feature vector. Feature vector is then converted into a square sized single channel image to be fed into a convolutional network. The neural network architecture used here is a combination of convolutional filters and average pooling layers followed by dense fully connected layers to predict a binary function. We have performed experiments on standard benchmark datasets taken from two very important protein function prediction task: proinflammatory cytokines and anticancer peptides. Our experiments show that the proposed method, ProtConv achieves state-of-the-art performances on both of the datasets. All necessary details about implementation with source code and datasets are made available at: https://github.com/swakkhar/ProtConv.
Collapse
|
33
|
Niu M, Lin Y, Zou Q. sgRNACNN: identifying sgRNA on-target activity in four crops using ensembles of convolutional neural networks. PLANT MOLECULAR BIOLOGY 2021; 105:483-495. [PMID: 33385273 DOI: 10.1007/s11103-020-01102-y] [Citation(s) in RCA: 65] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/2020] [Accepted: 12/01/2020] [Indexed: 06/12/2023]
Abstract
KEY MESSAGE We proposed an ensemble convolutional neural network model to identify sgRNA high on-target activity in four crops and we used one-hot encoding and k-mers for sequence encoding. As an important component of the CRISPR/Cas9 system, single-guide RNA (sgRNA) plays an important role in gene redirection and editing. sgRNA has played an important role in the improvement of agronomic species, but there is a lack of effective bioinformatics tools to identify the activity of sgRNA in agronomic species. Therefore, it is necessary to develop a method based on machine learning to identify sgRNA high on-target activity. In this work, we proposed a simple convolutional neural network method to identify sgRNA high on-target activity. Our study used one-hot encoding and k-mers for sequence data conversion and a voting algorithm for constructing the convolutional neural network ensemble model sgRNACNN for the prediction of sgRNA activity. The ensemble model sgRNACNN was used for predictions in four crops: Glycine max, Zea mays, Sorghum bicolor and Triticum aestivum. The accuracy rates of the four crops in the sgRNACNN model were 82.43%, 80.33%, 78.25% and 87.49%, respectively. The experimental results showed that sgRNACNN realizes the identification of high on-target activity sgRNA of agronomic data and can meet the demands of sgRNA activity prediction in agronomy to a certain extent. These results have certain significance for guiding crop gene editing and academic research. The source code and relevant dataset can be found in the following link: https://github.com/nmt315320/sgRNACNN.git .
Collapse
Affiliation(s)
- Mengting Niu
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Yuan Lin
- Department of System Integration, Sparebanken Vest, Bergen, Norway.
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.
| |
Collapse
|
34
|
Huang Q, Zhou W, Guo F, Xu L, Zhang L. 6mA-Pred: identifying DNA N6-methyladenine sites based on deep learning. PeerJ 2021; 9:e10813. [PMID: 33604189 PMCID: PMC7866889 DOI: 10.7717/peerj.10813] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2020] [Accepted: 12/30/2020] [Indexed: 01/03/2023] Open
Abstract
With the accumulation of data on 6mA modification sites, an increasing number of scholars have begun to focus on the identification of 6mA sites. Despite the recognized importance of 6mA sites, methods for their identification remain lacking, with most existing methods being aimed at their identification in individual species. In the present study, we aimed to develop an identification method suitable for multiple species. Based on previous research, we propose a method for 6mA site recognition. Our experiments prove that the proposed 6mA-Pred method is effective for identifying 6mA sites in genes from taxa such as rice, Mus musculus, and human. A series of experimental results show that 6mA-Pred is an excellent method. We provide the source code used in the study, which can be obtained from http://39.100.246.211:5004/6mA_Pred/.
Collapse
Affiliation(s)
- Qianfei Huang
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Wenyang Zhou
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, China
| |
Collapse
|
35
|
He S, Guo F, Zou Q, HuiDing. MRMD2.0: A Python Tool for Machine Learning with Feature Ranking and Reduction. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200503030350] [Citation(s) in RCA: 101] [Impact Index Per Article: 25.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Aims:
The study aims to find a way to reduce the dimensionality of the dataset.
Background:
Dimensionality reduction is the key issue of the machine learning process. It does
not only improve the prediction performance but also could recommend the intrinsic features and
help to explore the biological expression of the machine learning “black box”.
Objective:
A variety of feature selection algorithms are used to select data features to achieve
dimensionality reduction.
Methods:
First, MRMD2.0 integrated 7 different popular feature ranking algorithms with
PageRank strategy. Second, optimized dimensionality was detected with forward adding strategy.
Result:
We have achieved good results in our experiments.
Conclusion:
Several works have been tested with MRMD2.0. It showed well performance.
Otherwise, it also can draw the performance curves according to the feature dimensionality. If
users want to sacrifice accuracy for fewer features, they can select the dimensionality from the
performance curves.
Other:
We developed friendly python tools together with the web server. The users could upload
their csv, arff or libsvm format files. Then the webserver would help to rank features and find the
optimized dimensionality.
Collapse
Affiliation(s)
- Shida He
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - HuiDing
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
36
|
Charoenkwan P, Chiangjong W, Lee VS, Nantasenamat C, Hasan MM, Shoombuatong W. Improved prediction and characterization of anticancer activities of peptides using a novel flexible scoring card method. Sci Rep 2021; 11:3017. [PMID: 33542286 PMCID: PMC7862624 DOI: 10.1038/s41598-021-82513-9] [Citation(s) in RCA: 48] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2020] [Accepted: 01/18/2021] [Indexed: 01/30/2023] Open
Abstract
As anticancer peptides (ACPs) have attracted great interest for cancer treatment, several approaches based on machine learning have been proposed for ACP identification. Although existing methods have afforded high prediction accuracies, however such models are using a large number of descriptors together with complex ensemble approaches that consequently leads to low interpretability and thus poses a challenge for biologists and biochemists. Therefore, it is desirable to develop a simple, interpretable and efficient predictor for accurate ACP identification as well as providing the means for the rational design of new anticancer peptides with promising potential for clinical application. Herein, we propose a novel flexible scoring card method (FSCM) making use of propensity scores of local and global sequential information for the development of a sequence-based ACP predictor (named iACP-FSCM) for improving the prediction accuracy and model interpretability. To the best of our knowledge, iACP-FSCM represents the first sequence-based ACP predictor for rationalizing an in-depth understanding into the molecular basis for the enhancement of anticancer activities of peptides via the use of FSCM-derived propensity scores. The independent testing results showed that the iACP-FSCM provided accuracies of 0.825 and 0.910 as evaluated on the main and alternative datasets, respectively. Results from comparative benchmarking demonstrated that iACP-FSCM could outperform seven other existing ACP predictors with marked improvements of 7% and 17% for accuracy and MCC, respectively, on the main dataset. Furthermore, the iACP-FSCM (0.910) achieved very comparable results to that of the state-of-the-art ensemble model AntiCP2.0 (0.920) as evaluated on the alternative dataset. Comparative results demonstrated that iACP-FSCM was the most suitable choice for ACP identification and characterization considering its simplicity, interpretability and generalizability. It is highly anticipated that the iACP-FSCM may be a robust tool for the rapid screening and identification of promising ACPs for clinical use.
Collapse
Affiliation(s)
- Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, 50200, Thailand
| | - Wararat Chiangjong
- Pediatric Translational Research Unit, Department of Pediatrics, Faculty of Medicine, Ramathibodi Hospital, Mahidol University, Bangkok, 10400, Thailand
| | - Vannajan Sanghiran Lee
- Department of Chemistry, Centre of Theoretical and Computational Physics, Faculty of Science, University of Malaya, 50603, Kuala Lumpur, Malaysia
| | - Chanin Nantasenamat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| | - Md Mehedi Hasan
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka, 820-8502, Japan
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand.
| |
Collapse
|
37
|
Liu T, Chen JM, Zhang D, Zhang Q, Peng B, Xu L, Tang H. ApoPred: Identification of Apolipoproteins and Their Subfamilies With Multifarious Features. Front Cell Dev Biol 2021; 8:621144. [PMID: 33490085 PMCID: PMC7820372 DOI: 10.3389/fcell.2020.621144] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2020] [Accepted: 11/24/2020] [Indexed: 01/24/2023] Open
Abstract
Apolipoprotein is a group of plasma proteins that are associated with a variety of diseases, such as hyperlipidemia, atherosclerosis, Alzheimer's disease, and diabetes. In order to investigate the function of apolipoproteins and to develop effective targets for related diseases, it is necessary to accurately identify and classify apolipoproteins. Although it is possible to identify apolipoproteins accurately through biochemical experiments, they are expensive and time-consuming. This work aims to establish a high-efficiency and high-accuracy prediction model for recognition of apolipoproteins and their subfamilies. We firstly constructed a high-quality benchmark dataset including 270 apolipoproteins and 535 non-apolipoproteins. Based on the dataset, pseudo-amino acid composition (PseAAC) and composition of k-spaced amino acid pairs (CKSAAP) were used as input vectors. To improve the prediction accuracy and eliminate redundant information, analysis of variance (ANOVA) was used to rank the features. And the incremental feature selection was utilized to obtain the best feature subset. Support vector machine (SVM) was proposed to construct the classification model, which could produce the accuracy of 97.27%, sensitivity of 96.30%, and specificity of 97.76% for discriminating apolipoprotein from non-apolipoprotein in 10-fold cross-validation. In addition, the same process was repeated to generate a new model for predicting apolipoprotein subfamilies. The new model could achieve an overall accuracy of 95.93% in 10-fold cross-validation. According to our proposed model, a convenient webserver called ApoPred was established, which can be freely accessed at http://tang-biolab.com/server/ApoPred/service.html. We expect that this work will contribute to apolipoprotein function research and drug development in relevant diseases.
Collapse
Affiliation(s)
- Ting Liu
- School of Basic Medical Sciences, Southwest Medical University, Luzhou, China
| | - Jia-Mao Chen
- School of Basic Medical Sciences, Southwest Medical University, Luzhou, China
| | - Dan Zhang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Qian Zhang
- School of Basic Medical Sciences, Southwest Medical University, Luzhou, China
| | - Bowen Peng
- Division of international Cooperation, Health Commission of Sichuan Province, Chengdu, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Hua Tang
- School of Basic Medical Sciences, Southwest Medical University, Luzhou, China
- Central Nervous System Drug Key Laboratory of Sichuan Province, Luzhou, China
| |
Collapse
|
38
|
Li J, Zhang L, He S, Guo F, Zou Q. SubLocEP: a novel ensemble predictor of subcellular localization of eukaryotic mRNA based on machine learning. Brief Bioinform 2021; 22:6059770. [PMID: 33388743 DOI: 10.1093/bib/bbaa401] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Revised: 11/28/2020] [Accepted: 12/08/2020] [Indexed: 01/23/2023] Open
Abstract
MOTIVATION mRNA location corresponds to the location of protein translation and contributes to precise spatial and temporal management of the protein function. However, current assignment of subcellular localization of eukaryotic mRNA reveals important limitations: (1) turning multiple classifications into multiple dichotomies makes the training process tedious; (2) the majority of the models trained by classical algorithm are based on the extraction of single sequence information; (3) the existing state-of-the-art models have not reached an ideal level in terms of prediction and generalization ability. To achieve better assignment of subcellular localization of eukaryotic mRNA, a better and more comprehensive model must be developed. RESULTS In this paper, SubLocEP is proposed as a two-layer integrated prediction model for accurate prediction of the location of sequence samples. Unlike the existing models based on limited features, SubLocEP comprehensively considers additional feature attributes and is combined with LightGBM to generated single feature classifiers. The initial integration model (single-layer model) is generated according to the categories of a feature. Subsequently, two single-layer integration models are weighted (sequence-based: physicochemical properties = 3:2) to produce the final two-layer model. The performance of SubLocEP on independent datasets is sufficient to indicate that SubLocEP is an accurate and stable prediction model with strong generalization ability. Additionally, an online tool has been developed that contains experimental data and can maximize the user convenience for estimation of subcellular localization of eukaryotic mRNA.
Collapse
Affiliation(s)
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology
| | | | | | | |
Collapse
|
39
|
Lv Z, Ding H, Wang L, Zou Q. A Convolutional Neural Network Using Dinucleotide One-hot Encoder for identifying DNA N6-Methyladenine Sites in the Rice Genome. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.09.056] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
40
|
Timmons PB, Hewage CM. ENNAACT is a novel tool which employs neural networks for anticancer activity classification for therapeutic peptides. Biomed Pharmacother 2020; 133:111051. [PMID: 33254015 DOI: 10.1016/j.biopha.2020.111051] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Revised: 10/08/2020] [Accepted: 11/19/2020] [Indexed: 12/12/2022] Open
Abstract
The prevalence of cancer as a threat to human life, responsible for 9.6 million deaths worldwide in 2018, motivates the search for new anticancer agents. While many options are currently available for treatment, these are often expensive and impact the human body unfavourably. Anticancer peptides represent a promising emerging field of anticancer therapeutics, which are characterized by favourable toxicity profile. The development of accurate in silico methods for anticancer peptide prediction is of paramount importance, as the amount of available sequence data is growing each year. This study leverages advances in machine learning research to produce a novel sequence-based deep neural network classifier for anticancer peptide activity. The classifier achieves performance comparable to the best-in-class, with a cross-validated accuracy of 98.3%, Matthews correlation coefficient of 0.91 and an Area Under the Curve of 0.95. This innovative classifier is available as a web server at https://research.timmons.eu/ennaact, facilitating in silico screening and design of new anticancer peptide chemotherapeutics by the research community.
Collapse
Affiliation(s)
- Patrick Brendan Timmons
- UCD School of Biomolecular and Biomedical Science, UCD Centre for Synthesis and Chemical Biology, UCD Conway Institute, University College Dublin, Dublin 4, Ireland
| | - Chandralal M Hewage
- UCD School of Biomolecular and Biomedical Science, UCD Centre for Synthesis and Chemical Biology, UCD Conway Institute, University College Dublin, Dublin 4, Ireland.
| |
Collapse
|
41
|
Pérez-Peinado C, Valle J, Freire JM, Andreu D. Tumor Cell Attack by Crotalicidin (Ctn) and Its Fragment Ctn[15-34]: Insights into Their Dual Membranolytic and Intracellular Targeting Mechanism. ACS Chem Biol 2020; 15:2945-2957. [PMID: 33021779 DOI: 10.1021/acschembio.0c00596] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Crotalicidin (Ctn) and its fragment Ctn[15-34] are snake-venom-derived, cathelicidin-related peptides outstanding for their promising antimicrobial, antifungal, and antitumoral properties. In this study, we describe their membranolytic mechanisms as well as their putative interference with intracellular targets, both contributing to their antitumoral action against a pro-monocytic leukemia cell line. Initial flow cytometry assays demonstrated peptide ability to induce tumor cell membrane permeabilization and caspase-dependent apoptosis, without total activity reduction by serum proteases up to 24 h (Ctn) and 18 h (Ctn[15-34]). In addition, both Ctn and Ctn[15-34] showed preference for tumor cells rather than healthy cells, with selectivity ratios (tumoral vs healthy cells) of 17 and 7, respectively. Further microscopy and flow cytometry studies suggested their preferential accumulation in the cytoplasmic membrane and nucleus and proposed multiple predominant routes of peptide uptake, including direct entry and endocytosis. Affinity purification followed by proteomic identification experiments revealed both peptides to interact with proteins involved in DNA and protein metabolism, cell cycles, signal transduction, and/or programmed cell death, among others. These results suggest a putative role of Ctn and Ctn[15-34] to interact with key intracellular pathways, ultimately contributing to tumor cell death by necrosis/apoptosis. Altogether, this work proposes a dual mechanism underlying the antitumoral activity of Ctn and Ctn[15-34] and reinforces their potential as future therapeutic drugs.
Collapse
Affiliation(s)
- Clara Pérez-Peinado
- Department of Experimental and Health Sciences, Universitat Pompeu Fabra, Barcelona Biomedical Research Park, 08003 Barcelona, Spain
| | - Javier Valle
- Department of Experimental and Health Sciences, Universitat Pompeu Fabra, Barcelona Biomedical Research Park, 08003 Barcelona, Spain
| | - João M. Freire
- Drug Product Development, Janssen Vaccines and Prevention, Newtonweg 1, 2333-CP Leiden, The Netherlands
| | - David Andreu
- Department of Experimental and Health Sciences, Universitat Pompeu Fabra, Barcelona Biomedical Research Park, 08003 Barcelona, Spain
| |
Collapse
|
42
|
Meng C, Wu J, Guo F, Dong B, Xu L. CWLy-pred: A novel cell wall lytic enzyme identifier based on an improved MRMD feature selection method. Genomics 2020; 112:4715-4721. [DOI: 10.1016/j.ygeno.2020.08.015] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2020] [Revised: 08/04/2020] [Accepted: 08/13/2020] [Indexed: 10/25/2022]
|
43
|
Hou R, Wu J, Xu L, Zou Q, Wu YJ. Computational Prediction of Protein Arginine Methylation Based on Composition-Transition-Distribution Features. ACS OMEGA 2020; 5:27470-27479. [PMID: 33134710 PMCID: PMC7594152 DOI: 10.1021/acsomega.0c03972] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Accepted: 10/06/2020] [Indexed: 06/11/2023]
Abstract
Arginine methylation is one of the most essential protein post-translational modifications. Identifying the site of arginine methylation is a critical problem in biology research. Unfortunately, biological experiments such as mass spectrometry are expensive and time-consuming. Hence, predicting arginine methylation by machine learning is an alternative fast and efficient way. In this paper, we focus on the systematic characterization of arginine methylation with composition-transition-distribution (CTD) features. The presented framework consists of three stages. In the first stage, we extract CTD features from 1750 samples and exploit decision tree to generate accurate prediction. The accuracy of prediction can reach 96%. In the second stage, the support vector machine can predict the number of arginine methylation sites with 0.36 R-squared. In the third stage, experiments carried out with the updated arginine methylation site data set show that utilizing CTD features and adopting random forest as the classifier outperform previous methods. The accuracy of identification can reach 82.1 and 82.5% in single methylarginine and double methylarginine data sets, respectively. The discovery presented in this paper can be helpful for future research on arginine methylation.
Collapse
Affiliation(s)
- Ruiyan Hou
- Laboratory
of Molecular Toxicology, State Key Laboratory of Integrated Management
of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- College
of Life Science, University of Chinese Academy
of Sciences, Beijing 100049, China
| | - Jin Wu
- School
of Management, Shenzhen Polytechnic, Shenzhen 518055, China
| | - Lei Xu
- School
of Electronic and Engineering, Shenzhen
Polytechnic, Shenzhen 518055, China
| | - Quan Zou
- Institute
of Fundamental and Frontier Sciences, University
of Electronic Science and Technology of China, Chengdu 610054, China
| | - Yi-Jun Wu
- Laboratory
of Molecular Toxicology, State Key Laboratory of Integrated Management
of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
| |
Collapse
|
44
|
Guo Z, Wang P, Liu Z, Zhao Y. Discrimination of Thermophilic Proteins and Non-thermophilic Proteins Using Feature Dimension Reduction. Front Bioeng Biotechnol 2020; 8:584807. [PMID: 33195148 PMCID: PMC7642589 DOI: 10.3389/fbioe.2020.584807] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2020] [Accepted: 09/11/2020] [Indexed: 01/19/2023] Open
Abstract
Thermophilicity is a very important property of proteins, as it sometimes determines denaturation and cell death. Thus, methods for predicting thermophilic proteins and non-thermophilic proteins are of interest and can contribute to the design and engineering of proteins. In this article, we describe the use of feature dimension reduction technology and LIBSVM to identify thermophilic proteins. The highest accuracy obtained by cross-validation was 96.02% with 119 parameters. When using only 16 features, we obtained an accuracy of 93.33%. We discuss the importance of the different characteristics in identification and report a comparison of the performance of support vector machine to that of other methods.
Collapse
Affiliation(s)
- Zifan Guo
- School of Aeronautics and Astronautic, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Pingping Wang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Zhendong Liu
- School of Computer Science and Technology, Shandong Jianzhu University, Jinan, China
| | - Yuming Zhao
- Information and Computer Engineering College, Northeast Forestry University, Harbin, China
| |
Collapse
|
45
|
Dou L, Li X, Zhang L, Xiang H, Xu L. iGlu_AdaBoost: Identification of Lysine Glutarylation Using the AdaBoost Classifier. J Proteome Res 2020; 20:191-201. [PMID: 33090794 DOI: 10.1021/acs.jproteome.0c00314] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Lysine glutarylation is a newly reported post-translational modification (PTM) that plays significant roles in regulating metabolic and mitochondrial processes. Accurate identification of protein glutarylation is the primary task to better investigate molecular functions and various applications. Due to the common disadvantages of the time-consuming and expensive nature of traditional biological sequencing techniques as well as the explosive growth of protein data, building precise computational models to rapidly diagnose glutarylation is a popular and feasible solution. In this work, we proposed a novel AdaBoost-based predictor called iGlu_AdaBoost to distinguish glutarylation and non-glutarylation sequences. Here, the top 37 features were chosen from a total of 1768 combined features using Chi2 following incremental feature selection (IFS) to build the model, including 188D, the composition of k-spaced amino acid pairs (CKSAAP), and enhanced amino acid composition (EAAC). With the help of the hybrid-sampling method SMOTE-Tomek, the AdaBoost algorithm was performed with satisfactory recall, specificity, and AUC values of 87.48%, 72.49%, and 0.89 over 10-fold cross validation as well as 72.73%, 71.92%, and 0.63 over independent test, respectively. Further feature analysis inferred that positively charged amino acids RK play critical roles in glutarylation recognition. Our model presented the well generalization ability and consistency of the prediction results of positive and negative samples, which is comparable to four published tools. The proposed predictor is an efficient tool to find potential glutarylation sites and provides helpful suggestions for further research on glutarylation mechanisms and concerned disease treatments.
Collapse
Affiliation(s)
- Lijun Dou
- School of Automotive and Transportation Engineering, Shenzhen Polytechnic, Shenzhen 518055, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Xiaoling Li
- Department of Oncology, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin 150000, China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen 518172, China
| | - Huaikun Xiang
- School of Automotive and Transportation Engineering, Shenzhen Polytechnic, Shenzhen 518055, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen 518055, China
| |
Collapse
|
46
|
Li Q, Xu L, Li Q, Zhang L. Identification and Classification of Enhancers Using Dimension Reduction Technique and Recurrent Neural Network. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2020; 2020:8852258. [PMID: 33133227 PMCID: PMC7591959 DOI: 10.1155/2020/8852258] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Revised: 09/16/2020] [Accepted: 09/30/2020] [Indexed: 12/21/2022]
Abstract
Enhancers are noncoding fragments in DNA sequences, which play an important role in gene transcription and translation. However, due to their high free scattering and positional variability, the identification and classification of enhancers have a higher level of complexity than those of coding genes. In order to solve this problem, many computer studies have been carried out in this field, but there are still some deficiencies in these prediction models. In this paper, we use various feature extraction strategies, dimension reduction technology, and a comprehensive application of machine model and recurrent neural network model to achieve an accurate prediction of enhancer identification and classification with the accuracy of was 76.7% and 84.9%, respectively. The model proposed in this paper is superior to the previous methods in performance index or feature dimension, which provides inspiration for the prediction of enhancers by computer technology in the future.
Collapse
Affiliation(s)
- Qingwen Li
- College of Animal Science and Technology, Northeast Agricultural University, Harbin, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Qingyuan Li
- Forestry and Fruit Tree Research Institute, Wuhan Academy of Agricultural Sciences, Wuhan, China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, China
| |
Collapse
|
47
|
Yu L, Jing R, Liu F, Luo J, Li Y. DeepACP: A Novel Computational Approach for Accurate Identification of Anticancer Peptides by Deep Learning Algorithm. MOLECULAR THERAPY-NUCLEIC ACIDS 2020; 22:862-870. [PMID: 33230481 PMCID: PMC7658571 DOI: 10.1016/j.omtn.2020.10.005] [Citation(s) in RCA: 51] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/02/2020] [Accepted: 10/06/2020] [Indexed: 12/24/2022]
Abstract
Cancer is one of the most dangerous diseases to human health. The accurate prediction of anticancer peptides (ACPs) would be valuable for the development and design of novel anticancer agents. Current deep neural network models have obtained state-of-the-art prediction accuracy for the ACP classification task. However, based on existing studies, it remains unclear which deep learning architecture achieves the best performance. Thus, in this study, we first present a systematic exploration of three important deep learning architectures: convolutional, recurrent, and convolutional-recurrent networks for distinguishing ACPs from non-ACPs. We find that the recurrent neural network with bidirectional long short-term memory cells is superior to other architectures. By utilizing the proposed model, we implement a sequence-based deep learning tool (DeepACP) to accurately predict the likelihood of a peptide exhibiting anticancer activity. The results indicate that DeepACP outperforms several existing methods and can be used as an effective tool for the prediction of anticancer peptides. Furthermore, we visualize and understand the deep learning model. We hope that our strategy can be extended to identify other types of peptides and may provide more assistance to the development of proteomics and new drugs.
Collapse
Affiliation(s)
- Lezheng Yu
- School of Chemistry and Materials Science, Guizhou Education University, Guiyang 550018, China
- Corresponding author: Lezheng Yu, School of Chemistry and Materials Science, Guizhou Education University, Guiyang 550018, China.
| | - Runyu Jing
- College of Cybersecurity, Sichuan University, Chengdu 610065, China
| | - Fengjuan Liu
- School of Geography and Resources, Guizhou Education University, Guiyang 550018, China
| | - Jiesi Luo
- Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou 646000, Sichuan, China
- Corresponding author: Jiesi Luo, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou 646000, Sichuan, China.
| | - Yizhou Li
- College of Cybersecurity, Sichuan University, Chengdu 610065, China
| |
Collapse
|
48
|
Xu L, Liang G, Chen B, Tan X, Xiang H, Liao C. A Computational Method for the Identification of Endolysins and Autolysins. Protein Pept Lett 2020; 27:329-336. [PMID: 31577192 DOI: 10.2174/0929866526666191002104735] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2019] [Revised: 06/27/2019] [Accepted: 09/03/2019] [Indexed: 12/21/2022]
Abstract
BACKGROUND Cell lytic enzyme is a kind of highly evolved protein, which can destroy the cell structure and kill the bacteria. Compared with antibiotics, cell lytic enzyme will not cause serious problem of drug resistance of pathogenic bacteria. Thus, the study of cell wall lytic enzymes aims at finding an efficient way for curing bacteria infectious. Compared with using antibiotics, the problem of drug resistance becomes more serious. Therefore, it is a good choice for curing bacterial infections by using cell lytic enzymes. Cell lytic enzyme includes endolysin and autolysin and the difference between them is the purpose of the break of cell wall. The identification of the type of cell lytic enzymes is meaningful for the study of cell wall enzymes. OBJECTIVE In this article, our motivation is to predict the type of cell lytic enzyme. Cell lytic enzyme is helpful for killing bacteria, so it is meaningful for study the type of cell lytic enzyme. However, it is time consuming to detect the type of cell lytic enzyme by experimental methods. Thus, an efficient computational method for the type of cell lytic enzyme prediction is proposed in our work. METHODS We propose a computational method for the prediction of endolysin and autolysin. First, a data set containing 27 endolysins and 41 autolysins is built. Then the protein is represented by tripeptides composition. The features are selected with larger confidence degree. At last, the classifier is trained by the labeled vectors based on support vector machine. The learned classifier is used to predict the type of cell lytic enzyme. RESULTS Following the proposed method, the experimental results show that the overall accuracy can attain 97.06%, when 44 features are selected. Compared with Ding's method, our method improves the overall accuracy by nearly 4.5% ((97.06-92.9)/92.9%). The performance of our proposed method is stable, when the selected feature number is from 40 to 70. The overall accuracy of tripeptides optimal feature set is 94.12%, and the overall accuracy of Chou's amphiphilic PseAAC method is 76.2%. The experimental results also demonstrate that the overall accuracy is improved by nearly 18% when using the tripeptides optimal feature set. CONCLUSION The paper proposed an efficient method for identifying endolysin and autolysin. In this paper, support vector machine is used to predict the type of cell lytic enzyme. The experimental results show that the overall accuracy of the proposed method is 94.12%, which is better than some existing methods. In conclusion, the selected 44 features can improve the overall accuracy for identification of the type of cell lytic enzyme. Support vector machine performs better than other classifiers when using the selected feature set on the benchmark data set.
Collapse
Affiliation(s)
- Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Guangmin Liang
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Baowen Chen
- School of Software, Shenzhen Institute of Information Technology, Shenzhen, China
| | - Xu Tan
- School of Software, Shenzhen Institute of Information Technology, Shenzhen, China
| | - Huaikun Xiang
- School of Automotive and Transportation Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Changrui Liao
- Key Laboratory of Optoelectronic Devices and Systems of Ministry of Education and Guangdong Province, College of Optoelectronic Engineering, Shenzhen University, Shenzhen, China
| |
Collapse
|
49
|
Xu L, Jiang S, Wu J, Zou Q. An in silico approach to identification, categorization and prediction of nucleic acid binding proteins. Brief Bioinform 2020; 22:5892348. [PMID: 32793956 DOI: 10.1093/bib/bbaa171] [Citation(s) in RCA: 50] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2020] [Revised: 06/22/2020] [Accepted: 07/01/2020] [Indexed: 01/29/2023] Open
Abstract
The interaction between proteins and nucleic acid plays an important role in many processes, such as transcription, translation and DNA repair. The mechanisms of related biological events can be understood by exploring the function of proteins in these interactions. The number of known protein sequences has increased rapidly in recent years, but the databases for describing the structure and function of protein have unfortunately grown quite slowly. Thus, improving such databases is meaningful for predicting protein-nucleic acid interactions. Furthermore, the mechanism of related biological events, such as viral infection or designing novel drug targets, can be further understood by understanding the function of proteins in these interactions. The information for each sequence, including its function and interaction sites, were collected and identified, and a database called PNIDB was built. The proteins in PNIDB were grouped into 27 classes, such as transcription, immune system, and structural protein, etc. The function of each protein was then predicted using a machine learning method. Using our method, the predictor was trained on labeled sequences, and then the function of a protein was predicted based on the trained classifier. The prediction accuracy achieved a score of 77.43% by 10-fold cross validation.
Collapse
Affiliation(s)
- Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic
| | | | - Jin Wu
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China
| | - Quan Zou
- School of Management, Shenzhen Polytechnic
| |
Collapse
|
50
|
Li Q, Zhou W, Wang D, Wang S, Li Q. Prediction of Anticancer Peptides Using a Low-Dimensional Feature Model. Front Bioeng Biotechnol 2020; 8:892. [PMID: 32903381 PMCID: PMC7434836 DOI: 10.3389/fbioe.2020.00892] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Accepted: 07/10/2020] [Indexed: 01/09/2023] Open
Abstract
Cancer is still a severe health problem globally. The therapy of cancer traditionally involves the use of radiotherapy or anticancer drugs to kill cancer cells, but these methods are quite expensive and have side effects, which will cause great harm to patients. With the find of anticancer peptides (ACPs), significant progress has been achieved in the therapy of tumors. Therefore, it is invaluable to accurately identify anticancer peptides. Although biochemical experiments can solve this work, this method is expensive and time-consuming. To promote the application of anticancer peptides in cancer therapy, machine learning can be used to recognize anticancer peptides by extracting the feature vectors of anticancer peptides. Nevertheless, poor performance usually be found in training the machine learning model to utilizing high-dimensional features in practice. In order to solve the above job, this paper put forward a 19-dimensional feature model based on anticancer peptide sequences, which has lower dimensionality and better performance than some existing methods. In addition, this paper also separated a model with a low number of dimensions and acceptable performance. The few features identified in this study may represent the important features of anticancer peptides.
Collapse
Affiliation(s)
- Qingwen Li
- College of Animal Science and Technology, Northeast Agricultural University, Harbin, China
| | - Wenyang Zhou
- Center for Bioinformatics, School of Life Sciences and Technology, Harbin Institute of Technology, Harbin, China
| | - Donghua Wang
- Department of General Surgery, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Sui Wang
- Key Laboratory of Soybean Biology in Chinese Ministry of Education, Northeast Agricultural University, Harbin, China
- State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Harbin, China
| | - Qingyuan Li
- Forestry and Fruit Tree Research Institute, Wuhan Academy of Agricultural Sciences, Wuhan, China
| |
Collapse
|