1
|
Guan J, Xie P, Dong D, Liu Q, Zhao Z, Guo Y, Zhang Y, Lee TY, Yao L, Chiang YC. DeepKlapred: A deep learning framework for identifying protein lysine lactylation sites via multi-view feature fusion. Int J Biol Macromol 2024; 283:137668. [PMID: 39566793 DOI: 10.1016/j.ijbiomac.2024.137668] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2024] [Revised: 11/10/2024] [Accepted: 11/13/2024] [Indexed: 11/22/2024]
Abstract
Lysine lactylation (Kla) is a post-translational modification (PTM) that holds significant importance in the regulation of various biological processes. While traditional experimental methods are highly accurate for identifying Kla sites, they are both time-consuming and labor-intensive. Recent machine learning advances have enabled computational models for Kla site prediction. In this study, we propose a novel framework that integrates sequence embedding with sequence descriptors to enhance the representation of protein sequence features. Our framework employs a BiGRU-Transformer architecture to capture both local and global dependencies within the sequence, while incorporating six sequence descriptors to extract biochemical properties and evolutionary patterns. Additionally, we apply a cross-attention fusion mechanism to combine sequence embeddings with descriptor-based features, enabling the model to capture complex interactions between different feature representations. Our model demonstrated excellent performance in predicting Kla sites, achieving an accuracy of 0.998 on the training set and 0.969 on the independent set. Additionally, through attention analysis and motif discovery, our model provided valuable insights into key sequence patterns and regions that are crucial for Kla modification. This work not only deepens the understanding of Kla's functional roles but also holds the potential to positively impact future research in protein modification prediction and functional annotation.
Collapse
Affiliation(s)
- Jiahui Guan
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China; School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Peilin Xie
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China; School of Science and Engineering, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Danhong Dong
- School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Qianchen Liu
- School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Zhihao Zhao
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Yilin Guo
- School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Yilun Zhang
- School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Tzong-Yi Lee
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, 1001 Daxue Road, Hsinchu 300093, Taiwan; Center for Intelligent Drug Systems and Smart Bio-devices (IDS2B), National Yang Ming Chiao Tung University, 1001 Daxue Road, Hsinchu 300093, Taiwan.
| | - Lantian Yao
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China; School of Science and Engineering, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China.
| | - Ying-Chih Chiang
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China; School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China; School of Science and Engineering, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China.
| |
Collapse
|
2
|
Sultan MF, Shaon MSH, Karim T, Ali MM, Hasan MZ, Ahmed K, Bui FM, Chen L, Dhasarathan V, Moni MA. MLAFP-XN: Leveraging neural network model for development of antifungal peptide identification tool. Heliyon 2024; 10:e37820. [PMID: 39323787 PMCID: PMC11422610 DOI: 10.1016/j.heliyon.2024.e37820] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2024] [Revised: 08/23/2024] [Accepted: 09/10/2024] [Indexed: 09/27/2024] Open
Abstract
Infectious fungi have been an increasing global concern in the present era. A promising approach to tackle this pressing concern involves utilizing Antifungal peptides (AFP) to develop an antifungal drug that can selectively eliminate fungal pathogens from a host with minimal toxicity to the host. Accordingly, identifying precise therapeutic antifungal peptides is crucial for developing effective drugs and treatments. This study proposed MLAFP-XN, a neural network-based strategy for accurately detecting active AFP in sequencing data to achieve this objective. In this work, eight feature extraction techniques and the XGB feature selection strategy are utilized together to present an enhanced methodology. A total of 24 classification models were evaluated, and the most effective four have been selected. Each of these models demonstrated superior accuracy on independent test sets, with respective scores of 97.93 %, 99.47 %, and 99.48 %. Our model outperforms current state of the art methods. In addition, we created a companion website to demonstrate our AFP recognition process and use SHAP to identify the most influential properties.
Collapse
Affiliation(s)
- Md. Fahim Sultan
- Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City (DSC), Birulia, Savar, Dhaka, 1216, Bangladesh
| | - Md. Shazzad Hossain Shaon
- Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City (DSC), Birulia, Savar, Dhaka, 1216, Bangladesh
| | - Tasmin Karim
- Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City (DSC), Birulia, Savar, Dhaka, 1216, Bangladesh
| | - Md. Mamun Ali
- Division of Biomedical Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada
- Department of Software Engineering, Daffodil International University, Daffodil Smart City (DSC), Birulia, Savar, Dhaka, 1216, Bangladesh
- Health Informatics Research Lab, Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City (DSC), Birulia, Savar, Dhaka, 1216, Bangladesh
| | - Md. Zahid Hasan
- Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City (DSC), Birulia, Savar, Dhaka, 1216, Bangladesh
| | - Kawsar Ahmed
- Department of Electrical and Computer Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada
- Group of Bio-photomatiχ, Information and Communication Technology, Mawlana Bhashani Science and Technology University, Santosh, Tangail, 1902, Bangladesh
- Health Informatics Research Lab, Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City (DSC), Birulia, Savar, Dhaka, 1216, Bangladesh
| | - Francis M. Bui
- Department of Electrical and Computer Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada
| | - Li Chen
- Department of Electrical and Computer Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada
| | - Vigneswaran Dhasarathan
- Department of ECE, Centre for IoT and AI (CITI), KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
| | - Mohammad Ali Moni
- AI & Digital Health Technology, Artifcial Intelligence & Cyber Future Institute, Charles Stuart University, Bathurst, NSW, 2795, Australia
- AI & Digital Health Technology, Rural Health Research Institute, Charles Stuart University, Orange, NSW 2800, Australia
| |
Collapse
|
3
|
Meher PK, Pradhan UK, Sethi PL, Naha S, Gupta A, Parsad R. PredPSP: a novel computational tool to discover pathway-specific photosynthetic proteins in plants. PLANT MOLECULAR BIOLOGY 2024; 114:106. [PMID: 39316155 DOI: 10.1007/s11103-024-01500-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Accepted: 09/04/2024] [Indexed: 09/25/2024]
Abstract
Photosynthetic proteins play a crucial role in agricultural productivity by harnessing light energy for plant growth. Understanding these proteins, especially within C3 and C4 pathways, holds promise for improving crops in challenging environments. Despite existing models, a comprehensive computational framework specifically targeting plant photosynthetic proteins is lacking. The underutilization of plant datasets in computational algorithms accentuates the gap this study aims to fill by introducing a novel sequence-based computational method for identifying these proteins. The scope of this study encompassed diverse plant species, ensuring comprehensive representation across C3 and C4 pathways. Utilizing six deep learning models and seven shallow learning algorithms, paired with six sequence-derived feature sets followed by feature selection strategy, this study developed a comprehensive model for prediction of plant-specific photosynthetic proteins. Following 5-fold cross-validation analysis, LightGBM with 65 and 90 LGBM-VIM selected features respectively emerged as the best models for C3 (auROC: 91.78%, auPRC: 92.55%) and C4 (auROC: 99.05%, auPRC: 99.18%) plants. Validation using an independent dataset confirmed the robustness of the proposed model for both C3 (auROC: 87.23%, auPRC: 88.40%) and C4 (auROC: 92.83%, auPRC: 92.29%) categories. Comparison with existing methods demonstrated the superiority of the proposed model in predicting plant-specific photosynthetic proteins. This study further established a free online prediction server PredPSP ( https://iasri-sg.icar.gov.in/predpsp/ ) to facilitate ongoing efforts for identifying photosynthetic proteins in C3 and C4 plants. Being first of its kind, this study offers valuable insights into predicting plant-specific photosynthetic proteins which holds significant implications for plant biology.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India.
| | - Upendra Kumar Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India
| | - Padma Lochan Sethi
- Department of Bioinformatics, Odisha University of Agriculture & Technology, Bhubaneswar, 751003, Odisha, India
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India
| |
Collapse
|
4
|
Ghafoor H, Asim MN, Ibrahim MA, Dengel A. ProSol-multi: Protein solubility prediction via amino acids multi-level correlation and discriminative distribution. Heliyon 2024; 10:e36041. [PMID: 39281576 PMCID: PMC11401092 DOI: 10.1016/j.heliyon.2024.e36041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2024] [Revised: 08/01/2024] [Accepted: 08/08/2024] [Indexed: 09/18/2024] Open
Abstract
Protein solubility prediction is useful for the careful selection of highly effective candidate proteins for drug development. In recombinant proteins synthesis, solubility prediction is valuable for optimizing key protein characteristics, including stability, functionality, and ease of purification. It contains valuable information about potential biomarkers or therapeutic targets and helps in early forecasting of neurodegenerative diseases, cancer, and cardiovascular disorders. Traditional wet-lab experimental protein solubility prediction approaches are error-prone, time-consuming, and costly. Researchers harnessed the competence of Artificial Intelligence approaches for replacing experimental approaches with computational predictors. These predictors inferred the solubility of proteins by analyzing amino acids distributions in raw protein sequences. There is still a lot of room for the development of robust computational predictors because existing predictors remain fail in extracting comprehensive discriminative distribution of amino acids. To more precisely discriminate soluble proteins from insoluble proteins, this paper presents ProSol-Multi predictor that makes use of a novel MLCDE encoder and Random Forest classifier. MLCDE encoder transforms protein sequences into informative statistical vectors by capturing amino acids multi-level correlation and discriminative distribution within raw protein sequences. The performance of proposed encoder is evaluated against 56 existing protein sequence encoding methods on a widely used protein solubility prediction benchmark dataset under two different experimental settings namely intrinsic and extrinsic. Intrinsic evaluation reveals that from all sequence encoders, proposed MLCDE encoder manages to generate non-overlapping clusters of soluble and insoluble classes. In extrinsic evaluation, 10 machine learning classifiers achieve better performance with proposed MLCDE encoder as compared to 56 existing protein sequence encoders. Moreover, across 4 public benchmark datasets, proposed ProSol-Multi predictor outshines 20 existing predictors by an average accuracy of 3%, MCC and AU-ROC of 2%. ProSol-Multi interactive web application is available at https://sds_genetic_analysis.opendfki.de/ProSol-Multi.
Collapse
Affiliation(s)
- Hina Ghafoor
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Muhammad Ali Ibrahim
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Andreas Dengel
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| |
Collapse
|
5
|
Yadav AK, Gupta PK, Singh TR. PMTPred: machine-learning-based prediction of protein methyltransferases using the composition of k-spaced amino acid pairs. Mol Divers 2024; 28:2301-2315. [PMID: 39033257 DOI: 10.1007/s11030-024-10937-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Accepted: 07/10/2024] [Indexed: 07/23/2024]
Abstract
Protein methyltransferases (PMTs) are a group of enzymes that help catalyze the transfer of a methyl group to its substrates. These enzymes play an important role in epigenetic regulation and can methylate various substrates with DNA, RNA, protein, and small-molecule secondary metabolites. Dysregulation of methyltransferases is implicated in various human cancers. However, in light of the well-recognized significance of PMTs, reliable and efficient identification methods are essential. In the present work, we propose a machine-learning-based method for the identification of PMTs. Various sequence-based features were calculated, and prediction models were trained using various machine-learning algorithms using a tenfold cross-validation technique. After evaluating each model on the dataset, the SVM-based CKSAAP model achieved the highest prediction accuracy with balanced sensitivity and specificity. Also, this SVM model outperformed deep-learning algorithms for the prediction of PMTs. In addition, cross-database validation was performed to ensure the robustness of the model. Feature importance was assessed using shapley additive explanations (SHAP) values, providing insights into the contributions of different features to the model's predictions. Finally, the SVM-based CKSAAP model was implemented in a standalone tool, PMTPred, due to its consistent performance during independent testing and cross-database evaluation. We believe that PMTPred will be a useful and efficient tool for the identification of PMTs. The PMTPred is freely available for download at https://github.com/ArvindYadav7/PMTPred and http://www.bioinfoindia.org/PMTPred/home.html for research and academic use.
Collapse
Affiliation(s)
- Arvind Kumar Yadav
- Department of Biotechnology and Bioinformatics, Jaypee University of Information Technology, Solan- 173234, Himachal Pradesh, India
| | - Pradeep Kumar Gupta
- Department of Computer Science and Engineering, Jaypee University of Information Technology, Solan- 173234, Himachal Pradesh, India
- School of Computing, Department of Data Science and Engineering, Mohan Babu University, Tirupati- 517102, Andhra Pradesh, India
| | - Tiratha Raj Singh
- Department of Biotechnology and Bioinformatics, Jaypee University of Information Technology, Solan- 173234, Himachal Pradesh, India.
- Centre of Excellence in Healthcare Technologies and Informatics (CHETI), Department of Biotechnology and Bioinformatics, Jaypee University of Information Technology, Solan- 173234, Himachal Pradesh, India.
| |
Collapse
|
6
|
Ghafoor H, Asim MN, Ibrahim MA, Ahmed S, Dengel A. CAPTURE: Comprehensive anti-cancer peptide predictor with a unique amino acid sequence encoder. Comput Biol Med 2024; 176:108538. [PMID: 38759585 DOI: 10.1016/j.compbiomed.2024.108538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 04/26/2024] [Accepted: 04/28/2024] [Indexed: 05/19/2024]
Abstract
Anticancer peptides (ACPs) key properties including bioactivity, high efficacy, low toxicity, and lack of drug resistance make them ideal candidates for cancer therapies. To deeply explore the potential of ACPs and accelerate development of cancer therapies, although 53 Artificial Intelligence supported computational predictors have been developed for ACPs and non ACPs classification but only one predictor has been developed for ACPs functional types annotations. Moreover, these predictors extract amino acids distribution patterns to transform peptides sequences into statistical vectors that are further fed to classifiers for discriminating peptides sequences and annotating peptides functional classes. Overall, these predictors remain fail in extracting diverse types of amino acids distribution patterns from peptide sequences. The paper in hand presents a unique CARE encoder that transforms peptides sequences into statistical vectors by extracting 4 different types of distribution patterns including correlation, distribution, composition, and transition. Across public benchmark dataset, proposed encoder potential is explored under two different evaluation settings namely; intrinsic and extrinsic. Extrinsic evaluation indicates that 12 different machine learning classifiers achieve superior performance with the proposed encoder as compared to 55 existing encoders. Furthermore, an intrinsic evaluation reveals that, unlike existing encoders, the proposed encoder generates more discriminative clusters for ACPs and non-ACPs classes. Across 8 public benchmark ACPs and non-ACPs classification datasets, proposed encoder and Adaboost classifier based CAPTURE predictor outperforms existing predictors with an average accuracy, recall and MCC score of 1%, 4%, and 2% respectively. In generalizeability evaluation case study, across 7 benchmark anti-microbial peptides classification datasets, CAPTURE surpasses existing predictors by an average AU-ROC of 2%. CAPTURE predictive pipeline along with label powerset method outperforms state-of-the-art ACPs functional types predictor by 5%, 5%, 5%, 6%, and 3% in terms of average accuracy, subset accuracy, precision, recall, and F1 respectively. CAPTURE web application is available at https://sds_genetic_analysis.opendfki.de/CAPTURE.
Collapse
Affiliation(s)
- Hina Ghafoor
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany; German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany.
| | - Muhammad Ali Ibrahim
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany; German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Andreas Dengel
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany; German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| |
Collapse
|
7
|
Liao YH, Chen SZ, Bin YN, Zhao JP, Feng XL, Zheng CH. UsIL-6: An unbalanced learning strategy for identifying IL-6 inducing peptides by undersampling technique. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2024; 250:108176. [PMID: 38677081 DOI: 10.1016/j.cmpb.2024.108176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Revised: 03/26/2024] [Accepted: 04/11/2024] [Indexed: 04/29/2024]
Abstract
BACKGROUND AND OBJECTIVE Interleukin-6 (IL-6) is the critical factor of early warning, monitoring, and prognosis in the inflammatory storm of COVID-19 cases. IL-6 inducing peptides, which can induce cytokine IL-6 production, are very important for the development of diagnosis and immunotherapy. Although the existing methods have some success in predicting IL-6 inducing peptides, there is still room for improvement in the performance of these models in practical application. METHODS In this study, we proposed UsIL-6, a high-performance bioinformatics tool for identifying IL-6 inducing peptides. First, we extracted five groups of physicochemical properties and sequence structural information from IL-6 inducing peptide sequences, and obtained a 636-dimensional feature vector, we also employed NearMiss3 undersampling method and normalization method StandardScaler to process the data. Then, a 40-dimensional optimal feature vector was obtained by Boruta feature selection method. Finally, we combined this feature vector with extreme randomization tree classifier to build the final model UsIL-6. RESULTS The AUC value of UsIL-6 on the independent test dataset was 0.87, and the BACC value was 0.808, which indicated that UsIL-6 had better performance than the existing methods in IL-6 inducing peptide recognition. CONCLUSIONS The performance comparison on independent test dataset confirmed that UsIL-6 could achieve the highest performance, best robustness, and most excellent generalization ability. We hope that UsIL-6 will become a valuable method to identify, annotate and characterize new IL-6 inducing peptides.
Collapse
Affiliation(s)
- Yan-Hong Liao
- School of Mathematics and System Science, Xinjiang University, Urumqi, Xinjiang 830017, China
| | - Shou-Zhi Chen
- School of Mathematics and System Science, Xinjiang University, Urumqi, Xinjiang 830017, China
| | - Yan-Nan Bin
- School of Computer Science and Technology, Anhui University, Hefei, Anhui 230601, China
| | - Jian-Ping Zhao
- School of Mathematics and System Science, Xinjiang University, Urumqi, Xinjiang 830017, China.
| | - Xin-Long Feng
- School of Mathematics and System Science, Xinjiang University, Urumqi, Xinjiang 830017, China.
| | - Chun-Hou Zheng
- School of Mathematics and System Science, Xinjiang University, Urumqi, Xinjiang 830017, China; School of Computer Science and Technology, Anhui University, Hefei, Anhui 230601, China
| |
Collapse
|
8
|
Guan J, Yao L, Xie P, Chung CR, Huang Y, Chiang YC, Lee TY. A two-stage computational framework for identifying antiviral peptides and their functional types based on contrastive learning and multi-feature fusion strategy. Brief Bioinform 2024; 25:bbae208. [PMID: 38706321 PMCID: PMC11070730 DOI: 10.1093/bib/bbae208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2024] [Revised: 03/14/2024] [Accepted: 04/17/2024] [Indexed: 05/07/2024] Open
Abstract
Antiviral peptides (AVPs) have shown potential in inhibiting viral attachment, preventing viral fusion with host cells and disrupting viral replication due to their unique action mechanisms. They have now become a broad-spectrum, promising antiviral therapy. However, identifying effective AVPs is traditionally slow and costly. This study proposed a new two-stage computational framework for AVP identification. The first stage identifies AVPs from a wide range of peptides, and the second stage recognizes AVPs targeting specific families or viruses. This method integrates contrastive learning and multi-feature fusion strategy, focusing on sequence information and peptide characteristics, significantly enhancing predictive ability and interpretability. The evaluation results of the model show excellent performance, with accuracy of 0.9240 and Matthews correlation coefficient (MCC) score of 0.8482 on the non-AVP independent dataset, and accuracy of 0.9934 and MCC score of 0.9869 on the non-AMP independent dataset. Furthermore, our model can predict antiviral activities of AVPs against six key viral families (Coronaviridae, Retroviridae, Herpesviridae, Paramyxoviridae, Orthomyxoviridae, Flaviviridae) and eight viruses (FIV, HCV, HIV, HPIV3, HSV1, INFVA, RSV, SARS-CoV). Finally, to facilitate user accessibility, we built a user-friendly web interface deployed at https://awi.cuhk.edu.cn/∼dbAMP/AVP/.
Collapse
Affiliation(s)
- Jiahui Guan
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Lantian Yao
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
- School of Science and Engineering, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Peilin Xie
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Chia-Ru Chung
- Department of Computer Science and Information Engineering, National Central University, 320317 Taoyuan, Taiwan
| | - Yixian Huang
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Ying-Chih Chiang
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Tzong-Yi Lee
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, 300093 Hsinchu, Taiwan
- Center for Intelligent Drug Systems and Smart Bio-devices (IDS2B), National Yang Ming Chiao Tung University, 300093 Hsinchu, Taiwan
| |
Collapse
|
9
|
Karim T, Shaon MSH, Sultan MF, Hasan MZ, Kafy AA. ANNprob-ACPs: A novel anticancer peptide identifier based on probabilistic feature fusion approach. Comput Biol Med 2024; 169:107915. [PMID: 38171261 DOI: 10.1016/j.compbiomed.2023.107915] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Revised: 12/28/2023] [Accepted: 12/29/2023] [Indexed: 01/05/2024]
Abstract
Anticancer Peptides (ACPs) offer significant potential as cancer treatment drugs in this modern era. Quickly identifying active compounds from protein sequences is crucial for healthcare and cancer treatment. In this paper ANNprob-ACPs, a novel and effective model for detecting ACPs has been implemented based on nine feature encoding techniques, including AAC, CC, W2V, DPC, PAAC, QSO, CTDC, CTDT, and CKSAAGP. After analyzing the performance of several machine learning models, the six best models were selected based on their overall performances in every evaluation metric. The probability scores of each model were subsequently aggregated and used as input of our meta- model, called ANNprob-ACPs. Our model outperformed all others and its potential to lead to phenomenal identification of ACPs. The results of this study showed notable improvement in 10-fold cross-validation and independent test, with accuracy of 93.72% and 90.62%, respectively. Our proposed model, ANNprob-ACPs outperformed existing approaches in terms of accuracy and effectiveness in discovering ACPs. By using SHAP, this study obtained the physicochemical properties of QSO, and compositional properties of DPC, AAC, and PAAC are more impactful for our model's performances, which have a major impact on a drug's interactions and future discoveries. Consequently, this model is crucial for the future and has a high probability of detecting ACPs more frequently. We developed a web server of ANNprob-ACPs, which is accessible at ANNprob-ACPs webserver.
Collapse
Affiliation(s)
- Tasmin Karim
- Department of Computer Science & Engineering, Daffodil International University, Daffodil Smart City, Birulia, Dhaka, 1216, Bangladesh; Health Informatics Research Lab, Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City, Birulia, Dhaka, 1216, Bangladesh.
| | - Md Shazzad Hossain Shaon
- Department of Computer Science & Engineering, Daffodil International University, Daffodil Smart City, Birulia, Dhaka, 1216, Bangladesh; Health Informatics Research Lab, Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City, Birulia, Dhaka, 1216, Bangladesh.
| | - Md Fahim Sultan
- Department of Computer Science & Engineering, Daffodil International University, Daffodil Smart City, Birulia, Dhaka, 1216, Bangladesh; Health Informatics Research Lab, Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City, Birulia, Dhaka, 1216, Bangladesh.
| | - Md Zahid Hasan
- Department of Computer Science & Engineering, Daffodil International University, Daffodil Smart City, Birulia, Dhaka, 1216, Bangladesh; Health Informatics Research Lab, Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City, Birulia, Dhaka, 1216, Bangladesh.
| | - Abdulla-Al Kafy
- Department of Urban & Regional Planning, Rajshahi University of Engineering & Technology (RUET), Rajshahi, 6204, Bangladesh.
| |
Collapse
|
10
|
Yan J, Zhang B, Zhou M, Campbell-Valois FX, Siu SWI. A deep learning method for predicting the minimum inhibitory concentration of antimicrobial peptides against Escherichia coli using Multi-Branch-CNN and Attention. mSystems 2023; 8:e0034523. [PMID: 37431995 PMCID: PMC10506472 DOI: 10.1128/msystems.00345-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Accepted: 05/31/2023] [Indexed: 07/12/2023] Open
Abstract
Antimicrobial peptides (AMPs) are a promising alternative to antibiotics to combat drug resistance in pathogenic bacteria. However, the development of AMPs with high potency and specificity remains a challenge, and new tools to evaluate antimicrobial activity are needed to accelerate the discovery process. Therefore, we proposed MBC-Attention, a combination of a multi-branch convolution neural network architecture and attention mechanisms to predict the experimental minimum inhibitory concentration of peptides against Escherichia coli. The optimal MBC-Attention model achieved an average Pearson correlation coefficient (PCC) of 0.775 and a root mean squared error (RMSE) of 0.533 (log μM) in three independent tests of randomly drawn sequences from the data set. This results in a 5-12% improvement in PCC and a 6-13% improvement in RMSE compared to 17 traditional machine learning models and 2 optimally tuned models using random forest and support vector machine. Ablation studies confirmed that the two proposed attention mechanisms, global attention and local attention, contributed largely to performance improvement. IMPORTANCE Antimicrobial peptides (AMPs) are potential candidates for replacing conventional antibiotics to combat drug resistance in pathogenic bacteria. Therefore, it is necessary to evaluate the antimicrobial activity of AMPs quantitatively. However, wet-lab experiments are labor-intensive and time-consuming. To accelerate the evaluation process, we develop a deep learning method called MBC-Attention to regress the experimental minimum inhibitory concentration of AMPs against Escherichia coli. The proposed model outperforms traditional machine learning methods. Data, scripts to reproduce experiments, and the final production models are available on GitHub.
Collapse
Affiliation(s)
- Jielu Yan
- PAMI Research Group, Department of Computer and Information Science, University of Macau, Taipa, Macau, China
| | - Bob Zhang
- PAMI Research Group, Department of Computer and Information Science, University of Macau, Taipa, Macau, China
| | - Mingliang Zhou
- School of Computer Science, Chongqing University, Shapingba, Chongqing, China
| | - François-Xavier Campbell-Valois
- Host-Microbe Interactions Laboratory, Center for Chemical and Synthetic Biology, Department of Chemistry and Biomolecular Sciences, University of Ottawa, Ottawa, Ontario, Canada
- Centre for Infection, Immunity, and Inflammation, University of Ottawa, Ottawa, Ontario, Canada
- Department of Biochemistry, Microbiology and Immunology, University of Ottawa, Ottawa, Ontario, Canada
| | - Shirley W. I. Siu
- Institute of Science and Environment, University of Saint Joseph, Macau, China
| |
Collapse
|
11
|
Xie L, Xie L. Elucidation of genome-wide understudied proteins targeted by PROTAC-induced degradation using interpretable machine learning. PLoS Comput Biol 2023; 19:e1010974. [PMID: 37590332 PMCID: PMC10464998 DOI: 10.1371/journal.pcbi.1010974] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Revised: 08/29/2023] [Accepted: 07/27/2023] [Indexed: 08/19/2023] Open
Abstract
Proteolysis-targeting chimeras (PROTACs) are hetero-bifunctional molecules that induce the degradation of target proteins by recruiting an E3 ligase. PROTACs have the potential to inactivate disease-related genes that are considered undruggable by small molecules, making them a promising therapy for the treatment of incurable diseases. However, only a few hundred proteins have been experimentally tested for their amenability to PROTACs, and it remains unclear which other proteins in the entire human genome can be targeted by PROTACs. In this study, we have developed PrePROTAC, an interpretable machine learning model based on a transformer-based protein sequence descriptor and random forest classification. PrePROTAC predicts genome-wide targets that can be degraded by CRBN, one of the E3 ligases. In the benchmark studies, PrePROTAC achieved a ROC-AUC of 0.81, an average precision of 0.84, and over 40% sensitivity at a false positive rate of 0.05. When evaluated by an external test set which comprised proteins from different structural folds than those in the training set, the performance of PrePROTAC did not drop significantly, indicating its generalizability. Furthermore, we developed an embedding SHapley Additive exPlanations (eSHAP) method, which extends conventional SHAP analysis for original features to an embedding space through in silico mutagenesis. This method allowed us to identify key residues in the protein structure that play critical roles in PROTAC activity. The identified key residues were consistent with existing knowledge. Using PrePROTAC, we identified over 600 novel understudied proteins that are potentially degradable by CRBN and proposed PROTAC compounds for three novel drug targets associated with Alzheimer's disease.
Collapse
Affiliation(s)
- Li Xie
- Department of Computer Science, Hunter College, The City University of New York, New York City, New York, United States of America
| | - Lei Xie
- Department of Computer Science, Hunter College, The City University of New York, New York City, New York, United States of America
- Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York City, New York, United States of America
- Helen and Robert Appel Alzheimer’s Disease Research Institute, Feil Family Brain & Mind Research Institute, Weill Cornell Medicine, Cornell University, New York City, New York, United States of America
| |
Collapse
|
12
|
Xie L, Xie L. Elucidation of Genome-wide Understudied Proteins targeted by PROTAC-induced degradation using Interpretable Machine Learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.23.529828. [PMID: 36865212 PMCID: PMC9980153 DOI: 10.1101/2023.02.23.529828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/03/2023]
Abstract
Proteolysis-targeting chimeras (PROTACs) are hetero-bifunctional molecules. They induce the degradation of a target protein by recruiting an E3 ligase to the target. The PROTAC can inactivate disease-related genes that are considered as understudied, thus has a great potential to be a new type of therapy for the treatment of incurable diseases. However, only hundreds of proteins have been experimentally tested if they are amenable to the PROTACs. It remains elusive what other proteins can be targeted by the PROTAC in the entire human genome. For the first time, we have developed an interpretable machine learning model PrePROTAC, which is based on a transformer-based protein sequence descriptor and random forest classification to predict genome-wide PROTAC-induced targets degradable by CRBN, one of the E3 ligases. In the benchmark studies, PrePROTAC achieved ROC-AUC of 0.81, PR-AUC of 0.84, and over 40% sensitivity at a false positive rate of 0.05, respectively. Furthermore, we developed an embedding SHapley Additive exPlanations (eSHAP) method to identify positions in the protein structure, which play key roles in the PROTAC activity. The key residues identified were consistent with our existing knowledge. We applied PrePROTAC to identify more than 600 novel understudied proteins that are potentially degradable by CRBN, and proposed PROTAC compounds for three novel drug targets associated with Alzheimer's disease. Author Summary Many human diseases remain incurable because disease-causing genes cannot by selectively and effectively targeted by small molecules. Proteolysis-targeting chimera (PROTAC), an organic compound that binds to both a target and a degradation-mediating E3 ligase, has emerged as a promising approach to selectively target disease-driving genes that are not druggable by small molecules. Nevertheless, not all of proteins can be accommodated by E3 ligases, and be effectively degraded. Knowledge on the degradability of a protein will be crucial for the design of PROTACs. However, only hundreds of proteins have been experimentally tested if they are amenable to the PROTACs. It remains elusive what other proteins can be targeted by the PROTAC in the entire human genome. In this paper, we propose an intepretable machine learning model PrePROTAC that takes advantage of powerful protein language modeling. PrePROTAC achieves high accuracy when evaluated by an external dataset which comes from different gene families from the proteins in the training data, suggesting the generalizability of PrePROTAC. We apply PrePROTAC to the human genome, and identify more than 600 understudied proteins that are potentially responsive to the PROTAC. Furthermore, we design three PROTAC compounds for novel drug targets associated with Alzheimer's disease.
Collapse
Affiliation(s)
- Li Xie
- Department of Computer Science, Hunter College, The City University of New York, New York, 10065, USA
| | - Lei Xie
- Department of Computer Science, Hunter College, The City University of New York, New York, 10065, USA
- Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York, 10016, USA
- Helen and Robert Appel Alzheimer’s Disease Research Institute, Feil Family Brain & Mind Research Institute, Weill Cornell Medicine, Cornell University, New York, 10021, USA
| |
Collapse
|
13
|
Wei Z, Liu X, Yan R, Sun G, Yu W, Liu Q, Guo Q. Pixel-level multimodal fusion deep networks for predicting subcellular organelle localization from label-free live-cell imaging. Front Genet 2022; 13:1002327. [PMID: 36386823 PMCID: PMC9644055 DOI: 10.3389/fgene.2022.1002327] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Accepted: 09/26/2022] [Indexed: 01/25/2023] Open
Abstract
Complex intracellular organizations are commonly represented by dividing the metabolic process of cells into different organelles. Therefore, identifying sub-cellular organelle architecture is significant for understanding intracellular structural properties, specific functions, and biological processes in cells. However, the discrimination of these structures in the natural organizational environment and their functional consequences are not clear. In this article, we propose a new pixel-level multimodal fusion (PLMF) deep network which can be used to predict the location of cellular organelle using label-free cell optical microscopy images followed by deep-learning-based automated image denoising. It provides valuable insights that can be of tremendous help in improving the specificity of label-free cell optical microscopy by using the Transformer-Unet network to predict the ground truth imaging which corresponds to different sub-cellular organelle architectures. The new prediction method proposed in this article combines the advantages of a transformer's global prediction and CNN's local detail analytic ability of background features for label-free cell optical microscopy images, so as to improve the prediction accuracy. Our experimental results showed that the PLMF network can achieve over 0.91 Pearson's correlation coefficient (PCC) correlation between estimated and true fractions on lung cancer cell-imaging datasets. In addition, we applied the PLMF network method on the cell images for label-free prediction of several different subcellular components simultaneously, rather than using several fluorescent labels. These results open up a new way for the time-resolved study of subcellular components in different cells, especially for cancer cells.
Collapse
Affiliation(s)
- Zhihao Wei
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing, China
| | - Xi Liu
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing, China
| | - Ruiqing Yan
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing, China
| | - Guocheng Sun
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing, China,School of Mechanical Engineering & Hydrogen Energy Research Centre, Beijing Institute of Petrochemical Technology, Beijing, China
| | - Weiyong Yu
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing, China
| | - Qiang Liu
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing, China
| | - Qianjin Guo
- Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing, China,School of Mechanical Engineering & Hydrogen Energy Research Centre, Beijing Institute of Petrochemical Technology, Beijing, China,*Correspondence: Qianjin Guo,
| |
Collapse
|
14
|
Yi W, Sun A, Liu M, Liu X, Zhang W, Dai Q. Comparative Study on Feature Selection in Protein Structure and Function Prediction. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:1650693. [PMID: 36267316 PMCID: PMC9578875 DOI: 10.1155/2022/1650693] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Accepted: 09/14/2022] [Indexed: 11/18/2022]
Abstract
Many effective methods extract and fuse different protein features to study the relationship between protein sequence, structure, and function, but different methods have preferences in solving the research of protein structure and function, which requires selecting valuable and contributing features to design more effective prediction methods. This work mainly focused on the feature selection methods in the study of protein structure and function, and systematically compared and analyzed the efficiency of different feature selection methods in the prediction of protein structures, protein disorders, protein molecular chaperones, and protein solubility. The results show that the feature selection method based on nonlinear SVM performs best in protein structure prediction, protein solubility prediction, protein molecular chaperone prediction, and protein solubility prediction. After selection, the accuracy of features is improved by 13.16% ~71%, especially the Kmer features and PSSM features of proteins.
Collapse
Affiliation(s)
- Wenjing Yi
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Ao Sun
- College of Informatics Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Manman Liu
- College of Informatics Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Xiaoqing Liu
- College of Sciences, Hangzhou Dianzi University, Hangzhou 310018, China
| | - Wei Zhang
- College of Informatics Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Qi Dai
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| |
Collapse
|
15
|
Multiple Parallel Fusion Network for Predicting Protein Subcellular Localization from Stimulated Raman Scattering (SRS) Microscopy Images in Living Cells. Int J Mol Sci 2022; 23:ijms231810827. [PMID: 36142736 PMCID: PMC9504098 DOI: 10.3390/ijms231810827] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2022] [Revised: 09/10/2022] [Accepted: 09/13/2022] [Indexed: 11/23/2022] Open
Abstract
Stimulated Raman Scattering Microscopy (SRS) is a powerful tool for label-free detailed recognition and investigation of the cellular and subcellular structures of living cells. Determining subcellular protein localization from the cell level of SRS images is one of the basic goals of cell biology, which can not only provide useful clues for their functions and biological processes but also help to determine the priority and select the appropriate target for drug development. However, the bottleneck in predicting subcellular protein locations of SRS cell imaging lies in modeling complicated relationships concealed beneath the original cell imaging data owing to the spectral overlap information from different protein molecules. In this work, a multiple parallel fusion network, MPFnetwork, is proposed to study the subcellular locations from SRS images. This model used a multiple parallel fusion model to construct feature representations and combined multiple nonlinear decomposing algorithms as the automated subcellular detection method. Our experimental results showed that the MPFnetwork could achieve over 0.93 dice correlation between estimated and true fractions on SRS lung cancer cell datasets. In addition, we applied the MPFnetwork method to cell images for label-free prediction of several different subcellular components simultaneously, rather than using several fluorescent labels. These results open up a new method for the time-resolved study of subcellular components in different cells, especially cancer cells.
Collapse
|
16
|
Yan J, Zhang B, Zhou M, Kwok HF, Siu SWI. Multi-Branch-CNN: Classification of ion channel interacting peptides using multi-branch convolutional neural network. Comput Biol Med 2022; 147:105717. [PMID: 35752114 DOI: 10.1016/j.compbiomed.2022.105717] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2022] [Revised: 05/18/2022] [Accepted: 06/05/2022] [Indexed: 11/03/2022]
Abstract
Ligand peptides that have high affinity for ion channels are critical for regulating ion flux across the plasma membrane. These peptides are now being considered as potential drug candidates for many diseases, such as cardiovascular disease and cancers. In this work, we developed Multi-Branch-CNN, a CNN method with multiple input branches for identifying three types of ion channel peptide binders (sodium, potassium, and calcium) from intra- and inter-feature types. As for its real-world applications, prediction models that are able to recognize novel sequences having high or low similarities to training sequences are required. To this end, we tested our models on two test sets: a general test set including sequences spanning different similarity levels to those of the training set, and a novel-test set consisting of only sequences that bear little resemblance to sequences from the training set. Our experiments showed that the Multi-Branch-CNN method performs better than thirteen traditional ML algorithms (TML13), yielding an improvement in accuracy of 3.2%, 1.2%, and 2.3% on the test sets as well as 8.8%, 14.3%, and 14.6% on the novel-test sets for sodium, potassium, and calcium ion channels, respectively. We confirmed the effectiveness of Multi-Branch-CNN by comparing it to the standard CNN method with one input branch (Single-Branch-CNN) and an ensemble method (TML13-Stack). The data sets, script files to reproduce the experiments, and the final predictive models are freely available at https://github.com/jieluyan/Multi-Branch-CNN.
Collapse
Affiliation(s)
- Jielu Yan
- PAMI Research Group, Department of Computer and Information Science, University of Macau, Taipa, Macao Special Administrative Region of China
| | - Bob Zhang
- PAMI Research Group, Department of Computer and Information Science, University of Macau, Taipa, Macao Special Administrative Region of China.
| | - Mingliang Zhou
- School of Computer Science, Chongqing University, Shapingba, Chongqing, China
| | - Hang Fai Kwok
- Department of Biomedical Sciences, Faculty of Health Sciences, University of Macau, Taipa, Macao Special Administrative Region of China.
| | - Shirley W I Siu
- Department of Computer and Information Science, University of Macau, Taipa, Macao Special Administrative Region of China; Institute of Science and Environment, University of Saint Joseph, Estr. Marginal da Ilha Verde, Macao Special Administrative Region of China.
| |
Collapse
|
17
|
Pan G, Sun C, Liao Z, Tang J. Machine and Deep Learning for Prediction of Subcellular Localization. Methods Mol Biol 2022; 2361:249-261. [PMID: 34236666 DOI: 10.1007/978-1-0716-1641-3_15] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Protein subcellular localization prediction (PSLP), which plays an important role in the field of computational biology, identifies the position and function of proteins in cells without expensive cost and laborious effort. In the past few decades, various methods with different algorithms have been proposed in solving the problem of subcellular localization prediction; machine learning and deep learning constitute a large portion among those proposed methods. In order to provide an overview about those methods, the first part of this article will be a brief review of several state-of-the-art machine learning methods on subcellular localization prediction; then the materials used by subcellular localization prediction is described and a simple prediction method, that takes protein sequences as input and utilizes a convolutional neural network as the classifier, is introduced. At last, a list of notes is provided to indicate the major problems that may occur with this method.
Collapse
Affiliation(s)
- Gaofeng Pan
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA
| | - Chao Sun
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA
| | - Zijun Liao
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA.,Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University, Fujian, China
| | - Jijun Tang
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA. .,School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China.
| |
Collapse
|
18
|
Li H, Tamang T, Nantasenamat C. Toward insights on antimicrobial selectivity of host defense peptides via machine learning model interpretation. Genomics 2021; 113:3851-3863. [PMID: 34480984 DOI: 10.1016/j.ygeno.2021.08.023] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Revised: 08/22/2021] [Accepted: 08/25/2021] [Indexed: 10/20/2022]
Abstract
Host defense peptides are promising candidates for the development of novel antibiotics. To realize their therapeutic potential, high levels of target selectivity is essential. This study aims to identify factors governing selectivity via the use of the random forest algorithm for correlating peptide sequence information with their bioactivity data. Satisfactory predictive models were achieved from out-of-bag prediction that yielded accuracies and Matthew's correlation coefficients in excess of 0.80 and 0.57, respectively. Model interpretation through the use of variable importance metrics and partial dependence plots indicated that the selectivity was heavily influenced by the composition and distribution patterns of molecular charge and solubility related parameters. Furthermore, the three investigated bacterial target species (Escherichia coli, Pseudomonas aeruginosa and Staphylococcus aureus) likely had a significant influence on how selectivity was realized as there appears to be a similar underlying selectivity mechanism on the basis of charge-solubility properties (i.e. but which is tailored according to the target in question).
Collapse
Affiliation(s)
- Hao Li
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| | - Thinam Tamang
- Madan Bhandari Memorial College, Institute of Science and Technology, Tribhuvan University, Kathmandu 44602, Nepal
| | - Chanin Nantasenamat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand.
| |
Collapse
|
19
|
Chen Z, Zhao P, Li C, Li F, Xiang D, Chen YZ, Akutsu T, Daly RJ, Webb GI, Zhao Q, Kurgan L, Song J. iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Res 2021; 49:e60. [PMID: 33660783 PMCID: PMC8191785 DOI: 10.1093/nar/gkab122] [Citation(s) in RCA: 157] [Impact Index Per Article: 39.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Revised: 02/05/2021] [Accepted: 02/25/2021] [Indexed: 12/14/2022] Open
Abstract
Sequence-based analysis and prediction are fundamental bioinformatic tasks that facilitate understanding of the sequence(-structure)-function paradigm for DNAs, RNAs and proteins. Rapid accumulation of sequences requires equally pervasive development of new predictive models, which depends on the availability of effective tools that support these efforts. We introduce iLearnPlus, the first machine-learning platform with graphical- and web-based interfaces for the construction of machine-learning pipelines for analysis and predictions using nucleic acid and protein sequences. iLearnPlus provides a comprehensive set of algorithms and automates sequence-based feature extraction and analysis, construction and deployment of models, assessment of predictive performance, statistical analysis, and data visualization; all without programming. iLearnPlus includes a wide range of feature sets which encode information from the input sequences and over twenty machine-learning algorithms that cover several deep-learning approaches, outnumbering the current solutions by a wide margin. Our solution caters to experienced bioinformaticians, given the broad range of options, and biologists with no programming background, given the point-and-click interface and easy-to-follow design process. We showcase iLearnPlus with two case studies concerning prediction of long noncoding RNAs (lncRNAs) from RNA transcripts and prediction of crotonylation sites in protein chains. iLearnPlus is an open-source platform available at https://github.com/Superzchen/iLearnPlus/ with the webserver at http://ilearnplus.erc.monash.edu/.
Collapse
Affiliation(s)
- Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China
| | - Pei Zhao
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang 455000, China
| | - Chen Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Fuyi Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia.,Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, Victoria 3000, Australia
| | - Dongxu Xiang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Yong-Zi Chen
- Laboratory of Tumor Cell Biology, Key Laboratory of Cancer Prevention and Therapy, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin 300060, China
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
| | - Roger J Daly
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Quanzhi Zhao
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China.,Key Laboratory of Rice Biology in Henan Province, Henan Agricultural University, Zhengzhou 450046, China
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
20
|
Ding Y, Tang J, Guo F. Human protein subcellular localization identification via fuzzy model on Kernelized Neighborhood Representation. Appl Soft Comput 2020. [DOI: 10.1016/j.asoc.2020.106596] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
|
21
|
Sahu SS, Loaiza CD, Kaundal R. Plant-mSubP: a computational framework for the prediction of single- and multi-target protein subcellular localization using integrated machine-learning approaches. AOB PLANTS 2020; 12:plz068. [PMID: 32528639 PMCID: PMC7274489 DOI: 10.1093/aobpla/plz068] [Citation(s) in RCA: 61] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/28/2019] [Accepted: 10/11/2019] [Indexed: 05/18/2023]
Abstract
The subcellular localization of proteins is very important for characterizing its function in a cell. Accurate prediction of the subcellular locations in computational paradigm has been an active area of interest. Most of the work has been focused on single localization prediction. Only few studies have discussed the multi-target localization, but have not achieved good accuracy so far; in plant sciences, very limited work has been done. Here we report the development of a novel tool Plant-mSubP, which is based on integrated machine learning approaches to efficiently predict the subcellular localizations in plant proteomes. The proposed approach predicts with high accuracy 11 single localizations and three dual locations of plant cell. Several hybrid features based on composition and physicochemical properties of a protein such as amino acid composition, pseudo amino acid composition, auto-correlation descriptors, quasi-sequence-order descriptors and hybrid features are used to represent the protein. The performance of the proposed method has been assessed through a training set as well as an independent test set. Using the hybrid feature of the pseudo amino acid composition, N-Center-C terminal amino acid composition and the dipeptide composition (PseAAC-NCC-DIPEP), an overall accuracy of 81.97 %, 84.75 % and 87.88 % is achieved on the training data set of proteins containing the single-label, single- and dual-label combined, and dual-label proteins, respectively. When tested on the independent data, an accuracy of 64.36 %, 64.84 % and 81.08 % is achieved on the single-label, single- and dual-label, and dual-label proteins, respectively. The prediction models have been implemented on a web server available at http://bioinfo.usu.edu/Plant-mSubP/. The results indicate that the proposed approach is comparable to the existing methods in single localization prediction and outperforms all other existing tools when compared for dual-label proteins. The prediction tool will be a useful resource for better annotation of various plant proteomes.
Collapse
Affiliation(s)
- Sitanshu S Sahu
- Department of Electronics and Communication Engineering, Birla Institute of Technology, Mesra, Ranchi, India
| | - Cristian D Loaiza
- Department of Plants, Soils, and Climate/Center for Integrated BioSystems, College of Agriculture and Applied Sciences, Utah State University, Logan, UT, USA
| | - Rakesh Kaundal
- Department of Plants, Soils, and Climate/Center for Integrated BioSystems, College of Agriculture and Applied Sciences, Utah State University, Logan, UT, USA
- Bioinformatics Facility, Center for Integrated BioSystems, Utah State University, Logan, UT, USA
- Corresponding author’s e-mail address:
| |
Collapse
|
22
|
Shen Y, Ding Y, Tang J, Zou Q, Guo F. Critical evaluation of web-based prediction tools for human protein subcellular localization. Brief Bioinform 2019; 21:1628-1640. [DOI: 10.1093/bib/bbz106] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2019] [Revised: 07/23/2019] [Accepted: 07/27/2019] [Indexed: 11/12/2022] Open
Abstract
Abstract
Human protein subcellular localization has an important research value in biological processes, also in elucidating protein functions and identifying drug targets. Over the past decade, a number of protein subcellular localization prediction tools have been designed and made freely available online. The purpose of this paper is to summarize the progress of research on the subcellular localization of human proteins in recent years, including commonly used data sets proposed by the predecessors and the performance of all selected prediction tools against the same benchmark data set. We carry out a systematic evaluation of several publicly available subcellular localization prediction methods on various benchmark data sets. Among them, we find that mLASSO-Hum and pLoc-mHum provide a statistically significant improvement in performance, as measured by the value of accuracy, relative to the other methods. Meanwhile, we build a new data set using the latest version of Uniprot database and construct a new GO-based prediction method HumLoc-LBCI in this paper. Then, we test all selected prediction tools on the new data set. Finally, we discuss the possible development directions of human protein subcellular localization. Availability: The codes and data are available from http://www.lbci.cn/syn/.
Collapse
Affiliation(s)
- Yinan Shen
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Yijie Ding
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
- School of Computational Science and Engineering, University of South Carolina, Columbia, U.S
- Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
23
|
Chou KC. Advances in Predicting Subcellular Localization of Multi-label Proteins and its Implication for Developing Multi-target Drugs. Curr Med Chem 2019; 26:4918-4943. [PMID: 31060481 DOI: 10.2174/0929867326666190507082559] [Citation(s) in RCA: 78] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2018] [Revised: 01/29/2019] [Accepted: 01/31/2019] [Indexed: 12/16/2022]
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
24
|
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
25
|
Han GS, Yu ZG. ML-rRBF-ECOC: A Multi-Label Learning Classifier for Predicting Protein Subcellular Localization with Both Single and Multiple Sites. CURR PROTEOMICS 2019. [DOI: 10.2174/1570164616666190103143945] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
The subcellular localization of a protein is closely related with its functions
and interactions. More and more evidences show that proteins may simultaneously exist at, or move
between, two or more different subcellular localizations. Therefore, predicting protein subcellular localization
is an important but challenging problem.
Observation:
Most of the existing methods for predicting protein subcellular localization assume that a
protein locates at a single site. Although a few methods have been proposed to deal with proteins with
multiple sites, correlations between subcellular localization are not efficiently taken into account. In
this paper, we propose an integrated method for predicting protein subcellular localizations with both
single site and multiple sites.
Methods:
Firstly, we extend the Multi-Label Radial Basis Function (ML-RBF) method to the regularized
version, and augment the first layer of ML-RBF to take local correlations between subcellular localization
into account. Secondly, we embed the modified ML-RBF into a multi-label Error-Correcting
Output Codes (ECOC) method in order to further consider the subcellular localization dependency. We
name our method ML-rRBF-ECOC. Finally, the performance of ML-rRBF-ECOC is evaluated on
three benchmark datasets.
Results:
The results demonstrate that ML-rRBF-ECOC has highly competitive performance to the related
multi-label learning method and some state-of-the-art methods for predicting protein subcellular
localizations with multiple sites. Considering dependency between subcellular localizations can contribute
to the improvement of prediction performance.
Conclusion:
This also indicates that correlations between different subcellular localizations really exist.
Our method at least plays a complementary role to existing methods for predicting protein subcellular
localizations with multiple sites.
Collapse
Affiliation(s)
- Guo-Sheng Han
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Hunan 411105, China
| | - Zu-Guo Yu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Hunan 411105, China
| |
Collapse
|
26
|
Yao Y, Li M, Xu H, Yan S, He P, Dai Q, Qi Z, Liao B. Protein Subcellular Localization Prediction based on PSI-BLAST Profile and Principal Component Analysis. CURR PROTEOMICS 2019. [DOI: 10.2174/1570164616666190126155744] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Prediction of protein subcellular location is a meaningful task which attracts
much attention in recent years. Particularly, the number of new protein sequences yielded by the highthroughput
sequencing technology in the post genomic era has increased explosively.
Objective:
Protein subcellular localization prediction based solely on sequence data remains to be a
challenging problem of computational biology.
Methods:
In this paper, three sets of evolutionary features are derived from the position-specific scoring
matrix, which has shown great potential in other bioinformatics problems. A fusion model is built
up by the optimal parameters combination. Finally, principal component analysis and support vector
machine classifier is applied to predict protein subcellular localization on NNPSL dataset and Cell-
PLoc 2.0 dataset.
Results:
Our experimental results show that the proposed method remarkably improved the prediction
accuracy, and the features derived from PSI-BLAST profile only are appropriate for protein subcellular
localization prediction.
Collapse
Affiliation(s)
- Yuhua Yao
- School of Mathematics and Statistics, Hainan Normal University, Haikou 571158, China
| | - Manzhi Li
- School of Mathematics and Statistics, Hainan Normal University, Haikou 571158, China
| | - Huimin Xu
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Shoujiang Yan
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Pingan He
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Qi Dai
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Zhaohui Qi
- College of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang 050043, China
| | - Bo Liao
- School of Mathematics and Statistics, Hainan Normal University, Haikou 571158, China
| |
Collapse
|
27
|
Chou KC, Cheng X, Xiao X. pLoc_bal-mEuk: Predict Subcellular Localization of Eukaryotic Proteins by General PseAAC and Quasi-balancing Training Dataset. Med Chem 2019; 15:472-485. [DOI: 10.2174/1573406415666181218102517] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2018] [Revised: 10/23/2018] [Accepted: 12/12/2018] [Indexed: 12/24/2022]
Abstract
<P>Background/Objective: Information of protein subcellular localization is crucially important for both basic research and drug development. With the explosive growth of protein sequences discovered in the post-genomic age, it is highly demanded to develop powerful bioinformatics tools for timely and effectively identifying their subcellular localization purely based on the sequence information alone. Recently, a predictor called “pLoc-mEuk” was developed for identifying the subcellular localization of eukaryotic proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems where many proteins, called “multiplex proteins”, may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mEuk was trained by an extremely skewed dataset where some subset was about 200 times the size of the other subsets. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset. </P><P> Methods: To alleviate such bias, we have developed a new predictor called pLoc_bal-mEuk by quasi-balancing the training dataset. Cross-validation tests on exactly the same experimentconfirmed dataset have indicated that the proposed new predictor is remarkably superior to pLocmEuk, the existing state-of-the-art predictor in identifying the subcellular localization of eukaryotic proteins. It has not escaped our notice that the quasi-balancing treatment can also be used to deal with many other biological systems. </P><P> Results: To maximize the convenience for most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mEuk/. </P><P> Conclusion: It is anticipated that the pLoc_bal-Euk predictor holds very high potential to become a useful high throughput tool in identifying the subcellular localization of eukaryotic proteins, particularly for finding multi-target drugs that is currently a very hot trend trend in drug development.</P>
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| | - Xiang Cheng
- Gordon Life Science Institute, Boston, MA 02478, United States
| | - Xuan Xiao
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
28
|
Xi B, Tao J, Liu X, Xu X, He P, Dai Q. RaaMLab: A MATLAB toolbox that generates amino acid groups and reduced amino acid modes. Biosystems 2019; 180:38-45. [PMID: 30904554 DOI: 10.1016/j.biosystems.2019.03.002] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Revised: 12/25/2018] [Accepted: 03/06/2019] [Indexed: 01/31/2023]
Abstract
Amino acid (AA) classification and its different biophysical and chemical characteristics have been widely applied to analyze and predict the structural, functional, expression and interaction profiles of proteins and peptides. We present RaaMLab, a free and open-source MATLAB toolbox, to facilitate studies on proteins and peptides, to generate AA groups and to extract the structural and physicochemical features of reduced AAs (RedAA). This toolbox offers 4 kinds of databases, including the physicochemical properties of AAs and their groupings, 49 AA classification methods and 5 types of biophysicochemical features of RedAAs. These factors can be easily computed based on user-defined alphabet size and AA properties of AA groupings. RaaMLab is an open source freely available at https://github.com/bioinfo0706/RaaMLab. This website also contains a tutorial, extensive documentation and examples.
Collapse
Affiliation(s)
- Baohang Xi
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| | - Jin Tao
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| | - Xiaoqing Liu
- College of Sciences, Hangzhou Dianzi University, Hangzhou 310018, People's Republic of China
| | - Xinnan Xu
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| | - Pingan He
- College of Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| | - Qi Dai
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China.
| |
Collapse
|
29
|
Cheng X, Xiao X, Chou KC. pLoc_bal-mGneg: Predict subcellular localization of Gram-negative bacterial proteins by quasi-balancing training dataset and general PseAAC. J Theor Biol 2018; 458:92-102. [DOI: 10.1016/j.jtbi.2018.09.005] [Citation(s) in RCA: 65] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2018] [Revised: 09/05/2018] [Accepted: 09/07/2018] [Indexed: 01/03/2023]
|
30
|
Shen Y, Tang J, Guo F. Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou's general PseAAC. J Theor Biol 2018; 462:230-239. [PMID: 30452958 DOI: 10.1016/j.jtbi.2018.11.012] [Citation(s) in RCA: 106] [Impact Index Per Article: 15.1] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Revised: 11/07/2018] [Accepted: 11/15/2018] [Indexed: 01/07/2023]
Abstract
Identifying the location of proteins in a cell plays an important role in understanding their functions, such as drug design, therapeutic target discovery and biological research. However, the traditional subcellular localization experiments are time-consuming, laborious and small scale. With the development of next-generation sequencing technology, the number of proteins has grown exponentially, which lays the foundation of the computational method for identifying protein subcellular localization. Although many methods for predicting subcellular localization of proteins have been proposed, most of them are limited to single-location. In this paper, we propose a multi-kernel SVM to predict subcellular localization of both multi-location and single-location proteins. First, we make use of the evolutionary information extracted from position specific scoring matrix (PSSM) and physicochemical properties of proteins, by Chou's general PseAAC and other efficient functions. Then, we propose a multi-kernel support vector machine (SVM) model to identify multi-label protein subcellular localization. As a result, our method has a good performance on predicting subcellular localization of proteins. It achieves an average precision of 0.7065 and 0.6889 on two human datasets, respectively. All results are higher than those achieved by other existing methods. Therefore, we provide an efficient system via a novel perspective to study the protein subcellular localization.
Collapse
Affiliation(s)
- Yinan Shen
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Yaguan Road, Jinnan District, Tianjin, PR China.
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Yaguan Road, Jinnan District, Tianjin, PR China; School of Computational Science and Engineering, University of South Carolina, Columbia, USA.
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Yaguan Road, Jinnan District, Tianjin, PR China.
| |
Collapse
|
31
|
Chen Z, Zhao P, Li F, Leier A, Marquez-Lago TT, Wang Y, Webb GI, Smith AI, Daly RJ, Chou KC, Song J. iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 2018; 34:2499-2502. [PMID: 29528364 PMCID: PMC6658705 DOI: 10.1093/bioinformatics/bty140] [Citation(s) in RCA: 423] [Impact Index Per Article: 60.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2017] [Revised: 02/15/2018] [Accepted: 03/06/2018] [Indexed: 11/13/2022] Open
Abstract
Summary Structural and physiochemical descriptors extracted from sequence data have been widely used to represent sequences and predict structural, functional, expression and interaction profiles of proteins and peptides as well as DNAs/RNAs. Here, we present iFeature, a versatile Python-based toolkit for generating various numerical feature representation schemes for both protein and peptide sequences. iFeature is capable of calculating and extracting a comprehensive spectrum of 18 major sequence encoding schemes that encompass 53 different types of feature descriptors. It also allows users to extract specific amino acid properties from the AAindex database. Furthermore, iFeature integrates 12 different types of commonly used feature clustering, selection and dimensionality reduction algorithms, greatly facilitating training, analysis and benchmarking of machine-learning models. The functionality of iFeature is made freely available via an online web server and a stand-alone toolkit. Availability and implementation http://iFeature.erc.monash.edu/; https://github.com/Superzchen/iFeature/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zhen Chen
- School of Basic Medical Science, Qingdao University, 38 Dengzhou Road, Qingdao, China
| | - Pei Zhao
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang, China
| | - Fuyi Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC, Australia
| | - André Leier
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, AL, USA
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - Tatiana T Marquez-Lago
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, AL, USA
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - Yanan Wang
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China
| | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC, Australia
| | - A Ian Smith
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC, Australia
| | - Roger J Daly
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC, Australia
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA, USA
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC, Australia
| |
Collapse
|
32
|
Cheng X, Lin WZ, Xiao X, Chou KC. pLoc_bal-mAnimal: predict subcellular localization of animal proteins by balancing training dataset and PseAAC. Bioinformatics 2018; 35:398-406. [DOI: 10.1093/bioinformatics/bty628] [Citation(s) in RCA: 79] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2018] [Accepted: 07/11/2018] [Indexed: 12/25/2022] Open
Affiliation(s)
- Xiang Cheng
- Computer Science, Jingdezhen Ceramic Institute, Jingdezhen, China
- Computational Biology, Gordon Life Science Institute, Boston, MA, USA
| | - Wei-Zhong Lin
- Computer Science, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Xuan Xiao
- Computer Science, Jingdezhen Ceramic Institute, Jingdezhen, China
- Computational Biology, Gordon Life Science Institute, Boston, MA, USA
| | - Kuo-Chen Chou
- Computational Biology, Gordon Life Science Institute, Boston, MA, USA
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
33
|
Wang L, Zhao Y, Chen Y, Wang D. The effect of three novel feature extraction methods on the prediction of the subcellular localization of multi-site virus proteins. Bioengineered 2018; 9:196-202. [PMID: 28886267 PMCID: PMC5972939 DOI: 10.1080/21655979.2017.1373536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2017] [Accepted: 07/05/2017] [Indexed: 11/08/2022] Open
Abstract
Experimental methods play a crucial role in identifying the subcellular localization of proteins and building high-quality databases. However, more efficient, automated computational methods are required to predict the subcellular localization of proteins on a large scale. Various efficient feature extraction methods have been proposed to predict subcellular localization, but challenges remain. In this paper, three novel feature extraction methods are established to improve multi-site prediction. The first novel feature extraction method utilizes repetitive information via moving windows based on a dipeptide pseudo amino acid composition method (R-Dipeptide). The second novel feature extraction method utilizes the impact of each amino acid residue on its following residues based on pseudo amino acids (I-PseAAC). The third novel feature extraction method provides local information about protein sequences that reflects the strength of the physicochemical properties of residues (PseAAC2). The multi-label k-nearest neighbor algorithm (MLKNN) is used to predict the subcellular localization of multi-site virus proteins. The best overall accuracy values of R-Dipeptide, I-PseAAC, and PseAAC2 when applied to dataset S from Virus-mPloc are 59.92%, 59.13%, and 57.94% respectively.
Collapse
Affiliation(s)
- Lei Wang
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Jinan, China
| | - Yaou Zhao
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Jinan, China
| | - Yuehui Chen
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Jinan, China
| | - Dong Wang
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Jinan, China
| |
Collapse
|
34
|
pLoc-mEuk: Predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC. Genomics 2018; 110:50-58. [DOI: 10.1016/j.ygeno.2017.08.005] [Citation(s) in RCA: 180] [Impact Index Per Article: 25.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2017] [Revised: 08/10/2017] [Accepted: 08/11/2017] [Indexed: 11/22/2022]
|
35
|
Ruiz-Blanco YB, Agüero-Chapin G, García-Hernández E, Álvarez O, Antunes A, Green J. Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone. BMC Bioinformatics 2017; 18:349. [PMID: 28732462 PMCID: PMC5521120 DOI: 10.1186/s12859-017-1758-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2017] [Accepted: 07/13/2017] [Indexed: 11/10/2022] Open
Affiliation(s)
- Yasser B Ruiz-Blanco
- Facultad de Química y Farmacia, Universidad Central "Marta Abreu" de Las Villas, 54830, Santa Clara, Cuba.,Theoretical Chemistry, Max Planck Institute für Kohlenforschung, 45470, Mulheim an der Ruhr, Germany
| | - Guillermin Agüero-Chapin
- CIMAR/CIIMAR, Centro Interdisciplinar de Investigação Marinha e Ambiental, Universidade do Porto, Terminal de Cruzeiros do Porto de Leixões, Av. General Norton de Matos, s/n, 4450-208, Porto, Portugal. .,Centro de Bioactivos Químicos (CBQ), Universidad Central ¨Marta Abreu¨ de Las Villas (UCLV), 54830, Santa Clara, Cuba. .,Departamento de Biologia, Faculdade de Ciências, Universidade do Porto, Rua do Campo Alegre, 4169-007, Porto, Portugal.
| | - Enrique García-Hernández
- Instituto de Química, Universidad Nacional Autónoma de México (UNAM), 04360, D.F, México, Mexico
| | - Orlando Álvarez
- Centro de Bioactivos Químicos (CBQ), Universidad Central ¨Marta Abreu¨ de Las Villas (UCLV), 54830, Santa Clara, Cuba
| | - Agostinho Antunes
- CIMAR/CIIMAR, Centro Interdisciplinar de Investigação Marinha e Ambiental, Universidade do Porto, Terminal de Cruzeiros do Porto de Leixões, Av. General Norton de Matos, s/n, 4450-208, Porto, Portugal.,Departamento de Biologia, Faculdade de Ciências, Universidade do Porto, Rua do Campo Alegre, 4169-007, Porto, Portugal
| | - James Green
- Department of Systems and Computer Engineering, Carleton University, Ottawa, Canada
| |
Collapse
|
36
|
Du X, Sun S, Hu C, Yao Y, Yan Y, Zhang Y. DeepPPI: Boosting Prediction of Protein-Protein Interactions with Deep Neural Networks. J Chem Inf Model 2017; 57:1499-1510. [PMID: 28514151 DOI: 10.1021/acs.jcim.7b00028] [Citation(s) in RCA: 141] [Impact Index Per Article: 17.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
The complex language of eukaryotic gene expression remains incompletely understood. Despite the importance suggested by many proteins variants statistically associated with human disease, nearly all such variants have unknown mechanisms, for example, protein-protein interactions (PPIs). In this study, we address this challenge using a recent machine learning advance-deep neural networks (DNNs). We aim at improving the performance of PPIs prediction and propose a method called DeepPPI (Deep neural networks for Protein-Protein Interactions prediction), which employs deep neural networks to learn effectively the representations of proteins from common protein descriptors. The experimental results indicate that DeepPPI achieves superior performance on the test data set with an Accuracy of 92.50%, Precision of 94.38%, Recall of 90.56%, Specificity of 94.49%, Matthews Correlation Coefficient of 85.08% and Area Under the Curve of 97.43%, respectively. Extensive experiments show that DeepPPI can learn useful features of proteins pairs by a layer-wise abstraction, and thus achieves better prediction performance than existing methods. The source code of our approach can be available via http://ailab.ahu.edu.cn:8087/DeepPPI/index.html .
Collapse
Affiliation(s)
- Xiuquan Du
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, ‡School of Computer Science and Technology, and §Center of Information Support & Assurance Technology, Anhui University , Hefei, 230601 Anhui, China
| | - Shiwei Sun
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, ‡School of Computer Science and Technology, and §Center of Information Support & Assurance Technology, Anhui University , Hefei, 230601 Anhui, China
| | - Changlin Hu
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, ‡School of Computer Science and Technology, and §Center of Information Support & Assurance Technology, Anhui University , Hefei, 230601 Anhui, China
| | - Yu Yao
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, ‡School of Computer Science and Technology, and §Center of Information Support & Assurance Technology, Anhui University , Hefei, 230601 Anhui, China
| | - Yuanting Yan
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, ‡School of Computer Science and Technology, and §Center of Information Support & Assurance Technology, Anhui University , Hefei, 230601 Anhui, China
| | - Yanping Zhang
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, ‡School of Computer Science and Technology, and §Center of Information Support & Assurance Technology, Anhui University , Hefei, 230601 Anhui, China
| |
Collapse
|
37
|
Cheng X, Zhao SG, Xiao X, Chou KC. iATC-mHyb: a hybrid multi-label classifier for predicting the classification of anatomical therapeutic chemicals. Oncotarget 2017; 8:58494-58503. [PMID: 28938573 PMCID: PMC5601669 DOI: 10.18632/oncotarget.17028] [Citation(s) in RCA: 96] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2017] [Accepted: 03/28/2017] [Indexed: 01/18/2023] Open
Abstract
Recommended by the World Health Organization (WHO), drug compounds have been classified into 14 main ATC (Anatomical Therapeutic Chemical) classes according to their therapeutic and chemical characteristics. Given an uncharacterized compound, can we develop a computational method to fast identify which ATC class or classes it belongs to? The information thus obtained will timely help adjusting our focus and selection, significantly speeding up the drug development process. But this problem is by no means an easy one since some drug compounds may belong to two or more than two ATC classes. To address this problem, using the DO (Drug Ontology) approach based on the ChEBI (Chemical Entities of Biological Interest) database, we developed a predictor called iATC-mDO. Subsequently, hybridizing it with an existing drug ATC classifier, we constructed a predictor called iATC-mHyb. It has been demonstrated by the rigorous cross-validation and from five different measuring angles that iATC-mHyb is remarkably superior to the best existing predictor in identifying the ATC classes for drug compounds. To convenience most experimental scientists, a user-friendly web-server for iATC-mHyd has been established at http://www.jci-bioinfo.cn/iATC-mHyb, by which users can easily get their desired results without the need to go through the complicated mathematical equations involved.
Collapse
Affiliation(s)
- Xiang Cheng
- College of Information Science and Technology, Donghua University, Shanghai 201620, China.,Computer Department, Jingdezhen Ceramic Institute, Jingdezhen 333001, China
| | - Shu-Guang Zhao
- College of Information Science and Technology, Donghua University, Shanghai 201620, China
| | - Xuan Xiao
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen 333001, China.,Gordon Life Science Institute, Boston, MA 02478, USA
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, USA.,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.,Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah 21589, Saudi Arabia
| |
Collapse
|
38
|
Xiao X, Cheng X, Su S, Mao Q, Chou KC. pLoc-mGpos: Incorporate Key Gene Ontology Information into General PseAAC for Predicting Subcellular Localization of Gram-Positive Bacterial Proteins. ACTA ACUST UNITED AC 2017. [DOI: 10.4236/ns.2017.99032] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
39
|
Wan S, Mak MW, Kung SY. Transductive Learning for Multi-Label Protein Subchloroplast Localization Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:212-224. [PMID: 26887009 DOI: 10.1109/tcbb.2016.2527657] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Predicting the localization of chloroplast proteins at the sub-subcellular level is an essential yet challenging step to elucidate their functions. Most of the existing subchloroplast localization predictors are limited to predicting single-location proteins and ignore the multi-location chloroplast proteins. While recent studies have led to some multi-location chloroplast predictors, they usually perform poorly. This paper proposes an ensemble transductive learning method to tackle this multi-label classification problem. Specifically, given a protein in a dataset, its composition-based sequence information and profile-based evolutionary information are respectively extracted. These two kinds of features are respectively compared with those of other proteins in the dataset. The comparisons lead to two similarity vectors which are weighted-combined to constitute an ensemble feature vector. A transductive learning model based on the least squares and nearest neighbor algorithms is proposed to process the ensemble features. We refer to the resulting predictor to as EnTrans-Chlo. Experimental results on a stringent benchmark dataset and a novel dataset demonstrate that EnTrans-Chlo significantly outperforms state-of-the-art predictors and particularly gains more than 4% (absolute) improvement on the overall actual accuracy. For readers' convenience, EnTrans-Chlo is freely available online at http://bioinfo.eie.polyu.edu.hk/EnTransChloServer/.
Collapse
|
40
|
Wan S, Mak MW, Kung SY. Ensemble Linear Neighborhood Propagation for Predicting Subchloroplast Localization of Multi-Location Proteins. J Proteome Res 2016; 15:4755-4762. [DOI: 10.1021/acs.jproteome.6b00686] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Shibiao Wan
- Department
of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China
| | - Man-Wai Mak
- Department
of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China
| | - Sun-Yuan Kung
- Department
of Electrical Engineering, Princeton University, New Jersey 08540, United States
| |
Collapse
|
41
|
Wan S, Mak MW, Kung SY. Sparse regressions for predicting and interpreting subcellular localization of multi-label proteins. BMC Bioinformatics 2016; 17:97. [PMID: 26911432 PMCID: PMC4765148 DOI: 10.1186/s12859-016-0940-x] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2015] [Accepted: 01/27/2016] [Indexed: 11/10/2022] Open
Abstract
Background Predicting protein subcellular localization is indispensable for inferring protein functions. Recent studies have been focusing on predicting not only single-location proteins, but also multi-location proteins. Almost all of the high performing predictors proposed recently use gene ontology (GO) terms to construct feature vectors for classification. Despite their high performance, their prediction decisions are difficult to interpret because of the large number of GO terms involved. Results This paper proposes using sparse regressions to exploit GO information for both predicting and interpreting subcellular localization of single- and multi-location proteins. Specifically, we compared two multi-label sparse regression algorithms, namely multi-label LASSO (mLASSO) and multi-label elastic net (mEN), for large-scale predictions of protein subcellular localization. Both algorithms can yield sparse and interpretable solutions. By using the one-vs-rest strategy, mLASSO and mEN identified 87 and 429 out of more than 8,000 GO terms, respectively, which play essential roles in determining subcellular localization. More interestingly, many of the GO terms selected by mEN are from the biological process and molecular function categories, suggesting that the GO terms of these categories also play vital roles in the prediction. With these essential GO terms, not only where a protein locates can be decided, but also why it resides there can be revealed. Conclusions Experimental results show that the output of both mEN and mLASSO are interpretable and they perform significantly better than existing state-of-the-art predictors. Moreover, mEN selects more features and performs better than mLASSO on a stringent human benchmark dataset. For readers’ convenience, an online server called SpaPredictor for both mLASSO and mEN is available at http://bioinfo.eie.polyu.edu.hk/SpaPredictorServer/.
Collapse
Affiliation(s)
- Shibiao Wan
- Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong, SAR, China.
| | - Man-Wai Mak
- Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong, SAR, China.
| | - Sun-Yuan Kung
- Department of Electrical Engineering, Princeton University, New Jersey, USA.
| |
Collapse
|
42
|
Predicting subcellular localization of multi-location proteins by improving support vector machines with an adaptive-decision scheme. INT J MACH LEARN CYB 2015. [DOI: 10.1007/s13042-015-0460-4] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
|
43
|
Yousef A, Moghadam Charkari N. SFM: A novel sequence-based fusion method for disease genes identification and prioritization. J Theor Biol 2015. [DOI: 10.1016/j.jtbi.2015.07.010] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
44
|
Wan S, Mak MW, Kung SY. mLASSO-Hum: A LASSO-based interpretable human-protein subcellular localization predictor. J Theor Biol 2015; 382:223-34. [DOI: 10.1016/j.jtbi.2015.06.042] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2015] [Revised: 06/25/2015] [Accepted: 06/26/2015] [Indexed: 02/03/2023]
|
45
|
Yousef A, Charkari NM. A novel method based on physicochemical properties of amino acids and one class classification algorithm for disease gene identification. J Biomed Inform 2015; 56:300-6. [DOI: 10.1016/j.jbi.2015.06.018] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2015] [Revised: 06/04/2015] [Accepted: 06/26/2015] [Indexed: 10/23/2022]
|
46
|
mPLR-Loc: An adaptive decision multi-label classifier based on penalized logistic regression for protein subcellular localization prediction. Anal Biochem 2015; 473:14-27. [DOI: 10.1016/j.ab.2014.10.014] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2014] [Revised: 09/29/2014] [Accepted: 10/21/2014] [Indexed: 01/16/2023]
|
47
|
Yang R, Zhang C, Gao R, Zhang L. An ensemble method with hybrid features to identify extracellular matrix proteins. PLoS One 2015; 10:e0117804. [PMID: 25680094 PMCID: PMC4334504 DOI: 10.1371/journal.pone.0117804] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2014] [Accepted: 01/02/2015] [Indexed: 12/29/2022] Open
Abstract
The extracellular matrix (ECM) is a dynamic composite of secreted proteins that play important roles in numerous biological processes such as tissue morphogenesis, differentiation and homeostasis. Furthermore, various diseases are caused by the dysfunction of ECM proteins. Therefore, identifying these important ECM proteins may assist in understanding related biological processes and drug development. In view of the serious imbalance in the training dataset, a Random Forest-based ensemble method with hybrid features is developed in this paper to identify ECM proteins. Hybrid features are employed by incorporating sequence composition, physicochemical properties, evolutionary and structural information. The Information Gain Ratio and Incremental Feature Selection (IGR-IFS) methods are adopted to select the optimal features. Finally, the resulting predictor termed IECMP (Identify ECM Proteins) achieves an balanced accuracy of 86.4% using the 10-fold cross-validation on the training dataset, which is much higher than results obtained by other methods (ECMPRED: 71.0%, ECMPP: 77.8%). Moreover, when tested on a common independent dataset, our method also achieves significantly improved performance over ECMPP and ECMPRED. These results indicate that IECMP is an effective method for ECM protein prediction, which has a more balanced prediction capability for positive and negative samples. It is anticipated that the proposed method will provide significant information to fully decipher the molecular mechanisms of ECM-related biological processes and discover candidate drug targets. For public access, we develop a user-friendly web server for ECM protein identification that is freely accessible at http://iecmp.weka.cc.
Collapse
Affiliation(s)
- Runtao Yang
- School of Control Science and Engineering, Shandong University, Jinan, China
| | - Chengjin Zhang
- School of Control Science and Engineering, Shandong University, Jinan, China
- School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, China
- * E-mail: (CJZ); (RG)
| | - Rui Gao
- School of Control Science and Engineering, Shandong University, Jinan, China
- * E-mail: (CJZ); (RG)
| | - Lina Zhang
- School of Control Science and Engineering, Shandong University, Jinan, China
| |
Collapse
|
48
|
acACS: improving the prediction accuracy of protein subcellular locations and protein classification by incorporating the average chemical shifts composition. ScientificWorldJournal 2014; 2014:864135. [PMID: 25110749 PMCID: PMC4106170 DOI: 10.1155/2014/864135] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2014] [Revised: 06/15/2014] [Accepted: 06/16/2014] [Indexed: 11/17/2022] Open
Abstract
The chemical shift is sensitive to changes in the local environments and can report the structural changes. The structure information of a protein can be represented by the average chemical shifts (ACS) composition, which has been broadly applied for enhancing the prediction accuracy in protein subcellular locations and protein classification. However, different kinds of ACS composition can solve different problems. We established an online web server named acACS, which can convert secondary structure into average chemical shift and then compose the vector for representing a protein by using the algorithm of auto covariance. Our solution is easy to use and can meet the needs of users.
Collapse
|
49
|
Wan S, Mak MW, Kung SY. R3P-Loc: a compact multi-label predictor using ridge regression and random projection for protein subcellular localization. J Theor Biol 2014; 360:34-45. [PMID: 24997236 DOI: 10.1016/j.jtbi.2014.06.031] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2014] [Revised: 06/24/2014] [Accepted: 06/25/2014] [Indexed: 12/21/2022]
Abstract
Locating proteins within cellular contexts is of paramount significance in elucidating their biological functions. Computational methods based on knowledge databases (such as gene ontology annotation (GOA) database) are known to be more efficient than sequence-based methods. However, the predominant scenarios of knowledge-based methods are that (1) knowledge databases typically have enormous size and are growing exponentially, (2) knowledge databases contain redundant information, and (3) the number of extracted features from knowledge databases is much larger than the number of data samples with ground-truth labels. These properties render the extracted features liable to redundant or irrelevant information, causing the prediction systems suffer from overfitting. To address these problems, this paper proposes an efficient multi-label predictor, namely R3P-Loc, which uses two compact databases for feature extraction and applies random projection (RP) to reduce the feature dimensions of an ensemble ridge regression (RR) classifier. Two new compact databases are created from Swiss-Prot and GOA databases. These databases possess almost the same amount of information as their full-size counterparts but with much smaller size. Experimental results on two recent datasets (eukaryote and plant) suggest that R3P-Loc can reduce the dimensions by seven-folds and significantly outperforms state-of-the-art predictors. This paper also demonstrates that the compact databases reduce the memory consumption by 39 times without causing degradation in prediction accuracy. For readers׳ convenience, the R3P-Loc server is available online at url:http://bioinfo.eie.polyu.edu.hk/R3PLocServer/.
Collapse
Affiliation(s)
- Shibiao Wan
- Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China.
| | - Man-Wai Mak
- Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China.
| | - Sun-Yuan Kung
- Department of Electrical Engineering, Princeton University, NJ, USA.
| |
Collapse
|
50
|
Yu CS, Cheng CW, Su WC, Chang KC, Huang SW, Hwang JK, Lu CH. CELLO2GO: a web server for protein subCELlular LOcalization prediction with functional gene ontology annotation. PLoS One 2014; 9:e99368. [PMID: 24911789 PMCID: PMC4049835 DOI: 10.1371/journal.pone.0099368] [Citation(s) in RCA: 286] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2013] [Accepted: 05/14/2014] [Indexed: 01/15/2023] Open
Abstract
CELLO2GO (http://cello.life.nctu.edu.tw/cello2go/) is a publicly available, web-based system for screening various properties of a targeted protein and its subcellular localization. Herein, we describe how this platform is used to obtain a brief or detailed gene ontology (GO)-type categories, including subcellular localization(s), for the queried proteins by combining the CELLO localization-predicting and BLAST homology-searching approaches. Given a query protein sequence, CELLO2GO uses BLAST to search for homologous sequences that are GO annotated in an in-house database derived from the UniProt KnowledgeBase database. At the same time, CELLO attempts predict at least one subcellular localization on the basis of the species in which the protein is found. When homologs for the query sequence have been identified, the number of terms found for each of their GO categories, i.e., cellular compartment, molecular function, and biological process, are summed and presented as pie charts representing possible functional annotations for the queried protein. Although the experimental subcellular localization of a protein may not be known, and thus not annotated, CELLO can confidentially suggest a subcellular localization. CELLO2GO should be a useful tool for research involving complex subcellular systems because it combines CELLO and BLAST into one platform and its output is easily manipulated such that the user-specific questions may be readily addressed.
Collapse
Affiliation(s)
- Chin-Sheng Yu
- Department of Information Engineering and Computer Science, Feng Chia University, Taichung, Taiwan
- Master's Program in Biomedical Informatics and Biomedical Engineering, Feng Chia University, Taichung, Taiwan
| | - Chih-Wen Cheng
- Department of Information Engineering and Computer Science, Feng Chia University, Taichung, Taiwan
| | - Wen-Chi Su
- Department of Information Engineering and Computer Science, Feng Chia University, Taichung, Taiwan
| | - Kuei-Chung Chang
- Department of Information Engineering and Computer Science, Feng Chia University, Taichung, Taiwan
| | - Shao-Wei Huang
- Department of Medical Informatics, Tzu Chi University, Hualien, Taiwan
| | - Jenn-Kang Hwang
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Center of Bioinformatics Research, National Chiao Tung University, Hsinchu, Taiwan
| | - Chih-Hao Lu
- Graduate Institute of Basic Medical Science, China Medical University, Taichung, Taiwan
- * E-mail:
| |
Collapse
|