1
|
Beltrán JF, Belén LH, Yáñez AJ, Jimenez L. Predicting viral proteins that evade the innate immune system: a machine learning-based immunoinformatics tool. BMC Bioinformatics 2024; 25:351. [PMID: 39522017 PMCID: PMC11550529 DOI: 10.1186/s12859-024-05972-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2024] [Accepted: 10/29/2024] [Indexed: 11/16/2024] Open
Abstract
Viral proteins that evade the host's innate immune response play a crucial role in pathogenesis, significantly impacting viral infections and potential therapeutic strategies. Identifying these proteins through traditional methods is challenging and time-consuming due to the complexity of virus-host interactions. Leveraging advancements in computational biology, we present VirusHound-II, a novel tool that utilizes machine learning techniques to predict viral proteins evading the innate immune response with high accuracy. We evaluated a comprehensive range of machine learning models, including ensemble methods, neural networks, and support vector machines. Using a dataset of 1337 viral proteins known to evade the innate immune response (VPEINRs) and an equal number of non-VPEINRs, we employed pseudo amino acid composition as the molecular descriptor. Our methodology involved a tenfold cross-validation strategy on 80% of the data for training, followed by testing on an independent dataset comprising the remaining 20%. The random forest model demonstrated superior performance metrics, achieving 0.9290 accuracy, 0.9283 F1 score, 0.9354 precision, and 0.9213 sensitivity in the independent testing phase. These results establish VirusHound-II as an advancement in computational virology, accessible via a user-friendly web application. We anticipate that VirusHound-II will be a crucial resource for researchers, enabling the rapid and reliable prediction of viral proteins evading the innate immune response. This tool has the potential to accelerate the identification of therapeutic targets and enhance our understanding of viral evasion mechanisms, contributing to the development of more effective antiviral strategies and advancing our knowledge of virus-host interactions.
Collapse
Affiliation(s)
- Jorge F Beltrán
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar 01145, Temuco, Chile.
| | - Lisandra Herrera Belén
- Departamento de Ciencias Básicas, Facultad de Ciencias, Universidad Santo Tomas, Temuco, Chile
| | - Alejandro J Yáñez
- Departamento de Investigación y Desarrollo, Greenvolution SpA., Puerto Varas, Chile
- Interdisciplinary Center for Aquaculture Research (INCAR), Concepcion, Chile
| | - Luis Jimenez
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar 01145, Temuco, Chile
| |
Collapse
|
2
|
Hassan MT, Tayara H, Chong KT. NaII-Pred: An ensemble-learning framework for the identification and interpretation of sodium ion inhibitors by fusing multiple feature representation. Comput Biol Med 2024; 178:108737. [PMID: 38879934 DOI: 10.1016/j.compbiomed.2024.108737] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Revised: 04/21/2024] [Accepted: 06/08/2024] [Indexed: 06/18/2024]
Abstract
High-affinity ligand peptides for ion channels are essential for controlling the flow of ions across the plasma membrane. These peptides are now being investigated as possible therapeutic possibilities for a variety of illnesses, including cancer and cardiovascular disease. So, the identification and interpretation of ligand peptide inhibitors to control ion flow across cells become pivotal for exploration. In this work, we developed an ensemble-based model, NaII-Pred, for the identification of sodium ion inhibitors. The ensemble model was trained, tested, and evaluated on three different datasets. The NaII-Pred method employs six different descriptors and a hybrid feature set in conjunction with five conventional machine learning classifiers to create 35 baseline models. Through an ensemble approach, the top five baseline models trained on the hybrid feature set were integrated to yield the final predictive model, NaII-Pred. Our proposed model, NaII-Pred, outperforms the baseline models and the current predictors on both datasets. We believe NaII-Pred will play a critical role in screening and identifying potential sodium ion inhibitors and will be an invaluable tool.
Collapse
Affiliation(s)
- Mir Tanveerul Hassan
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju, 54896, South Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju, 54896, South Korea.
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju, 54896, South Korea; Advances Electronics and Information Research Centre, Jeonbuk National University, Jeonju, 54896, South Korea.
| |
Collapse
|
3
|
Wang JM, Cui RK, Qian ZK, Yang ZZ, Li Y. Mining channel-regulated peptides from animal venom by integrating sequence semantics and structural information. Comput Biol Chem 2024; 109:108027. [PMID: 38340414 DOI: 10.1016/j.compbiolchem.2024.108027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 01/24/2024] [Accepted: 02/04/2024] [Indexed: 02/12/2024]
Abstract
Channel-regulated peptides (CRPs) derived from animal venom hold great promise as potential drug candidates for numerous diseases associated with channel proteins. However, discovering and identifying CRPs using traditional bio-experimental methods is a time-consuming and laborious process. While there were a few computational studies on CRPs, they were limited to specific channel proteins, relied heavily on complex feature engineering, and lacked the incorporation of multi-source information. To address these problems, we proposed a novel deep learning model, called DeepCRPs, based on graph neural networks for systematically mining CRPs from animal venom. By combining the sequence semantic and structural information, the classification performance of four CRPs was significantly enhanced, reaching an accuracy of 0.92. This performance surpassed baseline models with accuracies ranging from 0.77 to 0.89. Furthermore, we employed advanced interpretable techniques to explore sequence and structural determinants relevant to the classification of CRPs, yielding potentially valuable bio-function interpretations. Comprehensive experimental results demonstrated the precision and interpretive capability of DeepCRPs, making it an accurate and bio-explainable suit for the identification and categorization of CRPs. Our research will contribute to the discovery and development of toxin peptides targeting channel proteins. The source data and code are freely available at https://github.com/liyigerry/DeepCRPs.
Collapse
Affiliation(s)
- Jian-Ming Wang
- College of Mathematics and Computer Science, Dali University, Dali, China
| | - Rong-Kai Cui
- College of Mathematics and Computer Science, Dali University, Dali, China
| | - Zheng-Kun Qian
- College of Mathematics and Computer Science, Dali University, Dali, China
| | - Zi-Zhong Yang
- Yunnan Provincial Key Laboratory of Entomological Biopharmaceutical R&D, College of Pharmacy, Dali University, Dali, China
| | - Yi Li
- College of Mathematics and Computer Science, Dali University, Dali, China.
| |
Collapse
|
4
|
Yan J, Zhang B, Zhou M, Kwok HF, Siu SWI. Multi-Branch-CNN: Classification of ion channel interacting peptides using multi-branch convolutional neural network. Comput Biol Med 2022; 147:105717. [PMID: 35752114 DOI: 10.1016/j.compbiomed.2022.105717] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2022] [Revised: 05/18/2022] [Accepted: 06/05/2022] [Indexed: 11/03/2022]
Abstract
Ligand peptides that have high affinity for ion channels are critical for regulating ion flux across the plasma membrane. These peptides are now being considered as potential drug candidates for many diseases, such as cardiovascular disease and cancers. In this work, we developed Multi-Branch-CNN, a CNN method with multiple input branches for identifying three types of ion channel peptide binders (sodium, potassium, and calcium) from intra- and inter-feature types. As for its real-world applications, prediction models that are able to recognize novel sequences having high or low similarities to training sequences are required. To this end, we tested our models on two test sets: a general test set including sequences spanning different similarity levels to those of the training set, and a novel-test set consisting of only sequences that bear little resemblance to sequences from the training set. Our experiments showed that the Multi-Branch-CNN method performs better than thirteen traditional ML algorithms (TML13), yielding an improvement in accuracy of 3.2%, 1.2%, and 2.3% on the test sets as well as 8.8%, 14.3%, and 14.6% on the novel-test sets for sodium, potassium, and calcium ion channels, respectively. We confirmed the effectiveness of Multi-Branch-CNN by comparing it to the standard CNN method with one input branch (Single-Branch-CNN) and an ensemble method (TML13-Stack). The data sets, script files to reproduce the experiments, and the final predictive models are freely available at https://github.com/jieluyan/Multi-Branch-CNN.
Collapse
Affiliation(s)
- Jielu Yan
- PAMI Research Group, Department of Computer and Information Science, University of Macau, Taipa, Macao Special Administrative Region of China
| | - Bob Zhang
- PAMI Research Group, Department of Computer and Information Science, University of Macau, Taipa, Macao Special Administrative Region of China.
| | - Mingliang Zhou
- School of Computer Science, Chongqing University, Shapingba, Chongqing, China
| | - Hang Fai Kwok
- Department of Biomedical Sciences, Faculty of Health Sciences, University of Macau, Taipa, Macao Special Administrative Region of China.
| | - Shirley W I Siu
- Department of Computer and Information Science, University of Macau, Taipa, Macao Special Administrative Region of China; Institute of Science and Environment, University of Saint Joseph, Estr. Marginal da Ilha Verde, Macao Special Administrative Region of China.
| |
Collapse
|
5
|
Jeong DU, Lim KM. Artificial neural network model for predicting changes in ion channel conductance based on cardiac action potential shapes generated via simulation. Sci Rep 2021; 11:7831. [PMID: 33837240 PMCID: PMC8035260 DOI: 10.1038/s41598-021-87578-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2021] [Accepted: 03/30/2021] [Indexed: 11/09/2022] Open
Abstract
Many studies have revealed changes in specific protein channels due to physiological causes such as mutation and their effects on action potential duration changes. However, no studies have been conducted to predict the type of protein channel abnormalities that occur through an action potential (AP) shape. Therefore, in this study, we aim to predict the ion channel conductance that is altered from various AP shapes using a machine learning algorithm. We perform electrophysiological simulations using a single-cell model to obtain AP shapes based on variations in the ion channel conductance. In the AP simulation, we increase and decrease the conductance of each ion channel at a constant rate, resulting in 1,980 AP shapes and one standard AP shape without any changes in the ion channel conductance. Subsequently, we calculate the AP difference shapes between them and use them as the input of the machine learning model to predict the changed ion channel conductance. In this study, we demonstrate that the changed ion channel conductance can be predicted with high prediction accuracy, as reflected by an F1 score of 0.985, using only AP shapes and simple machine learning.
Collapse
Affiliation(s)
- Da Un Jeong
- IT Convergence Engineering, Kumoh National Institute of Technology, Gumi, 39253, Republic of Korea
| | - Ki Moo Lim
- Medical IT Convergence Engineering, Kumoh National Institute of Technology, Gumi, 39253, Republic of Korea.
| |
Collapse
|
6
|
Yao Y, Zhang S, Liang Y. iORI-ENST: identifying origin of replication sites based on elastic net and stacking learning. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2021; 32:317-331. [PMID: 33730950 DOI: 10.1080/1062936x.2021.1895884] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Accepted: 02/23/2021] [Indexed: 06/12/2023]
Abstract
DNA replication is not only the basis of biological inheritance but also the most fundamental process in all living organisms. It plays a crucial role in the cell-division cycle and gene expression regulation. Hence, the accurate identification of the origin of replication sites (ORIs) has a great meaning for further understanding the regulatory mechanism of gene expression and treating genic diseases. In this paper, a novel, feasible and powerful model, namely, iORI-ENST is designed for identifying ORIs. Firstly, we extract the different features by incorporating mono-nucleotide binary encoding and dinucleotide-based spatial autocorrelation. Subsequently, elastic net is utilized as the feature selection method to select the optimal feature set. And then stacking learning is employed to predict ORIs and non-ORIs, which contains random forest, adaboost, gradient boosting decision tree, extra trees and support vector machine. Finally, the ORI sites are identified on the benchmark datasets S1 and S2 with their accuracies of 91.41% and 95.07%, respectively. Meanwhile, an independent dataset S3 is employed to verify the validation and transferability of our model and its accuracy reaches 91.10%. Comparing with state-of-the-art methods, our model achieves more remarkable performance. The results show our model is a feasible, effective and powerful tool for identifying ORIs. The source code and datasets are available at https://github.com/YingyingYao/iORI-ENST.
Collapse
Affiliation(s)
- Y Yao
- School of Mathematics and Statistics, Xidian University, Xi'an, P. R. China
| | - S Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an, P. R. China
| | - Y Liang
- School of Science, Xi'an Polytechnic University, Xi'an, P. R. China
| |
Collapse
|
7
|
Awais M, Hussain W, Khan YD, Rasool N, Khan SA, Chou KC. iPhosH-PseAAC: Identify Phosphohistidine Sites in Proteins by Blending Statistical Moments and Position Relative Features According to the Chou's 5-Step Rule and General Pseudo Amino Acid Composition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:596-610. [PMID: 31144645 DOI: 10.1109/tcbb.2019.2919025] [Citation(s) in RCA: 48] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Protein phosphorylation is one of the key mechanism in prokaryotes and eukaryotes and is responsible for various biological functions such as protein degradation, intracellular localization, the multitude of cellular processes, molecular association, cytoskeletal dynamics, and enzymatic inhibition/activation. Phosphohistidine (PhosH) has a key role in a number of biological processes, including central metabolism to signalling in eukaryotes and bacteria. Thus, identification of phosphohistidine sites in a protein sequence is crucial, and experimental identification can be expensive, time-taking, and laborious. To address this problem, here, we propose a novel computational model namely iPhosH-PseAAC for prediction of phosphohistidine sites in a given protein sequence using pseudo amino acid composition (PseAAC), statistical moments, and position relative features. The results of the proposed predictor are validated through self-consistency testing, 10-fold cross-validation, and jackknife testing. The self-consistency validation gave the 100 percent accuracy, whereas, for cross-validation, the accuracy achieved is 94.26 percent. Moreover, jackknife testing gave 97.07 percent accuracy for the proposed model. Thus, the proposed model iPhosH-PseAAC for prediction of iPhosH site has the great ability to predict the PhosH sites in given proteins.
Collapse
|
8
|
Abstract
During the last three decades or so, many efforts have been made to study the protein cleavage
sites by some disease-causing enzyme, such as HIV (Human Immunodeficiency Virus) protease
and SARS (Severe Acute Respiratory Syndrome) coronavirus main proteinase. It has become increasingly
clear <i>via</i> this mini-review that the motivation driving the aforementioned studies is quite wise,
and that the results acquired through these studies are very rewarding, particularly for developing peptide
drugs.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
9
|
|
10
|
Lissabet JFB, Belén LH, Farias JG. PPLK +C: A Bioinformatics Tool for Predicting Peptide Ligands of Potassium Channels Based on Primary Structure Information. Interdiscip Sci 2020; 12:258-263. [PMID: 31912313 DOI: 10.1007/s12539-019-00356-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2019] [Revised: 12/22/2019] [Accepted: 12/24/2019] [Indexed: 10/25/2022]
Abstract
Potassium channels play a key role in regulating the flow of ions through the plasma membrane, orchestrating many cellular processes including cell volume regulation, hormone secretion and electrical impulse formation. Ligand peptides of potassium channels are molecules used in basic and applied research and are now considered promising alternatives in the treatment of many diseases, such as cardiovascular diseases and cancer. Currently, there are various bioinformatics tools focused on the prediction of peptides with different activities. However, none of the current tools can predict ligand peptides of potassium channels. In this work, we developed a tool called PPLK+C; this is the first tool that can predict peptide ligands of potassium channels. We also evaluated several amino acid molecular features and four machine-learning algorithms for the prediction of potassium channel ligand peptides: random forest, nearest neighbors, support vector machine and artificial neural network. All the biological data used in this study for training and validating models were obtained from peptides with experimentally verified activity. PPLK+C is a bioinformatics software written in the Python programming language, which showed a high predictive capacity with a model generated with the random forest algorithm: 0.77 sensitivity, 0.94 specificity, 0.91 accuracy and 0.70 Matthews correlation coefficient. PPLK+C is a novel tool with a friendly interface that can be used for the discovery of novel ligand peptides of potassium channels with high reliability, using only primary structure information.
Collapse
Affiliation(s)
- Jorge Félix Beltrán Lissabet
- Department of Chemical Engineering, Faculty of Engineering and Sciences, Universidad de La Frontera, Av. Francisco Salazar 01145, P.O. Box 54-D, 4811230, Temuco, Chile
| | - Lisandra Herrera Belén
- Department of Chemical Engineering, Faculty of Engineering and Sciences, Universidad de La Frontera, Av. Francisco Salazar 01145, P.O. Box 54-D, 4811230, Temuco, Chile
| | - Jorge G Farias
- Department of Chemical Engineering, Faculty of Engineering and Sciences, Universidad de La Frontera, Av. Francisco Salazar 01145, P.O. Box 54-D, 4811230, Temuco, Chile.
| |
Collapse
|
11
|
Some illuminating remarks on molecular genetics and genomics as well as drug development. Mol Genet Genomics 2020; 295:261-274. [PMID: 31894399 DOI: 10.1007/s00438-019-01634-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2019] [Accepted: 12/05/2019] [Indexed: 02/07/2023]
Abstract
Facing the explosive growth of biological sequences unearthed in the post-genomic age, one of the most important but also most difficult problems in computational biology is how to express a biological sequence with a discrete model or a vector, but still keep it with considerable sequence-order information or its special pattern. To deal with such a challenging problem, the ideas of "pseudo amino acid components" and "pseudo K-tuple nucleotide composition" have been proposed. The ideas and their approaches have further stimulated the birth for "distorted key theory", "wenxing diagram", and substantially strengthening the power in treating the multi-label systems, as well as the establishment of the famous "5-steps rule". All these logic developments are quite natural that are very useful not only for theoretical scientists but also for experimental scientists in conducting genetics/genomics analysis and drug development. Presented in this review paper are also their future perspectives; i.e., their impacts will become even more significant and propounding.
Collapse
|
12
|
Shao YT, Liu XX, Lu Z, Chou KC. pLoc_Deep-mHum: Predict Subcellular Localization of Human Proteins by Deep Learning. ACTA ACUST UNITED AC 2020. [DOI: 10.4236/ns.2020.127042] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
13
|
Lu Z, Chou KC. pLoc_Deep-mGpos: Predict Subcellular Localization of Gram Positive Bacteria Proteins by Deep Learning. ACTA ACUST UNITED AC 2020. [DOI: 10.4236/jbise.2020.135005] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
14
|
Shao Y, Chou KC. pLoc_Deep-mVirus: A CNN Model for Predicting Subcellular Localization of Virus Proteins by Deep Learning. ACTA ACUST UNITED AC 2020. [DOI: 10.4236/ns.2020.126033] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
15
|
iRSpot-DTS: Predict recombination spots by incorporating the dinucleotide-based spare-cross covariance information into Chou's pseudo components. Genomics 2019; 111:1760-1770. [DOI: 10.1016/j.ygeno.2018.11.031] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2018] [Revised: 11/29/2018] [Accepted: 11/30/2018] [Indexed: 12/16/2022]
|
16
|
Chou KC. Advances in Predicting Subcellular Localization of Multi-label Proteins and its Implication for Developing Multi-target Drugs. Curr Med Chem 2019; 26:4918-4943. [PMID: 31060481 DOI: 10.2174/0929867326666190507082559] [Citation(s) in RCA: 78] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2018] [Revised: 01/29/2019] [Accepted: 01/31/2019] [Indexed: 12/16/2022]
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
17
|
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
18
|
Chou KC. Proposing Pseudo Amino Acid Components is an Important Milestone for Proteome and Genome Analyses. Int J Pept Res Ther 2019. [DOI: 10.1007/s10989-019-09910-7] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
19
|
|
20
|
Xiao X, Cheng X, Chen G, Mao Q, Chou KC. pLoc_bal-mVirus: Predict Subcellular Localization of Multi-Label Virus Proteins by Chou's General PseAAC and IHTS Treatment to Balance Training Dataset. Med Chem 2019; 15:496-509. [DOI: 10.2174/1573406415666181217114710] [Citation(s) in RCA: 44] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2018] [Revised: 10/23/2018] [Accepted: 12/12/2018] [Indexed: 12/17/2022]
Abstract
Background/Objective:Knowledge of protein subcellular localization is vitally important for both basic research and drug development. Facing the avalanche of protein sequences emerging in the post-genomic age, it is urgent to develop computational tools for timely and effectively identifying their subcellular localization based on the sequence information alone. Recently, a predictor called “pLoc-mVirus” was developed for identifying the subcellular localization of virus proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, known as “multiplex proteins”, may simultaneously occur in, or move between two or more subcellular location sites. Despite the fact that it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mVirus was trained by an extremely skewed dataset in which some subset was over 10 times the size of the other subsets. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset.Methods:Using the Chou's general PseAAC (Pseudo Amino Acid Composition) approach and the IHTS (Inserting Hypothetical Training Samples) treatment to balance out the training dataset, we have developed a new predictor called “pLoc_bal-mVirus” for predicting the subcellular localization of multi-label virus proteins.Results:Cross-validation tests on exactly the same experiment-confirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mVirus, the existing state-of-theart predictor for the same purpose.Conclusion:Its user-friendly web-server is available at http://www.jci-bioinfo.cn/pLoc_balmVirus/, by which the majority of experimental scientists can easily get their desired results without the need to go through the detailed complicated mathematics. Accordingly, pLoc_bal-mVirus will become a very useful tool for designing multi-target drugs and in-depth understanding of the biological process in a cell.
Collapse
Affiliation(s)
- Xuan Xiao
- Gordon Life Science Institute, Boston, MA 02478, United States
| | - Xiang Cheng
- Gordon Life Science Institute, Boston, MA 02478, United States
| | - Genqiang Chen
- College of Chemistry, Chemical Engineering and Biotechnology, Donghua University, Shanghai 201620, China
| | - Qi Mao
- College of Information Science and Technology, Donghua University, Shanghai, China
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
21
|
Chou KC, Cheng X, Xiao X. pLoc_bal-mEuk: Predict Subcellular Localization of Eukaryotic Proteins by General PseAAC and Quasi-balancing Training Dataset. Med Chem 2019; 15:472-485. [DOI: 10.2174/1573406415666181218102517] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2018] [Revised: 10/23/2018] [Accepted: 12/12/2018] [Indexed: 12/24/2022]
Abstract
<P>Background/Objective: Information of protein subcellular localization is crucially important for both basic research and drug development. With the explosive growth of protein sequences discovered in the post-genomic age, it is highly demanded to develop powerful bioinformatics tools for timely and effectively identifying their subcellular localization purely based on the sequence information alone. Recently, a predictor called “pLoc-mEuk” was developed for identifying the subcellular localization of eukaryotic proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems where many proteins, called “multiplex proteins”, may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mEuk was trained by an extremely skewed dataset where some subset was about 200 times the size of the other subsets. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset. </P><P> Methods: To alleviate such bias, we have developed a new predictor called pLoc_bal-mEuk by quasi-balancing the training dataset. Cross-validation tests on exactly the same experimentconfirmed dataset have indicated that the proposed new predictor is remarkably superior to pLocmEuk, the existing state-of-the-art predictor in identifying the subcellular localization of eukaryotic proteins. It has not escaped our notice that the quasi-balancing treatment can also be used to deal with many other biological systems. </P><P> Results: To maximize the convenience for most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mEuk/. </P><P> Conclusion: It is anticipated that the pLoc_bal-Euk predictor holds very high potential to become a useful high throughput tool in identifying the subcellular localization of eukaryotic proteins, particularly for finding multi-target drugs that is currently a very hot trend trend in drug development.</P>
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| | - Xiang Cheng
- Gordon Life Science Institute, Boston, MA 02478, United States
| | - Xuan Xiao
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
22
|
SPrenylC-PseAAC: A sequence-based model developed via Chou's 5-steps rule and general PseAAC for identifying S-prenylation sites in proteins. J Theor Biol 2019; 468:1-11. [DOI: 10.1016/j.jtbi.2019.02.007] [Citation(s) in RCA: 98] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2018] [Revised: 02/07/2019] [Accepted: 02/11/2019] [Indexed: 11/22/2022]
|
23
|
SPalmitoylC-PseAAC: A sequence-based model developed via Chou's 5-steps rule and general PseAAC for identifying S-palmitoylation sites in proteins. Anal Biochem 2019; 568:14-23. [DOI: 10.1016/j.ab.2018.12.019] [Citation(s) in RCA: 93] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2018] [Revised: 12/19/2018] [Accepted: 12/22/2018] [Indexed: 02/06/2023]
|
24
|
Khan YD, Jamil M, Hussain W, Rasool N, Khan SA, Chou KC. pSSbond-PseAAC: Prediction of disulfide bonding sites by integration of PseAAC and statistical moments. J Theor Biol 2019; 463:47-55. [DOI: 10.1016/j.jtbi.2018.12.015] [Citation(s) in RCA: 43] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2018] [Revised: 12/05/2018] [Accepted: 12/11/2018] [Indexed: 02/08/2023]
|
25
|
Xiao X, Xu ZC, Qiu WR, Wang P, Ge HT, Chou KC. iPSW(2L)-PseKNC: A two-layer predictor for identifying promoters and their strength by hybrid features via pseudo K-tuple nucleotide composition. Genomics 2018; 111:1785-1793. [PMID: 30529532 DOI: 10.1016/j.ygeno.2018.12.001] [Citation(s) in RCA: 54] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2018] [Revised: 11/20/2018] [Accepted: 12/04/2018] [Indexed: 12/20/2022]
Abstract
The promoter is a regulatory DNA region about 81-1000 base pairs long, usually located near the transcription start site (TSS) along upstream of a given gene. By combining a certain protein called transcription factor, the promoter provides the starting point for regulated gene transcription, and hence plays a vitally important role in gene transcriptional regulation. With explosive growth of DNA sequences in the post-genomic age, it has become an urgent challenge to develop computational method for effectively identifying promoters because the information thus obtained is very useful for both basic research and drug development. Although some prediction methods were developed in this regard, most of them were limited at merely identifying whether a query DNA sequence being of a promoter or not. However, based on their strength-distinct levels for transcriptional activation and expression, promoter should be divided into two categories: strong and weak types. Here a new two-layer predictor, called "iPSW(2L)-PseKNC", was developed by fusing the physicochemical properties of nucleotides and their nucleotide density into PseKNC (pseudo K-tuple nucleotide composition). Its 1st-layer serves to predict whether a query DNA sequence sample is of promoter or not, while its 2nd-layer is able to predict the strength of promoters. It has been observed through rigorous cross-validations that the 1st-layer sub-predictor is remarkably superior to the existing state-of-the-art predictors in identifying the promoters and non-promoters, and that the 2nd-layer sub-predictor can do what is beyond the reach of the existing predictors. Moreover, the web-server for iPSW(2L)-PseKNC has been established at http://www.jci-bioinfo.cn/iPSW(2L)-PseKNC, by which the majority of experimental scientists can easily get the results they need.
Collapse
Affiliation(s)
- Xuan Xiao
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China; The Gordon Life Science Institute, Boston, MA 02478, USA.
| | - Zhao-Chun Xu
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China.
| | - Wang-Ren Qiu
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China; The Gordon Life Science Institute, Boston, MA 02478, USA
| | - Peng Wang
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Hui-Ting Ge
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Kuo-Chen Chou
- The Gordon Life Science Institute, Boston, MA 02478, USA; Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|