1
|
Chi H, Chen H, Wang R, Zhang J, Jiang L, Zhang S, Jiang C, Huang J, Quan X, Liu Y, Zhang Q, Yang G. Proposing new early detection indicators for pancreatic cancer: Combining machine learning and neural networks for serum miRNA-based diagnostic model. Front Oncol 2023; 13:1244578. [PMID: 37601672 PMCID: PMC10437932 DOI: 10.3389/fonc.2023.1244578] [Citation(s) in RCA: 31] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Accepted: 07/18/2023] [Indexed: 08/22/2023] Open
Abstract
BACKGROUND Pancreatic cancer (PC) is a lethal malignancy that ranks seventh in terms of global cancer-related mortality. Despite advancements in treatment, the five-year survival rate remains low, emphasizing the urgent need for reliable early detection methods. MicroRNAs (miRNAs), a group of non-coding RNAs involved in critical gene regulatory mechanisms, have garnered significant attention as potential diagnostic and prognostic biomarkers for pancreatic cancer (PC). Their suitability stems from their accessibility and stability in blood, making them particularly appealing for clinical applications. METHODS In this study, we analyzed serum miRNA expression profiles from three independent PC datasets obtained from the Gene Expression Omnibus (GEO) database. To identify serum miRNAs associated with PC incidence, we employed three machine learning algorithms: Support Vector Machine-Recursive Feature Elimination (SVM-RFE), Least Absolute Shrinkage and Selection Operator (LASSO), and Random Forest. We developed an artificial neural network model to assess the accuracy of the identified PC-related serum miRNAs (PCRSMs) and create a nomogram. These findings were further validated through qPCR experiments. Additionally, patient samples with PC were classified using the consensus clustering method. RESULTS Our analysis revealed three PCRSMs, namely hsa-miR-4648, hsa-miR-125b-1-3p, and hsa-miR-3201, using the three machine learning algorithms. The artificial neural network model demonstrated high accuracy in distinguishing between normal and pancreatic cancer samples, with verification and training groups exhibiting AUC values of 0.935 and 0.926, respectively. We also utilized the consensus clustering method to classify PC samples into two optimal subtypes. Furthermore, our investigation into the expression of PCRSMs unveiled a significant negative correlation between the expression of hsa-miR-125b-1-3p and age. CONCLUSION Our study introduces a novel artificial neural network model for early diagnosis of pancreatic cancer, carrying significant clinical implications. Furthermore, our findings provide valuable insights into the pathogenesis of pancreatic cancer and offer potential avenues for drug screening, personalized treatment, and immunotherapy against this lethal disease.
Collapse
Affiliation(s)
- Hao Chi
- Clinical Medical College, Southwest Medical University, Luzhou, China
| | - Haiqing Chen
- Clinical Medical College, Southwest Medical University, Luzhou, China
| | - Rui Wang
- Department of General Surgery (Hepatobiliary Surgery), The Affiliated Hospital of Southwest Medical University, Luzhou, China
- Nuclear Medicine and Molecular Imaging Key Laboratory of Sichuan Province, Luzhou, China
- Academician (Expert) Workstation of Sichuan Province, Luzhou, China
| | - Jieying Zhang
- First Teaching Hospital of Tianjin University of Traditional Chinese Medicine, Tianjin, China
- National Clinical Research Center for Chinese Medicine Acupuncture and Moxibustion, Tianjin, China
| | - Lai Jiang
- Clinical Medical College, Southwest Medical University, Luzhou, China
| | - Shengke Zhang
- Clinical Medical College, Southwest Medical University, Luzhou, China
| | - Chenglu Jiang
- Clinical Medical College, Southwest Medical University, Luzhou, China
| | - Jinbang Huang
- Clinical Medical College, Southwest Medical University, Luzhou, China
| | - Xiaomin Quan
- Beijing University of Chinese Medicine, Beijing, China
- Beijing University of Chinese Medicine Second Affiliated DongFang Hospital, Beijing, China
| | - Yunfei Liu
- Department of General, Visceral, and Transplant Surgery, Ludwig-Maximilians-University Munich, Munich, Germany
| | - Qinhong Zhang
- Shenzhen Frontiers in Chinese Medicine Research Co., Ltd., Shenzhen, China
| | - Guanhu Yang
- Department of Specialty Medicine, Ohio University, Athens, OH, United States
| |
Collapse
|
2
|
Dey SS, Sharma PK, Munshi AD, Jaiswal S, Behera TK, Kumari K, G. B, Iquebal MA, Bhattacharya RC, Rai A, Kumar D. Genome wide identification of lncRNAs and circRNAs having regulatory role in fruit shelf life in health crop cucumber ( Cucumis sativus L.). FRONTIERS IN PLANT SCIENCE 2022; 13:884476. [PMID: 35991462 PMCID: PMC9383263 DOI: 10.3389/fpls.2022.884476] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/26/2022] [Accepted: 06/27/2022] [Indexed: 06/15/2023]
Abstract
Cucumber is an extremely perishable vegetable; however, under room conditions, the fruits become unfit for consumption 2-3 days after harvesting. One natural variant, DC-48 with an extended shelf-life was identified, fruits of which can be stored up to 10-15 days under room temperature. The genes involved in this economically important trait are regulated by non-coding RNAs. The study aims to identify the long non-coding RNAs (lncRNAs) and circular RNAs (circRNAs) by taking two contrasting genotypes, DC-48 and DC-83, at two different fruit developmental stages. The upper epidermis of the fruits was collected at 5 days and 10 days after pollination (DAP) for high throughput RNA sequencing. The differential expression analysis was performed to identify differentially expressed (DE) lncRNAs and circRNAs along with the network analysis of lncRNA, miRNA, circRNA, and mRNA interactions. A total of 97 DElncRNAs were identified where 18 were common under both the developmental stages (8 down regulated and 10 upregulated). Based on the back-spliced reads, 238 circRNAs were found to be distributed uniformly throughout the cucumber genomes with the highest numbers (71) in chromosome 4. The majority of the circRNAs (49%) were exonic in origin followed by inter-genic (47%) and intronic (4%) origin. The genes related to fruit firmness, namely, polygalacturonase, expansin, pectate lyase, and xyloglucan glycosyltransferase were present in the target sites and co-localized networks indicating the role of the lncRNA and circRNAs in their regulation. Genes related to fruit ripening, namely, trehalose-6-phosphate synthase, squamosa promoter binding protein, WRKY domain transcription factors, MADS box proteins, abscisic stress ripening inhibitors, and different classes of heat shock proteins (HSPs) were also found to be regulated by the identified lncRNA and circRNAs. Besides, ethylene biosynthesis and chlorophyll metabolisms were also found to be regulated by DElncRNAs and circRNAs. A total of 17 transcripts were also successfully validated through RT PCR data. These results would help the breeders to identify the complex molecular network and regulatory role of the lncRNAs and circRNAs in determining the shelf-life of cucumbers.
Collapse
Affiliation(s)
- Shyam S. Dey
- Division of Vegetable Science, ICAR-Indian Agricultural Research Institute, New Delhi, India
| | - Parva Kumar Sharma
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - A. D. Munshi
- Division of Vegetable Science, ICAR-Indian Agricultural Research Institute, New Delhi, India
| | - Sarika Jaiswal
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - T. K. Behera
- Division of Vegetable Science, ICAR-Indian Agricultural Research Institute, New Delhi, India
| | - Khushboo Kumari
- Division of Vegetable Science, ICAR-Indian Agricultural Research Institute, New Delhi, India
| | - Boopalakrishnan G.
- Division of Vegetable Science, ICAR-Indian Agricultural Research Institute, New Delhi, India
| | - Mir Asif Iquebal
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | | | - Anil Rai
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Dinesh Kumar
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| |
Collapse
|
3
|
Mohammadi A, Zahiri J, Mohammadi S, Khodarahmi M, Arab SS. PSSMCOOL: A Comprehensive R Package for Generating Evolutionary-based Descriptors of Protein Sequences from PSSM Profiles. BIOLOGY METHODS AND PROTOCOLS 2022; 7:bpac008. [PMID: 35388370 PMCID: PMC8977839 DOI: 10.1093/biomethods/bpac008] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Revised: 01/21/2022] [Indexed: 11/14/2022]
Abstract
Position-specific scoring matrix (PSSM), also called profile, is broadly used for representing the evolutionary history of a given protein sequence. Several investigations reported that the PSSM-based feature descriptors can improve the prediction of various protein attributes such as interaction, function, subcellular localization, secondary structure, disorder regions, and accessible surface area. While plenty of algorithms have been suggested for extracting evolutionary features from PSSM in recent years, there is not any integrated standalone tool for providing these descriptors. Here, we introduce PSSMCOOL, a flexible comprehensive R package that generates 38 PSSM-based feature vectors. To our best knowledge, PSSMCOOL is the first PSSM-based feature extraction tool implemented in R. With the growing demand for exploiting machine-learning algorithms in computational biology, this package would be a practical tool for machine-learning predictions.
Collapse
Affiliation(s)
- Alireza Mohammadi
- Bioinformatics and Computational Omics Lab (BioCOOL), Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran
| | - Javad Zahiri
- Department of Neuroscience, University of California San Diego, California, USA
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
| | - Saber Mohammadi
- Bioinformatics and Computational Omics Lab (BioCOOL), Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran
| | - Mohsen Khodarahmi
- Department of Radiology, Shahid Madani Hospital, Karaj, Iran
- Bahar Medical Imaging Center, Karaj, Iran
- Dr. Khodarahmi Medical Imaging Center, Karaj, Iran
| | - Seyed Shahriar Arab
- Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran
| |
Collapse
|
4
|
Abdennaji I, Zaied M, Girault JM. Prediction of protein structural class based on symmetrical recurrence quantification analysis. Comput Biol Chem 2021; 92:107450. [PMID: 33631460 DOI: 10.1016/j.compbiolchem.2021.107450] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2020] [Accepted: 02/03/2021] [Indexed: 11/19/2022]
Abstract
Protein structural class prediction for low similarity sequences is a significant challenge and one of the deeply explored subjects. This plays an important role in drug design, folding recognition of protein, functional analysis and several other biology applications. In this paper, we worked with two benchmark databases existing in the literature (1) 25PDB and (2) 1189 to apply our proposed method for predicting protein structural class. Initially, we transformed protein sequences into DNA sequences and then into binary sequences. Furthermore, we applied symmetrical recurrence quantification analysis (the new approach), where we got 8 features from each symmetry plot computation. Moreover, the machine learning algorithms such as Linear Discriminant Analysis (LDA), Random Forest (RF) and Support Vector Machine (SVM) are used. In addition, comparison was made to find the best classifier for protein structural class prediction. Results show that symmetrical recurrence quantification as feature extraction method with RF classifier outperformed existing methods with an overall accuracy of 100% without overfitting.
Collapse
Affiliation(s)
- Ines Abdennaji
- Research Team in Intelligent Machines, National School of Engineers of Gabes, B.P. W, 6072 Gabes, Tunisia; GSII ESEO - LAUM UMR CNRS 6613, 49000 Angers, France.
| | - Mourad Zaied
- Research Team in Intelligent Machines, National School of Engineers of Gabes, B.P. W, 6072 Gabes, Tunisia; GSII ESEO - LAUM UMR CNRS 6613, 49000 Angers, France
| | - Jean-Marc Girault
- Research Team in Intelligent Machines, National School of Engineers of Gabes, B.P. W, 6072 Gabes, Tunisia; GSII ESEO - LAUM UMR CNRS 6613, 49000 Angers, France
| |
Collapse
|
5
|
Prediction of Protein-Protein Interactions Based on Domain. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2019; 2019:5238406. [PMID: 31531123 PMCID: PMC6720845 DOI: 10.1155/2019/5238406] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/20/2019] [Revised: 07/09/2019] [Accepted: 07/30/2019] [Indexed: 11/17/2022]
Abstract
Protein-protein interactions (PPIs) play a crucial role in various biological processes. To better comprehend the pathogenesis and treatments of various diseases, it is necessary to learn the detail of these interactions. However, the current experimental method still has many false-positive and false-negative problems. Computational prediction of protein-protein interaction has become a more important prediction method which can overcome the obstacles of the experimental method. In this work, we proposed a novel computational domain-based method for PPI prediction, and an SVM model for the prediction was built based on the physicochemical property of the domain. The outcomes of SVM and the domain-domain score were used to construct the prediction model for protein-protein interaction. The predicted results demonstrated the domain-based research can enhance the ability to predict protein interactions.
Collapse
|
6
|
A novel feature selection method to predict protein structural class. Comput Biol Chem 2018; 76:118-129. [DOI: 10.1016/j.compbiolchem.2018.06.007] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2018] [Revised: 05/14/2018] [Accepted: 06/30/2018] [Indexed: 01/05/2023]
|
7
|
Yu B, Lou L, Li S, Zhang Y, Qiu W, Wu X, Wang M, Tian B. Prediction of protein structural class for low-similarity sequences using Chou’s pseudo amino acid composition and wavelet denoising. J Mol Graph Model 2017; 76:260-273. [DOI: 10.1016/j.jmgm.2017.07.012] [Citation(s) in RCA: 60] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2017] [Revised: 07/11/2017] [Accepted: 07/12/2017] [Indexed: 11/25/2022]
|
8
|
Yuan M, Yang Z, Huang G, Ji G. Feature selection by maximizing correlation information for integrated high-dimensional protein data. Pattern Recognit Lett 2017. [DOI: 10.1016/j.patrec.2017.03.011] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
9
|
Li L, Luo Q, Xiao W, Li J, Zhou S, Li Y, Zheng X, Yang H. A machine-learning approach for predicting palmitoylation sites from integrated sequence-based features. J Bioinform Comput Biol 2017; 15:1650025. [PMID: 27411307 DOI: 10.1142/s0219720016500256] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Palmitoylation is the covalent attachment of lipids to amino acid residues in proteins. As an important form of protein posttranslational modification, it increases the hydrophobicity of proteins, which contributes to the protein transportation, organelle localization, and functions, therefore plays an important role in a variety of cell biological processes. Identification of palmitoylation sites is necessary for understanding protein-protein interaction, protein stability, and activity. Since conventional experimental techniques to determine palmitoylation sites in proteins are both labor intensive and costly, a fast and accurate computational approach to predict palmitoylation sites from protein sequences is in urgent need. In this study, a support vector machine (SVM)-based method was proposed through integrating PSI-BLAST profile, physicochemical properties, [Formula: see text]-mer amino acid compositions (AACs), and [Formula: see text]-mer pseudo AACs into the principal feature vector. A recursive feature selection scheme was subsequently implemented to single out the most discriminative features. Finally, an SVM method was implemented to predict palmitoylation sites in proteins based on the optimal features. The proposed method achieved an accuracy of 99.41% and Matthews Correlation Coefficient of 0.9773 for a benchmark dataset. The result indicates the efficiency and accuracy of our method in prediction of palmitoylation sites based on protein sequences.
Collapse
Affiliation(s)
- Liqi Li
- * Department of General Surgery, Xinqiao Hospital, Third Military Medical University, Chongqing 400037, China
| | - Qifa Luo
- * Department of General Surgery, Xinqiao Hospital, Third Military Medical University, Chongqing 400037, China
| | - Weidong Xiao
- * Department of General Surgery, Xinqiao Hospital, Third Military Medical University, Chongqing 400037, China
| | - Jinhui Li
- * Department of General Surgery, Xinqiao Hospital, Third Military Medical University, Chongqing 400037, China
| | - Shiwen Zhou
- † National Drug Clinical Trial Institution, Xinqiao Hospital, Third Military Medical University, Chongqing 400037, China
| | - Yongsheng Li
- ‡ Institute of Cancer, Xinqiao Hospital, Third Military Medical University, Chongqing 400037, China
| | - Xiaoqi Zheng
- § Department of Mathematics, Shanghai Normal University, Shanghai 200234, China
| | - Hua Yang
- * Department of General Surgery, Xinqiao Hospital, Third Military Medical University, Chongqing 400037, China
| |
Collapse
|
10
|
Li L, Li J, Xiao W, Li Y, Qin Y, Zhou S, Yang H. Prediction the Substrate Specificities of Membrane Transport Proteins Based on Support Vector Machine and Hybrid Features. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:947-953. [PMID: 26571537 DOI: 10.1109/tcbb.2015.2495140] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Membrane transport proteins and their substrate specificities play crucial roles in a variety of cellular functions. Identifying the substrate specificities of membrane transport proteins is closely related to the protein-target interaction prediction, drug design, membrane recruitment, and dysregulation analysis. However, experimental methods to this aim are time consuming, labor intensive, and costly. Therefore, we proposed a novel method basing on support vector machine (SVM) to predict substrate specificities of membrane transport proteins by integrating features from position-specific score matrix (PSSM), PROFEAT, and Gene Ontology (GO). Finally, jackknife cross-validation tests were adopted on a benchmark and independent datasets to measure the performance of the proposed method. The overall accuracy of 96.16 and 80.45 percent were obtained for two datasets, which are higher (from 2.12 to 20.44 percent) than that by the state-of-the-art tool. Comparison results indicate that the proposed model is more reliable and efficient for accurate prediction the substrate specificities of membrane transport proteins.
Collapse
|
11
|
A Gram-Negative Bacterial Secreted Protein Types Prediction Method Based on PSI-BLAST Profile. BIOMED RESEARCH INTERNATIONAL 2016; 2016:3206741. [PMID: 27563663 PMCID: PMC4985605 DOI: 10.1155/2016/3206741] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/13/2016] [Revised: 07/04/2016] [Accepted: 07/05/2016] [Indexed: 11/29/2022]
Abstract
Prediction of secreted protein types based solely on sequence data remains to be a challenging problem. In this study, we extract the long-range correlation information and linear correlation information from position-specific score matrix (PSSM). A total of 6800 features are extracted at 17 different gaps; then, 309 features are selected by a filter feature selection method based on the training set. To verify the performance of our method, jackknife and independent dataset tests are performed on the test set and the reported overall accuracies are 93.60% and 100%, respectively. Comparison of our results with the existing method shows that our method provides the favorable performance for secreted protein type prediction.
Collapse
|
12
|
Zhang L, Kong L, Han X, Lv J. Structural class prediction of protein using novel feature extraction method from chaos game representation of predicted secondary structure. J Theor Biol 2016; 400:1-10. [DOI: 10.1016/j.jtbi.2016.04.011] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2016] [Revised: 03/18/2016] [Accepted: 04/08/2016] [Indexed: 11/30/2022]
|
13
|
Prediction of Protein Structural Classes for Low-Similarity Sequences Based on Consensus Sequence and Segmented PSSM. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2015; 2015:370756. [PMID: 26788119 PMCID: PMC4693000 DOI: 10.1155/2015/370756] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/31/2015] [Revised: 11/19/2015] [Accepted: 12/01/2015] [Indexed: 11/17/2022]
Abstract
Prediction of protein structural classes for low-similarity sequences is useful for understanding fold patterns, regulation, functions, and interactions of proteins. It is well known that feature extraction is significant to prediction of protein structural class and it mainly uses protein primary sequence, predicted secondary structure sequence, and position-specific scoring matrix (PSSM). Currently, prediction solely based on the PSSM has played a key role in improving the prediction accuracy. In this paper, we propose a novel method called CSP-SegPseP-SegACP by fusing consensus sequence (CS), segmented PsePSSM, and segmented autocovariance transformation (ACT) based on PSSM. Three widely used low-similarity datasets (1189, 25PDB, and 640) are adopted in this paper. Then a 700-dimensional (700D) feature vector is constructed and the dimension is decreased to 224D by using principal component analysis (PCA). To verify the performance of our method, rigorous jackknife cross-validation tests are performed on 1189, 25PDB, and 640 datasets. Comparison of our results with the existing PSSM-based methods demonstrates that our method achieves the favorable and competitive performance. This will offer an important complementary to other PSSM-based methods for prediction of protein structural classes for low-similarity sequences.
Collapse
|
14
|
Fan M, Zheng B, Li L. A novel Multi-Agent Ada-Boost algorithm for predicting protein structural class with the information of protein secondary structure. J Bioinform Comput Biol 2015; 13:1550022. [DOI: 10.1142/s0219720015500225] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Knowledge of the structural class of a given protein is important for understanding its folding patterns. Although a lot of efforts have been made, it still remains a challenging problem for prediction of protein structural class solely from protein sequences. The feature extraction and classification of proteins are the main problems in prediction. In this research, we extended our earlier work regarding these two aspects. In protein feature extraction, we proposed a scheme by calculating the word frequency and word position from sequences of amino acid, reduced amino acid, and secondary structure. For an accurate classification of the structural class of protein, we developed a novel Multi-Agent Ada-Boost (MA-Ada) method by integrating the features of Multi-Agent system into Ada-Boost algorithm. Extensive experiments were taken to test and compare the proposed method using four benchmark datasets in low homology. The results showed classification accuracies of 88.5%, 96.0%, 88.4%, and 85.5%, respectively, which are much better compared with the existing methods. The source code and dataset are available on request.
Collapse
Affiliation(s)
- Ming Fan
- Institute of Biomedical Engineering and Instrumentation, Hangzhou Dianzi University, Hangzhou 310018, China
| | - Bin Zheng
- Hunan Mechanical and Electrical Polytechnic, Chang Sha 410151, China
| | - Lihua Li
- Institute of Biomedical Engineering and Instrumentation, Hangzhou Dianzi University, Hangzhou 310018, China
| |
Collapse
|
15
|
Abbass J, Nebel JC. Customised fragments libraries for protein structure prediction based on structural class annotations. BMC Bioinformatics 2015; 16:136. [PMID: 25925397 PMCID: PMC4419399 DOI: 10.1186/s12859-015-0576-2] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2014] [Accepted: 04/17/2015] [Indexed: 12/05/2022] Open
Abstract
Background Since experimental techniques are time and cost consuming, in silico protein structure prediction is essential to produce conformations of protein targets. When homologous structures are not available, fragment-based protein structure prediction has become the approach of choice. However, it still has many issues including poor performance when targets’ lengths are above 100 residues, excessive running times and sub-optimal energy functions. Taking advantage of the reliable performance of structural class prediction software, we propose to address some of the limitations of fragment-based methods by integrating structural constraints in their fragment selection process. Results Using Rosetta, a state-of-the-art fragment-based protein structure prediction package, we evaluated our proposed pipeline on 70 former CASP targets containing up to 150 amino acids. Using either CATH or SCOP-based structural class annotations, enhancement of structure prediction performance is highly significant in terms of both GDT_TS (at least +2.6, p-values < 0.0005) and RMSD (−0.4, p-values < 0.005). Although CATH and SCOP classifications are different, they perform similarly. Moreover, proteins from all structural classes benefit from the proposed methodology. Further analysis also shows that methods relying on class-based fragments produce conformations which are more relevant to user and converge quicker towards the best model as estimated by GDT_TS (up to 10% in average). This substantiates our hypothesis that usage of structurally relevant templates conducts to not only reducing the size of the conformation space to be explored, but also focusing on a more relevant area. Conclusions Since our methodology produces models the quality of which is up to 7% higher in average than those generated by a standard fragment-based predictor, we believe it should be considered before conducting any fragment-based protein structure prediction. Despite such progress, ab initio prediction remains a challenging task, especially for proteins of average and large sizes. Apart from improving search strategies and energy functions, integration of additional constraints seems a promising route, especially if they can be accurately predicted from sequence alone. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0576-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jad Abbass
- Faculty of Science, Engineering and Computing, Kingston University, London, KT1 2EE, UK.
| | - Jean-Christophe Nebel
- Faculty of Science, Engineering and Computing, Kingston University, London, KT1 2EE, UK.
| |
Collapse
|
16
|
Prediction of protein structural class using tri-gram probabilities of position-specific scoring matrix and recursive feature elimination. Amino Acids 2015; 47:461-8. [DOI: 10.1007/s00726-014-1878-9] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2014] [Accepted: 11/17/2014] [Indexed: 10/24/2022]
|
17
|
Li L, Yu S, Xiao W, Li Y, Hu W, Huang L, Zheng X, Zhou S, Yang H. Protein submitochondrial localization from integrated sequence representation and SVM-based backward feature extraction. MOLECULAR BIOSYSTEMS 2015; 11:170-177. [PMID: 25335193 DOI: 10.1039/c4mb00340c] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Mitochondrion, a tiny energy factory, plays an important role in various biological processes of most eukaryotic cells. Mitochondrial defection is associated with a series of human diseases. Knowledge of the submitochondrial locations of proteins can help to reveal the biological functions of novel proteins, and understand the mechanisms underlying various biological processes occurring in the mitochondrion. However, experimental methods to determine protein submitochondrial locations are costly and time consuming. Thus it is essential to develop a fast and reliable computational method to predict protein submitochondrial locations. Here, we proposed a support vector machine (SVM) based approach for predicting protein submitochondrial locations. Information from the position-specific score matrix (PSSM), gene ontology (GO) and the protein feature (PROFEAT) was integrated into the principal features of this model. Then a recursive feature selection scheme was employed to select the optimal features. Finally, an SVM module was used to predict protein submitochondrial locations based on the optimal features. Through the jackknife cross-validation test, our method achieved an accuracy of 99.37% on benchmark dataset M317, and 100% on the other two datasets, M1105 and T86. These results indicate that our method is economic and effective for accurate prediction of the protein submitochondrial location.
Collapse
Affiliation(s)
- Liqi Li
- Department of General Surgery, Xinqiao Hospital, Third Military Medical University, Chongqing 400037, China.
| | | | | | | | | | | | | | | | | |
Collapse
|
18
|
Li L, Yu S, Xiao W, Li Y, Huang L, Zheng X, Zhou S, Yang H. Sequence-based identification of recombination spots using pseudo nucleic acid representation and recursive feature extraction by linear kernel SVM. BMC Bioinformatics 2014; 15:340. [PMID: 25409550 PMCID: PMC4289199 DOI: 10.1186/1471-2105-15-340] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2014] [Accepted: 09/29/2014] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND Identification of the recombination hot/cold spots is critical for understanding the mechanism of recombination as well as the genome evolution process. However, experimental identification of recombination spots is both time-consuming and costly. Developing an accurate and automated method for reliably and quickly identifying recombination spots is thus urgently needed. RESULTS Here we proposed a novel approach by fusing features from pseudo nucleic acid composition (PseNAC), including NAC, n-tier NAC and pseudo dinucleotide composition (PseDNC). A recursive feature extraction by linear kernel support vector machine (SVM) was then used to rank the integrated feature vectors and extract optimal features. SVM was adopted for identifying recombination spots based on these optimal features. To evaluate the performance of the proposed method, jackknife cross-validation test was employed on a benchmark dataset. The overall accuracy of this approach was 84.09%, which was higher (from 0.37% to 3.79%) than those of state-of-the-art tools. CONCLUSIONS Comparison results suggested that linear kernel SVM is a useful vehicle for identifying recombination hot/cold spots.
Collapse
Affiliation(s)
- Liqi Li
- />Department of General Surgery, Xinqiao Hospital, Third Military Medical University, Chongqing, 400037 China
| | - Sanjiu Yu
- />Institute of Cardiovascular Diseases of PLA, Xinqiao Hospital, Third Military Medical University, Chongqing, 400037 China
| | - Weidong Xiao
- />Department of General Surgery, Xinqiao Hospital, Third Military Medical University, Chongqing, 400037 China
| | - Yongsheng Li
- />Institute of Cancer, Xinqiao Hospital, Third Military Medical University, Chongqing, 400037 China
| | - Lan Huang
- />Institute of Cardiovascular Diseases of PLA, Xinqiao Hospital, Third Military Medical University, Chongqing, 400037 China
| | - Xiaoqi Zheng
- />Department of Mathematics, Shanghai Normal University, Shanghai, 200234 China
| | - Shiwen Zhou
- />National Drug Clinical Trial Institution, Xinqiao Hospital, Third Military Medical University, Chongqing, 400037 China
| | - Hua Yang
- />Department of General Surgery, Xinqiao Hospital, Third Military Medical University, Chongqing, 400037 China
| |
Collapse
|
19
|
Li L, Yu S, Xiao W, Li Y, Li M, Huang L, Zheng X, Zhou S, Yang H. Prediction of bacterial protein subcellular localization by incorporating various features into Chou's PseAAC and a backward feature selection approach. Biochimie 2014; 104:100-107. [PMID: 24929100 DOI: 10.1016/j.biochi.2014.06.001] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2014] [Accepted: 06/01/2014] [Indexed: 02/08/2023]
Abstract
Information on the subcellular localization of bacterial proteins is essential for protein function prediction, genome annotation and drug design. Here we proposed a novel approach to predict the subcellular localization of bacterial proteins by fusing features from position-specific score matrix (PSSM), Gene Ontology (GO) and PROFEAT. A backward feature selection approach by linear kennel of SVM was then used to rank the integrated feature vectors and extract optimal features. Finally, SVM was applied for predicting protein subcellular locations based on these optimal features. To validate the performance of our method, we employed jackknife cross-validation tests on three low similarity datasets, i.e., M638, Gneg1456 and Gpos523. The overall accuracies of 94.98%, 93.21%, and 94.57% were achieved for these three datasets, which are higher (from 1.8% to 10.9%) than those by state-of-the-art tools. Comparison results suggest that our method could serve as a very useful vehicle for expediting the prediction of bacterial protein subcellular localization.
Collapse
Affiliation(s)
- Liqi Li
- Department of General Surgery, Xinqiao Hospital, Third Military Medical University, Chongqing 400037, China
| | - Sanjiu Yu
- Institute of Cardiovascular Diseases of PLA, Xinqiao Hospital, Third Military Medical University, Chongqing 400037, China
| | - Weidong Xiao
- Department of General Surgery, Xinqiao Hospital, Third Military Medical University, Chongqing 400037, China
| | - Yongsheng Li
- Institute of Cancer, Xinqiao Hospital, Third Military Medical University, Chongqing 400037, China
| | - Maolin Li
- Department of General Surgery, Xinqiao Hospital, Third Military Medical University, Chongqing 400037, China
| | - Lan Huang
- Institute of Cardiovascular Diseases of PLA, Xinqiao Hospital, Third Military Medical University, Chongqing 400037, China
| | - Xiaoqi Zheng
- Department of Mathematics, Shanghai Normal University, Shanghai 200234, China.
| | - Shiwen Zhou
- National Drug Clinical Trial Institution, Xinqiao Hospital, Third Military Medical University, Chongqing 400037, China.
| | - Hua Yang
- Department of General Surgery, Xinqiao Hospital, Third Military Medical University, Chongqing 400037, China.
| |
Collapse
|
20
|
Liu WX, Deng EZ, Chen W, Lin H. Identifying the subfamilies of voltage-gated potassium channels using feature selection technique. Int J Mol Sci 2014; 15:12940-51. [PMID: 25054318 PMCID: PMC4139883 DOI: 10.3390/ijms150712940] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2014] [Revised: 07/13/2014] [Accepted: 07/14/2014] [Indexed: 11/16/2022] Open
Abstract
Voltage-gated K+ channel (VKC) plays important roles in biology procession, especially in nervous system. Different subfamilies of VKCs have different biological functions. Thus, knowing VKCs’ subfamilies has become a meaningful job because it can guide the direction for the disease diagnosis and drug design. However, the traditional wet-experimental methods were costly and time-consuming. It is highly desirable to develop an effective and powerful computational tool for identifying different subfamilies of VKCs. In this study, a predictor, called iVKC-OTC, has been developed by incorporating the optimized tripeptide composition (OTC) generated by feature selection technique into the general form of pseudo-amino acid composition to identify six subfamilies of VKCs. One of the remarkable advantages of introducing the optimized tripeptide composition is being able to avoid the notorious dimension disaster or over fitting problems in statistical predictions. It was observed on a benchmark dataset, by using a jackknife test, that the overall accuracy achieved by iVKC-OTC reaches to 96.77% in identifying the six subfamilies of VKCs, indicating that the new predictor is promising or at least may become a complementary tool to the existing methods in this area. It has not escaped our notice that the optimized tripeptide composition can also be used to investigate other protein classification problems.
Collapse
Affiliation(s)
- Wei-Xin Liu
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - En-Ze Deng
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Wei Chen
- Department of Physics, School of Sciences, and Center for Genomics and Computational Biology, Hebei United University, Tangshan 063000, China.
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|