1
|
Liu S, Cui C, Chen H, Liu T. Ensemble Learning-Based Feature Selection for Phage Protein Prediction. Front Microbiol 2022; 13:932661. [PMID: 35910662 PMCID: PMC9335128 DOI: 10.3389/fmicb.2022.932661] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2022] [Accepted: 06/14/2022] [Indexed: 11/14/2022] Open
Abstract
Phage has high specificity for its host recognition. As a natural enemy of bacteria, it has been used to treat super bacteria many times. Identifying phage proteins from the original sequence is very important for understanding the relationship between phage and host bacteria and developing new antimicrobial agents. However, traditional experimental methods are both expensive and time-consuming. In this study, an ensemble learning-based feature selection method is proposed to find important features for phage protein identification. The method uses four types of protein sequence-derived features, quantifies the importance of each feature by adding perturbations to the features to influence the results, and finally splices the important features among the four types of features. In addition, we analyzed the selected features and their biological significance.
Collapse
Affiliation(s)
- Songbo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Chengmin Cui
- Beijing Institute of Control Engineering, China Academy of Space Technology, Beijing, China
| | - Huipeng Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
- *Correspondence: Huipeng Chen
| | - Tong Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
2
|
Zhang S, Qiao H. KD-KLNMF: Identification of lncRNAs subcellular localization with multiple features and nonnegative matrix factorization. Anal Biochem 2020; 610:113995. [PMID: 33080214 DOI: 10.1016/j.ab.2020.113995] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2020] [Revised: 09/07/2020] [Accepted: 10/12/2020] [Indexed: 12/18/2022]
Abstract
Long non-coding RNAs (lncRNAs) refer to functional RNA molecules with a length more than 200 nucleotides and have minimal or no function to encode proteins. In recent years, more studies show that lncRNAs subcellular localization has valuable clues for their biological functions. So it is count for much to identify lncRNAs subcellular localization. In this paper, a novel statistical model named KD-KLNMF is constructed to predict lncRNAs subcellular localization. Firstly, k-mer and dinucleotide-based spatial autocorrelation are incorporated as the feature vector. Then, Synthetic Minority Over-sampling Technique is used to deal with the imbalance dataset. Next, Kullback-Leibler divergence-based nonnegative matrix factorization is applied to select optimal features. And then we utilize support vector machine as the classifier after comparing with other classifiers. Finally, the jackknife test is performed to evaluate the model. The overall accuracies reach 97.24% and 92.86% on training dataset and independent dataset, respectively. The results are better than the previous methods, which indicate that our model will be a useful and feasible tool to identify lncRNAs subcellular localization. The datasets and source code are freely available at https://github.com/HuijuanQiao/KD-KLNMF.
Collapse
Affiliation(s)
- Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China.
| | - Huijuan Qiao
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| |
Collapse
|
3
|
Khan F, Khan M, Iqbal N, Khan S, Muhammad Khan D, Khan A, Wei DQ. Prediction of Recombination Spots Using Novel Hybrid Feature Extraction Method via Deep Learning Approach. Front Genet 2020; 11:539227. [PMID: 33093842 PMCID: PMC7527634 DOI: 10.3389/fgene.2020.539227] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Accepted: 08/13/2020] [Indexed: 01/20/2023] Open
Abstract
Meiotic recombination is the driving force of evolutionary development and an important source of genetic variation. The meiotic recombination does not take place randomly in a chromosome but occurs in some regions of the chromosome. A region in chromosomes with higher rate of meiotic recombination events are considered as hotspots and a region where frequencies of the recombination events are lower are called coldspots. Prediction of meiotic recombination spots provides useful information about the basic functionality of inheritance and genome diversity. This study proposes an intelligent computational predictor called iRSpots-DNN for the identification of recombination spots. The proposed predictor is based on a novel feature extraction method and an optimized deep neural network (DNN). The DNN was employed as a classification engine whereas, the novel features extraction method was developed to extract meaningful features for the identification of hotspots and coldspots across the yeast genome. Unlike previous algorithms, the proposed feature extraction avoids bias among different selected features and preserved the sequence discriminant properties along with the sequence-structure information simultaneously. This study also considered other effective classifiers named support vector machine (SVM), K-nearest neighbor (KNN), and random forest (RF) to predict recombination spots. Experimental results on a benchmark dataset with 10-fold cross-validation showed that iRSpots-DNN achieved the highest accuracy, i.e., 95.81%. Additionally, the performance of the proposed iRSpots-DNN is significantly better than the existing predictors on a benchmark dataset. The relevant benchmark dataset and source code are freely available at: https://github.com/Fatima-Khan12/iRspot_DNN/tree/master/iRspot_DNN.
Collapse
Affiliation(s)
- Fatima Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Mukhtaj Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Nadeem Iqbal
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Salman Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Dost Muhammad Khan
- Department of Statistics, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Abbas Khan
- Department of Bioinformatics and Biological Statistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Dong-Qing Wei
- Department of Bioinformatics and Biological Statistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China.,State Key Laboratory of Microbial Metabolism, Shanghai-Islamabad-Belgrade Joint Innovation Center on Antibacterial Resistances, Joint Laboratory of International Cooperation in Metabolic and Developmental Sciences, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Ministry of Education, Shanghai, China.,Peng Cheng Laboratory, Shenzhen, China
| |
Collapse
|
4
|
Wang C, Wang W, Lu K, Zhang J, Chen P, Wang B. Predicting Drug-Target Interactions with Electrotopological State Fingerprints and Amphiphilic Pseudo Amino Acid Composition. Int J Mol Sci 2020; 21:ijms21165694. [PMID: 32784497 PMCID: PMC7570185 DOI: 10.3390/ijms21165694] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2020] [Revised: 08/05/2020] [Accepted: 08/06/2020] [Indexed: 12/13/2022] Open
Abstract
The task of drug-target interaction (DTI) prediction plays important roles in drug development. The experimental methods in DTIs are time-consuming, expensive and challenging. To solve these problems, machine learning-based methods are introduced, which are restricted by effective feature extraction and negative sampling. In this work, features with electrotopological state (E-state) fingerprints for drugs and amphiphilic pseudo amino acid composition (APAAC) for target proteins are tested. E-state fingerprints are extracted based on both molecular electronic and topological features with the same metric. APAAC is an extension of amino acid composition (AAC), which is calculated based on hydrophilic and hydrophobic characters to construct sequence order information. Using the combination of these feature pairs, the prediction model is established by support vector machines. In order to enhance the effectiveness of features, a distance-based negative sampling is proposed to obtain reliable negative samples. It is shown that the prediction results of area under curve for Receiver Operating Characteristic (AUC) are above 98.5% for all the three datasets in this work. The comparison of state-of-the-art methods demonstrates the effectiveness and efficiency of proposed method, which will be helpful for further drug development.
Collapse
Affiliation(s)
- Cheng Wang
- Department of Computer Science & Technology, Tongji University, Shanghai 201804, China;
| | - Wenyan Wang
- School of Electrical & Information Engineering, Anhui University of Technology, Ma’anshan 243002, China; (W.W.); (K.L.)
- Key Laboratory of Power Electronics and Motion Control Anhui Education Department, Ma’anshan 243032, China
| | - Kun Lu
- School of Electrical & Information Engineering, Anhui University of Technology, Ma’anshan 243002, China; (W.W.); (K.L.)
| | - Jun Zhang
- Institutes of Physical Science and Information Technology & School of Internet, Anhui University, Hefei 230601, China;
| | - Peng Chen
- Institutes of Physical Science and Information Technology & School of Internet, Anhui University, Hefei 230601, China;
- Correspondence: (P.C.); (B.W.)
| | - Bing Wang
- Department of Computer Science & Technology, Tongji University, Shanghai 201804, China;
- School of Electrical & Information Engineering, Anhui University of Technology, Ma’anshan 243002, China; (W.W.); (K.L.)
- Key Laboratory of Power Electronics and Motion Control Anhui Education Department, Ma’anshan 243032, China
- Correspondence: (P.C.); (B.W.)
| |
Collapse
|
5
|
iDHS-DSAMS: Identifying DNase I hypersensitive sites based on the dinucleotide property matrix and ensemble bagged tree. Genomics 2020; 112:1282-1289. [DOI: 10.1016/j.ygeno.2019.07.017] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2019] [Revised: 07/14/2019] [Accepted: 07/30/2019] [Indexed: 11/21/2022]
|
6
|
Yu B, Qiu W, Chen C, Ma A, Jiang J, Zhou H, Ma Q. SubMito-XGBoost: predicting protein submitochondrial localization by fusing multiple feature information and eXtreme gradient boosting. Bioinformatics 2019; 36:1074-1081. [DOI: 10.1093/bioinformatics/btz734] [Citation(s) in RCA: 98] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2019] [Revised: 09/04/2019] [Accepted: 09/25/2019] [Indexed: 11/13/2022] Open
Abstract
Abstract
Motivation
Mitochondria are an essential organelle in most eukaryotes. They not only play an important role in energy metabolism but also take part in many critical cytopathological processes. Abnormal mitochondria can trigger a series of human diseases, such as Parkinson's disease, multifactor disorder and Type-II diabetes. Protein submitochondrial localization enables the understanding of protein function in studying disease pathogenesis and drug design.
Results
We proposed a new method, SubMito-XGBoost, for protein submitochondrial localization prediction. Three steps are included: (i) the g-gap dipeptide composition (g-gap DC), pseudo-amino acid composition (PseAAC), auto-correlation function (ACF) and Bi-gram position-specific scoring matrix (Bi-gram PSSM) are employed to extract protein sequence features, (ii) Synthetic Minority Oversampling Technique (SMOTE) is used to balance samples, and the ReliefF algorithm is applied for feature selection and (iii) the obtained feature vectors are fed into XGBoost to predict protein submitochondrial locations. SubMito-XGBoost has obtained satisfactory prediction results by the leave-one-out-cross-validation (LOOCV) compared with existing methods. The prediction accuracies of the SubMito-XGBoost method on the two training datasets M317 and M983 were 97.7% and 98.9%, which are 2.8–12.5% and 3.8–9.9% higher than other methods, respectively. The prediction accuracy of the independent test set M495 was 94.8%, which is significantly better than the existing studies. The proposed method also achieves satisfactory predictive performance on plant and non-plant protein submitochondrial datasets. SubMito-XGBoost also plays an important role in new drug design for the treatment of related diseases.
Availability and implementation
The source codes and data are publicly available at https://github.com/QUST-AIBBDRC/SubMito-XGBoost/.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China
- School of Life Sciences, University of Science and Technology of China, Hefei 230027, China
- Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao 266061, China
- School of Mathematics and Statistics, Changsha University of Science and Technology, Changsha 410114, China
| | - Wenying Qiu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China
- Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao 266061, China
| | - Cheng Chen
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China
- Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao 266061, China
| | - Anjun Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Jing Jiang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
- School of Aerospace Engineering, Xiamen University, Xiamen 361001, China
| | - Hongyan Zhou
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China
- Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao 266061, China
| | - Qin Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| |
Collapse
|
7
|
Li SH, Guan ZX, Zhang D, Zhang ZM, Huang J, Yang W, Lin H. Recent Advancement in Predicting Subcellular Localization of Mycobacterial Protein with Machine Learning Methods. Med Chem 2019; 16:605-619. [PMID: 31584379 DOI: 10.2174/1573406415666191004101913] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2019] [Revised: 06/25/2019] [Accepted: 08/23/2019] [Indexed: 01/28/2023]
Abstract
Mycobacterium tuberculosis (MTB) can cause the terrible tuberculosis (TB), which is reported as one of the most dreadful epidemics. Although many biochemical molecular drugs have been developed to cope with this disease, the drug resistance-especially the multidrug-resistant (MDR) and extensively drug-resistance (XDR)-poses a huge threat to the treatment. However, traditional biochemical experimental method to tackle TB is time-consuming and costly. Benefited by the appearance of the enormous genomic and proteomic sequence data, TB can be treated via sequence-based biological computational approach-bioinformatics. Studies on predicting subcellular localization of mycobacterial protein (MBP) with high precision and efficiency may help figure out the biological function of these proteins and then provide useful insights for protein function annotation as well as drug design. In this review, we reported the progress that has been made in computational prediction of subcellular localization of MBP including the following aspects: 1) Construction of benchmark datasets. 2) Methods of feature extraction. 3) Techniques of feature selection. 4) Application of several published prediction algorithms. 5) The published results. 6) The further study on prediction of subcellular localization of MBP.
Collapse
Affiliation(s)
- Shi-Hao Li
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Zheng-Xing Guan
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Dan Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Zi-Mei Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Jian Huang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Wuritu Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,Development and Planning Department, Inner Mongolia University, Hohhot, P.R. China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
8
|
Liang Y, Zhang S. iDHS-DMCAC: identifying DNase I hypersensitive sites with balanced dinucleotide-based detrending moving-average cross-correlation coefficient. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2019; 30:429-445. [PMID: 31117818 DOI: 10.1080/1062936x.2019.1615546] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
DNase I hypersensitive sites (DHSs) are associated with regulatory DNA elements, so their good understanding is significant for both the biomedical research and the discovery of new drugs. Traditional experimental methods are laborious, time consuming and an inaccurately task to detect DHSs. More importantly, with the avalanche of genome sequences in the postgenomic age, it is highly essential to develop cost-effective computational approaches to identify DHSs. In this paper, we develop a statistical feature extraction model using the detrended moving-average cross-correlation (DMCA) coefficient descriptor based on dinucleotide property matrix generated by the 15 DNA dinucleotide properties, and this model is named iDHS-DMCAC. A 105-dimensional feature vector is constructed for a certain window on the two class imbalanced benchmark datasets, with over-sampling and support vector machine algorithms. Rigorous cross-validations indicate that our predictor remarkably outperforms the existing models in both accuracy and stability. We anticipate that iDHS-DMCAC will become a very useful high throughput tool, or at the very least, a complementary tool to the existing methods of identifying DNase I hypersensitive sites. The datasets and source codes of the proposed model are freely available at https://github.com/shengli0201/Datasets .
Collapse
Affiliation(s)
- Y Liang
- a School of Science , Xi'an Polytechnic University , Xi'an , P. R. China
| | - S Zhang
- b School of Mathematics and Statistics , Xidian University , Xi'an , P. R. China
| |
Collapse
|
9
|
Yang Q, Jia C, Li T. Prediction of aptamer-protein interacting pairs based on sparse autoencoder feature extraction and an ensemble classifier. Math Biosci 2019; 311:103-108. [PMID: 30880100 DOI: 10.1016/j.mbs.2019.01.009] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2018] [Revised: 01/29/2019] [Accepted: 01/29/2019] [Indexed: 10/27/2022]
Abstract
Aptamer-protein interacting pairs play important roles in physiological functions and structural characterization. Identifying aptamer-protein interacting pairs is challenging and limited, despite of the tremendous applications of aptamers. Therefore, it is vital to construct a high prediction performance model for identifying aptamer-target interacting pairs. In this study, a novel ensemble method is presented to predict aptamer-protein interacting pairs by integrating sequence characteristics derived from aptamers and the target proteins. The features extracted for aptamers were the compositions of amino acids and pseudo K-tuple nucleotides. In addition, a sparse autoencoder was used to characterize features for the target protein sequences. To remove redundant features, gradient boosting decision tree (GBDT) and incremental feature selection (IFS) methods were used to obtain the optimum combination of sequence characters. Based on 616 selected features, an ensemble of three sub- support vector machine (SVM) classifiers was used to construct our prediction model. Evaluated on an independent dataset, our predictor obtained an accuracy of 75.7%, Matthew's Correlation Coefficient of 0.478, and Youden's Index of 0.538, which were superior to the values reached using other existing predictors. The results show that our model can be used to distinguishing novel aptamer-protein interacting pairs and revealing the interrelation between aptamers and proteins.
Collapse
Affiliation(s)
- Qing Yang
- Institute of Environmental Systems Biology, College of Environmental and Engineering, Dalian Maritime University, No. 1 Linghai Road, Dalian 116026, China
| | - Cangzhi Jia
- School of Science, Dalian Maritime University, No. 1 Linghai Road, Dalian 116026, China
| | - Taoying Li
- Department of Maritime Economics and Management, Dalian Maritime University, No. 1 Linghai Road, Dalian 116026, China.
| |
Collapse
|
10
|
Tang H, Zhao YW, Zou P, Zhang CM, Chen R, Huang P, Lin H. HBPred: a tool to identify growth hormone-binding proteins. Int J Biol Sci 2018; 14:957-964. [PMID: 29989085 PMCID: PMC6036759 DOI: 10.7150/ijbs.24174] [Citation(s) in RCA: 136] [Impact Index Per Article: 19.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2017] [Accepted: 01/15/2018] [Indexed: 12/19/2022] Open
Abstract
Hormone-binding protein (HBP) is a kind of soluble carrier protein and can selectively and non-covalently interact with hormone. HBP plays an important role in life growth, but its function is still unclear. Correct recognition of HBPs is the first step to further study their function and understand their biological process. However, it is difficult to correctly recognize HBPs from more and more proteins through traditional biochemical experiments because of high experimental cost and long experimental period. To overcome these disadvantages, we designed a computational method for identifying HBPs accurately in the study. At first, we collected HBP data from UniProt to establish a high-quality benchmark dataset. Based on the dataset, the dipeptide composition was extracted from HBP residue sequences. In order to find out the optimal features to provide key clues for HBP identification, the analysis of various (ANOVA) was performed for feature ranking. The optimal features were selected through the incremental feature selection strategy. Subsequently, the features were inputted into support vector machine (SVM) for prediction model construction. Jackknife cross-validation results showed that 88.6% HBPs and 81.3% non-HBPs were correctly recognized, suggesting that our proposed model was powerful. This study provides a new strategy to identify HBPs. Moreover, based on the proposed model, we established a webserver called HBPred, which could be freely accessed at http://lin-group.cn/server/HBPred.
Collapse
Affiliation(s)
- Hua Tang
- Department of Pathophysiology, Southwest Medical University, Luzhou 646000, China
| | - Ya-Wei Zhao
- Key Laboratory for NeuroInformation of Ministry of Education, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Ping Zou
- Department of Pathophysiology, Southwest Medical University, Luzhou 646000, China
| | - Chun-Mei Zhang
- Department of Pathophysiology, Southwest Medical University, Luzhou 646000, China
| | - Rong Chen
- Department of Pathophysiology, Southwest Medical University, Luzhou 646000, China
| | - Po Huang
- Department of Pathophysiology, Southwest Medical University, Luzhou 646000, China
| | - Hao Lin
- Key Laboratory for NeuroInformation of Ministry of Education, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|