1
|
Li Z, Liao B, Li Y, Liu W, Chen M, Cai L. Gene function prediction based on combining gene ontology hierarchy with multi-instance multi-label learning. RSC Adv 2018; 8:28503-28509. [PMID: 35542493 PMCID: PMC9083914 DOI: 10.1039/c8ra05122d] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2018] [Accepted: 07/12/2018] [Indexed: 12/04/2022] Open
Abstract
Gene function annotation is the main challenge in the post genome era, which is an important part of the genome annotation. The sequencing of the human genome project produces a whole genome data, providing abundant biological information for the study of gene function annotation. However, to obtain useful knowledge from a large amount of data, a potential strategy is to apply machine learning methods to mine these data and predict gene function. In this study, we improved multi-instance hierarchical clustering by using gene ontology hierarchy to annotate gene function, which combines gene ontology hierarchy with multi-instance multi-label learning frame structure. Then, we used multi-label support vector machine (MLSVM) and multi-label k-nearest neighbor (MLKNN) algorithm to predict the function of gene. Finally, we verified our method in four yeast expression datasets. The performance of the simulated experiments proved that our method is efficient.
Collapse
Affiliation(s)
- Zejun Li
- College of Information Science and Engineering, Hunan University Changsha Hunan 410082 China
- School of Computer and Information Science, Hunan Institute of Technology Hengyang 412002 China
| | - Bo Liao
- College of Information Science and Engineering, Hunan University Changsha Hunan 410082 China
| | - Yun Li
- College of Information Science and Engineering, Hunan University Changsha Hunan 410082 China
| | - Wenhua Liu
- School of Computer and Information Science, Hunan Institute of Technology Hengyang 412002 China
| | - Min Chen
- College of Information Science and Engineering, Hunan University Changsha Hunan 410082 China
- School of Computer and Information Science, Hunan Institute of Technology Hengyang 412002 China
| | - Lijun Cai
- College of Information Science and Engineering, Hunan University Changsha Hunan 410082 China
| |
Collapse
|
2
|
Li W, Liao B, Zhu W, Chen M, Li Z, Wei X, Peng L, Huang G, Cai L, Chen H. Fisher Discrimination Regularized Robust Coding Based on a Local Center for Tumor Classification. Sci Rep 2018; 8:9152. [PMID: 29904059 PMCID: PMC6002553 DOI: 10.1038/s41598-018-27364-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2018] [Accepted: 05/31/2018] [Indexed: 11/29/2022] Open
Abstract
Tumor classification is crucial to the clinical diagnosis and proper treatment of cancers. In recent years, sparse representation-based classifier (SRC) has been proposed for tumor classification. The employed dictionary plays an important role in sparse representation-based or sparse coding-based classification. However, sparse representation-based tumor classification models have not used the employed dictionary, thereby limiting their performance. Furthermore, this sparse representation model assumes that the coding residual follows a Gaussian or Laplacian distribution, which may not effectively describe the coding residual in practical tumor classification. In the present study, we formulated a novel effective cancer classification technique, namely, Fisher discrimination regularized robust coding (FDRRC), by combining the Fisher discrimination dictionary learning method with the regularized robust coding (RRC) model, which searches for a maximum a posteriori solution to coding problems by assuming that the coding residual and representation coefficient are independent and identically distributed. The proposed FDRRC model is extensively evaluated on various tumor datasets and shows superior performance compared with various state-of-the-art tumor classification methods in a variety of classification tasks.
Collapse
Affiliation(s)
- Weibiao Li
- College of Information Science and Engineering, Hunan University, Changsha, Hunan, 410082, China
| | - Bo Liao
- College of Information Science and Engineering, Hunan University, Changsha, Hunan, 410082, China.
| | - Wen Zhu
- College of Information Science and Engineering, Hunan University, Changsha, Hunan, 410082, China
| | - Min Chen
- College of Information Science and Engineering, Hunan University, Changsha, Hunan, 410082, China
| | - Zejun Li
- College of Information Science and Engineering, Hunan University, Changsha, Hunan, 410082, China
| | - Xiaohui Wei
- College of Information Science and Engineering, Hunan University, Changsha, Hunan, 410082, China
| | - Lihong Peng
- Hunan University of Technology, Zhu Zhou, Hunan, 412007, China
| | - Guohua Huang
- College of Information Science and Engineering, Hunan University, Changsha, Hunan, 410082, China
| | - Lijun Cai
- College of Information Science and Engineering, Hunan University, Changsha, Hunan, 410082, China
| | - HaoWen Chen
- College of Information Science and Engineering, Hunan University, Changsha, Hunan, 410082, China
| |
Collapse
|
3
|
Identification of DNA-protein Binding Sites through Multi-Scale Local Average Blocks on Sequence Information. Molecules 2017; 22:molecules22122079. [PMID: 29182548 PMCID: PMC6149935 DOI: 10.3390/molecules22122079] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2017] [Revised: 11/22/2017] [Accepted: 11/24/2017] [Indexed: 12/25/2022] Open
Abstract
DNA–protein interactions appear as pivotal roles in diverse biological procedures and are paramount for cell metabolism, while identifying them with computational means is a kind of prudent scenario in depleting in vitro and in vivo experimental charging. A variety of state-of-the-art investigations have been elucidated to improve the accuracy of the DNA–protein binding sites prediction. Nevertheless, structure-based approaches are limited under the condition without 3D information, and the predictive validity is still refinable. In this essay, we address a kind of competitive method called Multi-scale Local Average Blocks (MLAB) algorithm to solve this issue. Different from structure-based routes, MLAB exploits a strategy that not only extracts local evolutionary information from primary sequences, but also using predicts solvent accessibility. Moreover, the construction about predictors of DNA–protein binding sites wields an ensemble weighted sparse representation model with random under-sampling. To evaluate the performance of MLAB, we conduct comprehensive experiments of DNA–protein binding sites prediction. MLAB gives MCC of 0.392, 0.315, 0.439 and 0.245 on PDNA-543, PDNA-41, PDNA-316 and PDNA-52 datasets, respectively. It shows that MLAB gains advantages by comparing with other outstanding methods. MCC for our method is increased by at least 0.053, 0.015 and 0.064 on PDNA-543, PDNA-41 and PDNA-316 datasets, respectively.
Collapse
|
4
|
Ding Y, Tang J, Guo F. Identification of Protein-Ligand Binding Sites by Sequence Information and Ensemble Classifier. J Chem Inf Model 2017; 57:3149-3161. [PMID: 29125297 DOI: 10.1021/acs.jcim.7b00307] [Citation(s) in RCA: 53] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Identifying protein-ligand binding sites is an important process in drug discovery and structure-based drug design. Detecting protein-ligand binding sites is expensive and time-consuming by traditional experimental methods. Hence, computational approaches provide many effective strategies to deal with this issue. Recently, lots of computational methods are based on structure information on proteins. However, these methods are limited in the common scenario, where both the sequence of protein target is known and sufficient 3D structure information is available. Studies indicate that sequence-based computational approaches for predicting protein-ligand binding sites are more practical. In this paper, we employ a novel computational model of protein-ligand binding sites prediction, using protein sequence. We apply the Discrete Cosine Transform (DCT) to extract feature from Position-Specific Score Matrix (PSSM). In order to improve the accuracy, Predicted Relative Solvent Accessibility (PRSA) information is also utilized. The predictor of protein-ligand binding sites is built by employing the ensemble weighted sparse representation model with random under-sampling. To evaluate our method, we conduct several comprehensive tests (12 types of ligands testing sets) for predicting protein-ligand binding sites. Results show that our method achieves better Matthew's correlation coefficient (MCC) than other outstanding methods on independent test sets of ATP (0.506), ADP (0.511), AMP (0.393), GDP (0.579), GTP (0.641), Mg2+ (0.317), Fe3+ (0.490) and HEME (0.640). Our proposed method outperforms earlier predictors (the performance of MCC) in 8 of the 12 ligands types.
Collapse
Affiliation(s)
- Yijie Ding
- School of Computer Science and Technology, Tianjin University , No. 135, Yaguan Road, Tianjin Haihe Education Park, Tianjin 300350, China
| | - Jijun Tang
- School of Computer Science and Technology, Tianjin University , No. 135, Yaguan Road, Tianjin Haihe Education Park, Tianjin 300350, China.,Department of Computer Science and Engineering, University of South Carolina , Columbia, South Carolina 29208, United States
| | - Fei Guo
- School of Computer Science and Technology, Tianjin University , No. 135, Yaguan Road, Tianjin Haihe Education Park, Tianjin 300350, China
| |
Collapse
|
5
|
Li W, Liao B, Zhu W, Chen M, Peng L, Wei X, Gu C, Li K. Maxdenominator Reweighted Sparse Representation for Tumor Classification. Sci Rep 2017; 7:46030. [PMID: 28393883 PMCID: PMC5385541 DOI: 10.1038/srep46030] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2016] [Accepted: 03/08/2017] [Indexed: 11/09/2022] Open
Abstract
The classification of tumors is crucial for the proper treatment of cancer. Sparse representation-based classifier (SRC) exhibits good classification performance and has been successfully used to classify tumors using gene expression profile data. In this study, we propose a three-step maxdenominator reweighted sparse representation classification (MRSRC) method to classify tumors. First, we extract a set of metagenes from the training samples. These metagenes can capture the structures inherent to the data and are more effective for classification than the original gene expression data. Second, we use a reweighted regularization method to obtain the sparse representation coefficients. Reweighted regularization can enhance sparsity and obtain better sparse representation coefficients. Third, we classify the data by utilizing a maxdenominator residual error function. Maxdenominator strategy can reduce the residual error and improve the accuracy of the final classification. Extensive experiments using publicly available gene expression profile data sets show that the performance of MRSRC is comparable with or better than many existing representative methods.
Collapse
Affiliation(s)
- Weibiao Li
- College of Information Science and Engineering, Hunan University, Changsha, Hunan 410082, China
| | - Bo Liao
- College of Information Science and Engineering, Hunan University, Changsha, Hunan 410082, China
| | - Wen Zhu
- College of Information Science and Engineering, Hunan University, Changsha, Hunan 410082, China
| | - Min Chen
- College of Information Science and Engineering, Hunan University, Changsha, Hunan 410082, China
| | - Li Peng
- College of Information Science and Engineering, Hunan University, Changsha, Hunan 410082, China
| | - Xiaohui Wei
- College of Information Science and Engineering, Hunan University, Changsha, Hunan 410082, China
| | - Changlong Gu
- College of Information Science and Engineering, Hunan University, Changsha, Hunan 410082, China
| | - Keqin Li
- Department of Computer Science, State University of New York, New Paltz, New York 12561, USA
| |
Collapse
|
6
|
Liu W, Zhu W, Liao B, Chen H, Ren S, Cai L. Improving gene regulatory network structure using redundancy reduction in the MRNET algorithm. RSC Adv 2017. [DOI: 10.1039/c7ra01557g] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Inferring gene regulatory networks from expression data is a central problem in systems biology.
Collapse
Affiliation(s)
- Wei Liu
- College of Information Science and Engineering
- Hunan University
- Changsha
- China
| | - Wen Zhu
- College of Information Science and Engineering
- Hunan University
- Changsha
- China
| | - Bo Liao
- College of Information Science and Engineering
- Hunan University
- Changsha
- China
| | - Haowen Chen
- College of Information Science and Engineering
- Hunan University
- Changsha
- China
| | - Siqi Ren
- College of Information Science and Engineering
- Hunan University
- Changsha
- China
| | - Lijun Cai
- College of Information Science and Engineering
- Hunan University
- Changsha
- China
| |
Collapse
|
7
|
Protein Sub-Nuclear Localization Based on Effective Fusion Representations and Dimension Reduction Algorithm LDA. Int J Mol Sci 2015; 16:30343-61. [PMID: 26703574 PMCID: PMC4691178 DOI: 10.3390/ijms161226237] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2015] [Revised: 12/07/2015] [Accepted: 12/11/2015] [Indexed: 01/01/2023] Open
Abstract
An effective representation of a protein sequence plays a crucial role in protein sub-nuclear localization. The existing representations, such as dipeptide composition (DipC), pseudo-amino acid composition (PseAAC) and position specific scoring matrix (PSSM), are insufficient to represent protein sequence due to their single perspectives. Thus, this paper proposes two fusion feature representations of DipPSSM and PseAAPSSM to integrate PSSM with DipC and PseAAC, respectively. When constructing each fusion representation, we introduce the balance factors to value the importance of its components. The optimal values of the balance factors are sought by genetic algorithm. Due to the high dimensionality of the proposed representations, linear discriminant analysis (LDA) is used to find its important low dimensional structure, which is essential for classification and location prediction. The numerical experiments on two public datasets with KNN classifier and cross-validation tests showed that in terms of the common indexes of sensitivity, specificity, accuracy and MCC, the proposed fusing representations outperform the traditional representations in protein sub-nuclear localization, and the representation treated by LDA outperforms the untreated one.
Collapse
|
8
|
Huang YA, You ZH, Gao X, Wong L, Wang L. Using Weighted Sparse Representation Model Combined with Discrete Cosine Transformation to Predict Protein-Protein Interactions from Protein Sequence. BIOMED RESEARCH INTERNATIONAL 2015; 2015:902198. [PMID: 26634213 PMCID: PMC4641304 DOI: 10.1155/2015/902198] [Citation(s) in RCA: 70] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/13/2015] [Accepted: 10/04/2015] [Indexed: 01/08/2023]
Abstract
Increasing demand for the knowledge about protein-protein interactions (PPIs) is promoting the development of methods for predicting protein interaction network. Although high-throughput technologies have generated considerable PPIs data for various organisms, it has inevitable drawbacks such as high cost, time consumption, and inherently high false positive rate. For this reason, computational methods are drawing more and more attention for predicting PPIs. In this study, we report a computational method for predicting PPIs using the information of protein sequences. The main improvements come from adopting a novel protein sequence representation by using discrete cosine transform (DCT) on substitution matrix representation (SMR) and from using weighted sparse representation based classifier (WSRC). When performing on the PPIs dataset of Yeast, Human, and H. pylori, we got excellent results with average accuracies as high as 96.28%, 96.30%, and 86.74%, respectively, significantly better than previous methods. Promising results obtained have proven that the proposed method is feasible, robust, and powerful. To further evaluate the proposed method, we compared it with the state-of-the-art support vector machine (SVM) classifier. Extensive experiments were also performed in which we used Yeast PPIs samples as training set to predict PPIs of other five species datasets.
Collapse
Affiliation(s)
- Yu-An Huang
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, Guangdong 518060, China
| | - Zhu-Hong You
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, Jiangsu 221116, China
| | - Xin Gao
- Department of Medical Imaging, Suzhou Institute of Biomedical Engineering and Technology, Suzhou, Jiangsu 215163, China
| | - Leon Wong
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, Guangdong 518060, China
| | - Lirong Wang
- School of Electronic and Information Engineering, Soochow University, Suzhou, Jiangsu 215123, China
| |
Collapse
|