1
|
Ma T, Jiang M, Pang S, Zhang Z, Hang H, Zhou W, Zhang Y. SeqMG-RPI: A Sequence-Based Framework Integrating Multi-Scale RNA Features and Protein Graphs for RNA-Protein Interaction Prediction. J Chem Inf Model 2025. [PMID: 40262169 DOI: 10.1021/acs.jcim.5c00176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/24/2025]
Abstract
RNA-protein interaction (RPI) plays a crucial role in cell biology, and accurate prediction of RPI is essential to understand molecular mechanisms and advance disease research. Some existing RPI prediction methods typically rely on a single feature and there is significant room for improvement. In this paper, we propose a novel sequence-based RPI prediction method, called SeqMG-RPI. For RNA, SeqMG-RPI introduces an innovative multi-scale RNA feature that integrates three sequence-based representations: a multi-channel RNA feature, a k-mer frequency feature, and a k-mer sparse matrix feature. For protein, SeqMG-RPI utilizes a graph-based protein feature to capture protein information. Moreover, a novel neural network architecture is constructed for feature extraction and RPI prediction. Through experiments from multiple perspectives across various datasets, it is demonstrated that the proposed method outperforms existing methods, which has better performance and generalization.
Collapse
Affiliation(s)
- Teng Ma
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266525, China
| | - Mingjian Jiang
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266525, China
| | - Shunpeng Pang
- School of Computer Engineering, Weifang University, Weifang 261061, China
| | - Zhi Zhang
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266525, China
| | - Huaibin Hang
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266525, China
| | - Wei Zhou
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266525, China
| | - Yuanyuan Zhang
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266525, China
| |
Collapse
|
2
|
Huang Y, Huang S, Zhang XF, Ou-Yang L, Liu C. NJGCG: A node-based joint Gaussian copula graphical model for gene networks inference across multiple states. Comput Struct Biotechnol J 2024; 23:3199-3210. [PMID: 39263209 PMCID: PMC11388165 DOI: 10.1016/j.csbj.2024.08.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2024] [Revised: 08/05/2024] [Accepted: 08/11/2024] [Indexed: 09/13/2024] Open
Abstract
Inferring the interactions between genes is essential for understanding the mechanisms underlying biological processes. Gene networks will change along with the change of environment and state. The accumulation of gene expression data from multiple states makes it possible to estimate the gene networks in various states based on computational methods. However, most existing gene network inference methods focus on estimating a gene network from a single state, ignoring the similarities between networks in different but related states. Moreover, in addition to individual edges, similarities and differences between different networks may also be driven by hub genes. But existing network inference methods rarely consider hub genes, which affects the accuracy of network estimation. In this paper, we propose a novel node-based joint Gaussian copula graphical (NJGCG) model to infer multiple gene networks from gene expression data containing heterogeneous samples jointly. Our model can handle various gene expression data with missing values. Furthermore, a tree-structured group lasso penalty is designed to identify the common and specific hub genes in different gene networks. Simulation studies show that our proposed method outperforms other compared methods in all cases. We also apply NJGCG to infer the gene networks for different stages of differentiation in mouse embryonic stem cells and different subtypes of breast cancer, and explore changes in gene networks across different stages of differentiation or different subtypes of breast cancer. The common and specific hub genes in the estimated gene networks are closely related to stem cell differentiation processes and heterogeneity within breast cancers.
Collapse
Affiliation(s)
- Yun Huang
- Department of Geriatrics, The First Affiliated Hospital of Fujian Medical University, Fuzhou 350005, China
- Clinical Research Center for Geriatric Hypertension Disease of Fujian province, The First Affiliated Hospital of Fujian Medical University, Fuzhou 350005, China
| | - Sen Huang
- Guangdong Key Laboratory of Intelligent Information Processing, College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics & Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Wuhan, China
| | - Le Ou-Yang
- Guangdong Key Laboratory of Intelligent Information Processing, College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
| | - Chen Liu
- Department of Oncology, Molecular Oncology Research Institute, The First Affiliated Hospital of Fujian Medical University, Fuzhou 350005, China
- Department of Oncology, National Regional Medical Center, Binhai Campus of The First Affiliated Hospital, Fujian Medical University, Fuzhou 350212, China
- Fujian Key Laboratory of Precision Medicine for Cancer, The First Affiliated Hospital of Fujian Medical University, Fuzhou 350005, China
| |
Collapse
|
3
|
Sun M, Hu H, Pang W, Zhou Y. ACP-BC: A Model for Accurate Identification of Anticancer Peptides Based on Fusion Features of Bidirectional Long Short-Term Memory and Chemically Derived Information. Int J Mol Sci 2023; 24:15447. [PMID: 37895128 PMCID: PMC10607064 DOI: 10.3390/ijms242015447] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2023] [Revised: 09/10/2023] [Accepted: 10/20/2023] [Indexed: 10/29/2023] Open
Abstract
Anticancer peptides (ACPs) have been proven to possess potent anticancer activities. Although computational methods have emerged for rapid ACPs identification, their accuracy still needs improvement. In this study, we propose a model called ACP-BC, a three-channel end-to-end model that utilizes various combinations of data augmentation techniques. In the first channel, features are extracted from the raw sequence using a bidirectional long short-term memory network. In the second channel, the entire sequence is converted into a chemical molecular formula, which is further simplified using Simplified Molecular Input Line Entry System notation to obtain deep abstract features through a bidirectional encoder representation transformer (BERT). In the third channel, we manually selected four effective features according to dipeptide composition, binary profile feature, k-mer sparse matrix, and pseudo amino acid composition. Notably, the application of chemical BERT in predicting ACPs is novel and successfully integrated into our model. To validate the performance of our model, we selected two benchmark datasets, ACPs740 and ACPs240. ACP-BC achieved prediction accuracy with 87% and 90% on these two datasets, respectively, representing improvements of 1.3% and 7% compared to existing state-of-the-art methods on these datasets. Therefore, systematic comparative experiments have shown that the ACP-BC can effectively identify anticancer peptides.
Collapse
Affiliation(s)
- Mingwei Sun
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (M.S.); (H.H.)
| | - Haoyuan Hu
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (M.S.); (H.H.)
| | - Wei Pang
- School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh EH14 4AS, UK;
| | - You Zhou
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (M.S.); (H.H.)
- College of Software, Jilin University, Changchun 130012, China
| |
Collapse
|
4
|
Wang C, Xu S, Sun D, Liu ZP. ActivePPI: quantifying protein-protein interaction network activity with Markov random fields. Bioinformatics 2023; 39:btad567. [PMID: 37698984 PMCID: PMC10516639 DOI: 10.1093/bioinformatics/btad567] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2023] [Revised: 08/11/2023] [Accepted: 09/11/2023] [Indexed: 09/14/2023] Open
Abstract
MOTIVATION Protein-protein interactions (PPI) are crucial components of the biomolecular networks that enable cells to function. Biological experiments have identified a large number of PPI, and these interactions are stored in knowledge bases. However, these interactions are often restricted to specific cellular environments and conditions. Network activity can be characterized as the extent of agreement between a PPI network (PPIN) and a distinct cellular environment measured by protein mass spectrometry, and it can also be quantified as a statistical significance score. Without knowing the activity of these PPI in the cellular environments or specific phenotypes, it is impossible to reveal how these PPI perform and affect cellular functioning. RESULTS To calculate the activity of PPIN in different cellular conditions, we proposed a PPIN activity evaluation framework named ActivePPI to measure the consistency between network architecture and protein measurement data. ActivePPI estimates the probability density of protein mass spectrometry abundance and models PPIN using a Markov-random-field-based method. Furthermore, empirical P-value is derived based on a nonparametric permutation test to quantify the likelihood significance of the match between PPIN structure and protein abundance data. Extensive numerical experiments demonstrate the superior performance of ActivePPI and result in network activity evaluation, pathway activity assessment, and optimal network architecture tuning tasks. To summarize it succinctly, ActivePPI is a versatile tool for evaluating PPI network that can uncover the functional significance of protein interactions in crucial cellular biological processes and offer further insights into physiological phenomena. AVAILABILITY AND IMPLEMENTATION All source code and data are freely available at https://github.com/zpliulab/ActivePPI.
Collapse
Affiliation(s)
- Chuanyuan Wang
- Department of Biomedical Engineering, School of Control Science and Engineering, Shandong University, Jinan, Shandong 250061, China
| | - Shiyu Xu
- Department of Biomedical Engineering, School of Control Science and Engineering, Shandong University, Jinan, Shandong 250061, China
| | - Duanchen Sun
- School of Mathematics, Shandong University, Jinan, Shandong 250100, China
| | - Zhi-Ping Liu
- Department of Biomedical Engineering, School of Control Science and Engineering, Shandong University, Jinan, Shandong 250061, China
| |
Collapse
|
5
|
Fan Y, Xiong H, Sun G. DeepASDPred: a CNN-LSTM-based deep learning method for Autism spectrum disorders risk RNA identification. BMC Bioinformatics 2023; 24:261. [PMID: 37349705 DOI: 10.1186/s12859-023-05378-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Accepted: 06/06/2023] [Indexed: 06/24/2023] Open
Abstract
BACKGROUND Autism spectrum disorders (ASD) are a group of neurodevelopmental disorders characterized by difficulty communicating with society and others, behavioral difficulties, and a brain that processes information differently than normal. Genetics has a strong impact on ASD associated with early onset and distinctive signs. Currently, all known ASD risk genes are able to encode proteins, and some de novo mutations disrupting protein-coding genes have been demonstrated to cause ASD. Next-generation sequencing technology enables high-throughput identification of ASD risk RNAs. However, these efforts are time-consuming and expensive, so an efficient computational model for ASD risk gene prediction is necessary. RESULTS In this study, we propose DeepASDPerd, a predictor for ASD risk RNA based on deep learning. Firstly, we use K-mer to feature encode the RNA transcript sequences, and then fuse them with corresponding gene expression values to construct a feature matrix. After combining chi-square test and logistic regression to select the best feature subset, we input them into a binary classification prediction model constructed by convolutional neural network and long short-term memory for training and classification. The results of the tenfold cross-validation proved our method outperformed the state-of-the-art methods. Dataset and source code are available at https://github.com/Onebear-X/DeepASDPred is freely available. CONCLUSIONS Our experimental results show that DeepASDPred has outstanding performance in identifying ASD risk RNA genes.
Collapse
Affiliation(s)
- Yongxian Fan
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China
| | - Hui Xiong
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China
| | - Guicong Sun
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China.
| |
Collapse
|
6
|
Jha K, Karmakar S, Saha S. Graph-BERT and language model-based framework for protein-protein interaction identification. Sci Rep 2023; 13:5663. [PMID: 37024543 PMCID: PMC10079975 DOI: 10.1038/s41598-023-31612-w] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Accepted: 03/14/2023] [Indexed: 04/08/2023] Open
Abstract
Identification of protein-protein interactions (PPI) is among the critical problems in the domain of bioinformatics. Previous studies have utilized different AI-based models for PPI classification with advances in artificial intelligence (AI) techniques. The input to these models is the features extracted from different sources of protein information, mainly sequence-derived features. In this work, we present an AI-based PPI identification model utilizing a PPI network and protein sequences. The PPI network is represented as a graph where each node is a protein pair, and an edge is defined between two nodes if there exists a common protein between these nodes. Each node in a graph has a feature vector. In this work, we have used the language model to extract feature vectors directly from protein sequences. The feature vectors for protein in pairs are concatenated and used as a node feature vector of a PPI network graph. Finally, we have used the Graph-BERT model to encode the PPI network graph with sequence-based features and learn the hidden representation of the feature vector for each node. The next step involves feeding the learned representations of nodes to the fully connected layer, the output of which is fed into the softmax layer to classify the protein interactions. To assess the efficacy of the proposed PPI model, we have performed experiments on several PPI datasets. The experimental results demonstrate that the proposed approach surpasses the existing PPI works and designed baselines in classifying PPI.
Collapse
Affiliation(s)
- Kanchan Jha
- Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, Bihar, 801103, India.
| | - Sourav Karmakar
- Department of Computer Science and Engineering, National Institute of Technology Durgapur, Durgapur, West Bengal, 713209, India
| | - Sriparna Saha
- Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, Bihar, 801103, India
| |
Collapse
|
7
|
Ren ZH, You ZH, Zou Q, Yu CQ, Ma YF, Guan YJ, You HR, Wang XF, Pan J. DeepMPF: deep learning framework for predicting drug-target interactions based on multi-modal representation with meta-path semantic analysis. J Transl Med 2023; 21:48. [PMID: 36698208 PMCID: PMC9876420 DOI: 10.1186/s12967-023-03876-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Accepted: 01/05/2023] [Indexed: 01/26/2023] Open
Abstract
BACKGROUND Drug-target interaction (DTI) prediction has become a crucial prerequisite in drug design and drug discovery. However, the traditional biological experiment is time-consuming and expensive, as there are abundant complex interactions present in the large size of genomic and chemical spaces. For alleviating this phenomenon, plenty of computational methods are conducted to effectively complement biological experiments and narrow the search spaces into a preferred candidate domain. Whereas, most of the previous approaches cannot fully consider association behavior semantic information based on several schemas to represent complex the structure of heterogeneous biological networks. Additionally, the prediction of DTI based on single modalities cannot satisfy the demand for prediction accuracy. METHODS We propose a multi-modal representation framework of 'DeepMPF' based on meta-path semantic analysis, which effectively utilizes heterogeneous information to predict DTI. Specifically, we first construct protein-drug-disease heterogeneous networks composed of three entities. Then the feature information is obtained under three views, containing sequence modality, heterogeneous structure modality and similarity modality. We proposed six representative schemas of meta-path to preserve the high-order nonlinear structure and catch hidden structural information of the heterogeneous network. Finally, DeepMPF generates highly representative comprehensive feature descriptors and calculates the probability of interaction through joint learning. RESULTS To evaluate the predictive performance of DeepMPF, comparison experiments are conducted on four gold datasets. Our method can obtain competitive performance in all datasets. We also explore the influence of the different feature embedding dimensions, learning strategies and classification methods. Meaningfully, the drug repositioning experiments on COVID-19 and HIV demonstrate DeepMPF can be applied to solve problems in reality and help drug discovery. The further analysis of molecular docking experiments enhances the credibility of the drug candidates predicted by DeepMPF. CONCLUSIONS All the results demonstrate the effectively predictive capability of DeepMPF for drug-target interactions. It can be utilized as a useful tool to prescreen the most potential drug candidates for the protein. The web server of the DeepMPF predictor is freely available at http://120.77.11.78/DeepMPF/ , which can help relevant researchers to further study.
Collapse
Affiliation(s)
- Zhong-Hao Ren
- grid.460132.20000 0004 1758 0275School of Information Engineering, Xijing University, Xi’an, 710100 China
| | - Zhu-Hong You
- grid.440588.50000 0001 0307 1240School of Computer Science, Northwestern Polytechnical University, Xi’an, 710129 China
| | - Quan Zou
- grid.54549.390000 0004 0369 4060Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054 China
| | - Chang-Qing Yu
- grid.460132.20000 0004 1758 0275School of Information Engineering, Xijing University, Xi’an, 710100 China
| | - Yan-Fang Ma
- grid.417234.70000 0004 1808 3203Department of Galactophore, The Third People’s Hospital of Gansu Province, Lanzhou, 730020 China
| | - Yong-Jian Guan
- grid.460132.20000 0004 1758 0275School of Information Engineering, Xijing University, Xi’an, 710100 China
| | - Hai-Ru You
- grid.440588.50000 0001 0307 1240School of Computer Science, Northwestern Polytechnical University, Xi’an, 710129 China
| | - Xin-Fei Wang
- grid.460132.20000 0004 1758 0275School of Information Engineering, Xijing University, Xi’an, 710100 China
| | - Jie Pan
- grid.460132.20000 0004 1758 0275School of Information Engineering, Xijing University, Xi’an, 710100 China
| |
Collapse
|
8
|
Vora DS, Kalakoti Y, Sundar D. Computational Methods and Deep Learning for Elucidating Protein Interaction Networks. Methods Mol Biol 2023; 2553:285-323. [PMID: 36227550 DOI: 10.1007/978-1-0716-2617-7_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Protein interactions play a critical role in all biological processes, but experimental identification of protein interactions is a time- and resource-intensive process. The advances in next-generation sequencing and multi-omics technologies have greatly benefited large-scale predictions of protein interactions using machine learning methods. A wide range of tools have been developed to predict protein-protein, protein-nucleic acid, and protein-drug interactions. Here, we discuss the applications, methods, and challenges faced when employing the various prediction methods. We also briefly describe ways to overcome the challenges and prospective future developments in the field of protein interaction biology.
Collapse
Affiliation(s)
- Dhvani Sandip Vora
- Department of Biochemical Engineering and Biotechnology, Indian Institute of Technology Delhi, Hauz Khas, New Delhi, India
| | - Yogesh Kalakoti
- Department of Biochemical Engineering and Biotechnology, Indian Institute of Technology Delhi, Hauz Khas, New Delhi, India
| | - Durai Sundar
- Department of Biochemical Engineering and Biotechnology, Indian Institute of Technology Delhi, Hauz Khas, New Delhi, India.
- School of Artificial Intelligence, Indian Institute of Technology Delhi, Hauz Khas, New Delhi, India.
| |
Collapse
|
9
|
MFIDMA: A Multiple Information Integration Model for the Prediction of Drug-miRNA Associations. BIOLOGY 2022; 12:biology12010041. [PMID: 36671734 PMCID: PMC9855084 DOI: 10.3390/biology12010041] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/20/2022] [Revised: 12/19/2022] [Accepted: 12/22/2022] [Indexed: 12/28/2022]
Abstract
Abnormal microRNA (miRNA) functions play significant roles in various pathological processes. Thus, predicting drug-miRNA associations (DMA) may hold great promise for identifying the potential targets of drugs. However, discovering the associations between drugs and miRNAs through wet experiments is time-consuming and laborious. Therefore, it is significant to develop computational prediction methods to improve the efficiency of identifying DMA on a large scale. In this paper, a multiple features integration model (MFIDMA) is proposed to predict drug-miRNA association. Specifically, we first formulated known DMA as a bipartite graph and utilized structural deep network embedding (SDNE) to learn the topological features from the graph. Second, the Word2vec algorithm was utilized to construct the attribute features of the miRNAs and drugs. Third, two kinds of features were entered into the convolution neural network (CNN) and deep neural network (DNN) to integrate features and predict potential target miRNAs for the drugs. To evaluate the MFIDMA model, it was implemented on three different datasets under a five-fold cross-validation and achieved average AUCs of 0.9407, 0.9444 and 0.8919. In addition, the MFIDMA model showed reliable results in the case studies of Verapamil and hsa-let-7c-5p, confirming that the proposed model can also predict DMA in real-world situations. The model was effective in analyzing the neighbors and topological features of the drug-miRNA network by SDNE. The experimental results indicated that the MFIDMA is an accurate and robust model for predicting potential DMA, which is significant for miRNA therapeutics research and drug discovery.
Collapse
|
10
|
Wang J, Lu S, Wang SH, Zhang YD. A review on extreme learning machine. MULTIMEDIA TOOLS AND APPLICATIONS 2022; 81:41611-41660. [DOI: 10.1007/s11042-021-11007-7] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/21/2020] [Revised: 02/26/2021] [Accepted: 05/05/2021] [Indexed: 08/30/2023]
Abstract
AbstractExtreme learning machine (ELM) is a training algorithm for single hidden layer feedforward neural network (SLFN), which converges much faster than traditional methods and yields promising performance. In this paper, we hope to present a comprehensive review on ELM. Firstly, we will focus on the theoretical analysis including universal approximation theory and generalization. Then, the various improvements are listed, which help ELM works better in terms of stability, efficiency, and accuracy. Because of its outstanding performance, ELM has been successfully applied in many real-time learning tasks for classification, clustering, and regression. Besides, we report the applications of ELM in medical imaging: MRI, CT, and mammogram. The controversies of ELM were also discussed in this paper. We aim to report these advances and find some future perspectives.
Collapse
|
11
|
Graph Neural Network for Protein-Protein Interaction Prediction: A Comparative Study. MOLECULES (BASEL, SWITZERLAND) 2022; 27:molecules27186135. [PMID: 36144868 PMCID: PMC9501426 DOI: 10.3390/molecules27186135] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/27/2022] [Revised: 09/14/2022] [Accepted: 09/16/2022] [Indexed: 11/17/2022]
Abstract
Proteins are the fundamental biological macromolecules which underline practically all biological activities. Protein-protein interactions (PPIs), as they are known, are how proteins interact with other proteins in their environment to perform biological functions. Understanding PPIs reveals how cells behave and operate, such as the antigen recognition and signal transduction in the immune system. In the past decades, many computational methods have been developed to predict PPIs automatically, requiring less time and resources than experimental techniques. In this paper, we present a comparative study of various graph neural networks for protein-protein interaction prediction. Five network models are analyzed and compared, including neural networks (NN), graph convolutional neural networks (GCN), graph attention networks (GAT), hyperbolic neural networks (HNN), and hyperbolic graph convolutions (HGCN). By utilizing the protein sequence information, all of these models can predict the interaction between proteins. Fourteen PPI datasets are extracted and utilized to compare the prediction performance of all these methods. The experimental results show that hyperbolic graph neural networks tend to have a better performance than the other methods on the protein-related datasets.
Collapse
|
12
|
Su XR, Hu L, You ZH, Hu PW, Zhao BW. Multi-view heterogeneous molecular network representation learning for protein-protein interaction prediction. BMC Bioinformatics 2022; 23:234. [PMID: 35710342 PMCID: PMC9205098 DOI: 10.1186/s12859-022-04766-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Accepted: 05/27/2022] [Indexed: 01/02/2023] Open
Abstract
BACKGROUND Protein-protein interaction (PPI) plays an important role in regulating cells and signals. Despite the ongoing efforts of the bioassay group, continued incomplete data limits our ability to understand the molecular roots of human disease. Therefore, it is urgent to develop a computational method to predict PPIs from the perspective of molecular system. METHODS In this paper, a highly efficient computational model, MTV-PPI, is proposed for PPI prediction based on a heterogeneous molecular network by learning inter-view protein sequences and intra-view interactions between molecules simultaneously. On the one hand, the inter-view feature is extracted from the protein sequence by k-mer method. On the other hand, we use a popular embedding method LINE to encode the heterogeneous molecular network to obtain the intra-view feature. Thus, the protein representation used in MTV-PPI is constructed by the aggregation of its inter-view feature and intra-view feature. Finally, random forest is integrated to predict potential PPIs. RESULTS To prove the effectiveness of MTV-PPI, we conduct extensive experiments on a collected heterogeneous molecular network with the accuracy of 86.55%, sensitivity of 82.49%, precision of 89.79%, AUC of 0.9301 and AUPR of 0.9308. Further comparison experiments are performed with various protein representations and classifiers to indicate the effectiveness of MTV-PPI in predicting PPIs based on a complex network. CONCLUSION The achieved experimental results illustrate that MTV-PPI is a promising tool for PPI prediction, which may provide a new perspective for the future interactions prediction researches based on heterogeneous molecular network.
Collapse
Affiliation(s)
- Xiao-Rui Su
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, 830011 China
- University of Chinese Academy of Sciences, Beijing, 100049 China
- Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi, 830011 China
| | - Lun Hu
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, 830011 China
- University of Chinese Academy of Sciences, Beijing, 100049 China
- Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi, 830011 China
| | - Zhu-Hong You
- School of Computer Science, Northwestern Polytechnical University, Xi’an, 710129 China
| | - Peng-Wei Hu
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, 830011 China
- University of Chinese Academy of Sciences, Beijing, 100049 China
- Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi, 830011 China
| | - Bo-Wei Zhao
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, 830011 China
- University of Chinese Academy of Sciences, Beijing, 100049 China
- Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi, 830011 China
| |
Collapse
|
13
|
Jha K, Saha S, Singh H. Prediction of protein-protein interaction using graph neural networks. Sci Rep 2022; 12:8360. [PMID: 35589837 PMCID: PMC9120162 DOI: 10.1038/s41598-022-12201-9] [Citation(s) in RCA: 53] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Accepted: 04/18/2022] [Indexed: 01/09/2023] Open
Abstract
Proteins are the essential biological macromolecules required to perform nearly all biological processes, and cellular functions. Proteins rarely carry out their tasks in isolation but interact with other proteins (known as protein-protein interaction) present in their surroundings to complete biological activities. The knowledge of protein-protein interactions (PPIs) unravels the cellular behavior and its functionality. The computational methods automate the prediction of PPI and are less expensive than experimental methods in terms of resources and time. So far, most of the works on PPI have mainly focused on sequence information. Here, we use graph convolutional network (GCN) and graph attention network (GAT) to predict the interaction between proteins by utilizing protein's structural information and sequence features. We build the graphs of proteins from their PDB files, which contain 3D coordinates of atoms. The protein graph represents the amino acid network, also known as residue contact network, where each node is a residue. Two nodes are connected if they have a pair of atoms (one from each node) within the threshold distance. To extract the node/residue features, we use the protein language model. The input to the language model is the protein sequence, and the output is the feature vector for each amino acid of the underlying sequence. We validate the predictive capability of the proposed graph-based approach on two PPI datasets: Human and S. cerevisiae. Obtained results demonstrate the effectiveness of the proposed approach as it outperforms the previous leading methods. The source code for training and data to train the model are available at https://github.com/JhaKanchan15/PPI_GNN.git .
Collapse
Affiliation(s)
- Kanchan Jha
- Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, Bihar, 801103, India.
| | - Sriparna Saha
- Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, Bihar, 801103, India
| | - Hiteshi Singh
- Department of Electrical Engineering, Indian Institute of Technology Jodhpur, Jodhpur, Rajasthan, 342030, India
| |
Collapse
|
14
|
Li LP, Zhang B, Cheng L. CPIELA: Computational Prediction of Plant Protein–Protein Interactions by Ensemble Learning Approach From Protein Sequences and Evolutionary Information. Front Genet 2022; 13:857839. [PMID: 35360876 PMCID: PMC8963800 DOI: 10.3389/fgene.2022.857839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Accepted: 02/10/2022] [Indexed: 11/22/2022] Open
Abstract
Identification and characterization of plant protein–protein interactions (PPIs) are critical in elucidating the functions of proteins and molecular mechanisms in a plant cell. Although experimentally validated plant PPIs data have become increasingly available in diverse plant species, the high-throughput techniques are usually expensive and labor-intensive. With the incredibly valuable plant PPIs data accumulating in public databases, it is progressively important to propose computational approaches to facilitate the identification of possible PPIs. In this article, we propose an effective framework for predicting plant PPIs by combining the position-specific scoring matrix (PSSM), local optimal-oriented pattern (LOOP), and ensemble rotation forest (ROF) model. Specifically, the plant protein sequence is firstly transformed into the PSSM, in which the protein evolutionary information is perfectly preserved. Then, the local textural descriptor LOOP is employed to extract texture variation features from PSSM. Finally, the ROF classifier is adopted to infer the potential plant PPIs. The performance of CPIELA is evaluated via cross-validation on three plant PPIs datasets: Arabidopsis thaliana, Zea mays, and Oryza sativa. The experimental results demonstrate that the CPIELA method achieved the high average prediction accuracies of 98.63%, 98.09%, and 94.02%, respectively. To further verify the high performance of CPIELA, we also compared it with the other state-of-the-art methods on three gold standard datasets. The experimental results illustrate that CPIELA is efficient and reliable for predicting plant PPIs. It is anticipated that the CPIELA approach could become a useful tool for facilitating the identification of possible plant PPIs.
Collapse
Affiliation(s)
- Li-Ping Li
- College of Grassland and Environment Sciences, Xinjiang Agricultural University, Urumqi, China
- Xinjiang Key Laboratory of Grassland Resources and Ecology, Urumqi, China
- *Correspondence: Li-Ping Li, ; Bo Zhang,
| | - Bo Zhang
- College of Grassland and Environment Sciences, Xinjiang Agricultural University, Urumqi, China
- Xinjiang Key Laboratory of Grassland Resources and Ecology, Urumqi, China
- *Correspondence: Li-Ping Li, ; Bo Zhang,
| | - Li Cheng
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Science, Urumqi, China
| |
Collapse
|
15
|
Ren ZH, Yu CQ, Li LP, You ZH, Guan YJ, Li YC, Pan J. SAWRPI: A Stacking Ensemble Framework With Adaptive Weight for Predicting ncRNA-Protein Interactions Using Sequence Information. Front Genet 2022; 13:839540. [PMID: 35360836 PMCID: PMC8963817 DOI: 10.3389/fgene.2022.839540] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2021] [Accepted: 02/07/2022] [Indexed: 11/13/2022] Open
Abstract
Non-coding RNAs (ncRNAs) take essential effects on biological processes, like gene regulation. One critical way of ncRNA executing biological functions is interactions between ncRNA and RNA binding proteins (RBPs). Identifying proteins, involving ncRNA-protein interactions, can well understand the function ncRNA. Many high-throughput experiment have been applied to recognize the interactions. As a consequence of these approaches are time- and labor-consuming, currently, a great number of computational methods have been developed to improve and advance the ncRNA-protein interactions research. However, these methods may be not available to all RNAs and proteins, particularly processing new RNAs and proteins. Additionally, most of them cannot process well with long sequence. In this work, a computational method SAWRPI is proposed to make prediction of ncRNA-protein through sequence information. More specifically, the raw features of protein and ncRNA are firstly extracted through the k-mer sparse matrix with SVD reduction and learning nucleic acid symbols by natural language processing with local fusion strategy, respectively. Then, to classify easily, Hilbert Transformation is exploited to transform raw feature data to the new feature space. Finally, stacking ensemble strategy is adopted to learn high-level abstraction features automatically and generate final prediction results. To confirm the robustness and stability, three different datasets containing two kinds of interactions are utilized. In comparison with state-of-the-art methods and other results classifying or feature extracting strategies, SAWRPI achieved high performance on three datasets, containing two kinds of lncRNA-protein interactions. Upon our finding, SAWRPI is a trustworthy, robust, yet simple and can be used as a beneficial supplement to the task of predicting ncRNA-protein interactions.
Collapse
Affiliation(s)
- Zhong-Hao Ren
- School of Information Engineering, Xijing University, Xi’an, China
| | - Chang-Qing Yu
- School of Information Engineering, Xijing University, Xi’an, China
- *Correspondence: Li-Ping Li, ; Chang-Qing Yu,
| | - Li-Ping Li
- School of Information Engineering, Xijing University, Xi’an, China
- *Correspondence: Li-Ping Li, ; Chang-Qing Yu,
| | - Zhu-Hong You
- School of Computer Science, Northwestern Polytechnical University, Xi’an, China
| | - Yong-Jian Guan
- School of Information Engineering, Xijing University, Xi’an, China
| | - Yue-Chao Li
- School of Information Engineering, Xijing University, Xi’an, China
| | - Jie Pan
- School of Information Engineering, Xijing University, Xi’an, China
| |
Collapse
|
16
|
Ma M, Yang J, Liu R. A novel structure automatic-determined Fourier extreme learning machine for generalized Black–Scholes partial differential equation. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2021.107904] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
17
|
A. Tawfeek M, Alanazi S, A. Abd El-Aziz A. Artificial Fish Swarm for Multi Protein Sequences Alignment in Bioinformatics. COMPUTERS, MATERIALS & CONTINUA 2022; 72:6091-6106. [DOI: 10.32604/cmc.2022.028391] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Accepted: 03/18/2022] [Indexed: 09/01/2023]
|
18
|
Yi HC, You ZH, Guo ZH, Huang DS, Chan KCC. Learning Representation of Molecules in Association Network for Predicting Intermolecular Associations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2546-2554. [PMID: 32070992 DOI: 10.1109/tcbb.2020.2973091] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
A key aim of post-genomic biomedical research is to systematically understand molecules and their interactions in human cells. Multiple biomolecules coordinate to sustain life activities, and interactions between various biomolecules are interconnected. However, existing studies usually only focusing on associations between two or very limited types of molecules. In this study, we propose a network representation learning based computational framework MAN-SDNE to predict any intermolecular associations. More specifically, we constructed a large-scale molecular association network of multiple biomolecules in human by integrating associations among long non-coding RNA, microRNA, protein, drug, and disease, containing 6,528 molecular nodes, 9 kind of,105,546 associations. And then, the feature of each node is represented by its network proximity and attribute features. Furthermore, these features are used to train Random Forest classifier to predict intermolecular associations. MAN-SDNE achieves a remarkable performance with an AUC of 0.9552 and an AUPR of 0.9338 under five-fold cross-validation. To indicate the ability to predict specific types of interactions, a case study for predicting lncRNA-protein interactions using MAN-SDNE is also executed. Experimental results demonstrate this work offers a systematic insight for understanding the synergistic associations between molecules and complex diseases and provides a network-based computational tool to systematically explore intermolecular interactions.
Collapse
|
19
|
Zhang S, Shi H. iR5hmcSC: Identifying RNA 5-hydroxymethylcytosine with multiple features based on stacking learning. Comput Biol Chem 2021; 95:107583. [PMID: 34562726 DOI: 10.1016/j.compbiolchem.2021.107583] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Revised: 09/02/2021] [Accepted: 09/12/2021] [Indexed: 01/27/2023]
Abstract
RNA 5-hydroxymethylcytosine (5hmC) modification is the basis of the translation of genetic information and the biological evolution. The study of its distribution in transcriptome is fundamentally crucial to reveal the biological significance of 5hmC. Biochemical experiments can use a variety of sequencing-based technologies to achieve high-throughput identification of 5hmC; however, they are labor-intensive, time-consuming, as well as expensive. Therefore, it is urgent to develop more effective and feasible computational methods. In this paper, a novel and powerful model called iR5hmcSC is designed for identifying 5hmC. Firstly, we extract the different features by K-mer, Pseudo Structure Status Composition and One-Hot encoding. Subsequently, the combination of chi-square test and logistic regression is utilized as the feature selection method to select the optimal feature sets. And then stacking learning, an ensemble learning method including random forest (RF), extra trees (EX), AdaBoost (Ada), gradient boosting decision tree (GBDT), and support vector machine (SVM), is used to recognize 5hmC and non-5hmC. Finally, 10-fold cross-validation test is performed to evaluate the model. The accuracy reaches 85.27% and 79.92% on benchmark dataset and independent dataset, respectively. The result is better than the state-of-the-art methods, which indicates that our model is a feasible tool to identify 5hmC. The datasets and source code are freely available at https://github.com/HongyanShi026/iR5hmcSC.
Collapse
Affiliation(s)
- Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, PR China.
| | - Hongyan Shi
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, PR China
| |
Collapse
|
20
|
Zhang Q, Yu W, Han K, Nandi AK, Huang DS. Multi-Scale Capsule Network for Predicting DNA-Protein Binding Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1793-1800. [PMID: 32960766 DOI: 10.1109/tcbb.2020.3025579] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Discovering DNA-protein binding sites, also known as motif discovery, is the foundation for further analysis of transcription factors (TFs). Deep learning algorithms such as convolutional neural networks (CNN) have been introduced to motif discovery task and have achieved state-of-art performance. However, due to the limitations of CNN, motif discovery methods based on CNN do not take full advantage of large-scale sequencing data generated by high-throughput sequencing technology. Hence, in this paper we propose multi-scale capsule network architecture (MSC) integrating multi-scale CNN, a variant of CNN able to extract motif features of different lengths, and capsule network, a novel type of artificial neural network architecture aimed at improving CNN. The proposed method is tested on real ChIP-seq datasets and the experimental results show a considerable improvement compared with two well-tested deep learning-based sequence model, DeepBind and Deepsea.
Collapse
|
21
|
Yuan X, Xu X, Zhao H, Duan J. ERINS: Novel Sequence Insertion Detection by Constructing an Extended Reference. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1893-1901. [PMID: 31751246 DOI: 10.1109/tcbb.2019.2954315] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Next generation sequencing technology has led to the development of methods for the detection of novel sequence insertions (nsINS). Multiple signatures from short reads are usually extracted to improve nsINS detection performance. However, characterization of nsINSs larger than the mean insert size is still challenging. This article presents a new method, ERINS, to detect nsINS contents and genotypes of full spectrum range size. It integrates the features of structural variations and mapping states of split reads to find nsINS breakpoints, and then adopts a left-most mapping strategy to infer nsINS content by iteratively extending the standard reference at each breakpoint. Finally, it realigns all reads to the extended reference and infers nsINS genotypes through statistical testing on read counts. We test and validate the performance of ERINS on simulation and real sequencing datasets. The simulation experimental results demonstrate that it outperforms several peer methods with respect to sensitivity and precision. The real data application indicates that ERINS obtains high consistent results with those of previously reported and detects nsINSs over 200 base pairs that many other methods fail. In conclusion, ERINS can be used as a supplement to existing tools and will become a routine approach for characterizing nsINSs.
Collapse
|
22
|
Chen XG, Zhang W, Yang X, Li C, Chen H. ACP-DA: Improving the Prediction of Anticancer Peptides Using Data Augmentation. Front Genet 2021; 12:698477. [PMID: 34276801 PMCID: PMC8279753 DOI: 10.3389/fgene.2021.698477] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Accepted: 06/07/2021] [Indexed: 12/09/2022] Open
Abstract
Anticancer peptides (ACPs) have provided a promising perspective for cancer treatment, and the prediction of ACPs is very important for the discovery of new cancer treatment drugs. It is time consuming and expensive to use experimental methods to identify ACPs, so computational methods for ACP identification are urgently needed. There have been many effective computational methods, especially machine learning-based methods, proposed for such predictions. Most of the current machine learning methods try to find suitable features or design effective feature learning techniques to accurately represent ACPs. However, the performance of these methods can be further improved for cases with insufficient numbers of samples. In this article, we propose an ACP prediction model called ACP-DA (Data Augmentation), which uses data augmentation for insufficient samples to improve the prediction performance. In our method, to better exploit the information of peptide sequences, peptide sequences are represented by integrating binary profile features and AAindex features, and then the samples in the training set are augmented in the feature space. After data augmentation, the samples are used to train the machine learning model, which is used to predict ACPs. The performance of ACP-DA exceeds that of existing methods, and ACP-DA achieves better performance in the prediction of ACPs compared with a method without data augmentation. The proposed method is available at http://github.com/chenxgscuec/ACPDA.
Collapse
Affiliation(s)
- Xian-Gan Chen
- School of Biomedical Engineering, South-Central University for Nationalities, Wuhan, China.,Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central University for Nationalities, Wuhan, China.,Key Laboratory of Cognitive Science (South-Central University for Nationalities), State Ethnic Affairs Commission, Wuhan, China
| | - Wen Zhang
- College of Informatics, Huazhong Agricultural University, Wuhan, China.,Hubei Engineering Technology Research Center of Agricultural Big Data, Wuhan, China
| | - Xiaofei Yang
- School of Biomedical Engineering, South-Central University for Nationalities, Wuhan, China.,Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central University for Nationalities, Wuhan, China.,Key Laboratory of Cognitive Science (South-Central University for Nationalities), State Ethnic Affairs Commission, Wuhan, China
| | - Chenhong Li
- School of Biomedical Engineering, South-Central University for Nationalities, Wuhan, China.,Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central University for Nationalities, Wuhan, China.,Key Laboratory of Cognitive Science (South-Central University for Nationalities), State Ethnic Affairs Commission, Wuhan, China
| | - Hengling Chen
- School of Biomedical Engineering, South-Central University for Nationalities, Wuhan, China.,Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central University for Nationalities, Wuhan, China.,Key Laboratory of Cognitive Science (South-Central University for Nationalities), State Ethnic Affairs Commission, Wuhan, China
| |
Collapse
|
23
|
Li X, Zhang S, Wong KC. Deep embedded clustering with multiple objectives on scRNA-seq data. Brief Bioinform 2021; 22:6209682. [PMID: 33822877 DOI: 10.1093/bib/bbab090] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Revised: 02/17/2021] [Accepted: 02/25/2021] [Indexed: 12/19/2022] Open
Abstract
In recent years, single-cell RNA sequencing (scRNA-seq) technologies have been widely adopted to interrogate gene expression of individual cells; it brings opportunities to understand the underlying processes in a high-throughput manner. Deep embedded clustering (DEC) was demonstrated successful in high-dimensional sparse scRNA-seq data by joint feature learning and cluster assignment for identifying cell types simultaneously. However, the deep network architecture for embedding clustering is not trivial to optimize. Therefore, we propose an evolutionary multiobjective DEC by synergizing the multiobjective evolutionary optimization to simultaneously evolve the hyperparameters and architectures of DEC in an automatic manner. Firstly, a denoising autoencoder is integrated into the DEC to project the high-dimensional sparse scRNA-seq data into a low-dimensional space. After that, to guide the evolution, three objective functions are formulated to balance the model's generality and clustering performance for robustness. Meanwhile, migration and mutation operators are proposed to optimize the objective functions to select the suitable hyperparameters and architectures of DEC in the multiobjective framework. Multiple comparison analyses are conducted on twenty synthetic data and eight real data from different representative single-cell sequencing platforms to validate the effectiveness. The experimental results reveal that the proposed algorithm outperforms other state-of-the-art clustering methods under different metrics. Meanwhile, marker genes identification, gene ontology enrichment and pathology analysis are conducted to reveal novel insights into the cell type identification and characterization mechanisms.
Collapse
Affiliation(s)
- Xiangtao Li
- School of Artificial Intelligence Jilin University, Jilin, China
| | - Shixiong Zhang
- Department of Computer science City University of Hong Kong, Hong Kong SAR
| | - Ka-Chun Wong
- Department of Computer science City University of Hong Kong, Hong Kong SAR
| |
Collapse
|
24
|
Yuan X, Yu J, Xi J, Yang L, Shang J, Li Z, Duan J. CNV_IFTV: An Isolation Forest and Total Variation-Based Detection of CNVs from Short-Read Sequencing Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:539-549. [PMID: 31180897 DOI: 10.1109/tcbb.2019.2920889] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Accurate detection of copy number variations (CNVs) from short-read sequencing data is challenging due to the uneven distribution of reads and the unbalanced amplitudes of gains and losses. The direct use of read depths to measure CNVs tends to limit performance. Thus, robust computational approaches equipped with appropriate statistics are required to detect CNV regions and boundaries. This study proposes a new method called CNV_IFTV to address this need. CNV_IFTV assigns an anomaly score to each genome bin through a collection of isolation trees. The trees are trained based on isolation forest algorithm through conducting subsampling from measured read depths. With the anomaly scores, CNV_IFTV uses a total variation model to smooth adjacent bins, leading to a denoised score profile. Finally, a statistical model is established to test the denoised scores for calling CNVs. CNV_IFTV is tested on both simulated and real data in comparison to several peer methods. The results indicate that the proposed method outperforms the peer methods. CNV_IFTV is a reliable tool for detecting CNVs from short-read sequencing data even for low-level coverage and tumor purity. The detection results on tumor samples can aid to evaluate known cancer genes and to predict target drugs for disease diagnosis.
Collapse
|
25
|
Zhang XF, Ou-Yang L, Yan T, Hu XT, Yan H. A Joint Graphical Model for Inferring Gene Networks Across Multiple Subpopulations and Data Types. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:1043-1055. [PMID: 31794418 DOI: 10.1109/tcyb.2019.2952711] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Reconstructing gene networks from gene expression data is a long-standing challenge. In most applications, the observations can be divided into several distinct but related subpopulations and the gene expression measurements can be collected from multiple data types. Most existing methods are designed to estimate a single gene network from a single dataset. These methods may be suboptimal since they do not exploit the similarities and differences among different subpopulations and data types. In this article, we propose a joint graphical model to estimate the multiple gene networks simultaneously. Our model decomposes each subpopulation-specific gene network as a sum of common and unique components and imposes a group lasso penalty on gene networks corresponding to different data types. The gene network variations across subpopulations can be learned automatically by the decompositions of networks, and the similarities and differences among data types can be captured by the group lasso penalty. The simulation studies demonstrate that our method outperforms the state-of-the-art methods. We also apply our method to the cancer genome atlas breast cancer datasets to reconstruct subtype-specific gene networks. Hub nodes in the estimated subnetworks unique to individual cancer subtypes rediscover well-known genes associated with breast cancer subtypes and provide interesting predictions.
Collapse
|
26
|
Yang XF, Zhou YK, Zhang L, Gao Y, Du PF. Predicting LncRNA Subcellular Localization Using Unbalanced Pseudo-k Nucleotide Compositions. Curr Bioinform 2020. [DOI: 10.2174/1574893614666190902151038] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Background:
Long non-coding RNAs (lncRNAs) are transcripts with a length more
than 200 nucleotides, functioning in the regulation of gene expression. More evidence has shown
that the biological functions of lncRNAs are intimately related to their subcellular localizations.
Therefore, it is very important to confirm the lncRNA subcellular localization.
Methods:
In this paper, we proposed a novel method to predict the subcellular localization of
lncRNAs. To more comprehensively utilize lncRNA sequence information, we exploited both kmer
nucleotide composition and sequence order correlated factors of lncRNA to formulate
lncRNA sequences. Meanwhile, a feature selection technique which was based on the Analysis Of
Variance (ANOVA) was applied to obtain the optimal feature subset. Finally, we used the support
vector machine (SVM) to perform the prediction.
Results:
The AUC value of the proposed method can reach 0.9695, which indicated the proposed
predictor is an efficient and reliable tool for determining lncRNA subcellular localization. Furthermore,
the predictor can reach the maximum overall accuracy of 90.37% in leave-one-out cross
validation, which clearly outperforms the existing state-of- the-art method.
Conclusion:
It is demonstrated that the proposed predictor is feasible and powerful for the prediction
of lncRNA subcellular. To facilitate subsequent genetic sequence research, we shared the
source code at https://github.com/NicoleYXF/lncRNA.
Collapse
Affiliation(s)
- Xiao-Fei Yang
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Yuan-Ke Zhou
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Lin Zhang
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Yang Gao
- School of Medicine, Nankai University, Tianjin 300071, China
| | - Pu-Feng Du
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| |
Collapse
|
27
|
Bi XA, Wu H, Xie Y, Zhang L, Luo X, Fu Y. The exploration of Parkinson's disease: a multi-modal data analysis of resting functional magnetic resonance imaging and gene data. Brain Imaging Behav 2020; 15:1986-1996. [PMID: 32990896 DOI: 10.1007/s11682-020-00392-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/31/2020] [Indexed: 02/02/2023]
Abstract
Parkinson's disease (PD) is the most universal chronic degenerative neurological dyskinesia and an important threat to elderly health. At present, the researches of PD are mainly based on single-modal data analysis, while the fusion research of multi-modal data may provide more meaningful information in the aspect of comprehending the pathogenesis of PD. In this paper, 104 samples having resting functional magnetic resonance imaging (rfMRI) and gene data are from Parkinson's Progression Markers Initiative (PPMI) and Alzheimer's Disease Neuroimaging Initiative (ADNI) database to predict pathological brain areas and risk genes related to PD. In the experiment, Pearson correlation analysis is adopted to conduct fusion analysis from the data of genes and brain areas as multi-modal sample characteristics, and the clustering evolution random forest (CERF) method is applied to detect the discriminative genes and brain areas. The experimental results indicate that compared with several existing advanced methods, the CERF method can further improve the diagnosis of PD and healthy control, and can achieve a significant effect. More importantly, we find that there are some interesting associations between brain areas and genes in PD patients. Based on these associations, we notice that PD-related brain areas include angular gyrus, thalamus, posterior cingulate gyrus and paracentral lobule, and risk genes mainly include C6orf10, HLA-DPB1 and HLA-DOA. These discoveries have a significant contribution to the early prevention and clinical treatments of PD.
Collapse
Affiliation(s)
- Xia-An Bi
- Hunan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Hunan Normal University, Changsha, People's Republic of China. .,College of Information Science and Engineering, Hunan Normal University, Changsha, People's Republic of China.
| | - Hao Wu
- Hunan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Hunan Normal University, Changsha, People's Republic of China.,College of Information Science and Engineering, Hunan Normal University, Changsha, People's Republic of China
| | - Yiming Xie
- Hunan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Hunan Normal University, Changsha, People's Republic of China.,College of Information Science and Engineering, Hunan Normal University, Changsha, People's Republic of China
| | - Lixia Zhang
- Hunan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Hunan Normal University, Changsha, People's Republic of China.,College of Information Science and Engineering, Hunan Normal University, Changsha, People's Republic of China
| | - Xun Luo
- Hunan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Hunan Normal University, Changsha, People's Republic of China.,College of Information Science and Engineering, Hunan Normal University, Changsha, People's Republic of China
| | - Yu Fu
- Hunan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Hunan Normal University, Changsha, People's Republic of China.,College of Information Science and Engineering, Hunan Normal University, Changsha, People's Republic of China
| | | |
Collapse
|
28
|
Ranjan A, Fahad MS, Fernandez-Baca D, Deepak A, Tripathi S. Deep Robust Framework for Protein Function Prediction Using Variable-Length Protein Sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1648-1659. [PMID: 30998479 DOI: 10.1109/tcbb.2019.2911609] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The order of amino acids in a protein sequence enables the protein to acquire a conformation suitable for performing functions, thereby motivating the need to analyze these sequences for predicting functions. Although machine learning based approaches are fast compared to methods using BLAST, FASTA, etc., they fail to perform well for long protein sequences (with more than 300 amino acids). In this paper, we introduce a novel method for construction of two separate feature sets for protein using bi-directional long short-term memory network based on the analysis of fixed 1) single-sized segments and 2) multi-sized segments. The model trained on the proposed feature set based on multi-sized segments is combined with the model trained using state-of-the-art Multi-label Linear Discriminant Analysis (MLDA) features to further improve the accuracy. Extensive evaluations using separate datasets for biological processes and molecular functions demonstrate not only improved results for long sequences, but also significantly improve the overall accuracy over state-of-the-art method. The single-sized approach produces an improvement of +3.37 percent for biological processes and +5.48 percent for molecular functions over the MLDA based classifier. The corresponding numbers for multi-sized approach are +5.38 and +8.00 percent. Combining the two models, the accuracy further improves to +7.41 and +9.21 percent, respectively.
Collapse
|
29
|
Khatun MS, Shoombuatong W, Hasan MM, Kurata H. Evolution of Sequence-based Bioinformatics Tools for Protein-protein Interaction Prediction. Curr Genomics 2020; 21:454-463. [PMID: 33093807 PMCID: PMC7536797 DOI: 10.2174/1389202921999200625103936] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Revised: 03/19/2020] [Accepted: 05/27/2020] [Indexed: 12/22/2022] Open
Abstract
Protein-protein interactions (PPIs) are the physical connections between two or more proteins via electrostatic forces or hydrophobic effects. Identification of the PPIs is pivotal, which contributes to many biological processes including protein function, disease incidence, and therapy design. The experimental identification of PPIs via high-throughput technology is time-consuming and expensive. Bioinformatics approaches are expected to solve such restrictions. In this review, our main goal is to provide an inclusive view of the existing sequence-based computational prediction of PPIs. Initially, we briefly introduce the currently available PPI databases and then review the state-of-the-art bioinformatics approaches, working principles, and their performances. Finally, we discuss the caveats and future perspective of the next generation algorithms for the prediction of PPIs.
Collapse
Affiliation(s)
| | | | - Md. Mehedi Hasan
- Address correspondence to these authors at the Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan; Japan Society for the Promotion of Science, 5-3-1 Kojimachi, Chiyoda-ku, Tokyo 102-0083, Japan; Tel: +81-948-297-828; E-mail: and Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan; Biomedical Informatics R&D Center, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan; Tel: +81-948-297-828; E-mail:
| | - Hiroyuki Kurata
- Address correspondence to these authors at the Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan; Japan Society for the Promotion of Science, 5-3-1 Kojimachi, Chiyoda-ku, Tokyo 102-0083, Japan; Tel: +81-948-297-828; E-mail: and Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan; Biomedical Informatics R&D Center, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan; Tel: +81-948-297-828; E-mail:
| |
Collapse
|
30
|
Zhu H, Liu G, Zhou M, Xie Y, Abusorrah A, Kang Q. Optimizing Weighted Extreme Learning Machines for imbalanced classification and application to credit card fraud detection. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.04.078] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
31
|
Yi HC, You ZH, Huang DS, Guo ZH, Chan KCC, Li Y. Learning Representations to Predict Intermolecular Interactions on Large-Scale Heterogeneous Molecular Association Network. iScience 2020; 23:101261. [PMID: 32580123 PMCID: PMC7317230 DOI: 10.1016/j.isci.2020.101261] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2019] [Revised: 04/29/2020] [Accepted: 06/08/2020] [Indexed: 02/07/2023] Open
Abstract
Molecular components that are functionally interdependent in human cells constitute molecular association networks. Disease can be caused by disturbance of multiple molecular interactions. New biomolecular regulatory mechanisms can be revealed by discovering new biomolecular interactions. To this end, a heterogeneous molecular association network is formed by systematically integrating comprehensive associations between miRNAs, lncRNAs, circRNAs, mRNAs, proteins, drugs, microbes, and complex diseases. We propose a machine learning method for predicting intermolecular interactions, named MMI-Pred. More specifically, a network embedding model is developed to fully exploit the network behavior of biomolecules, and attribute features are also calculated. Then, these discriminative features are combined to train a random forest classifier to predict intermolecular interactions. MMI-Pred achieves an outstanding performance of 93.50% accuracy in hybrid associations prediction under 5-fold cross-validation. This work provides systematic landscape and machine learning method to model and infer complex associations between various biological components.
Collapse
Affiliation(s)
- Hai-Cheng Yi
- Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zhu-Hong You
- Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China; University of Chinese Academy of Sciences, Beijing 100049, China.
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China
| | - Zhen-Hao Guo
- Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China
| | - Keith C C Chan
- Department of Computing, Hong Kong Polytechnic University, Hong Kong SAR 999077, China
| | - Yangming Li
- College of Engineering Technology, Rochester Institute of Technology, Rochester, NY 14623, USA
| |
Collapse
|
32
|
Luo X, Zhou M, Li S, Hu L, Shang M. Non-Negativity Constrained Missing Data Estimation for High-Dimensional and Sparse Matrices from Industrial Applications. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:1844-1855. [PMID: 30835233 DOI: 10.1109/tcyb.2019.2894283] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
High-dimensional and sparse (HiDS) matrices are commonly seen in big-data-related industrial applications like recommender systems. Latent factor (LF) models have proven to be accurate and efficient in extracting hidden knowledge from them. However, they mostly fail to fulfill the non-negativity constraints that describe the non-negative nature of many industrial data. Moreover, existing models suffer from slow convergence rate. An alternating-direction-method of multipliers-based non-negative LF (AMNLF) model decomposes the task of non-negative LF analysis on an HiDS matrix into small subtasks, where each task is solved based on the latest solutions to the previously solved ones, thereby achieving fast convergence and high prediction accuracy for its missing data. This paper theoretically analyzes the characteristics of an AMNLF model, and presents detailed empirical studies regarding its performance on nine HiDS matrices from industrial applications currently in use. Therefore, its capability of addressing HiDS matrices is justified in both theory and practice.
Collapse
|
33
|
Yi HC, You ZH, Wang MN, Guo ZH, Wang YB, Zhou JR. RPI-SE: a stacking ensemble learning framework for ncRNA-protein interactions prediction using sequence information. BMC Bioinformatics 2020; 21:60. [PMID: 32070279 PMCID: PMC7029608 DOI: 10.1186/s12859-020-3406-0] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2019] [Accepted: 02/11/2020] [Indexed: 01/03/2023] Open
Abstract
Background The interactions between non-coding RNAs (ncRNA) and proteins play an essential role in many biological processes. Several high-throughput experimental methods have been applied to detect ncRNA-protein interactions. However, these methods are time-consuming and expensive. Accurate and efficient computational methods can assist and accelerate the study of ncRNA-protein interactions. Results In this work, we develop a stacking ensemble computational framework, RPI-SE, for effectively predicting ncRNA-protein interactions. More specifically, to fully exploit protein and RNA sequence feature, Position Weight Matrix combined with Legendre Moments is applied to obtain protein evolutionary information. Meanwhile, k-mer sparse matrix is employed to extract efficient feature of ncRNA sequences. Finally, an ensemble learning framework integrated different types of base classifier is developed to predict ncRNA-protein interactions using these discriminative features. The accuracy and robustness of RPI-SE was evaluated on three benchmark data sets under five-fold cross-validation and compared with other state-of-the-art methods. Conclusions The results demonstrate that RPI-SE is competent for ncRNA-protein interactions prediction task with high accuracy and robustness. It’s anticipated that this work can provide a computational prediction tool to advance ncRNA-protein interactions related biomedical research.
Collapse
Affiliation(s)
- Hai-Cheng Yi
- Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, 830011, China.,University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Zhu-Hong You
- Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, 830011, China. .,University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Mei-Neng Wang
- School of Mathematics and Computer Science, Yichun University, Yichun, 336000, China
| | - Zhen-Hao Guo
- Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, 830011, China
| | - Yan-Bin Wang
- Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, 830011, China
| | - Ji-Ren Zhou
- Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, 830011, China
| |
Collapse
|
34
|
Xu S, Wang P, Zhang CX, Lu J. Spectral Learning Algorithm Reveals Propagation Capability of Complex Networks. IEEE TRANSACTIONS ON CYBERNETICS 2019; 49:4253-4261. [PMID: 30137020 DOI: 10.1109/tcyb.2018.2861568] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
In network science and the data mining field, a long-lasting and significant task is to predict the propagation capability of nodes in a complex network. Recently, an increasing number of unsupervised learning algorithms, such as the prominent PageRank (PR) and LeaderRank (LR), have been developed to address this issue. However, in degree uncorrelated networks, this paper finds that PR and LR are actually proportional to in-degree of nodes. As a result, the two algorithms fail to accurately predict the nodes' propagation capability. To overcome the arising drawback, this paper proposes a new iterative algorithm called SpectralRank (SR), in which the nodes' propagation capability is assumed to be proportional to the amount of its neighbors after adding a ground node to the network. Moreover, a weighted SR algorithm is also proposed to further involve a priori information of a node itself. A probabilistic framework is established, which is provided as the theoretical foundation of the proposed algorithms. Simulations of the susceptible-infected-removed model on 32 networks, including directed, undirected, and binary ones, reveal the advantages of the SR-family methods (i.e., weighted and unweighted SR) over PR and LR. When compared with other 11 well-known algorithms, the indices in the SR-family always outperform the others. Therefore, the proposed measures provide new insights on the prediction of the nodes' propagation capability and have great implications in the control of spreading behaviors in complex networks.
Collapse
|
35
|
Chen ZH, You ZH, Zhang WB, Wang YB, Cheng L, Alghazzawi D. Global Vectors Representation of Protein Sequences and Its Application for Predicting Self-Interacting Proteins with Multi-Grained Cascade Forest Model. Genes (Basel) 2019; 10:genes10110924. [PMID: 31726752 PMCID: PMC6896115 DOI: 10.3390/genes10110924] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2019] [Revised: 11/05/2019] [Accepted: 11/06/2019] [Indexed: 11/22/2022] Open
Abstract
Self-interacting proteins (SIPs) is of paramount importance in current molecular biology. There have been developed a number of traditional biological experiment methods for predicting SIPs in the past few years. However, these methods are costly, time-consuming and inefficient, and often limit their usage for predicting SIPs. Therefore, the development of computational method emerges at the times require. In this paper, we for the first time proposed a novel deep learning model which combined natural language processing (NLP) method for potential SIPs prediction from the protein sequence information. More specifically, the protein sequence is de novo assembled by k-mers. Then, we obtained the global vectors representation for each protein sequences by using natural language processing (NLP) technique. Finally, based on the knowledge of known self-interacting and non-interacting proteins, a multi-grained cascade forest model is trained to predict SIPs. Comprehensive experiments were performed on yeast and human datasets, which obtained an accuracy rate of 91.45% and 93.12%, respectively. From our evaluations, the experimental results show that the use of amino acid semantics information is very helpful for addressing the problem of sequences containing both self-interacting and non-interacting pairs of proteins. This work would have potential applications for various biological classification problems.
Collapse
Affiliation(s)
- Zhan-Heng Chen
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China; (Z.-H.C.); (W.-B.Z.); (Y.-B.W.); (L.C.)
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zhu-Hong You
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China; (Z.-H.C.); (W.-B.Z.); (Y.-B.W.); (L.C.)
- University of Chinese Academy of Sciences, Beijing 100049, China
- Correspondence: or ; Tel.: +86-991-3835-823
| | - Wen-Bo Zhang
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China; (Z.-H.C.); (W.-B.Z.); (Y.-B.W.); (L.C.)
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yan-Bin Wang
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China; (Z.-H.C.); (W.-B.Z.); (Y.-B.W.); (L.C.)
| | - Li Cheng
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China; (Z.-H.C.); (W.-B.Z.); (Y.-B.W.); (L.C.)
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Daniyal Alghazzawi
- Department of Information Systems, King Abdulaziz University, Jeddah 21589, Saudi Arabia;
| |
Collapse
|
36
|
Yi HC, You ZH, Guo ZH. Construction and Analysis of Molecular Association Network by Combining Behavior Representation and Node Attributes. Front Genet 2019; 10:1106. [PMID: 31788002 PMCID: PMC6854842 DOI: 10.3389/fgene.2019.01106] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2019] [Accepted: 10/15/2019] [Indexed: 11/13/2022] Open
Abstract
A key aim of post-genomic biomedical research is to systematically understand and model complex biomolecular activities based on a systematic perspective. Biomolecular interactions are widespread and interrelated, multiple biomolecules coordinate to sustain life activities, any disturbance of these complex connections can lead to abnormal of life activities or complex diseases. However, many existing researches usually only focus on individual intermolecular interactions. In this work, we revealed, constructed, and analyzed a large-scale molecular association network of multiple biomolecules in human by integrating associations among lncRNAs, miRNAs, proteins, drugs, and diseases, in which various associations are interconnected and any type of associations can be predicted. We propose Molecular Association Network (MAN)–High-Order Proximity preserved Embedding (HOPE), a novel network representation learning based method to fully exploit latent feature of biomolecules to accurately predict associations between molecules. More specifically, network representation learning algorithm HOPE was applied to learn behavior feature of nodes in the association network. Attribute features of nodes were also adopted. Then, a machine learning model CatBoost was trained to predict potential association between any nodes. The performance of our method was evaluated under five-fold cross validation. A case study to predict miRNA-disease associations was also conducted to verify the prediction capability. MAN-HOPE achieves high accuracy of 93.3% and area under the receiver operating characteristic curve of 0.9793. The experimental results demonstrate the novelty of our systematic understanding of the intermolecular associations, and enable systematic exploration of the landscape of molecular interactions that shape specialized cellular functions.
Collapse
Affiliation(s)
- Hai-Cheng Yi
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Zhu-Hong You
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, China
| | - Zhen-Hao Guo
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, China
| |
Collapse
|
37
|
Hu L, Yuan X, Liu X, Xiong S, Luo X. Efficiently Detecting Protein Complexes from Protein Interaction Networks via Alternating Direction Method of Multipliers. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1922-1935. [PMID: 29994334 DOI: 10.1109/tcbb.2018.2844256] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Protein complexes are crucial in improving our understanding of the mechanisms employed by proteins. Various computational algorithms have thus been proposed to detect protein complexes from protein interaction networks. However, given massive protein interactome data obtained by high-throughput technologies, existing algorithms, especially those with additionally consideration of biological information of proteins, either have low efficiency in performing their tasks or suffer from limited effectiveness. For addressing this issue, this work proposes to detect protein complexes from a protein interaction network with high efficiency and effectiveness. To do so, the original detection task is first formulated into an optimization problem according to the intuitive properties of protein complexes. After that, the framework of alternating direction method of multipliers is applied to decompose this optimization problem into several subtasks, which can be subsequently solved in a separate and parallel manner. An algorithm for implementing this solution is then developed. Experimental results on five large protein interaction networks demonstrated that compared to state-of-the-art protein complex detection algorithms, our algorithm outperformed them in terms of both effectiveness and efficiency. Moreover, as number of parallel processes increases, one can expect an even higher computational efficiency for the proposed algorithm with no compromise on effectiveness.
Collapse
|
38
|
Zheng K, You ZH, Wang L, Zhou Y, Li LP, Li ZW. MLMDA: a machine learning approach to predict and validate MicroRNA-disease associations by integrating of heterogenous information sources. J Transl Med 2019; 17:260. [PMID: 31395072 PMCID: PMC6688360 DOI: 10.1186/s12967-019-2009-x] [Citation(s) in RCA: 54] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2019] [Accepted: 07/31/2019] [Indexed: 02/01/2023] Open
Abstract
Background Emerging evidences show that microRNA (miRNA) plays an important role in many human complex diseases. However, considering the inherent time-consuming and expensive of traditional in vitro experiments, more and more attention has been paid to the development of efficient and feasible computational methods to predict the potential associations between miRNA and disease. Methods In this work, we present a machine learning-based model called MLMDA for predicting the association of miRNAs and diseases. More specifically, we first use the k-mer sparse matrix to extract miRNA sequence information, and combine it with miRNA functional similarity, disease semantic similarity and Gaussian interaction profile kernel similarity information. Then, more representative features are extracted from them through deep auto-encoder neural network (AE). Finally, the random forest classifier is used to effectively predict potential miRNA–disease associations. Results The experimental results show that the MLMDA model achieves promising performance under fivefold cross validations with AUC values of 0.9172, which is higher than the methods using different classifiers or different feature combination methods mentioned in this paper. In addition, to further evaluate the prediction performance of MLMDA model, case studies are carried out with three Human complex diseases including Lymphoma, Lung Neoplasm, and Esophageal Neoplasms. As a result, 39, 37 and 36 out of the top 40 predicted miRNAs are confirmed by other miRNA–disease association databases. Conclusions These prominent experimental results suggest that the MLMDA model could serve as a useful tool guiding the future experimental validation for those promising miRNA biomarker candidates. The source code and datasets explored in this work are available at http://220.171.34.3:81/.
Collapse
Affiliation(s)
- Kai Zheng
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, 221116, China.
| | - Zhu-Hong You
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, 830011, China.
| | - Lei Wang
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, 830011, China. .,College of Information Science and Engineering, Zaozhuang University, Zaozhuang, 277100, China.
| | - Yong Zhou
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, 221116, China
| | - Li-Ping Li
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, 830011, China
| | - Zheng-Wei Li
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, 221116, China
| |
Collapse
|
39
|
Li Y, Li LP, Wang L, Yu CQ, Wang Z, You ZH. An Ensemble Classifier to Predict Protein-Protein Interactions by Combining PSSM-based Evolutionary Information with Local Binary Pattern Model. Int J Mol Sci 2019; 20:E3511. [PMID: 31319578 PMCID: PMC6679202 DOI: 10.3390/ijms20143511] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2019] [Revised: 07/04/2019] [Accepted: 07/15/2019] [Indexed: 01/03/2023] Open
Abstract
Protein plays a critical role in the regulation of biological cell functions. Among them, whether proteins interact with each other has become a fundamental problem, because proteins usually perform their functions by interacting with other proteins. Although a large amount of protein-protein interactions (PPIs) data has been produced by high-throughput biotechnology, the disadvantage of biological experimental technique is time-consuming and costly. Thus, computational methods for predicting protein interactions have become a research hot spot. In this research, we propose an efficient computational method that combines Rotation Forest (RF) classifier with Local Binary Pattern (LBP) feature extraction method to predict PPIs from the perspective of Position-Specific Scoring Matrix (PSSM). The proposed method has achieved superior performance in predicting Yeast, Human, and H. pylori datasets with average accuracies of 92.12%, 96.21%, and 86.59%, respectively. In addition, we also evaluated the performance of the proposed method on the four independent datasets of C. elegans, H. pylori, H. sapiens, and M. musculus datasets. These obtained experimental results fully prove that our model has good feasibility and robustness in predicting PPIs.
Collapse
Affiliation(s)
- Yang Li
- School of Information Engineering, Xijing University, Xi'an 710123, China
| | - Li-Ping Li
- School of Information Engineering, Xijing University, Xi'an 710123, China.
| | - Lei Wang
- College of Information Science and Engineering, Zaozhuang University, Zaozhuang 277100, China.
| | - Chang-Qing Yu
- School of Information Engineering, Xijing University, Xi'an 710123, China.
| | - Zheng Wang
- School of Information Engineering, Xijing University, Xi'an 710123, China
| | - Zhu-Hong You
- School of Information Engineering, Xijing University, Xi'an 710123, China
| |
Collapse
|
40
|
Yi HC, You ZH, Zhou X, Cheng L, Li X, Jiang TH, Chen ZH. ACP-DL: A Deep Learning Long Short-Term Memory Model to Predict Anticancer Peptides Using High-Efficiency Feature Representation. MOLECULAR THERAPY. NUCLEIC ACIDS 2019; 17:1-9. [PMID: 31173946 PMCID: PMC6554234 DOI: 10.1016/j.omtn.2019.04.025] [Citation(s) in RCA: 122] [Impact Index Per Article: 20.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/09/2019] [Revised: 04/08/2019] [Accepted: 04/08/2019] [Indexed: 01/10/2023]
Abstract
Cancer is a well-known killer of human beings, which has led to countless deaths and misery. Anticancer peptides open a promising perspective for cancer treatment, and they have various attractive advantages. Conventional wet experiments are expensive and inefficient for finding and identifying novel anticancer peptides. There is an urgent need to develop a novel computational method to predict novel anticancer peptides. In this study, we propose a deep learning long short-term memory (LSTM) neural network model, ACP-DL, to effectively predict novel anticancer peptides. More specifically, to fully exploit peptide sequence information, we developed an efficient feature representation approach by integrating binary profile feature and k-mer sparse matrix of the reduced amino acid alphabet. Then we implemented a deep LSTM model to automatically learn how to identify anticancer peptides and non-anticancer peptides. To our knowledge, this is the first time that the deep LSTM model has been applied to predict anticancer peptides. It was demonstrated by cross-validation experiments that the proposed ACP-DL remarkably outperformed other comparison methods with high accuracy and satisfied specificity on benchmark datasets. In addition, we also contributed two new anticancer peptides benchmark datasets, ACP740 and ACP240, in this work. The source code and datasets are available at https://github.com/haichengyi/ACP-DL.
Collapse
Affiliation(s)
- Hai-Cheng Yi
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zhu-Hong You
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China.
| | - Xi Zhou
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China
| | - Li Cheng
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China
| | - Xiao Li
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China
| | - Tong-Hai Jiang
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China
| | - Zhan-Heng Chen
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China
| |
Collapse
|
41
|
Luo X, Zhou M. Effects of Extended Stochastic Gradient Descent Algorithms on Improving Latent Factor-Based Recommender Systems. IEEE Robot Autom Lett 2019. [DOI: 10.1109/lra.2019.2891986] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
42
|
Chen ZH, Li LP, He Z, Zhou JR, Li Y, Wong L. An Improved Deep Forest Model for Predicting Self-Interacting Proteins From Protein Sequence Using Wavelet Transformation. Front Genet 2019; 10:90. [PMID: 30881376 PMCID: PMC6405691 DOI: 10.3389/fgene.2019.00090] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2018] [Accepted: 01/29/2019] [Indexed: 12/23/2022] Open
Abstract
Self-interacting proteins (SIPs), whose more than two identities can interact with each other, play significant roles in the understanding of cellular process and cell functions. Although a number of experimental methods have been designed to detect the SIPs, they remain to be extremely time-consuming, expensive, and challenging even nowadays. Therefore, there is an urgent need to develop the computational methods for predicting SIPs. In this study, we propose a deep forest based predictor for accurate prediction of SIPs using protein sequence information. More specifically, a novel feature representation method, which integrate position-specific scoring matrix (PSSM) with wavelet transform, is introduced. To evaluate the performance of the proposed method, cross-validation tests are performed on two widely used benchmark datasets. The experimental results show that the proposed model achieved high accuracies of 95.43 and 93.65% on human and yeast datasets, respectively. The AUC value for evaluating the performance of the proposed method was also reported. The AUC value for yeast and human datasets are 0.9203 and 0.9586, respectively. To further show the advantage of the proposed method, it is compared with several existing methods. The results demonstrate that the proposed model is better than other SIPs prediction methods. This work can offer an effective architecture to biologists in detecting new SIPs.
Collapse
Affiliation(s)
- Zhan-Heng Chen
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Li-Ping Li
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| | - Zhou He
- College of Engineering and Applied Science, University of Colorado Boulder, Boulder, CO, United States
| | - Ji-Ren Zhou
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| | - Yangming Li
- ECTET, Rochester Institute of Technology, Rochester, NY, United States
| | - Leon Wong
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
- University of Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
43
|
Zhan ZH, Jia LN, Zhou Y, Li LP, Yi HC. BGFE: A Deep Learning Model for ncRNA-Protein Interaction Predictions Based on Improved Sequence Information. Int J Mol Sci 2019; 20:E978. [PMID: 30813451 PMCID: PMC6412311 DOI: 10.3390/ijms20040978] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2019] [Revised: 02/19/2019] [Accepted: 02/20/2019] [Indexed: 11/26/2022] Open
Abstract
The interactions between ncRNAs and proteins are critical for regulating various cellular processes in organisms, such as gene expression regulations. However, due to limitations, including financial and material consumptions in recent experimental methods for predicting ncRNA and protein interactions, it is essential to propose an innovative and practical approach with convincing performance of prediction accuracy. In this study, based on the protein sequences from a biological perspective, we put forward an effective deep learning method, named BGFE, to predict ncRNA and protein interactions. Protein sequences are represented by bi-gram probability feature extraction method from Position Specific Scoring Matrix (PSSM), and for ncRNA sequences, k-mers sparse matrices are employed to represent them. Furthermore, to extract hidden high-level feature information, a stacked auto-encoder network is employed with the stacked ensemble integration strategy. We evaluate the performance of the proposed method by using three datasets and a five-fold cross-validation after classifying the features through the random forest classifier. The experimental results clearly demonstrate the effectiveness and the prediction accuracy of our approach. In general, the proposed method is helpful for ncRNA and protein interacting predictions and it provides some serviceable guidance in future biological research.
Collapse
Affiliation(s)
- Zhao-Hui Zhan
- China University of Mining and Technology, Xuzhou 221116, China.
| | - Li-Na Jia
- College of Information Science and Engineering, Zaozhuang University, Zaozhuang 277100, Shandong, China.
| | - Yong Zhou
- China University of Mining and Technology, Xuzhou 221116, China.
| | - Li-Ping Li
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China.
| | - Hai-Cheng Yi
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China.
| |
Collapse
|
44
|
Chen ZH, You ZH, Li LP, Wang YB, Wong L, Yi HC. Prediction of Self-Interacting Proteins from Protein Sequence Information Based on Random Projection Model and Fast Fourier Transform. Int J Mol Sci 2019; 20:ijms20040930. [PMID: 30795499 PMCID: PMC6412412 DOI: 10.3390/ijms20040930] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2018] [Revised: 01/06/2019] [Accepted: 01/07/2019] [Indexed: 12/30/2022] Open
Abstract
It is significant for biological cells to predict self-interacting proteins (SIPs) in the field of bioinformatics. SIPs mean that two or more identical proteins can interact with each other by one gene expression. This plays a major role in the evolution of protein‒protein interactions (PPIs) and cellular functions. Owing to the limitation of the experimental identification of self-interacting proteins, it is more and more significant to develop a useful biological tool for the prediction of SIPs from protein sequence information. Therefore, we propose a novel prediction model called RP-FFT that merges the Random Projection (RP) model and Fast Fourier Transform (FFT) for detecting SIPs. First, each protein sequence was transformed into a Position Specific Scoring Matrix (PSSM) using the Position Specific Iterated BLAST (PSI-BLAST). Second, the features of protein sequences were extracted by the FFT method on PSSM. Lastly, we evaluated the performance of RP-FFT and compared the RP classifier with the state-of-the-art support vector machine (SVM) classifier and other existing methods on the human and yeast datasets; after the five-fold cross-validation, the RP-FFT model can obtain high average accuracies of 96.28% and 91.87% on the human and yeast datasets, respectively. The experimental results demonstrated that our RP-FFT prediction model is reasonable and robust.
Collapse
Affiliation(s)
- Zhan-Heng Chen
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China.
- University of Chinese Academy of Sciences, Beijing 100049, China.
| | - Zhu-Hong You
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China.
- University of Chinese Academy of Sciences, Beijing 100049, China.
| | - Li-Ping Li
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China.
| | - Yan-Bin Wang
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China.
| | - Leon Wong
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China.
- University of Chinese Academy of Sciences, Beijing 100049, China.
| | - Hai-Cheng Yi
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China.
- University of Chinese Academy of Sciences, Beijing 100049, China.
| |
Collapse
|
45
|
Ijaq J, Malik G, Kumar A, Das PS, Meena N, Bethi N, Sundararajan VS, Suravajhala P. A model to predict the function of hypothetical proteins through a nine-point classification scoring schema. BMC Bioinformatics 2019; 20:14. [PMID: 30621574 PMCID: PMC6325861 DOI: 10.1186/s12859-018-2554-y] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2018] [Accepted: 11/30/2018] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Hypothetical proteins [HP] are those that are predicted to be expressed in an organism, but no evidence of their existence is known. In the recent past, annotation and curation efforts have helped overcome the challenge in understanding their diverse functions. Techniques to decipher sequence-structure-function relationship, especially in terms of functional modelling of the HPs have been developed by researchers, but using the features as classifiers for HPs has not been attempted. With the rise in number of annotation strategies, next-generation sequencing methods have provided further understanding the functions of HPs. RESULTS In our previous work, we developed a six-point classification scoring schema with annotation pertaining to protein family scores, orthology, protein interaction/association studies, bidirectional best BLAST hits, sorting signals, known databases and visualizers which were used to validate protein interactions. In this study, we introduced three more classifiers to our annotation system, viz. pseudogenes linked to HPs, homology modelling and non-coding RNAs associated to HPs. We discuss the challenges and performance of these classifiers using machine learning heuristics with an improved accuracy from Perceptron (81.08 to 97.67), Naive Bayes (54.05 to 96.67), Decision tree J48 (67.57 to 97.00), and SMO_npolyk (59.46 to 96.67). CONCLUSION With the introduction of three new classification features, the performance of the nine-point classification scoring schema has an improved accuracy to functionally annotate the HPs.
Collapse
Affiliation(s)
- Johny Ijaq
- Department of Biotechnology, Osmania University, Hyderabad, 500007 India
- Bioclues.org, Kukatpally, Hyderabad, 500072 India
| | - Girik Malik
- Department of Pediatrics, The Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children’s Hospital, The Ohio State University, Columbus, OH USA
- Bioclues.org, Kukatpally, Hyderabad, 500072 India
- Labrynthe, New Delhi, India
| | - Anuj Kumar
- Bioclues.org, Kukatpally, Hyderabad, 500072 India
- Advanced Center for Computational and Applied Biotechnology, Uttarakhand Council for Biotechnology, Dehradun, 248007 India
| | - Partha Sarathi Das
- Bioclues.org, Kukatpally, Hyderabad, 500072 India
- Department of Microbiology, Bioinformatics Infrastructure Facility, Vidyasagar University, Midnapore, India
| | - Narendra Meena
- Department of Biotechnology and Bioinformatics, Birla Institute of Scientific Research, Statue Circle, RJ 302001 India
| | - Neeraja Bethi
- Department of Biotechnology, Osmania University, Hyderabad, 500007 India
| | | | - Prashanth Suravajhala
- Bioclues.org, Kukatpally, Hyderabad, 500072 India
- Department of Biotechnology and Bioinformatics, Birla Institute of Scientific Research, Statue Circle, RJ 302001 India
| |
Collapse
|
46
|
You ZH, Huang W, Zhang S, Huang YA, Yu CQ, Li LP. An Efficient Ensemble Learning Approach for Predicting Protein-Protein Interactions by Integrating Protein Primary Sequence and Evolutionary Information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 16:809-817. [PMID: 30475726 DOI: 10.1109/tcbb.2018.2882423] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Protein-protein interactions (PPIs) perform a very important function in many cellular processes, including signal transduction, post-translational modifications, apoptosis, and cell growth. Deregulation of PPIs results in many diseases, including cancer and pernicious anemia. Although many high-throughput methods have been applied to generate a large amount of PPIs data, they are generally expensive, inefficient and labor-intensive. Hence, there is an urgent need for developing a computational method to accurately and rapidly detect PPIs. In this article, we proposed a highly efficient approach to predict PPIs by integrating a new protein sequence substitution matrix feature representation and ensemble weighted sparse representation model classifier. The proposed method is demonstrated on Saccharomyces cerevisiae dataset and achieved 99.26% prediction accuracy with 98.53% sensitivity at precision of 100%, which is shown to have much higher predictive accuracy than current state-of-the-art algorithms. Extensive experiments are performed with the benchmark data set from Human and Helicobacter pylori that the proposed method achieves outstanding better success rates than other existing approaches in this problem. Experiment results illustrate that our proposed method presents an economical approach for computational building of PPI networks, which can be a helpful supplementary method for future proteomics researches.
Collapse
|
47
|
Zhan ZH, You ZH, Li LP, Zhou Y, Yi HC. Accurate Prediction of ncRNA-Protein Interactions From the Integration of Sequence and Evolutionary Information. Front Genet 2018; 9:458. [PMID: 30349558 PMCID: PMC6186793 DOI: 10.3389/fgene.2018.00458] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2018] [Accepted: 09/19/2018] [Indexed: 12/18/2022] Open
Abstract
Non-coding RNA (ncRNA) plays a crucial role in numerous biological processes including gene expression and post-transcriptional gene regulation. The biological function of ncRNA is mostly realized by binding with related proteins. Therefore, an accurate understanding of interactions between ncRNA and protein has a significant impact on current biological research. The major challenge at this stage is the waste of a great deal of redundant time and resource consumed on classification in traditional interaction pattern prediction methods. Fortunately, an efficient classifier named LightGBM can solve this difficulty of long time consumption. In this study, we employed LightGBM as the integrated classifier and proposed a novel computational model for predicting ncRNA and protein interactions. More specifically, the pseudo-Zernike Moments and singular value decomposition algorithm are employed to extract the discriminative features from protein and ncRNA sequences. On four widely used datasets RPI369, RPI488, RPI1807, and RPI2241, we evaluated the performance of LGBM and obtained an superior performance with AUC of 0.799, 0.914, 0.989, and 0.762, respectively. The experimental results of 10-fold cross-validation shown that the proposed method performs much better than existing methods in predicting ncRNA-protein interaction patterns, which could be used as a useful tool in proteomics research.
Collapse
Affiliation(s)
- Zhao-Hui Zhan
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China
| | - Zhu-Hong You
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| | - Li-Ping Li
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| | - Yong Zhou
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China
| | - Hai-Cheng Yi
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| |
Collapse
|
48
|
Luo X, Zhou M, Li S, Xia Y, You ZH, Zhu Q, Leung H. Incorporation of Efficient Second-Order Solvers Into Latent Factor Models for Accurate Prediction of Missing QoS Data. IEEE TRANSACTIONS ON CYBERNETICS 2018; 48:1216-1228. [PMID: 28422674 DOI: 10.1109/tcyb.2017.2685521] [Citation(s) in RCA: 54] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Generating highly accurate predictions for missing quality-of-service (QoS) data is an important issue. Latent factor (LF)-based QoS-predictors have proven to be effective in dealing with it. However, they are based on first-order solvers that cannot well address their target problem that is inherently bilinear and nonconvex, thereby leaving a significant opportunity for accuracy improvement. This paper proposes to incorporate an efficient second-order solver into them to raise their accuracy. To do so, we adopt the principle of Hessian-free optimization and successfully avoid the direct manipulation of a Hessian matrix, by employing the efficiently obtainable product between its Gauss-Newton approximation and an arbitrary vector. Thus, the second-order information is innovatively integrated into them. Experimental results on two industrial QoS datasets indicate that compared with the state-of-the-art predictors, the newly proposed one achieves significantly higher prediction accuracy at the expense of affordable computational burden. Hence, it is especially suitable for industrial applications requiring high prediction accuracy of unknown QoS data.
Collapse
|
49
|
Li LP, Wang YB, You ZH, Li Y, An JY. PCLPred: A Bioinformatics Method for Predicting Protein-Protein Interactions by Combining Relevance Vector Machine Model with Low-Rank Matrix Approximation. Int J Mol Sci 2018; 19:ijms19041029. [PMID: 29596363 PMCID: PMC5979371 DOI: 10.3390/ijms19041029] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2018] [Revised: 03/20/2018] [Accepted: 03/21/2018] [Indexed: 11/30/2022] Open
Abstract
Protein–protein interactions (PPI) are key to protein functions and regulations within the cell cycle, DNA replication, and cellular signaling. Therefore, detecting whether a pair of proteins interact is of great importance for the study of molecular biology. As researchers have become aware of the importance of computational methods in predicting PPIs, many techniques have been developed for performing this task computationally. However, there are few technologies that really meet the needs of their users. In this paper, we develop a novel and efficient sequence-based method for predicting PPIs. The evolutionary features are extracted from the position-specific scoring matrix (PSSM) of protein. The features are then fed into a robust relevance vector machine (RVM) classifier to distinguish between the interacting and non-interacting protein pairs. In order to verify the performance of our method, five-fold cross-validation tests are performed on the Saccharomyces cerevisiae dataset. A high accuracy of 94.56%, with 94.79% sensitivity at 94.36% precision, was obtained. The experimental results illustrated that the proposed approach can extract the most significant features from each protein sequence and can be a bright and meaningful tool for the research of proteomics.
Collapse
Affiliation(s)
- Li-Ping Li
- Department of Information Engineering, Xijing University, Xi'an 710123, China.
| | - Yan-Bin Wang
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China.
| | - Zhu-Hong You
- Department of Information Engineering, Xijing University, Xi'an 710123, China.
| | - Yang Li
- Department of Information Engineering, Xijing University, Xi'an 710123, China.
| | - Ji-Yong An
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 21116, China.
| |
Collapse
|
50
|
Yi HC, You ZH, Huang DS, Li X, Jiang TH, Li LP. A Deep Learning Framework for Robust and Accurate Prediction of ncRNA-Protein Interactions Using Evolutionary Information. MOLECULAR THERAPY-NUCLEIC ACIDS 2018; 11:337-344. [PMID: 29858068 PMCID: PMC5992449 DOI: 10.1016/j.omtn.2018.03.001] [Citation(s) in RCA: 84] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/10/2017] [Revised: 02/02/2018] [Accepted: 03/04/2018] [Indexed: 01/01/2023]
Abstract
The interactions between non-coding RNAs (ncRNAs) and proteins play an important role in many biological processes, and their biological functions are primarily achieved by binding with a variety of proteins. High-throughput biological techniques are used to identify protein molecules bound with specific ncRNA, but they are usually expensive and time consuming. Deep learning provides a powerful solution to computationally predict RNA-protein interactions. In this work, we propose the RPI-SAN model by using the deep-learning stacked auto-encoder network to mine the hidden high-level features from RNA and protein sequences and feed them into a random forest (RF) model to predict ncRNA binding proteins. Stacked assembling is further used to improve the accuracy of the proposed method. Four benchmark datasets, including RPI2241, RPI488, RPI1807, and NPInter v2.0, were employed for the unbiased evaluation of five established prediction tools: RPI-Pred, IPMiner, RPISeq-RF, lncPro, and RPI-SAN. The experimental results show that our RPI-SAN model achieves much better performance than other methods, with accuracies of 90.77%, 89.7%, 96.1%, and 99.33%, respectively. It is anticipated that RPI-SAN can be used as an effective computational tool for future biomedical researches and can accurately predict the potential ncRNA-protein interacted pairs, which provides reliable guidance for biological research.
Collapse
Affiliation(s)
- Hai-Cheng Yi
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zhu-Hong You
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China.
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, China.
| | - Xiao Li
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China
| | - Tong-Hai Jiang
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China
| | - Li-Ping Li
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China
| |
Collapse
|