1
|
Wu C, Lin B, Zhang J, Gao R, Song R, Liu ZP. AttentionEP: Predicting essential proteins via fusion of multiscale features by attention mechanisms. Comput Struct Biotechnol J 2024; 23:4315-4323. [PMID: 39697678 PMCID: PMC11652892 DOI: 10.1016/j.csbj.2024.11.039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2024] [Revised: 11/17/2024] [Accepted: 11/25/2024] [Indexed: 12/20/2024] Open
Abstract
Identifying essential proteins is of utmost importance in the field of biomedical research due to their essential functions in cellular activities and their involvement in mechanisms related to diseases. In this research, a novel approach called AttentionEP for predicting essential proteins (EP) is introduced by attention mechanisms. This method leverages both cross-attention and self-attention frameworks, focusing on enhancing prediction accuracy through the integration of features across diverse scales. Spatial characteristics of proteins are obtained from the protein-protein interaction (PPI) network by employing Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT). Following this, Bidirectional Long Short-Term Memory networks (BiLSTM) are employed to derive temporal features from gene expression datasets. Furthermore, spatial characteristics are derived by integrating data on subcellular localization with the application of Deep Neural Networks (DNN). In order to effectively integrate features across multiple scales, initial steps involve the application of self-attention techniques to derive essential insights from each unique data set. Following this, mechanisms involving self-attention and cross-attention are employed to enhance the interaction between diverse information sources. To identify essential proteins, a classifier based on the ResNet architecture is developed. The findings from the experiments indicate that the method introduced here shows superior performance in identifying essential proteins, recording an Area Under the Curve (AUC) value of 0.9433. This approach shows a considerable advantage over established techniques. The findings of this study provide a significant advancement in the comprehension of critical proteins, revealing promising potential for applications in the development of therapeutics and addressing various diseases.
Collapse
Affiliation(s)
- Chuanyan Wu
- School of Intelligent Engineering, Shandong Management University, No.3500 Dingxiang Road, Jinan, Shandong, 250357, China
| | - Bentao Lin
- School of Intelligent Engineering, Shandong Management University, No.3500 Dingxiang Road, Jinan, Shandong, 250357, China
| | - Jialin Zhang
- School of Control Science and Engineering, Shandong University, No.17923 Jingshi Road, Jinan, Shandong, 250061, China
| | - Rui Gao
- School of Control Science and Engineering, Shandong University, No.17923 Jingshi Road, Jinan, Shandong, 250061, China
| | - Rui Song
- School of Control Science and Engineering, Shandong University, No.17923 Jingshi Road, Jinan, Shandong, 250061, China
| | - Zhi-Ping Liu
- School of Control Science and Engineering, Shandong University, No.17923 Jingshi Road, Jinan, Shandong, 250061, China
| |
Collapse
|
2
|
Byeon H, Shabaz M, Ramesh JVN, Dutta AK, Vijay R, Soni M, Patni JC, Rusho MA, Singh PP. Feature fusion-based food protein subcellular prediction for drug composition. Food Chem 2024; 454:139747. [PMID: 38797095 DOI: 10.1016/j.foodchem.2024.139747] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2024] [Revised: 05/05/2024] [Accepted: 05/18/2024] [Indexed: 05/29/2024]
Abstract
The structure and function of dietary proteins, as well as their subcellular prediction, are critical for designing and developing new drug compositions and understanding the pathophysiology of certain diseases. As a remedy, we provide a subcellular localization method based on feature fusion and clustering for dietary proteins. Additionally, an enhanced PseAAC (Pseudo-amino acid composition) method is suggested, which builds upon the conventional PseAAC. The study initially builds a novel model of representing the food protein sequence by integrating autocorrelation, chi density, and improved PseAAC to better convey information about the food protein sequence. After that, the dimensionality of the fused feature vectors is reduced by using principal component analysis. With prediction accuracies of 99.24% in the Gram-positive dataset and 95.33% in the Gram-negative dataset, respectively, the experimental findings demonstrate the practicability and efficacy of the proposed approach. This paper is basically exploring pseudo-amino acid composition of not any clinical aspect but exploring a pharmaceutical aspect for drug repositioning.
Collapse
Affiliation(s)
- Haewon Byeon
- Department of AI and Software, Inje University, Gimhae 50834, Republic of Korea; Inje University Medical Big Data Research Center, Gimhae 50834, Republic of Korea
| | - Mohammad Shabaz
- Model Institute of Engineering and Technology, Jammu, J&K, India.
| | - Janjhyam Venkata Naga Ramesh
- Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram, Guntur Dist., Andhra Pradesh 522302, India
| | - Ashit Kumar Dutta
- Department of Computer Science and Information Systems, College of Applied Sciences, AlMaarefa University, Ad Diriyah, Riyadh 13713, Saudi Arabia.
| | - Richa Vijay
- Chitkara University Institute of Engineering and Technology, Chitkara University, Rajpura, Punjab, India
| | - Mukesh Soni
- Dr. D. Y. Patil Vidyapeeth, Pune, Dr. D. Y. Patil School of Science & Technology, Tathawade, Pune, India
| | | | - Maher Ali Rusho
- Department of Lockheed Martin Engineering Management, University of Colorado, Boulder, Boulder, CO 80309, USA.
| | | |
Collapse
|
3
|
Yang W, Zou S, Gao H, Wang L, Ni W. A Novel Method for Targeted Identification of Essential Proteins by Integrating Chemical Reaction Optimization and Naive Bayes Model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1274-1286. [PMID: 38536675 DOI: 10.1109/tcbb.2024.3382392] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/10/2024]
Abstract
Targeted identification of essential proteins is of great significance for species identification, drug manufacturing, and disease treatment. It is a challenge to analyze the binding mechanism between essential proteins and improve the identification speed while ensuring the accuracy of the identification. This paper proposes a novel method called EPCRO for identifying essential proteins, which incorporates the chemical reaction optimization (CRO) algorithm and the naive Bayes model to effectively detect essential proteins. In EPCRO, the naive Bayes model is employed to analyze the homogeneity between proteins. In order to improve the identification rate and speed of essential proteins, the protein homogeneity rate is integrated into the CRO algorithm to balance between local and global searches. EPCRO is experimentally compared with 17 existing methods (including, DC, SC, IC, EC, LAC, NC, PeC, WDC, EPD-RW, RWHN, TEGS, CFMM, BSPM, AFSO-EP, CVIM, RWEP, and EPPSO-DC) based on biological datasets. The results show that EPCRO is superior to the above methods in identification accuracy and speed.
Collapse
|
4
|
Li G, Luo X, Hu Z, Wu J, Peng W, Liu J, Zhu X. Essential proteins discovery based on dominance relationship and neighborhood similarity centrality. Health Inf Sci Syst 2023; 11:55. [PMID: 37981988 PMCID: PMC10654316 DOI: 10.1007/s13755-023-00252-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Accepted: 10/13/2023] [Indexed: 11/21/2023] Open
Abstract
Essential proteins play a vital role in development and reproduction of cells. The identification of essential proteins helps to understand the basic survival of cells. Due to time-consuming, costly and inefficient with biological experimental methods for discovering essential proteins, computational methods have gained increasing attention. In the initial stage, essential proteins are mainly identified by the centralities based on protein-protein interaction (PPI) networks, which limit their identification rate due to many false positives in PPI networks. In this study, a purified PPI network is firstly introduced to reduce the impact of false positives in the PPI network. Secondly, by analyzing the similarity relationship between a protein and its neighbors in the PPI network, a new centrality called neighborhood similarity centrality (NSC) is proposed. Thirdly, based on the subcellular localization and orthologous data, the protein subcellular localization score and ortholog score are calculated, respectively. Fourthly, by analyzing a large number of methods based on multi-feature fusion, it is found that there is a special relationship among features, which is called dominance relationship, then, a novel model based on dominance relationship is proposed. Finally, NSC, subcellular localization score, and ortholog score are fused by the dominance relationship model, and a new method called NSO is proposed. In order to verify the performance of NSO, the seven representative methods (ION, NCCO, E_POC, SON, JDC, PeC, WDC) are compared on yeast datasets. The experimental results show that the NSO method has higher identification rate than other methods.
Collapse
Affiliation(s)
- Gaoshi Li
- Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin, 541004 China
- Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, 541004 Guangxi China
- College of Computer Science and Engineering, Guangxi Normal University, Guilin, 541004 Guangxi China
| | - Xinlong Luo
- Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin, 541004 China
- Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, 541004 Guangxi China
- College of Computer Science and Engineering, Guangxi Normal University, Guilin, 541004 Guangxi China
| | - Zhipeng Hu
- Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin, 541004 China
- Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, 541004 Guangxi China
- College of Computer Science and Engineering, Guangxi Normal University, Guilin, 541004 Guangxi China
| | - Jingli Wu
- Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin, 541004 China
- Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, 541004 Guangxi China
- College of Computer Science and Engineering, Guangxi Normal University, Guilin, 541004 Guangxi China
| | - Wei Peng
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, 650500 Yunnan China
| | - Jiafei Liu
- Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin, 541004 China
- Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, 541004 Guangxi China
- College of Computer Science and Engineering, Guangxi Normal University, Guilin, 541004 Guangxi China
| | - Xiaoshu Zhu
- Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin, 541004 China
- Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, 541004 Guangxi China
- College of Computer Science and Engineering, Guangxi Normal University, Guilin, 541004 Guangxi China
- School of Computer and Information Security & School of Software Engineering, Guilin University of Electronic Science and Technology, Guilin, China
| |
Collapse
|
5
|
Zhao H, Liu G, Cao X. A seed expansion-based method to identify essential proteins by integrating protein-protein interaction sub-networks and multiple biological characteristics. BMC Bioinformatics 2023; 24:452. [PMID: 38036960 PMCID: PMC10688502 DOI: 10.1186/s12859-023-05583-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2023] [Accepted: 11/24/2023] [Indexed: 12/02/2023] Open
Abstract
BACKGROUND The identification of essential proteins is of great significance in biology and pathology. However, protein-protein interaction (PPI) data obtained through high-throughput technology include a high number of false positives. To overcome this limitation, numerous computational algorithms based on biological characteristics and topological features have been proposed to identify essential proteins. RESULTS In this paper, we propose a novel method named SESN for identifying essential proteins. It is a seed expansion method based on PPI sub-networks and multiple biological characteristics. Firstly, SESN utilizes gene expression data to construct PPI sub-networks. Secondly, seed expansion is performed simultaneously in each sub-network, and the expansion process is based on the topological features of predicted essential proteins. Thirdly, the error correction mechanism is based on multiple biological characteristics and the entire PPI network. Finally, SESN analyzes the impact of each biological characteristic, including protein complex, gene expression data, GO annotations, and subcellular localization, and adopts the biological data with the best experimental results. The output of SESN is a set of predicted essential proteins. CONCLUSIONS The analysis of each component of SESN indicates the effectiveness of all components. We conduct comparison experiments using three datasets from two species, and the experimental results demonstrate that SESN achieves superior performance compared to other methods.
Collapse
Affiliation(s)
- He Zhao
- College of Computer Science and Technology, Jilin University, Changchun, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China
| | - Guixia Liu
- College of Computer Science and Technology, Jilin University, Changchun, China.
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China.
| | - Xintian Cao
- College of Computer Science and Technology, Jilin University, Changchun, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China
| |
Collapse
|
6
|
Liu P, Liu C, Mao Y, Guo J, Liu F, Cai W, Zhao F. Identification of essential proteins based on edge features and the fusion of multiple-source biological information. BMC Bioinformatics 2023; 24:203. [PMID: 37198530 DOI: 10.1186/s12859-023-05315-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Accepted: 04/30/2023] [Indexed: 05/19/2023] Open
Abstract
BACKGROUND A major current focus in the analysis of protein-protein interaction (PPI) data is how to identify essential proteins. As massive PPI data are available, this warrants the design of efficient computing methods for identifying essential proteins. Previous studies have achieved considerable performance. However, as a consequence of the features of high noise and structural complexity in PPIs, it is still a challenge to further upgrade the performance of the identification methods. METHODS This paper proposes an identification method, named CTF, which identifies essential proteins based on edge features including h-quasi-cliques and uv-triangle graphs and the fusion of multiple-source information. We first design an edge-weight function, named EWCT, for computing the topological scores of proteins based on quasi-cliques and triangle graphs. Then, we generate an edge-weighted PPI network using EWCT and dynamic PPI data. Finally, we compute the essentiality of proteins by the fusion of topological scores and three scores of biological information. RESULTS We evaluated the performance of the CTF method by comparison with 16 other methods, such as MON, PeC, TEGS, and LBCC, the experiment results on three datasets of Saccharomyces cerevisiae show that CTF outperforms the state-of-the-art methods. Moreover, our method indicates that the fusion of other biological information is beneficial to improve the accuracy of identification.
Collapse
Affiliation(s)
- Peiqiang Liu
- School of Computer Science and Technology, Shandong Technology and Business University, Yantai, China.
| | - Chang Liu
- School of Computer Science and Technology, Shandong Technology and Business University, Yantai, China
| | - Yanyan Mao
- School of Computer Science and Technology, Shandong Technology and Business University, Yantai, China
- College of Oceanography and Space Informatics, China University of Petroleum (East China), Qingdao, China
| | - Junhong Guo
- School of Computer Science and Technology, Shandong Technology and Business University, Yantai, China
| | - Fanshu Liu
- School of Computer Science and Technology, Shandong Technology and Business University, Yantai, China
| | - Wangmin Cai
- School of Computer Science and Technology, Shandong Technology and Business University, Yantai, China
| | - Feng Zhao
- School of Computer Science and Technology, Shandong Technology and Business University, Yantai, China
| |
Collapse
|
7
|
Xue X, Zhang W, Fan A. Comparative analysis of gene ontology-based semantic similarity measurements for the application of identifying essential proteins. PLoS One 2023; 18:e0284274. [PMID: 37083829 PMCID: PMC10121005 DOI: 10.1371/journal.pone.0284274] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2022] [Accepted: 03/28/2023] [Indexed: 04/22/2023] Open
Abstract
Identifying key proteins from protein-protein interaction (PPI) networks is one of the most fundamental and important tasks for computational biologists. However, the protein interactions obtained by high-throughput technology are characterized by a high false positive rate, which severely hinders the prediction accuracy of the current computational methods. In this paper, we propose a novel strategy to identify key proteins by constructing reliable PPI networks. Five Gene Ontology (GO)-based semantic similarity measurements (Jiang, Lin, Rel, Resnik, and Wang) are used to calculate the confidence scores for protein pairs under three annotation terms (Molecular function (MF), Biological process (BP), and Cellular component (CC)). The protein pairs with low similarity values are assumed to be low-confidence links, and the refined PPI networks are constructed by filtering the low-confidence links. Six topology-based centrality methods (the BC, DC, EC, NC, SC, and aveNC) are applied to test the performance of the measurements under the original network and refined network. We systematically compare the performance of the five semantic similarity metrics with the three GO annotation terms on four benchmark datasets, and the simulation results show that the performance of these centrality methods under refined PPI networks is relatively better than that under the original networks. Resnik with a BP annotation term performs best among all five metrics with the three annotation terms. These findings suggest the importance of semantic similarity metrics in measuring the reliability of the links between proteins and highlight the Resnik metric with the BP annotation term as a favourable choice.
Collapse
Affiliation(s)
- Xiaoli Xue
- School of Science, East China Jiaotong University, Nanchang, China
| | - Wei Zhang
- School of Science, East China Jiaotong University, Nanchang, China
| | - Anjing Fan
- School of Computer and Information Engineering, Anyang Normal University, Anyang, China
| |
Collapse
|
8
|
Wang L, Peng J, Kuang L, Tan Y, Chen Z. Identification of Essential Proteins Based on Local Random Walk and Adaptive Multi-View Multi-Label Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3507-3516. [PMID: 34788220 DOI: 10.1109/tcbb.2021.3128638] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Accumulating evidences have indicated that essential proteins play vital roles in human physiological process. In recent years, although researches on prediction of essential proteins have been developing rapidly, there are as well various limitations such as unsatisfactory data suitability, low accuracy of predictive results and so on. In this manuscript, a novel method called RWAMVL was proposed to predict essential proteins based on the Random Walk and the Adaptive Multi-View multi-label Learning. In RWAMVL, considering that the inherent noise is ubiquitous in existing datasets of known protein-protein interactions (PPIs), a variety of different features including biological features of proteins and topological features of PPI networks were obtained by adopting adaptive multi-view multi-label learning first. And then, an improved random walk method was designed to detect essential proteins based on these different features. Finally, in order to verify the predictive performance of RWAMVL, intensive experiments were done to compare it with multiple state-of-the-art predictive methods under different expeditionary frameworks. And as a result, RWAMVL was proven that it can achieve better prediction accuracy than all those competitive methods, which demonstrated as well that RWAMVL may be a potential tool for prediction of key proteins in the future.
Collapse
|
9
|
Wang C, Zhang H, Ma H, Wang Y, Cai K, Guo T, Yang Y, Li Z, Zhu Y. Inference of pan-cancer related genes by orthologs matching based on enhanced LSTM model. Front Microbiol 2022; 13:963704. [PMID: 36267181 PMCID: PMC9577021 DOI: 10.3389/fmicb.2022.963704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Accepted: 08/16/2022] [Indexed: 11/13/2022] Open
Abstract
Many disease-related genes have been found to be associated with cancer diagnosis, which is useful for understanding the pathophysiology of cancer, generating targeted drugs, and developing new diagnostic and treatment techniques. With the development of the pan-cancer project and the ongoing expansion of sequencing technology, many scientists are focusing on mining common genes from The Cancer Genome Atlas (TCGA) across various cancer types. In this study, we attempted to infer pan-cancer associated genes by examining the microbial model organism Saccharomyces Cerevisiae (Yeast) by homology matching, which was motivated by the benefits of reverse genetics. First, a background network of protein-protein interactions and a pathogenic gene set involving several cancer types in humans and yeast were created. The homology between the human gene and yeast gene was then discovered by homology matching, and its interaction sub-network was obtained. This was undertaken following the principle that the homologous genes of the common ancestor may have similarities in expression. Then, using bidirectional long short-term memory (BiLSTM) in combination with adaptive integration of heterogeneous information, we further explored the topological characteristics of the yeast protein interaction network and presented a node representation score to evaluate the node ability in graphs. Finally, homologous mapping for human genes matched the important genes identified by ensemble classifiers for yeast, which may be thought of as genes connected to all types of cancer. One way to assess the performance of the BiLSTM model is through experiments on the database. On the other hand, enrichment analysis, survival analysis, and other outcomes can be used to confirm the biological importance of the prediction results. You may access the whole experimental protocols and programs at https://github.com/zhuyuan-cug/AI-BiLSTM/tree/master.
Collapse
Affiliation(s)
- Chao Wang
- Department of Surgery, Hepatic Surgery Center, Institute of Hepato-Pancreato-Biliary Surgery, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Houwang Zhang
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong SAR, China
| | - Haishu Ma
- School of Automation, China University of Geosciences, Wuhan, China
- Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan, China
- Engineering Research Center of Intelligent Technology for Geo-Exploration, Wuhan, China
| | - Yawen Wang
- School of Mathematics and Physics, China University of Geosciences, Wuhan, China
| | - Ke Cai
- School of Automation, China University of Geosciences, Wuhan, China
- Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan, China
- Engineering Research Center of Intelligent Technology for Geo-Exploration, Wuhan, China
| | - Tingrui Guo
- School of Automation, China University of Geosciences, Wuhan, China
- Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan, China
- Engineering Research Center of Intelligent Technology for Geo-Exploration, Wuhan, China
| | - Yuanhang Yang
- School of Mathematics and Physics, China University of Geosciences, Wuhan, China
| | - Zhen Li
- School of Mathematics and Physics, China University of Geosciences, Wuhan, China
| | - Yuan Zhu
- School of Automation, China University of Geosciences, Wuhan, China
- Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan, China
- Engineering Research Center of Intelligent Technology for Geo-Exploration, Wuhan, China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Shanghai, China
| |
Collapse
|
10
|
Agrawal S, Sisodia DS, Nagwani NK. Augmented sequence features and subcellular localization for functional characterization of unknown protein sequences. Med Biol Eng Comput 2021; 59:2297-2310. [PMID: 34545514 DOI: 10.1007/s11517-021-02436-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2020] [Accepted: 08/29/2021] [Indexed: 11/24/2022]
Abstract
Advances in high-throughput techniques lead to evolving a large number of unknown protein sequences (UPS). Functional characterization of UPS is significant for the investigation of disease symptoms and drug repositioning. Protein subcellular localization is imperative for the functional characterization of protein sequences. Diverse techniques are used on protein sequences for feature extraction. However, many times a single feature extraction technique leads to poor prediction performance. In this paper, two feature augmentations are described through sequence induced, physicochemical, and evolutionary information of the amino acid residues. While augmented features preserve the sequence-order-information and protein-residue-properties. Two bacterial protein datasets Gram-Positive (G +) and Gram-Negative (G-) are utilized for the experimental work. After performing essential preprocessing on protein datasets, two sets of feature vectors are obtained. These feature vectors are used separately to train the different individual and ensembles such as decision tree (C 4.5), k-nearest neighbor (k-NN), multi-layer perceptron (MLP), Naïve Bayes (NB), support vector machine (SVM), AdaBoost, gradient boosting machine (GBM), and random forest (RF) with fivefold cross-validation. Prediction results of the model demonstrate that overall accuracy reported by C4.5 is highest 99.57% on G + and 97.47% on G- datasets with known protein sequences. Similarly, for the UPS overall accuracy of G + is 85.17% with SVM and 82.45% with G- dataset using MLP.
Collapse
Affiliation(s)
- Saurabh Agrawal
- Department of Computer Science & Engineering, National Institute of Technology Raipur, GE Road, Raipur, Chhattisgarh, 492010, India.
| | - Dilip Singh Sisodia
- Department of Computer Science & Engineering, National Institute of Technology Raipur, GE Road, Raipur, Chhattisgarh, 492010, India
| | - Naresh Kumar Nagwani
- Department of Computer Science & Engineering, National Institute of Technology Raipur, GE Road, Raipur, Chhattisgarh, 492010, India
| |
Collapse
|
11
|
Li S, Zhang Z, Li X, Tan Y, Wang L, Chen Z. An iteration model for identifying essential proteins by combining comprehensive PPI network with biological information. BMC Bioinformatics 2021; 22:430. [PMID: 34496745 PMCID: PMC8425031 DOI: 10.1186/s12859-021-04300-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Accepted: 07/08/2021] [Indexed: 11/10/2022] Open
Abstract
Background Essential proteins have great impacts on cell survival and development, and played important roles in disease analysis and new drug design. However, since it is inefficient and costly to identify essential proteins by using biological experiments, then there is an urgent need for automated and accurate detection methods. In recent years, the recognition of essential proteins in protein interaction networks (PPI) has become a research hotspot, and many computational models for predicting essential proteins have been proposed successively. Results In order to achieve higher prediction performance, in this paper, a new prediction model called TGSO is proposed. In TGSO, a protein aggregation degree network is constructed first by adopting the node density measurement method for complex networks. And simultaneously, a protein co-expression interactive network is constructed by combining the gene expression information with the network connectivity, and a protein co-localization interaction network is constructed based on the subcellular localization data. And then, through integrating these three kinds of newly constructed networks, a comprehensive protein–protein interaction network will be obtained. Finally, based on the homology information, scores can be calculated out iteratively for different proteins, which can be utilized to estimate the importance of proteins effectively. Moreover, in order to evaluate the identification performance of TGSO, we have compared TGSO with 13 different latest competitive methods based on three kinds of yeast databases. And experimental results show that TGSO can achieve identification accuracies of 94%, 82% and 72% out of the top 1%, 5% and 10% candidate proteins respectively, which are to some degree superior to these state-of-the-art competitive models. Conclusions We constructed a comprehensive interactive network based on multi-source data to reduce the noise and errors in the initial PPI, and combined with iterative methods to improve the accuracy of necessary protein prediction, and means that TGSO may be conducive to the future development of essential protein recognition as well.
Collapse
Affiliation(s)
- Shiyuan Li
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, China.,Hunan Province Key Laboratory of Industrial Internet Technology and Security, Changsha University, Changsha, 410022, China
| | - Zhen Zhang
- College of Electronic Information and Electrical Engineering, Changsha University, Changsha, 410022, China
| | - Xueyong Li
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, China.,Hunan Province Key Laboratory of Industrial Internet Technology and Security, Changsha University, Changsha, 410022, China
| | - Yihong Tan
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, China. .,Hunan Province Key Laboratory of Industrial Internet Technology and Security, Changsha University, Changsha, 410022, China.
| | - Lei Wang
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, China.,Hunan Province Key Laboratory of Industrial Internet Technology and Security, Changsha University, Changsha, 410022, China
| | - Zhiping Chen
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410022, China. .,Hunan Province Key Laboratory of Industrial Internet Technology and Security, Changsha University, Changsha, 410022, China.
| |
Collapse
|
12
|
Peng J, Kuang L, Zhang Z, Tan Y, Chen Z, Wang L. A Novel Model for Identifying Essential Proteins Based on Key Target Convergence Sets. Front Genet 2021; 12:721486. [PMID: 34394201 PMCID: PMC8358660 DOI: 10.3389/fgene.2021.721486] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Accepted: 06/30/2021] [Indexed: 11/20/2022] Open
Abstract
In recent years, many computational models have been designed to detect essential proteins based on protein-protein interaction (PPI) networks. However, due to the incompleteness of PPI networks, the prediction accuracy of these models is still not satisfactory. In this manuscript, a novel key target convergence sets based prediction model (KTCSPM) is proposed to identify essential proteins. In KTCSPM, a weighted PPI network and a weighted (Domain-Domain Interaction) network are constructed first based on known PPIs and PDIs downloaded from benchmark databases. And then, by integrating these two kinds of networks, a novel weighted PDI network is built. Next, through assigning a unique key target convergence set (KTCS) for each node in the weighted PDI network, an improved method based on the random walk with restart is designed to identify essential proteins. Finally, in order to evaluate the predictive effects of KTCSPM, it is compared with 12 competitive state-of-the-art models, and experimental results show that KTCSPM can achieve better prediction accuracy. Considering the satisfactory predictive performance achieved by KTCSPM, it indicates that KTCSPM might be a good supplement to the future research on prediction of essential proteins.
Collapse
Affiliation(s)
- Jiaxin Peng
- College of Computer, Xiangtan University, Xiangtan, China.,College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Linai Kuang
- College of Computer, Xiangtan University, Xiangtan, China
| | - Zhen Zhang
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Yihong Tan
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Zhiping Chen
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Lei Wang
- College of Computer, Xiangtan University, Xiangtan, China.,College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| |
Collapse
|
13
|
He X, Kuang L, Chen Z, Tan Y, Wang L. Method for Identifying Essential Proteins by Key Features of Proteins in a Novel Protein-Domain Network. Front Genet 2021; 12:708162. [PMID: 34267785 PMCID: PMC8276041 DOI: 10.3389/fgene.2021.708162] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Accepted: 05/31/2021] [Indexed: 11/21/2022] Open
Abstract
In recent years, due to low accuracy and high costs of traditional biological experiments, more and more computational models have been proposed successively to infer potential essential proteins. In this paper, a novel prediction method called KFPM is proposed, in which, a novel protein-domain heterogeneous network is established first by combining known protein-protein interactions with known associations between proteins and domains. Next, based on key topological characteristics extracted from the newly constructed protein-domain network and functional characteristics extracted from multiple biological information of proteins, a new computational method is designed to effectively integrate multiple biological features to infer potential essential proteins based on an improved PageRank algorithm. Finally, in order to evaluate the performance of KFPM, we compared it with 13 state-of-the-art prediction methods, experimental results show that, among the top 1, 5, and 10% of candidate proteins predicted by KFPM, the prediction accuracy can achieve 96.08, 83.14, and 70.59%, respectively, which significantly outperform all these 13 competitive methods. It means that KFPM may be a meaningful tool for prediction of potential essential proteins in the future.
Collapse
Affiliation(s)
- Xin He
- College of Computer, Xiangtan University, Xiangtan, China
| | - Linai Kuang
- College of Computer, Xiangtan University, Xiangtan, China
| | - Zhiping Chen
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, China
| | - Yihong Tan
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, China
| | - Lei Wang
- College of Computer, Xiangtan University, Xiangtan, China
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, China
| |
Collapse
|
14
|
Meng Z, Kuang L, Chen Z, Zhang Z, Tan Y, Li X, Wang L. Method for Essential Protein Prediction Based on a Novel Weighted Protein-Domain Interaction Network. Front Genet 2021; 12:645932. [PMID: 33815480 PMCID: PMC8010314 DOI: 10.3389/fgene.2021.645932] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2020] [Accepted: 02/15/2021] [Indexed: 01/04/2023] Open
Abstract
In recent years a number of calculative models based on protein-protein interaction (PPI) networks have been proposed successively. However, due to false positives, false negatives, and the incompleteness of PPI networks, there are still many challenges affecting the design of computational models with satisfactory predictive accuracy when inferring key proteins. This study proposes a prediction model called WPDINM for detecting key proteins based on a novel weighted protein-domain interaction (PDI) network. In WPDINM, a weighted PPI network is constructed first by combining the gene expression data of proteins with topological information extracted from the original PPI network. Simultaneously, a weighted domain-domain interaction (DDI) network is constructed based on the original PDI network. Next, through integrating the newly obtained weighted PPI network and weighted DDI network with the original PDI network, a weighted PDI network is further constructed. Then, based on topological features and biological information, including the subcellular localization and orthologous information of proteins, a novel PageRank-based iterative algorithm is designed and implemented on the newly constructed weighted PDI network to estimate the criticality of proteins. Finally, to assess the prediction performance of WPDINM, we compared it with 12 kinds of competitive measures. Experimental results show that WPDINM can achieve a predictive accuracy rate of 90.19, 81.96, 70.72, 62.04, 55.83, and 51.13% in the top 1%, top 5%, top 10%, top 15%, top 20%, and top 25% separately, which exceeds the prediction accuracy achieved by traditional state-of-the-art competing measures. Owing to the satisfactory identification effect, the WPDINM measure may contribute to the further development of key protein identification.
Collapse
Affiliation(s)
- Zixuan Meng
- College of Computer, Xiangtan University, Xiangtan, China
| | - Linai Kuang
- College of Computer, Xiangtan University, Xiangtan, China
| | - Zhiping Chen
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, China
| | - Zhen Zhang
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, China
| | - Yihong Tan
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, China
| | - Xueyong Li
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, China
| | - Lei Wang
- College of Computer, Xiangtan University, Xiangtan, China
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, China
| |
Collapse
|
15
|
Zhao B, Hu S, Liu X, Xiong H, Han X, Zhang Z, Li X, Wang L. A Novel Computational Approach for Identifying Essential Proteins From Multiplex Biological Networks. Front Genet 2020; 11:343. [PMID: 32373163 PMCID: PMC7186452 DOI: 10.3389/fgene.2020.00343] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Accepted: 03/23/2020] [Indexed: 11/13/2022] Open
Abstract
The identification of essential proteins can help in understanding the minimum requirements for cell survival and development. Ever-increasing amounts of high-throughput data provide us with opportunities to detect essential proteins from protein interaction networks (PINs). Existing network-based approaches are limited by the poor quality of the underlying PIN data, which exhibits high rates of false positive and false negative results. To overcome this problem, researchers have focused on the prediction of essential proteins by combining PINs with other biological data, which has led to the emergence of various interactions between proteins. It remains challenging, however, to use aggregated multiplex interactions within a single analysis framework to identify essential proteins. In this study, we created a multiplex biological network (MON) by initially integrating PINs, protein domains, and gene expression profiles. Next, we proposed a new approach to discover essential proteins by extending the random walk with restart algorithm to the tensor, which provides a data model representation of the MON. In contrast to existing approaches, the proposed MON approach considers for the importance of nodes and the different types of interactions between proteins during the iteration. MON was implemented to identify essential proteins within two yeast PINs. Our comprehensive experimental results demonstrated that MON outperformed 11 other state-of-the-art approaches in terms of precision-recall curve, jackknife curve, and other criteria.
Collapse
Affiliation(s)
- Bihai Zhao
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China.,Hunan Provincial Key Laboratory of Industrial Internet Technology and Security, Changsha University, Changsha, China.,Hunan Provincial Key Laboratory of Nutrition and Quality Control of Aquatic Animals, Changsha University, Changsha, China
| | - Sai Hu
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Xiner Liu
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Huijun Xiong
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Xiao Han
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Zhihong Zhang
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China.,Hunan Provincial Key Laboratory of Industrial Internet Technology and Security, Changsha University, Changsha, China
| | - Xueyong Li
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China
| | - Lei Wang
- College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, China.,Hunan Provincial Key Laboratory of Industrial Internet Technology and Security, Changsha University, Changsha, China
| |
Collapse
|