1
|
Jia P, Zhang F, Wu C, Li M. A comprehensive review of protein-centric predictors for biomolecular interactions: from proteins to nucleic acids and beyond. Brief Bioinform 2024; 25:bbae162. [PMID: 38739759 PMCID: PMC11089422 DOI: 10.1093/bib/bbae162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Revised: 02/17/2024] [Accepted: 03/31/2024] [Indexed: 05/16/2024] Open
Abstract
Proteins interact with diverse ligands to perform a large number of biological functions, such as gene expression and signal transduction. Accurate identification of these protein-ligand interactions is crucial to the understanding of molecular mechanisms and the development of new drugs. However, traditional biological experiments are time-consuming and expensive. With the development of high-throughput technologies, an increasing amount of protein data is available. In the past decades, many computational methods have been developed to predict protein-ligand interactions. Here, we review a comprehensive set of over 160 protein-ligand interaction predictors, which cover protein-protein, protein-nucleic acid, protein-peptide and protein-other ligands (nucleotide, heme, ion) interactions. We have carried out a comprehensive analysis of the above four types of predictors from several significant perspectives, including their inputs, feature profiles, models, availability, etc. The current methods primarily rely on protein sequences, especially utilizing evolutionary information. The significant improvement in predictions is attributed to deep learning methods. Additionally, sequence-based pretrained models and structure-based approaches are emerging as new trends.
Collapse
Affiliation(s)
- Pengzhen Jia
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| | - Fuhao Zhang
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
- College of Information Engineering, Northwest A&F University, No. 3 Taicheng Road, Yangling, Shaanxi 712100, China
| | - Chaojin Wu
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| |
Collapse
|
2
|
Zhang C, Gao Q, Li M, Yu T. Implementing link prediction in protein networks via feature fusion models based on graph neural networks. Comput Biol Chem 2024; 108:107980. [PMID: 38000328 DOI: 10.1016/j.compbiolchem.2023.107980] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Revised: 10/07/2023] [Accepted: 11/02/2023] [Indexed: 11/26/2023]
Abstract
MOTIVATION Protein-protein interactions serve as the cornerstone for various biochemical processes within biological organisms. Existing research methodologies predominantly employ link prediction techniques to analyze these interaction networks. However, traditional approaches often fall short in delivering satisfactory predictive performance when applied to multi-species datasets. Current computational methods largely focus on analyzing the network topology, resulting in a somewhat monolithic feature set. The integration of diverse features in the model could potentially yield superior performance and broader applicability. To this end, we propose an autoencoder model built on graph neural networks, designed to enhance both predictive performance and generalizability by leveraging the integration of gene ontology. RESULTS In this research, we developed AGraphSAGE, a model specifically designed for analyzing protein-protein interaction network data. By seamlessly integrating gene ontology into the graph structure, we employed a dual-channel graph sampling and aggregation network that capitalizes on topological information to process high-dimensional features. Feature fusion is achieved through the implementation of graph attention mechanisms, and we adopted a link prediction framework as the experimental training model. Performance was evaluated on real-world datasets using key metrics, such as Area Under the Curve (AUC). A hyperparameter search space was established, and a Bayesian optimization strategy was applied to iteratively fine-tune the model, assessing the impact of various parameters on predictive efficacy. The experimental results validate that our proposed model is capable of effectively predicting protein-protein interactions across diverse biological species.
Collapse
Affiliation(s)
- Chi Zhang
- College of Computer and Control Engineering, Qiqihar University, Qiqihar 161006, China
| | - Qian Gao
- College of Computer and Control Engineering, Qiqihar University, Qiqihar 161006, China
| | - Ming Li
- College of Computer and Control Engineering, Qiqihar University, Qiqihar 161006, China.
| | - Tianfei Yu
- College of Life Science and Agriculture Forestry, Qiqihar University, Qiqihar 161006, China.
| |
Collapse
|
3
|
Zhao BW, Su XR, Yang Y, Li DX, Li GD, Hu PW, Zhao YG, Hu L. Drug-disease association prediction using semantic graph and function similarity representation learning over heterogeneous information networks. Methods 2023; 220:106-114. [PMID: 37972913 DOI: 10.1016/j.ymeth.2023.10.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Revised: 10/13/2023] [Accepted: 10/28/2023] [Indexed: 11/19/2023] Open
Abstract
Discovering new indications for existing drugs is a promising development strategy at various stages of drug research and development. However, most of them complete their tasks by constructing a variety of heterogeneous networks without considering available higher-order connectivity patterns in heterogeneous biological information networks, which are believed to be useful for improving the accuracy of new drug discovering. To this end, we propose a computational-based model, called SFRLDDA, for drug-disease association prediction by using semantic graph and function similarity representation learning. Specifically, SFRLDDA first integrates a heterogeneous information network (HIN) by drug-disease, drug-protein, protein-disease associations, and their biological knowledge. Second, different representation learning strategies are applied to obtain the feature representations of drugs and diseases from different perspectives over semantic graph and function similarity graphs constructed, respectively. At last, a Random Forest classifier is incorporated by SFRLDDA to discover potential drug-disease associations (DDAs). Experimental results demonstrate that SFRLDDA yields a best performance when compared with other state-of-the-art models on three benchmark datasets. Moreover, case studies also indicate that the simultaneous consideration of semantic graph and function similarity of drugs and diseases in the HIN allows SFRLDDA to precisely predict DDAs in a more comprehensive manner.
Collapse
Affiliation(s)
- Bo-Wei Zhao
- The Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, Urumqi 830011, China; University of Chinese Academy of Sciences, Beijing 100049, China; Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China.
| | - Xiao-Rui Su
- The Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, Urumqi 830011, China; University of Chinese Academy of Sciences, Beijing 100049, China; Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China.
| | - Yue Yang
- The Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, Urumqi 830011, China; University of Chinese Academy of Sciences, Beijing 100049, China; Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China.
| | - Dong-Xu Li
- The Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, Urumqi 830011, China; University of Chinese Academy of Sciences, Beijing 100049, China; Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China.
| | - Guo-Dong Li
- The Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, Urumqi 830011, China; University of Chinese Academy of Sciences, Beijing 100049, China; Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China.
| | - Peng-Wei Hu
- The Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, Urumqi 830011, China; University of Chinese Academy of Sciences, Beijing 100049, China; Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China.
| | - Yong-Gang Zhao
- Department of Orthopaedic Surgery (hand and foot trauma), People's Hospital of Dongxihu, Wuhan 420100, China.
| | - Lun Hu
- The Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, Urumqi 830011, China; University of Chinese Academy of Sciences, Beijing 100049, China; Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China.
| |
Collapse
|
4
|
Li DX, Zhou P, Zhao BW, Su XR, Li GD, Zhang J, Hu PW, Hu L. Biocaiv: an integrative webserver for motif-based clustering analysis and interactive visualization of biological networks. BMC Bioinformatics 2023; 24:451. [PMID: 38030973 PMCID: PMC10685597 DOI: 10.1186/s12859-023-05574-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2023] [Accepted: 11/20/2023] [Indexed: 12/01/2023] Open
Abstract
BACKGROUND As an important task in bioinformatics, clustering analysis plays a critical role in understanding the functional mechanisms of many complex biological systems, which can be modeled as biological networks. The purpose of clustering analysis in biological networks is to identify functional modules of interest, but there is a lack of online clustering tools that visualize biological networks and provide in-depth biological analysis for discovered clusters. RESULTS Here we present BioCAIV, a novel webserver dedicated to maximize its accessibility and applicability on the clustering analysis of biological networks. This, together with its user-friendly interface, assists biological researchers to perform an accurate clustering analysis for biological networks and identify functionally significant modules for further assessment. CONCLUSIONS BioCAIV is an efficient clustering analysis webserver designed for a variety of biological networks. BioCAIV is freely available without registration requirements at http://bioinformatics.tianshanzw.cn:8888/BioCAIV/ .
Collapse
Affiliation(s)
- Dong-Xu Li
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
- University of Chinese Academy of Sciences, Beijing, China
- Xinjiang Laboratory of Minority Speech and Language Information Processing, Ürümqi, China
| | - Peng Zhou
- School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan, China
| | - Bo-Wei Zhao
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
- University of Chinese Academy of Sciences, Beijing, China
- Xinjiang Laboratory of Minority Speech and Language Information Processing, Ürümqi, China
| | - Xiao-Rui Su
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
- University of Chinese Academy of Sciences, Beijing, China
- Xinjiang Laboratory of Minority Speech and Language Information Processing, Ürümqi, China
| | - Guo-Dong Li
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
- University of Chinese Academy of Sciences, Beijing, China
- Xinjiang Laboratory of Minority Speech and Language Information Processing, Ürümqi, China
| | - Jun Zhang
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
- University of Chinese Academy of Sciences, Beijing, China
- Xinjiang Laboratory of Minority Speech and Language Information Processing, Ürümqi, China
| | - Peng-Wei Hu
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
- University of Chinese Academy of Sciences, Beijing, China
- Xinjiang Laboratory of Minority Speech and Language Information Processing, Ürümqi, China
| | - Lun Hu
- The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China.
- University of Chinese Academy of Sciences, Beijing, China.
- Xinjiang Laboratory of Minority Speech and Language Information Processing, Ürümqi, China.
| |
Collapse
|
5
|
Ma C, Shi Y, Huang Y, Dai G. Raman spectroscopy-based prediction of ofloxacin concentration in solution using a novel loss function and an improved GA-CNN model. BMC Bioinformatics 2023; 24:409. [PMID: 37904084 PMCID: PMC10617066 DOI: 10.1186/s12859-023-05542-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2023] [Accepted: 10/20/2023] [Indexed: 11/01/2023] Open
Abstract
BACKGROUND A Raman spectroscopy method can quickly and accurately measure the concentration of ofloxacin in solution. This method has the advantages of accuracy and rapidity over traditional detection methods. However, the manual analysis methods for the collected Raman spectral data often ignore the nonlinear characteristics of the data and cannot accurately predict the concentration of the target sample. METHODS To address this drawback, this paper proposes a novel kernel-Huber loss function that combines the Huber loss function with the Gaussian kernel function. This function is used with an improved genetic algorithm-convolutional neural network (GA-CNN) to model and predict the Raman spectral data of different concentrations of ofloxacin in solution. In addition, the paper introduces recurrent neural networks (RNN), long short-term memory (LSTM), bidirectional long short-term memory (BiLSTM) and gated recurrent units (GRU) models to conduct multiple experiments and use root mean square error (RMSE) and residual predictive deviation (RPD) as evaluation metrics. RESULTS The proposed method achieved an [Formula: see text] of 0.9989 on the test set data and improved by 3% over the traditional CNN. Multiple experiments were also conducted using RNN, LSTM, BiLSTM, and GRU models and evaluated their performance using RMSE, RPD, and other metrics. The results showed that the proposed method consistently outperformed these models. CONCLUSIONS This paper demonstrates the effectiveness of the proposed method for predicting the concentration of ofloxacin in solution based on Raman spectral data, in addition to discussing the advantages and limitations of the proposed method, and the study proposes a solution to the problem of deep learning methods for Raman spectral concentration prediction.
Collapse
Affiliation(s)
- Chenyu Ma
- School of Information and Control Engineering, Liaoning Petrochemical University, Fushun, 113001, China
| | - Yuanbo Shi
- School of Artificial Intelligence and Software, Liaoning Petrochemical University, Fushun, 113001, China.
| | - Yueyang Huang
- School of Information and Control Engineering, Liaoning Petrochemical University, Fushun, 113001, China
| | - Gongwei Dai
- School of Artificial Intelligence and Software, Liaoning Petrochemical University, Fushun, 113001, China
| |
Collapse
|
6
|
Luo X, Wang L, Hu P, Hu L. Predicting Protein-Protein Interactions Using Sequence and Network Information via Variational Graph Autoencoder. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:3182-3194. [PMID: 37155405 DOI: 10.1109/tcbb.2023.3273567] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
Protein-protein interactions (PPIs) play a critical role in the proteomics study, and a variety of computational algorithms have been developed to predict PPIs. Though effective, their performance is constrained by high false-positive and false-negative rates observed in PPI data. To overcome this problem, a novel PPI prediction algorithm, namely PASNVGA, is proposed in this work by combining the sequence and network information of proteins via variational graph autoencoder. To do so, PASNVGA first applies different strategies to extract the features of proteins from their sequence and network information, and obtains a more compact form of these features using principal component analysis. In addition, PASNVGA designs a scoring function to measure the higher-order connectivity between proteins and so as to obtain a higher-order adjacency matrix. With all these features and adjacency matrices, PASNVGA trains a variational graph autoencoder model to further learn the integrated embeddings of proteins. The prediction task is then completed by using a simple feedforward neural network. Extensive experiments have been conducted on five PPI datasets collected from different species. Compared with several state-of-the-art algorithms, PASNVGA has been demonstrated as a promising PPI prediction algorithm.
Collapse
|
7
|
Wu J, Liu Z, Yang X, Lin Z. Improved compound-protein interaction site and binding affinity prediction using self-supervised protein embeddings. BMC Bioinformatics 2022; 23:543. [PMID: 36526969 PMCID: PMC9756525 DOI: 10.1186/s12859-022-05107-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Accepted: 12/09/2022] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Compound-protein interaction site and binding affinity predictions are crucial for drug discovery and drug design. In recent years, many deep learning-based methods have been proposed for predications related to compound-protein interaction. For protein inputs, how to make use of protein primary sequence and tertiary structure information has impact on prediction results. RESULTS In this study, we propose a deep learning model based on a multi-objective neural network, which involves a multi-objective neural network for compound-protein interaction site and binding affinity prediction. We used several kinds of self-supervised protein embeddings to enrich our protein inputs and used convolutional neural networks to extract features from them. Our results demonstrate that our model had improvements in terms of interaction site prediction and affinity prediction compared to previous models. In a case study, our model could better predict binding sites, which also showed its effectiveness. CONCLUSION These results suggest that our model could be a helpful tool for compound-protein related predictions.
Collapse
Affiliation(s)
- Jialin Wu
- grid.79703.3a0000 0004 1764 3838School of Biology and Biological Engineering, South China University of Technology, 382 East Outer Loop Road, University Park, Guangzhou, 510006 Guangdong China
| | - Zhe Liu
- grid.79703.3a0000 0004 1764 3838School of Biology and Biological Engineering, South China University of Technology, 382 East Outer Loop Road, University Park, Guangzhou, 510006 Guangdong China
| | - Xiaofeng Yang
- grid.79703.3a0000 0004 1764 3838School of Biology and Biological Engineering, South China University of Technology, 382 East Outer Loop Road, University Park, Guangzhou, 510006 Guangdong China
| | - Zhanglin Lin
- grid.79703.3a0000 0004 1764 3838School of Biology and Biological Engineering, South China University of Technology, 382 East Outer Loop Road, University Park, Guangzhou, 510006 Guangdong China
| |
Collapse
|
8
|
Zhang ML, Zhao BW, Su XR, He YZ, Yang Y, Hu L. RLFDDA: a meta-path based graph representation learning model for drug-disease association prediction. BMC Bioinformatics 2022; 23:516. [PMID: 36456957 PMCID: PMC9713188 DOI: 10.1186/s12859-022-05069-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 11/21/2022] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND Drug repositioning is a very important task that provides critical information for exploring the potential efficacy of drugs. Yet developing computational models that can effectively predict drug-disease associations (DDAs) is still a challenging task. Previous studies suggest that the accuracy of DDA prediction can be improved by integrating different types of biological features. But how to conduct an effective integration remains a challenging problem for accurately discovering new indications for approved drugs. METHODS In this paper, we propose a novel meta-path based graph representation learning model, namely RLFDDA, to predict potential DDAs on heterogeneous biological networks. RLFDDA first calculates drug-drug similarities and disease-disease similarities as the intrinsic biological features of drugs and diseases. A heterogeneous network is then constructed by integrating DDAs, disease-protein associations and drug-protein associations. With such a network, RLFDDA adopts a meta-path random walk model to learn the latent representations of drugs and diseases, which are concatenated to construct joint representations of drug-disease associations. As the last step, we employ the random forest classifier to predict potential DDAs with their joint representations. RESULTS To demonstrate the effectiveness of RLFDDA, we have conducted a series of experiments on two benchmark datasets by following a ten-fold cross-validation scheme. The results show that RLFDDA yields the best performance in terms of AUC and F1-score when compared with several state-of-the-art DDAs prediction models. We have also conducted a case study on two common diseases, i.e., paclitaxel and lung tumors, and found that 7 out of top-10 diseases and 8 out of top-10 drugs have already been validated for paclitaxel and lung tumors respectively with literature evidence. Hence, the promising performance of RLFDDA may provide a new perspective for novel DDAs discovery over heterogeneous networks.
Collapse
Affiliation(s)
- Meng-Long Zhang
- grid.9227.e0000000119573309The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, China ,grid.410726.60000 0004 1797 8419University of Chinese Academy of Sciences, Beijing, China ,Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi, China
| | - Bo-Wei Zhao
- grid.9227.e0000000119573309The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, China ,grid.410726.60000 0004 1797 8419University of Chinese Academy of Sciences, Beijing, China ,Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi, China
| | - Xiao-Rui Su
- grid.9227.e0000000119573309The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, China ,grid.410726.60000 0004 1797 8419University of Chinese Academy of Sciences, Beijing, China ,Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi, China
| | - Yi-Zhou He
- grid.162110.50000 0000 9291 3229School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China
| | - Yue Yang
- grid.162110.50000 0000 9291 3229School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China
| | - Lun Hu
- grid.9227.e0000000119573309The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, China ,grid.410726.60000 0004 1797 8419University of Chinese Academy of Sciences, Beijing, China ,Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi, China
| |
Collapse
|
9
|
Multiple instance neural networks based on sparse attention for cancer detection using T-cell receptor sequences. BMC Bioinformatics 2022; 23:469. [PMID: 36348271 PMCID: PMC9644450 DOI: 10.1186/s12859-022-05012-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Accepted: 10/26/2022] [Indexed: 11/11/2022] Open
Abstract
Early detection of cancers has been much explored due to its paramount importance in biomedical fields. Among different types of data used to answer this biological question, studies based on T cell receptors (TCRs) are under recent spotlight due to the growing appreciation of the roles of the host immunity system in tumor biology. However, the one-to-many correspondence between a patient and multiple TCR sequences hinders researchers from simply adopting classical statistical/machine learning methods. There were recent attempts to model this type of data in the context of multiple instance learning (MIL). Despite the novel application of MIL to cancer detection using TCR sequences and the demonstrated adequate performance in several tumor types, there is still room for improvement, especially for certain cancer types. Furthermore, explainable neural network models are not fully investigated for this application. In this article, we propose multiple instance neural networks based on sparse attention (MINN-SA) to enhance the performance in cancer detection and explainability. The sparse attention structure drops out uninformative instances in each bag, achieving both interpretability and better predictive performance in combination with the skip connection. Our experiments show that MINN-SA yields the highest area under the ROC curve scores on average measured across 10 different types of cancers, compared to existing MIL approaches. Moreover, we observe from the estimated attentions that MINN-SA can identify the TCRs that are specific for tumor antigens in the same T cell repertoire.
Collapse
|
10
|
Hu L, Li Z, Tang Z, Zhao C, Zhou X, Hu P. Effectively predicting HIV-1 protease cleavage sites by using an ensemble learning approach. BMC Bioinformatics 2022; 23:447. [PMID: 36303135 PMCID: PMC9608884 DOI: 10.1186/s12859-022-04999-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2022] [Accepted: 10/13/2022] [Indexed: 11/10/2022] Open
Abstract
Background The site information of substrates that can be cleaved by human immunodeficiency virus 1 proteases (HIV-1 PRs) is of great significance for designing effective inhibitors against HIV-1 viruses. A variety of machine learning-based algorithms have been developed to predict HIV-1 PR cleavage sites by extracting relevant features from substrate sequences. However, only relying on the sequence information is not sufficient to ensure a promising performance due to the uncertainty in the way of separating the datasets used for training and testing. Moreover, the existence of noisy data, i.e., false positive and false negative cleavage sites, could negatively influence the accuracy performance. Results In this work, an ensemble learning algorithm for predicting HIV-1 PR cleavage sites, namely EM-HIV, is proposed by training a set of weak learners, i.e., biased support vector machine classifiers, with the asymmetric bagging strategy. By doing so, the impact of data imbalance and noisy data can thus be alleviated. Besides, in order to make full use of substrate sequences, the features used by EM-HIV are collected from three different coding schemes, including amino acid identities, chemical properties and variable-length coevolutionary patterns, for the purpose of constructing more relevant feature vectors of octamers. Experiment results on three independent benchmark datasets demonstrate that EM-HIV outperforms state-of-the-art prediction algorithm in terms of several evaluation metrics. Hence, EM-HIV can be regarded as a useful tool to accurately predict HIV-1 PR cleavage sites.
Collapse
Affiliation(s)
- Lun Hu
- grid.9227.e0000000119573309Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| | - Zhenfeng Li
- grid.162110.50000 0000 9291 3229School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan, China
| | - Zehai Tang
- grid.162110.50000 0000 9291 3229School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan, China
| | - Cheng Zhao
- grid.162110.50000 0000 9291 3229School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan, China
| | - Xi Zhou
- grid.9227.e0000000119573309Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| | - Pengwei Hu
- grid.9227.e0000000119573309Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| |
Collapse
|