1
|
Dhall A, Patiyal S, Raghava GPS. A hybrid method for discovering interferon-gamma inducing peptides in human and mouse. Sci Rep 2024; 14:26859. [PMID: 39501025 PMCID: PMC11538504 DOI: 10.1038/s41598-024-77957-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2024] [Accepted: 10/28/2024] [Indexed: 11/08/2024] Open
Abstract
Interferon-gamma (IFN-γ) is a versatile pleiotropic cytokine essential for both innate and adaptive immune responses. It exhibits both pro-inflammatory and anti-inflammatory properties, making it a promising therapeutic candidate for treating various infectious diseases and cancers. We present IFNepitope2, a host-specific technique to annotate IFN-γ inducing peptides, it is an updated version of IFNepitope introduced by Dhanda et al. In this study, dataset used for developing prediction method contain experimentally validated 25,492 and 7983 IFN-γ inducing peptides in human and mouse host, respectively. In initial phase, machine learning techniques have been exploited to develop classification model using wide range of peptide features. Further, to improve machine learning based models or alignment free models, we explore potential of similarity-based technique BLAST. Finally, a hybrid model has been developed that combine best machine learning based model with BLAST. In most of the case, models based on extra tree perform better than other machine learning techniques. In case of peptide features, compositional feature particularly dipeptide composition performs better than one-hot encoding or binary profile. Our best machine learning based models achieved AUROC 0.89 and 0.83 for human and mouse host, respectively. The hybrid model achieved the AUROC 0.90 and 0.85 for human and mouse host, respectively. All models have been evaluated on an independent/validation dataset not used for training or testing these models. Newly developed method performs better than existing method on independent dataset. The major objective of this study is to predict, design and scan IFN-γ inducing peptides, thus server/software have been developed ( https://webs.iiitd.edu.in/raghava/ifnepitope2/ ). This method is also available as standalone at https://github.com/raghavagps/ifnepitope2 and python package index at https://pypi.org/project/ifnepitope2/ .
Collapse
Affiliation(s)
- Anjali Dhall
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Industrial Estate, Phase III, (Near Govind Puri Metro Station), New Delhi, 110020, India
| | - Sumeet Patiyal
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Industrial Estate, Phase III, (Near Govind Puri Metro Station), New Delhi, 110020, India
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Industrial Estate, Phase III, (Near Govind Puri Metro Station), New Delhi, 110020, India.
| |
Collapse
|
2
|
Patiyal S, Tiwari P, Ghai M, Dhapola A, Dhall A, Raghava GPS. A hybrid approach for predicting transcription factors. FRONTIERS IN BIOINFORMATICS 2024; 4:1425419. [PMID: 39119181 PMCID: PMC11306938 DOI: 10.3389/fbinf.2024.1425419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 07/03/2024] [Indexed: 08/10/2024] Open
Abstract
Transcription factors are essential DNA-binding proteins that regulate the transcription rate of several genes and control the expression of genes inside a cell. The prediction of transcription factors with high precision is important for understanding biological processes such as cell differentiation, intracellular signaling, and cell-cycle control. In this study, we developed a hybrid method that combines alignment-based and alignment-free methods for predicting transcription factors with higher accuracy. All models have been trained, tested, and evaluated on a large dataset that contains 19,406 transcription factors and 523,560 non-transcription factor protein sequences. To avoid biases in evaluation, the datasets were divided into training and validation/independent datasets, where 80% of the data was used for training, and the remaining 20% was used for external validation. In the case of alignment-free methods, models were developed using machine learning techniques and the composition-based features of a protein. Our best alignment-free model obtained an AUC of 0.97 on an independent dataset. In the case of the alignment-based method, we used BLAST at different cut-offs to predict the transcription factors. Although the alignment-based method demonstrated excellent performance, it was unable to cover all transcription factors due to instances of no hits. To combine the strengths of both methods, we developed a hybrid method that combines alignment-free and alignment-based methods. In the hybrid method, we added the scores of the alignment-free and alignment-based methods and achieved a maximum AUC of 0.99 on the independent dataset. The method proposed in this study performs better than existing methods. We incorporated the best models in the webserver/Python Package Index/standalone package of "TransFacPred" (https://webs.iiitd.edu.in/raghava/transfacpred).
Collapse
Affiliation(s)
| | | | | | | | | | - Gajendra P. S. Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| |
Collapse
|
3
|
Huang G, Tang X, Zheng P. DeepHLAPred: a deep learning-based method for non-classical HLA binder prediction. BMC Genomics 2023; 24:706. [PMID: 37993812 PMCID: PMC10666343 DOI: 10.1186/s12864-023-09796-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Accepted: 11/08/2023] [Indexed: 11/24/2023] Open
Abstract
Human leukocyte antigen (HLA) is closely involved in regulating the human immune system. Despite great advance in detecting classical HLA Class I binders, there are few methods or toolkits for recognizing non-classical HLA Class I binders. To fill in this gap, we have developed a deep learning-based tool called DeepHLAPred. The DeepHLAPred used electron-ion interaction pseudo potential, integer numerical mapping and accumulated amino acid frequency as initial representation of non-classical HLA binder sequence. The deep learning module was used to further refine high-level representations. The deep learning module comprised two parallel convolutional neural networks, each followed by maximum pooling layer, dropout layer, and bi-directional long short-term memory network. The experimental results showed that the DeepHLAPred reached the state-of-the-art performanceson the cross-validation test and the independent test. The extensive test demonstrated the rationality of the DeepHLAPred. We further analyzed sequence pattern of non-classical HLA class I binders by information entropy. The information entropy of non-classical HLA binder sequence implied sequence pattern to a certain extent. In addition, we have developed a user-friendly webserver for convenient use, which is available at http://www.biolscience.cn/DeepHLApred/ . The tool and the analysis is helpful to detect non-classical HLA Class I binder. The source code and data is available at https://github.com/tangxingyu0/DeepHLApred .
Collapse
Affiliation(s)
- Guohua Huang
- School of Information Technology and Administration, Hunan University of Finance and Economics, Changsha, Hunan, 410215, China.
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China.
| | - Xingyu Tang
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China
| | - Peijie Zheng
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China
| |
Collapse
|
4
|
Naorem LD, Sharma N, Raghava GPS. A web server for predicting and scanning of IL-5 inducing peptides using alignment-free and alignment-based method. Comput Biol Med 2023; 158:106864. [PMID: 37058758 DOI: 10.1016/j.compbiomed.2023.106864] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Revised: 03/06/2023] [Accepted: 03/30/2023] [Indexed: 04/16/2023]
Abstract
Interleukin-5 (IL-5) can act as an enticing therapeutic target due to its pivotal role in several eosinophil-mediated diseases. The aim of this study is to develop a model for predicting IL-5 inducing antigenic regions in a protein with high precision. All models in this study have been trained, tested and validated on experimentally validated 1907 IL-5 inducing and 7759 non-IL-5 inducing peptides obtained from IEDB. Our primary analysis indicates that IL-5 inducing peptides are dominated by certain residues like Ile, Asn, and Tyr. It was also observed that binders of a wide range of HLA alleles can induce IL-5. Initially, alignment-based methods have been developed using similarity and motif search. These alignment-based methods provide high precision but poor coverage. In order to overcome this limitation, we explore alignment-free methods which are mainly machine learning-based models. Firstly, models have been developed using binary profiles and eXtreme Gradient Boosting-based model achieved a maximum AUC of 0.59. Secondly, composition-based models have been developed and our dipeptide-based random forest model achieved a maximum AUC of 0.74. Thirdly, random forest model developed using selected 250 dipeptides and achieved AUC 0.75 and MCC 0.29 on validation dataset; best among alignment-free models. In order to improve the performance, we developed an ensemble or hybrid method that combined alignment-based and alignment-free methods. Our hybrid method achieved AUC 0.94 with MCC 0.60 on a validation/independent dataset. The best hybrid model developed in this study has been incorporated into the user-friendly web server and a standalone package named 'IL5pred' (https://webs.iiitd.edu.in/raghava/il5pred/).
Collapse
Affiliation(s)
- Leimarembi Devi Naorem
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India.
| | - Neelam Sharma
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India.
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India.
| |
Collapse
|
5
|
Jing Y, Zhang S, Wang H. DapNet-HLA: Adaptive dual-attention mechanism network based on deep learning to predict non-classical HLA binding sites. Anal Biochem 2023; 666:115075. [PMID: 36740003 DOI: 10.1016/j.ab.2023.115075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 01/30/2023] [Accepted: 02/02/2023] [Indexed: 02/05/2023]
Abstract
Human leukocyte antigen (HLA) plays a vital role in immunomodulatory function. Studies have shown that immunotherapy based on non-classical HLA has essential applications in cancer, COVID-19, and allergic diseases. However, there are few deep learning methods to predict non-classical HLA alleles. In this work, an adaptive dual-attention network named DapNet-HLA is established based on existing datasets. Firstly, amino acid sequences are transformed into digital vectors by looking up the table. To overcome the feature sparsity problem caused by unique one-hot encoding, the fused word embedding method is used to map each amino acid to a low-dimensional word vector optimized with the training of the classifier. Then, we use the GCB (group convolution block), SENet attention (squeeze-and-excitation networks), BiLSTM (bidirectional long short-term memory network), and Bahdanau attention mechanism to construct the classifier. The use of SENet can make the weight of the effective feature map high, so that the model can be trained to achieve better results. Attention mechanism is an Encoder-Decoder model used to improve the effectiveness of RNN, LSTM or GRU (gated recurrent neural network). The ablation experiment shows that DapNet-HLA has the best adaptability for five datasets. On the five test datasets, the ACC index and MCC index of DapNet-HLA are 4.89% and 0.0933 higher than the comparison method, respectively. According to the ROC curve and PR curve verified by the 5-fold cross-validation, the AUC value of each fold has a slight fluctuation, which proves the robustness of the DapNet-HLA. The codes and datasets are accessible at https://github.com/JYY625/DapNet-HLA.
Collapse
Affiliation(s)
- Yuanyuan Jing
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China.
| | - Houqiang Wang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| |
Collapse
|
6
|
Pande A, Patiyal S, Lathwal A, Arora C, Kaur D, Dhall A, Mishra G, Kaur H, Sharma N, Jain S, Usmani SS, Agrawal P, Kumar R, Kumar V, Raghava GPS. Pfeature: A Tool for Computing Wide Range of Protein Features and Building Prediction Models. J Comput Biol 2023; 30:204-222. [PMID: 36251780 DOI: 10.1089/cmb.2022.0241] [Citation(s) in RCA: 22] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023] Open
Abstract
In the last three decades, a wide range of protein features have been discovered to annotate a protein. Numerous attempts have been made to integrate these features in a software package/platform so that the user may compute a wide range of features from a single source. To complement the existing methods, we developed a method, Pfeature, for computing a wide range of protein features. Pfeature allows to compute more than 200,000 features required for predicting the overall function of a protein, residue-level annotation of a protein, and function of chemically modified peptides. It has six major modules, namely, composition, binary profiles, evolutionary information, structural features, patterns, and model building. Composition module facilitates to compute most of the existing compositional features, plus novel features. The binary profile of amino acid sequences allows to compute the fraction of each type of residue as well as its position. The evolutionary information module allows to compute evolutionary information of a protein in the form of a position-specific scoring matrix profile generated using Position-Specific Iterative Basic Local Alignment Search Tool (PSI-BLAST); fit for annotation of a protein and its residues. A structural module was developed for computing of structural features/descriptors from a tertiary structure of a protein. These features are suitable to predict the therapeutic potential of a protein containing non-natural or chemically modified residues. The model-building module allows to implement various machine learning techniques for developing classification and regression models as well as feature selection. Pfeature also allows the generation of overlapping patterns and features from a protein. A user-friendly Pfeature is available as a web server python library and stand-alone package.
Collapse
Affiliation(s)
- Akshara Pande
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Sumeet Patiyal
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Anjali Lathwal
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Chakit Arora
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Dilraj Kaur
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Anjali Dhall
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Gaurav Mishra
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.,Department of Electrical Engineering, Shiv Nadar University, Greater Noida, India
| | - Harpreet Kaur
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.,Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
| | - Neelam Sharma
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Shipra Jain
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Salman Sadullah Usmani
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.,Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
| | - Piyush Agrawal
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.,Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
| | - Rajesh Kumar
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.,Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
| | - Vinod Kumar
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.,Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, India
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| |
Collapse
|
7
|
Patiyal S, Dhall A, Bajaj K, Sahu H, Raghava GPS. Prediction of RNA-interacting residues in a protein using CNN and evolutionary profile. Brief Bioinform 2023; 24:6901899. [PMID: 36516298 DOI: 10.1093/bib/bbac538] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2022] [Revised: 09/28/2022] [Accepted: 11/08/2022] [Indexed: 12/15/2022] Open
Abstract
This paper describes a method Pprint2, which is an improved version of Pprint developed for predicting RNA-interacting residues in a protein. Training and independent/validation datasets used in this study comprises of 545 and 161 non-redundant RNA-binding proteins, respectively. All models were trained on training dataset and evaluated on the validation dataset. The preliminary analysis reveals that positively charged amino acids such as H, R and K, are more prominent in the RNA-interacting residues. Initially, machine learning based models have been developed using binary profile and obtain maximum area under curve (AUC) 0.68 on validation dataset. The performance of this model improved significantly from AUC 0.68 to 0.76, when evolutionary profile is used instead of binary profile. The performance of our evolutionary profile-based model improved further from AUC 0.76 to 0.82, when convolutional neural network has been used for developing model. Our final model based on convolutional neural network using evolutionary information achieved AUC 0.82 with Matthews correlation coefficient of 0.49 on the validation dataset. Our best model outperforms existing methods when evaluated on the independent/validation dataset. A user-friendly standalone software and web-based server named 'Pprint2' has been developed for predicting RNA-interacting residues (https://webs.iiitd.edu.in/raghava/pprint2 and https://github.com/raghavagps/pprint2).
Collapse
Affiliation(s)
- Sumeet Patiyal
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Anjali Dhall
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Khushboo Bajaj
- Department of Computer Science and Engineering, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Harshita Sahu
- Department of Computer Science and Engineering, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| |
Collapse
|