1
|
Han KS, Kim HK, Kim MH, Pak MH, Pak SJ, Choe MM, Kim CS. PredIDR2: Improving accuracy of protein intrinsic disorder prediction by updating deep convolutional neural network and supplementing DisProt data. Int J Biol Macromol 2025; 306:141801. [PMID: 40054813 DOI: 10.1016/j.ijbiomac.2025.141801] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2024] [Revised: 03/03/2025] [Accepted: 03/04/2025] [Indexed: 05/11/2025]
Abstract
Intrinsically disordered proteins (IDPs) or regions (IDRs) are widespread in proteomes, and involved in several important biological processes and implicated in many diseases. Many computational methods for IDR prediction are being developed to decrease the gap between the low speed of experimental determination of annotated proteins and the rapid increase of non-annotated proteins, and their performances are blindly tested by the community-driven experiment, the Critical Assessment of protein Intrinsic Disorder (CAID). In this paper, we developed PredIDR2 series, an updated version of PredIDR tested in CAID2 in order to accurately predict intrinsically disordered regions from protein sequence. It includes four methods depending on the input features and the producing mode of the negative samples of the training set. PredIDR2 series (AUC_ROC = 0.952) perform remarkably better than our previous PredIDR (AUC_ROC = 0.933) for Disorder-PDB dataset of CAID2, which seems to be mainly attributed to the introduction of a new deep convolutional neural network and the augmentation of the training data, especially from DisProt database. PredIDR2 series outperform the state-of-the-art IDR prediction methods participated in CAID2 in terms of AUC_ROC, AUC_PR and DC_mae and belong to the seven top-performing methods in terms of MCC. PredIDR2 series can be freely used through the CAID Prediction Portal available at https://caid.idpcentral.org/portal or downloaded as a Singularity container from https://biocomputingup.it/shared/caid-predictors/.
Collapse
Affiliation(s)
- Kun-Sop Han
- University of Sciences, Pyongyang, Democratic People's Republic of Korea.
| | - Ha-Kyong Kim
- Branch of Biotechnology, State Academy of Sciences, Pyongyang, Democratic People's Republic of Korea
| | - Myong-Hyok Kim
- University of Sciences, Pyongyang, Democratic People's Republic of Korea
| | - Myong-Hyon Pak
- University of Sciences, Pyongyang, Democratic People's Republic of Korea
| | - Song-Jin Pak
- University of Sciences, Pyongyang, Democratic People's Republic of Korea
| | - Mun-Myong Choe
- University of Science and Technology, Pyongyang, Democratic People's Republic of Korea
| | - Chol-Song Kim
- University of Sciences, Pyongyang, Democratic People's Republic of Korea
| |
Collapse
|
2
|
Xiong S, Cai J, Shi H, Cui F, Zhang Z, Wei L. UMPPI: Unveiling Multilevel Protein-Peptide Interaction Prediction via Language Models. J Chem Inf Model 2025; 65:3789-3799. [PMID: 40077987 DOI: 10.1021/acs.jcim.4c02365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/14/2025]
Abstract
Protein-peptide interactions are essential to cellular processes and disease mechanisms. Identifying protein-peptide binding residues is critical for understanding peptide function and advancing drug discovery. However, experimental methods are costly and time-intensive, while existing computational approaches often predict interactions or binding residues separately, lack effective feature integration, or rely heavily on limited high-quality structural data. To address these challenges, we propose UMPPI (Unveiling Multilevel Protein-Peptide Interaction), a multiobjective framework based on the pretrained protein language model ESM2. UMPPI simultaneously predicts binary protein-peptide interactions and binding residues on both peptides and proteins through a multiobjective optimization strategy. By integrating ESM2 to encode sequences and extract latent structural information, UMPPI bridges the gap between sequence-based and structure-based methods. Extensive experiments demonstrated that UMPPI successfully captured binary interactions between peptides and proteins and identified the binding residues on peptides and proteins. UMPPI can serve as a useful tool for protein-peptide interaction prediction and identification of critical binding residues, thereby facilitating the peptide drug discovery process.
Collapse
Affiliation(s)
- Shuwen Xiong
- Faculty of Applied Sciences, Macao Polytechnic University, R. de Luís Gonzaga Gomes, Macao 999078, China
| | - Jiajie Cai
- School of Software, Shandong University, Jinan 250101, China
| | - Hua Shi
- School of Optoelectronic and Communication Engineering, Xiamen University of Technology, Xiamen 361005, China
| | - Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Leyi Wei
- Faculty of Applied Sciences, Macao Polytechnic University, R. de Luís Gonzaga Gomes, Macao 999078, China
- School of Software, Shandong University, Jinan 250101, China
| |
Collapse
|
3
|
Alanazi W, Meng D, Pollastri G. Advancements in one-dimensional protein structure prediction using machine learning and deep learning. Comput Struct Biotechnol J 2025; 27:1416-1430. [PMID: 40242292 PMCID: PMC12002955 DOI: 10.1016/j.csbj.2025.04.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2025] [Revised: 04/01/2025] [Accepted: 04/02/2025] [Indexed: 04/18/2025] Open
Abstract
The accurate prediction of protein structures remains a cornerstone challenge in structural bioinformatics, essential for understanding the intricate relationship between protein sequence, structure, and function. Recent advancements in Machine Learning (ML) and Deep Learning (DL) have revolutionized this field, offering innovative approaches to tackle one- dimensional (1D) protein structure annotations, including secondary structure, solvent accessibility, and intrinsic disorder. This review highlights the evolution of predictive methodologies, from early machine learning models to sophisticated deep learning frameworks that integrate sequence embeddings and pretrained language models. Key advancements, such as AlphaFold's transformative impact on structure prediction and the rise of protein language models (PLMs), have enabled unprecedented accuracy in capturing sequence-structure relationships. Furthermore, we explore the role of specialized datasets, benchmarking competitions, and multimodal integration in shaping state-of-the-art prediction models. By addressing challenges in data quality, scalability, interpretability, and task-specific optimization, this review underscores the transformative impact of ML, DL, and PLMs on 1D protein prediction while providing insights into emerging trends and future directions in this rapidly evolving field.
Collapse
Affiliation(s)
- Wafa Alanazi
- School of Computer Science, University College Dublin, Belfield, Dublin D04 C1P1, Ireland
- Department of Computer Science, College of Science, Northern Border University, Arar, Saudi Arabia
| | - Di Meng
- School of Computer Science, University College Dublin, Belfield, Dublin D04 C1P1, Ireland
| | - Gianluca Pollastri
- School of Computer Science, University College Dublin, Belfield, Dublin D04 C1P1, Ireland
| |
Collapse
|
4
|
Asadi M, Ghasemi Y, Nezafat N, Sarkari B, Baneshi M, Mostafavi-Pour Z, Anbardar MH, Savardashtaki A. Designing a novel multi-epitope antigen for diagnosing human cytomegalovirus infection: An immunoinformatics approach. Biotechnol Appl Biochem 2025; 72:469-483. [PMID: 39417400 DOI: 10.1002/bab.2677] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2024] [Accepted: 09/13/2024] [Indexed: 10/19/2024]
Abstract
Human cytomegalovirus (HCMV) infection can lead to congenital infections and severe complications, particularly in immunocompromised individuals. Current serological tests for diagnosing HCMV infection often face limitations in sensitivity and specificity. Developing multi-epitope antigens for serological assays offers the potential for enhancing diagnostic accuracy. This study aimed to design a novel multi-epitope antigen for HCMV infection diagnosis using immunoinformatic approaches. Five tegument proteins (universal protein resource [UniProt] ID: Po8318, Po6725, F5HC97, Q6RX10, and F5HC05) were selected based on their antigenic properties and literature review. Six linear B-cell epitopes were predicted within conserved regions of each antigen sequence and linked with appropriate linkers. The designed multi-epitope antigen underwent thorough evaluation for physicochemical properties, solubility, antigenicity, and cross-reactivity. Additionally, the three-dimensional structure of the antigen was predicted, refined, and validated. The nucleotide sequence of the designed antigen was optimized for successful expression in Escherichia coli and inserted into a pET23a (+) vector. Immunoinformatic analysis revealed that the multi-epitope antigen exhibits stability, antigenicity, and lacks cross-reactivity. Our findings suggest that this multi-epitope antigen is a promising candidate for diagnosing HCMV infection. However, further validation through laboratory testing is required to confirm its diagnostic efficacy.
Collapse
Affiliation(s)
- Marzieh Asadi
- Department of Medical Biotechnology, School of Advanced Medical Sciences and Technologies, Shiraz University of Medical Sciences, Shiraz, Shiraz, Iran
- Student Research Committee, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Younes Ghasemi
- Department of Pharmaceutical Biotechnology, School of Pharmacy, Shiraz University of Medical Sciences, Shiraz, Iran
- Pharmaceutical Sciences Research Center, Shiraz University of Medical Science, Shiraz, Iran
| | - Navid Nezafat
- Department of Pharmaceutical Biotechnology, School of Pharmacy, Shiraz University of Medical Sciences, Shiraz, Iran
- Pharmaceutical Sciences Research Center, Shiraz University of Medical Science, Shiraz, Iran
| | - Bahador Sarkari
- Department of Parasitology and Mycology, School of Medicine, Shiraz University of Medical Science, Shiraz, Iran
| | - Maryam Baneshi
- Department of Medical Biotechnology, School of Advanced Medical Sciences and Technologies, Shiraz University of Medical Sciences, Shiraz, Shiraz, Iran
| | - Zohreh Mostafavi-Pour
- Recombinant Protein Laboratory, Department of Biochemistry, School of Medicine, Shiraz University of Medical Sciences, Shiraz, Iran
| | | | - Amir Savardashtaki
- Department of Medical Biotechnology, School of Advanced Medical Sciences and Technologies, Shiraz University of Medical Sciences, Shiraz, Shiraz, Iran
- Infertility Research Center, Shiraz University of Medical Sciences, Shiraz, Iran
| |
Collapse
|
5
|
Sun J, Ru J, Cribbs AP, Xiong D. PyPropel: a Python-based tool for efficiently processing and characterising protein data. BMC Bioinformatics 2025; 26:70. [PMID: 40025421 PMCID: PMC11871610 DOI: 10.1186/s12859-025-06079-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2024] [Accepted: 02/10/2025] [Indexed: 03/04/2025] Open
Abstract
BACKGROUND The volume of protein sequence data has grown exponentially in recent years, driven by advancements in metagenomics. Despite this, a substantial proportion of these sequences remain poorly annotated, underscoring the need for robust bioinformatics tools to facilitate efficient characterisation and annotation for functional studies. RESULTS We present PyPropel, a Python-based computational tool developed to streamline the large-scale analysis of protein data, with a particular focus on applications in machine learning. PyPropel integrates sequence and structural data pre-processing, feature generation, and post-processing for model performance evaluation and visualisation, offering a comprehensive solution for handling complex protein datasets. CONCLUSION PyPropel provides added value over existing tools by offering a unified workflow that encompasses the full spectrum of protein research, from raw data pre-processing to functional annotation and model performance analysis, thereby supporting efficient protein function studies.
Collapse
Affiliation(s)
- Jianfeng Sun
- Botnar Research Centre, University of Oxford, Headington, Oxford, OX3 7LD, UK.
| | - Jinlong Ru
- Chair of Prevention of Microbial Diseases, School of Life Sciences Weihenstephan, Technical University of Munich, 85354, Freising, Germany
| | - Adam P Cribbs
- Botnar Research Centre, University of Oxford, Headington, Oxford, OX3 7LD, UK
| | - Dapeng Xiong
- Department of Computational Biology, Cornell University, Ithaca, 14853, USA.
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, 14853, USA.
| |
Collapse
|
6
|
Khatooni Z, Broderick G, Anand SK, Wilson HL. Combined immunoinformatic approaches with computational biochemistry for development of subunit-based vaccine against Lawsonia intracellularis. PLoS One 2025; 20:e0314254. [PMID: 39992906 PMCID: PMC11849901 DOI: 10.1371/journal.pone.0314254] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2024] [Accepted: 11/07/2024] [Indexed: 02/26/2025] Open
Abstract
Lawsonia intracellularis (LI) are obligate intracellular bacteria and the causative agent of proliferative hemorrhagic enteropathy that significantly impacts the health of piglets and the profitability of the swine industry. In this study, we used immunoinformatic and computational methodologies such as homology modelling, molecular docking, molecular dynamic (MD) simulation, and free energy calculations in a novel three stage approach to identify strong T and B cell epitopes in the LI proteome. From ∼ 1342 LI proteins, we narrowed our focus to 256 proteins that were either not well-identified (unknown role) or were expressed at a higher frequency in pathogenic strains relative to non-pathogenic strains. At stage 1, these proteins were analyzed for predicted virulence, antigenicity, solubility, and probability of residing within a membrane. At stage 2, we used NetMHCPan4-1 to identify over ten thousand cytotoxic T lymphocyte epitopes (CTLEs) and 286 CTLEs were ranked as having high predicted binding affinity for the SLA-1 and SLA-2 complexes. At stage 3, we used homology modeling to predict the structures of the top ranked CTLEs and we subjected each of them to molecular docking analysis with SLA-1*0401 and SLA-2*0402. The top ranked 25 SLA-CTLE complexes were selected to be an input for subsequent MD simulations to fully investigate the atomic-level dynamics of proteins under the natural thermal fluctuation of water and thus potentially provide deep insight into the CTLE-SLA interaction. We also performed free energy evaluation by Molecular Mechanics/Poisson-Boltzmann Surface Area to predict epitope interactions and binding affinities to the SLA-1 and SLA-2. We identified the top five CTLEs having the strongest binding energy to the indicated SLAs (-305.6 kJ/mol, -219.5 kJ/mol, -214.8 kJ/mol, -139.5 kJ/mol and -92.6 kJ/mol, respectively.) W also performed B-cell epitope prediction and the top-ranked 5 CTLEs and 3 B-cell epitopes were organized into a multi-epitope subunit antigen vaccine construct joined using EAAAK, AAY, KK, and GGGGG linkers with 40 residues of the LI DnaK protein attached to the N-terminus to further enhance the antigenicity of the vaccine construct. Blind docking studies showed strong interactions between our vaccine construct with swine Toll-like receptor 5. Collectively, these molecular modeling and immunoinformatic analyses present a useful in silico protocol for the discovery of candidate antigen in many viral and bacterial pathogens.
Collapse
Affiliation(s)
- Zahed Khatooni
- Vaccine and Infectious Disease Organization (VIDO), University of Saskatchewan, Saskatoon, Saskatchewan, Canada
| | - Gordon Broderick
- Vaccine and Infectious Disease Organization (VIDO), University of Saskatchewan, Saskatoon, Saskatchewan, Canada
| | - Sanjeev K. Anand
- Now with Modulant Biosciences LLC, Fishers, IN, United States of America
| | - Heather L. Wilson
- Vaccine and Infectious Disease Organization (VIDO), University of Saskatchewan, Saskatoon, Saskatchewan, Canada
- Department of Veterinary Microbiology, Western College of Veterinary Medicine, University of Saskatchewan, Saskatoon, Saskatchewan, Canada
- School of Public Health, Vaccinology & Immunotherapeutics program, University of Saskatchewan, Saskatoon, Saskatchewan, Canada
| |
Collapse
|
7
|
Alanazi W, Meng D, Pollastri G. PaleAle 6.0: Prediction of Protein Relative Solvent Accessibility by Leveraging Pre-Trained Language Models (PLMs). Biomolecules 2025; 15:49. [PMID: 39858443 PMCID: PMC11764203 DOI: 10.3390/biom15010049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2024] [Revised: 12/18/2024] [Accepted: 12/31/2024] [Indexed: 01/27/2025] Open
Abstract
Predicting the relative solvent accessibility (RSA) of a protein is critical to understanding its 3D structure and biological function. RSA prediction, especially when homology transfer cannot provide information about a protein's structure, is a significant step toward addressing the protein structure prediction challenge. Today, deep learning is arguably the most powerful method for predicting RSA and other structural features of proteins. In particular, recent breakthroughs in deep learning-driven by the integration of natural language processing (NLP) algorithms-have significantly advanced the field of protein research. Inspired by the remarkable success of NLP techniques, this study leverages pre-trained language models (PLMs) to enhance RSA prediction. We present a deep neural network architecture based on a combination of bidirectional recurrent neural networks and convolutional layers that can analyze long-range interactions within protein sequences and predict protein RSA using ESM-2 encoding. The final predictor, PaleAle 6.0, predicts RSA in real values as well as two-state (exposure threshold of 25%) and four-state (exposure thresholds of 4%, 25%, and 50%) discrete classifications. On the 2022 test set dataset, PaleAle 6.0 achieved over 82% accuracy for two-state RSA (RSA_2C) and 59.75% accuracy for four-state RSA (RSA_4C), with a Pearson correlation coefficient (PCC) of 77.88 for real-value RSA prediction. When evaluated on the more challenging 2024 test set, PaleAle 6.0 maintained a strong performance, achieving 79.74% accuracy in the two-state prediction and 55.30% accuracy in the four-state prediction, with a PCC of 73.08 for real-value predictions, outperforming all previously benchmarked predictors.
Collapse
Affiliation(s)
- Wafa Alanazi
- School of Computer Science, University College Dublin (UCD), D04 V1W8 Dublin, Ireland; (D.M.); (G.P.)
- Department of Computer Science, College of Science, Northern Border University, Arar 73241, Saudi Arabia
| | - Di Meng
- School of Computer Science, University College Dublin (UCD), D04 V1W8 Dublin, Ireland; (D.M.); (G.P.)
| | - Gianluca Pollastri
- School of Computer Science, University College Dublin (UCD), D04 V1W8 Dublin, Ireland; (D.M.); (G.P.)
| |
Collapse
|
8
|
Chatzimiltis S, Agathocleous M, Promponas VJ, Christodoulou C. Post-processing enhances protein secondary structure prediction with second order deep learning and embeddings. Comput Struct Biotechnol J 2025; 27:243-251. [PMID: 39866664 PMCID: PMC11764030 DOI: 10.1016/j.csbj.2024.12.022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Revised: 12/20/2024] [Accepted: 12/21/2024] [Indexed: 01/28/2025] Open
Abstract
Protein Secondary Structure Prediction (PSSP) is regarded as a challenging task in bioinformatics, and numerous approaches to achieve a more accurate prediction have been proposed. Accurate PSSP can be instrumental in inferring protein tertiary structure and their functions. Machine Learning and in particular Deep Learning approaches show promising results for the PSSP problem. In this paper, we deploy a Convolutional Neural Network (CNN) trained with the Subsampled Hessian Newton (SHN) method (a Hessian Free Optimisation variant), with a two- dimensional input representation of embeddings extracted from a language model pretrained with protein sequences. Utilising a CNN trained with the SHN method and the input embeddings, we achieved on average a 79.96% per residue (Q3) accuracy on the CB513 dataset and 81.45% Q3 accuracy on the PISCES dataset (without any post-processing techniques applied). The application of ensembles and filtering techniques to the results of the CNN improved the overall prediction performance. The Q3 accuracy on the CB513 increased to 93.65% and for the PISCES dataset to 87.13%. Moreover, our method was evaluated using the CASP13 dataset where we showed that as the post-processing window size increased, the prediction performance increased as well. In fact, with the biggest post-processing window size (limited by the smallest CASP13 protein), we achieved a Q3 accuracy of 98.12% and a Segment Overlap (SOV) score of 96.98 on the CASP13 dataset when the CNNs were trained with the PISCES dataset. Finally, we showed that input representations from embeddings can perform equally well as representations extracted from multiple sequence alignments.
Collapse
Affiliation(s)
- Sotiris Chatzimiltis
- University of Cyprus, Department of Computer Science, Nicosia, Cyprus
- 5G/6GIC, Institute for Communication Systems (ICS), University of Surrey, Guildford, United Kingdom
| | - Michalis Agathocleous
- University of Cyprus, Department of Computer Science, Nicosia, Cyprus
- University of Nicosia, Department of Computer Science, Nicosia, Cyprus
| | | | | |
Collapse
|
9
|
Han KS, Song SR, Pak MH, Kim CS, Ri CP, Del Conte A, Piovesan D. PredIDR: Accurate prediction of protein intrinsic disorder regions using deep convolutional neural network. Int J Biol Macromol 2025; 284:137665. [PMID: 39571839 DOI: 10.1016/j.ijbiomac.2024.137665] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2024] [Revised: 10/29/2024] [Accepted: 11/13/2024] [Indexed: 12/02/2024]
Abstract
The involvement of protein intrinsic disorder in essential biological processes, it is well known in structural biology. However, experimental methods for detecting intrinsic structural disorder and directly measuring highly dynamic behavior of protein structure are limited. To address this issue, several computational methods to predict intrinsic disorder from protein sequences were developed and their performance is evaluated by the Critical Assessment of protein Intrinsic Disorder (CAID). In this paper, we describe a new computational method, PredIDR, which provides accurate prediction of intrinsically disordered regions in proteins, mimicking experimental X-ray missing residues. Indeed, missing residues in Protein Data Bank (PDB) were used as positive examples to train a deep convolutional neural network which produces two types of output for short and long regions. PredIDR took part in the second round of CAID and was as accurate as the top state-of-the-art IDR prediction methods. PredIDR can be freely used through the CAID Prediction Portal available at https://caid.idpcentral.org/portal or downloaded as a Singularity container from https://biocomputingup.it/shared/caid-predictors/.
Collapse
Affiliation(s)
- Kun-Sop Han
- University of Sciences, Pyongyang, Democratic People's Republic of Korea
| | - Se-Ryong Song
- Branch of Biotechnology, State Academy of Sciences, Pyongyang, Democratic People's Republic of Korea
| | - Myong-Hyon Pak
- University of Sciences, Pyongyang, Democratic People's Republic of Korea
| | - Chol-Song Kim
- University of Sciences, Pyongyang, Democratic People's Republic of Korea
| | - Chol-Pyok Ri
- University of Sciences, Pyongyang, Democratic People's Republic of Korea
| | - Alessio Del Conte
- Department of Biomedical Sciences, University of Padova, Padova, Italy
| | - Damiano Piovesan
- Department of Biomedical Sciences, University of Padova, Padova, Italy
| |
Collapse
|
10
|
Wu T, Cheng W, Cheng J. Improving Protein Secondary Structure Prediction by Deep Language Models and Transformer Networks. Methods Mol Biol 2025; 2867:43-53. [PMID: 39576574 DOI: 10.1007/978-1-0716-4196-5_3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2024]
Abstract
Protein secondary structure prediction is useful for many applications. It can be considered a language translation problem, that is, translating a sequence of 20 different amino acids into a sequence of secondary structure symbols (e.g., alpha helix, beta strand, and coil). Here, we develop a novel protein secondary structure predictor called TransPross based on the transformer network and attention mechanism widely used in natural language processing to directly extract the evolutionary information from the protein language (i.e., raw multiple sequence alignment [MSA] of a protein) to predict the secondary structure. The method is different from traditional methods that first generate a MSA and then calculate expert-curated statistical profiles from the MSA as input. The attention mechanism used by TransPross can effectively capture long-range residue-residue interactions in protein sequences to predict secondary structures. Benchmarked on several datasets, TransPross outperforms the state-of-art methods. Moreover, our experiment shows that the prediction accuracy of TransPross positively correlates with the depth of MSAs, and it is able to achieve the average prediction accuracy (i.e., Q3 score) above 80% for hard targets with few homologous sequences in their MSAs. TransPross is freely available at https://github.com/BioinfoMachineLearning/TransPro .
Collapse
Affiliation(s)
- Tianqi Wu
- Electrical Engineering and Computer Science Department, University of Missouri, Columbia, MO, USA
| | - Weihang Cheng
- Department of Chemistry, Hubei University, Wuhan, Hubei, China
| | - Jianlin Cheng
- Electrical Engineering and Computer Science Department, University of Missouri, Columbia, MO, USA.
| |
Collapse
|
11
|
Srivastava G, Liu M, Ni X, Pu L, Brylinski M. Machine Learning Techniques to Infer Protein Structure and Function from Sequences: A Comprehensive Review. Methods Mol Biol 2025; 2867:79-104. [PMID: 39576576 DOI: 10.1007/978-1-0716-4196-5_5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2024]
Abstract
The elucidation of protein structure and function plays a pivotal role in understanding biological processes and facilitating drug discovery. With the exponential growth of protein sequence data, machine learning techniques have emerged as powerful tools for predicting protein characteristics from sequences alone. This review provides a comprehensive overview of the importance and application of machine learning in inferring protein structure and function. We discuss various machine learning approaches, primarily focusing on convolutional neural networks and natural language processing, and their utilization in predicting protein secondary and tertiary structures, residue-residue contacts, protein function, and subcellular localization. Furthermore, we highlight the challenges associated with using machine learning techniques in this context, such as the availability of high-quality training datasets and the interpretability of models. We also delve into the latest progress in the field concerning the advancements made in the development of intricate deep learning architectures. Overall, this review underscores the significance of machine learning in advancing our understanding of protein structure and function, and its potential to revolutionize drug discovery and personalized medicine.
Collapse
Affiliation(s)
- Gopal Srivastava
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, USA
| | - Mengmeng Liu
- Division of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, LA, USA
| | - Xialong Ni
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, USA
| | - Limeng Pu
- Center for Computation and Technology, Louisiana State University, Baton Rouge, LA, USA
| | - Michal Brylinski
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, USA.
- Center for Computation and Technology, Louisiana State University, Baton Rouge, LA, USA.
| |
Collapse
|
12
|
Nael MA, Ghoneim MM, Almuqbil M, Al-Serwi RH, El-Sherbiny M, Mostafa AE, Elokely KM. An evaluation of the precision of computational methods used in drug development initiatives. J Biomol Struct Dyn 2024:1-15. [PMID: 39659185 DOI: 10.1080/07391102.2024.2435633] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Accepted: 03/29/2024] [Indexed: 12/12/2024]
Abstract
Computational approaches are commonly employed to expedite and provide decision-making for the drug development process. Drug development programs that involve targets without known crystal structures can be quite challenging. In many cases, a viable approach is to generate reliable homology models using the amino acid sequence of the target. This is followed by a series of validation steps, druggable pocket detection, and then moving forward with lead identification and validation. This study commenced by conducting an initial benchmark exercise using a series of computationally designed sequences for steroid-binding proteins. By conducting an unbiased comparison with the released X-ray crystal structures, the homology models that were generated demonstrated reliable outcomes. The aligned homology models showed a root mean square deviation (RMSD) of less than 0.6 Å when compared to the corresponding X-ray structures. Three different methods were used to detect the druggable cavities for comparison, and the identified pockets closely resembled those of the crystal structures. The achievement of near-native pose prediction was made possible by utilizing the comprehensive binding energy function that characterizes the interaction between each pose and the neighboring residues. In order to address the issue of limited correlation between entropy and internal energy in docking, an alternative was devised by incorporating entropy as a post-docking optimization step to enhance the accuracy of ligand binding affinity predictions and improve the overall quality of the results.
Collapse
Affiliation(s)
- Manal A Nael
- Department of Pharmaceutical Chemistry, Faculty of Pharmacy, Tanta University, Tanta, Egypt
- Department of Chemistry, Institute for Computational Molecular Science, Temple University, Philadelphia, Pennsylvania, USA
| | - Mohammed M Ghoneim
- Department of Pharmacy Practice, College of Pharmacy, AlMaarefa University, Riyadh, Saudi Arabia
| | - Mansour Almuqbil
- Clinical Pharmacy Department, College of Pharmacy, King Saud University, Riyadh, Saudi Arabia
| | - Rasha Hamed Al-Serwi
- Department of Basic Dental Sciences, College of Dentistry, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia
| | - Mohamed El-Sherbiny
- Department of Basic Medical Sciences, College of Medicine, AlMaarefa University, Riyadh, Saudi Arabia
| | - Ahmad E Mostafa
- Pharmacognosy and Medicinal Plants Department, Faculty of Pharmacy, Al-Azhar University, Cairo, Egypt
| | - Khaled M Elokely
- Department of Chemistry, Institute for Computational Molecular Science, Temple University, Philadelphia, Pennsylvania, USA
| |
Collapse
|
13
|
Dong B, Liu Z, Xu D, Hou C, Dong G, Zhang T, Wang G. SERT-StructNet: Protein secondary structure prediction method based on multi-factor hybrid deep model. Comput Struct Biotechnol J 2024; 23:1364-1375. [PMID: 38596312 PMCID: PMC11001767 DOI: 10.1016/j.csbj.2024.03.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Revised: 03/20/2024] [Accepted: 03/21/2024] [Indexed: 04/11/2024] Open
Abstract
Protein secondary structure prediction (PSSP) is a pivotal research endeavour that plays a crucial role in the comprehensive elucidation of protein functions and properties. Current prediction methodologies are focused on deep-learning techniques, particularly focusing on multi-factor features. Diverging from existing approaches, in this study, we placed special emphasis on the effects of amino acid properties and protein secondary structure propensity scores (SSPs) on secondary structure during the meticulous selection of multi-factor features. This differential feature-selection strategy results in a distinctive and effective amalgamation of the sequence and property features. To harness these multi-factor features optimally, we introduced a hybrid deep feature extraction model. The model initially employs mechanisms such as dilated convolution (D-Conv) and a channel attention network (SENet) for local feature extraction and targeted channel enhancement. Subsequently, a combination of recurrent neural network variants (BiGRU and BiLSTM), along with a transformer module, was employed to achieve global bidirectional information consideration and feature enhancement. This approach to multi-factor feature input and multi-level feature processing enabled a comprehensive exploration of intricate associations among amino acid residues in protein sequences, yielding a Q 3 accuracy of 84.9% and an Sov score of 85.1%. The overall performance surpasses that of the comparable methods. This study introduces a novel and efficient method for determining the PSSP domain, which is poised to deepen our understanding of the practical applications of protein molecular structures.
Collapse
Affiliation(s)
- Benzhi Dong
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Zheng Liu
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Dali Xu
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Chang Hou
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Guanghui Dong
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Tianjiao Zhang
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Guohua Wang
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| |
Collapse
|
14
|
Dong B, Su H, Xu D, Hou C, Liu Z, Niu N, Wang G. ILMCNet: A Deep Neural Network Model That Uses PLM to Process Features and Employs CRF to Predict Protein Secondary Structure. Genes (Basel) 2024; 15:1350. [PMID: 39457474 PMCID: PMC11507629 DOI: 10.3390/genes15101350] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2024] [Revised: 10/07/2024] [Accepted: 10/18/2024] [Indexed: 10/28/2024] Open
Abstract
BACKGROUND Protein secondary structure prediction (PSSP) is a critical task in computational biology, pivotal for understanding protein function and advancing medical diagnostics. Recently, approaches that integrate multiple amino acid sequence features have gained significant attention in PSSP research. OBJECTIVES We aim to automatically extract additional features represented by evolutionary information from a large number of sequences while simultaneously incorporating positional information for more comprehensive sequence features. Additionally, we consider the interdependence between secondary structures during the prediction stage. METHODS To this end, we propose a deep neural network model, ILMCNet, which utilizes a language model and Conditional Random Field (CRF). Protein language models (PLMs) pre-trained on sequences from multiple large databases can provide sequence features that incorporate evolutionary information. ILMCNet uses positional encoding to ensure that the input features include positional information. To better utilize these features, we propose a hybrid network architecture that employs a Transformer Encoder to enhance features and integrates a feature extraction module combining a Convolutional Neural Network (CNN) with a Bidirectional Long Short-Term Memory Network (BiLSTM). This design enables deep extraction of localized features while capturing global bidirectional information. In the prediction stage, ILMCNet employs CRF to capture the interdependencies between secondary structures. RESULTS Experimental results on benchmark datasets such as CB513, TS115, NEW364, CASP11, and CASP12 demonstrate that the prediction performance of our method surpasses that of comparable approaches. CONCLUSIONS This study proposes a new approach to PSSP research and is expected to play an important role in other protein-related research fields, such as protein tertiary structure prediction.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Guohua Wang
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China; (B.D.); (H.S.); (D.X.); (C.H.); (Z.L.); (N.N.)
| |
Collapse
|
15
|
Sanjeevi M, Mohan A, Ramachandran D, Jeyaraman J, Sekar K. CSSP-2.0: A refined consensus method for accurate protein secondary structure prediction. Comput Biol Chem 2024; 112:108158. [PMID: 39053174 DOI: 10.1016/j.compbiolchem.2024.108158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2023] [Revised: 06/19/2024] [Accepted: 07/18/2024] [Indexed: 07/27/2024]
Abstract
Studying the relationship between sequences and their corresponding three-dimensional structure assists structural biologists in solving the protein-folding problem. Despite several experimental and in-silico approaches, still understanding or decoding the three-dimensional structures from the sequence remains a mystery. In such cases, the accuracy of the structure prediction plays an indispensable role. To address this issue, an updated web server (CSSP-2.0) has been created to improve the accuracy of our previous version of CSSP by deploying the existing algorithms. It uses input as probabilities and predicts the consensus for the secondary structure as a highly accurate three-state Q3 (helix, strand, and coil). This prediction is achieved using six recent top-performing methods: MUFOLD-SS, RaptorX, PSSpred v4, PSIPRED, JPred v4, and Porter 5.0. CSSP-2.0 validation includes datasets involving various protein classes from the PDB, CullPDB, and AlphaFold databases. Our results indicate a significant improvement in the accuracy of the consensus Q3 prediction. Using CSSP-2.0, crystallographers can sort out the stable regular secondary structures from the entire complex structure, which would aid in inferring the functional annotation of hypothetical proteins. The web server is freely available at https://bioserver3.physics.iisc.ac.in/cgi-bin/cssp-2/.
Collapse
Affiliation(s)
- Madhumathi Sanjeevi
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore 560012, India; Structural Biology and Bio-Computing Laboratory, Department of Bioinformatics, Alagappa University, Karaikudi 630004, India
| | - Ajitha Mohan
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore 560012, India
| | | | - Jeyakanthan Jeyaraman
- Structural Biology and Bio-Computing Laboratory, Department of Bioinformatics, Alagappa University, Karaikudi 630004, India.
| | - Kanagaraj Sekar
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore 560012, India.
| |
Collapse
|
16
|
Dong B, Liu Z, Xu D, Hou C, Niu N, Wang G. Impact of Multi-Factor Features on Protein Secondary Structure Prediction. Biomolecules 2024; 14:1155. [PMID: 39334921 PMCID: PMC11430196 DOI: 10.3390/biom14091155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2024] [Revised: 09/05/2024] [Accepted: 09/10/2024] [Indexed: 09/30/2024] Open
Abstract
Protein secondary structure prediction (PSSP) plays a crucial role in resolving protein functions and properties. Significant progress has been made in this field in recent years, and the use of a variety of protein-related features, including amino acid sequences, position-specific score matrices (PSSM), amino acid properties, and secondary structure trend factors, to improve prediction accuracy is an important technical route for it. However, a comprehensive evaluation of the impact of these factor features in secondary structure prediction is lacking in the current work. This study quantitatively analyzes the impact of several major factors on secondary structure prediction models using a more explanatory four-class machine learning approach. The applicability of each factor in the different types of methods, the extent to which the different methods work on each factor, and the evaluation of the effect of multi-factor combinations are explored in detail. Through experiments and analyses, it was found that PSSM performs best in methods with strong high-dimensional features and complex feature extraction capabilities, while amino acid sequences, although performing poorly overall, perform relatively well in methods with strong linear processing capabilities. Also, the combination of amino acid properties and trend factors significantly improved the prediction performance. This study provides empirical evidence for future researchers to optimize multi-factor feature combinations and apply them to protein secondary structure prediction models, which is beneficial in further optimizing the use of these factors to enhance the performance of protein secondary structure prediction models.
Collapse
Affiliation(s)
| | | | | | | | - Na Niu
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China; (B.D.); (Z.L.); (D.X.); (C.H.)
| | - Guohua Wang
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China; (B.D.); (Z.L.); (D.X.); (C.H.)
| |
Collapse
|
17
|
Omidian M, Mostafavi-Pour Z, Asadi M, Sharifdini M, Nezafat N, Pouryousef A, Savardashtaki A, Taheri-Anganeh M, Mikaeili F, Sarkari B. Design and expression of a chimeric recombinant antigen (SsIR-Ss1a) for the serodiagnosis of human strongyloidiasis: Evaluation of performance, sensitivity, and specificity. PLoS Negl Trop Dis 2024; 18:e0012320. [PMID: 39008519 PMCID: PMC11271862 DOI: 10.1371/journal.pntd.0012320] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Revised: 07/25/2024] [Accepted: 06/26/2024] [Indexed: 07/17/2024] Open
Abstract
BACKGROUND The sensitivity of parasitological and molecular methods is unsatisfactory for the diagnosis of strongyloidiasis, and serological techniques are remaining as the most effective diagnostic approach. The present study aimed to design and produce a chimeric recombinant antigen from Strongyloides stercoralis immunoreactive antigen (SsIR) and Ss1a antigens, using immune-informatics approaches, and evaluated its diagnostic performance in an ELISA system for the diagnosis of human strongyloidiasis. METHODOLOGY/PRINCIPAL FINDINGS The coding sequences for SsIR and Ss1a were selected from GenBank and were gene-optimized. Using bioinformatics analysis, the regions with the highest antigenicity that did not overlap with other parasite antigens were selected. The chimeric recombinant antigen SsIR- Ss1a, was constructed. The solubility and physicochemical properties of the designed construct were analyzed and its tertiary structures were built and evaluated. The construct was expressed into the pET-23a (+) expression vector and the optimized DNA sequences of SsIR-Ss1a (873 bp) were cloned into competent E. coli DH5α cells. Diagnostic performances of the produced recombinant antigen, along with a commercial kit were evaluated in an indirect ELISA system, using a panel of sera from strongyloidiasis patients and controls. The physicochemical and bioinformatics evaluations revealed that the designed chimeric construct is soluble, has a molecular with of 35 KDa, and is antigenic. Western blotting confirmed the immunoreactivity of the produced chimeric recombinant antigen with the sera of strongyloidiasis patients. The sensitivity and specificity of the indirect ELISA system, using the produced SsIR-Ss1a chimeric antigen, were found to be 93.94% (95% CI, 0.803 to 0.989) and 97.22% (95% CI, 0.921 to 0.992) respectively. CONCLUSIONS/SIGNIFICANCE The preliminary findings of this study suggest that the produced SsIR-Ss1a chimeric antigen shows promise in the diagnosis of human strongyloidiasis. However, these results are based on a limited panel of samples, and further research with a larger sample size is necessary to confirm its accuracy. The construct has potential as an antigen in the ELISA system for the serological diagnosis of this neglected parasitic infection, but additional validation is required.
Collapse
Affiliation(s)
- Mostafa Omidian
- Student Research Committee, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Zohreh Mostafavi-Pour
- Recombinant Proteins Laboratory, Department of Biochemistry, School of Medicine, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Marzieh Asadi
- Department of Medical Biotechnology, School of Advanced Medical Sciences and Technologies, Shiraz University of Medical Sciences Shiraz, Iran
| | - Meysam Sharifdini
- Department of Medical Parasitology and Mycology, School of Medicine, Guilan University of Medical Sciences, Rasht, Iran
| | - Navid Nezafat
- Pharmaceutical Sciences Research Center, Shiraz University of Medical Sciences, Shiraz, Iran
- Department of Pharmaceutical Biotechnology, School of Pharmacy, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Ali Pouryousef
- Student Research Committee, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Amir Savardashtaki
- Department of Medical Biotechnology, School of Advanced Medical Sciences and Technologies, Shiraz University of Medical Sciences Shiraz, Iran
| | - Mortaza Taheri-Anganeh
- Cellular and Molecular Research Center, Cellular and Molecular Medicine Research Institute, Urmia University of Medical Sciences, Urmia, Iran
| | - Fattaneh Mikaeili
- Student Research Committee, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Bahador Sarkari
- Student Research Committee, Shiraz University of Medical Sciences, Shiraz, Iran
- Basic Sciences in Infectious Diseases Research Center, Shiraz University of Medical Sciences, Shiraz, Iran
| |
Collapse
|
18
|
Dong Y, Quan H, Ma C, Shan L, Deng L. TGC-ARG: Anticipating Antibiotic Resistance via Transformer-Based Modeling and Contrastive Learning. Int J Mol Sci 2024; 25:7228. [PMID: 39000335 PMCID: PMC11241484 DOI: 10.3390/ijms25137228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Revised: 06/25/2024] [Accepted: 06/27/2024] [Indexed: 07/16/2024] Open
Abstract
In various domains, including everyday activities, agricultural practices, and medical treatments, the escalating challenge of antibiotic resistance poses a significant concern. Traditional approaches to studying antibiotic resistance genes (ARGs) often require substantial time and effort and are limited in accuracy. Moreover, the decentralized nature of existing data repositories complicates comprehensive analysis of antibiotic resistance gene sequences. In this study, we introduce a novel computational framework named TGC-ARG designed to predict potential ARGs. This framework takes protein sequences as input, utilizes SCRATCH-1D for protein secondary structure prediction, and employs feature extraction techniques to derive distinctive features from both sequence and structural data. Subsequently, a Siamese network is employed to foster a contrastive learning environment, enhancing the model's ability to effectively represent the data. Finally, a multi-layer perceptron (MLP) integrates and processes sequence embeddings alongside predicted secondary structure embeddings to forecast ARG presence. To evaluate our approach, we curated a pioneering open dataset termed ARSS (Antibiotic Resistance Sequence Statistics). Comprehensive comparative experiments demonstrate that our method surpasses current state-of-the-art methodologies. Additionally, through detailed case studies, we illustrate the efficacy of our approach in predicting potential ARGs.
Collapse
Affiliation(s)
| | | | | | | | - Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha 410083, China; (Y.D.); (H.Q.); (C.M.); (L.S.)
| |
Collapse
|
19
|
Cao Y, Qiu B, Ning X, Fan L, Qin Y, Yu D, Yang C, Ma H, Liao X, You C. Enhancing Machine-Learning Prediction of Enzyme Catalytic Temperature Optima through Amino Acid Conservation Analysis. Int J Mol Sci 2024; 25:6252. [PMID: 38892439 PMCID: PMC11173260 DOI: 10.3390/ijms25116252] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 05/22/2024] [Accepted: 05/30/2024] [Indexed: 06/21/2024] Open
Abstract
Enzymes play a crucial role in various industrial production and pharmaceutical developments, serving as catalysts for numerous biochemical reactions. Determining the optimal catalytic temperature (Topt) of enzymes is crucial for optimizing reaction conditions, enhancing catalytic efficiency, and accelerating the industrial processes. However, due to the limited availability of experimentally determined Topt data and the insufficient accuracy of existing computational methods in predicting Topt, there is an urgent need for a computational approach to predict the Topt values of enzymes accurately. In this study, using phosphatase (EC 3.1.3.X) as an example, we constructed a machine learning model utilizing amino acid frequency and protein molecular weight information as features and employing the K-nearest neighbors regression algorithm to predict the Topt of enzymes. Usually, when conducting engineering for enzyme thermostability, researchers tend not to modify conserved amino acids. Therefore, we utilized this machine learning model to predict the Topt of phosphatase sequences after removing conserved amino acids. We found that the predictive model's mean coefficient of determination (R2) value increased from 0.599 to 0.755 compared to the model based on the complete sequences. Subsequently, experimental validation on 10 phosphatase enzymes with undetermined optimal catalytic temperatures shows that the predicted values of most phosphatase enzymes based on the sequence without conservative amino acids are closer to the experimental optimal catalytic temperature values. This study lays the foundation for the rapid selection of enzymes suitable for industrial conditions.
Collapse
Affiliation(s)
- Yinyin Cao
- College of Biotechnology, Tianjin University of Science and Technology, Tianjin 300457, China; (Y.C.)
- Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China; (B.Q.); (H.M.)
| | - Boyu Qiu
- Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China; (B.Q.); (H.M.)
- Department of Life Sciences and Medicine, University of Science and Technology of China, Hefei 230022, China
| | - Xiao Ning
- Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China; (B.Q.); (H.M.)
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Lin Fan
- Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China; (B.Q.); (H.M.)
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yanmei Qin
- Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China; (B.Q.); (H.M.)
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Dong Yu
- College of Biotechnology, Tianjin University of Science and Technology, Tianjin 300457, China; (Y.C.)
- Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China; (B.Q.); (H.M.)
| | - Chunhe Yang
- College of Biotechnology, Tianjin University of Science and Technology, Tianjin 300457, China; (Y.C.)
- Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China; (B.Q.); (H.M.)
| | - Hongwu Ma
- Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China; (B.Q.); (H.M.)
- National Center of Technology Innovation for Synthetic Biology, Tianjin 300308, China
| | - Xiaoping Liao
- Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China; (B.Q.); (H.M.)
- National Center of Technology Innovation for Synthetic Biology, Tianjin 300308, China
| | - Chun You
- Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China; (B.Q.); (H.M.)
- University of Chinese Academy of Sciences, Beijing 100049, China
- National Center of Technology Innovation for Synthetic Biology, Tianjin 300308, China
| |
Collapse
|
20
|
Yin S, Mi X, Shukla D. Leveraging machine learning models for peptide-protein interaction prediction. RSC Chem Biol 2024; 5:401-417. [PMID: 38725911 PMCID: PMC11078210 DOI: 10.1039/d3cb00208j] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Accepted: 02/07/2024] [Indexed: 05/12/2024] Open
Abstract
Peptides play a pivotal role in a wide range of biological activities through participating in up to 40% protein-protein interactions in cellular processes. They also demonstrate remarkable specificity and efficacy, making them promising candidates for drug development. However, predicting peptide-protein complexes by traditional computational approaches, such as docking and molecular dynamics simulations, still remains a challenge due to high computational cost, flexible nature of peptides, and limited structural information of peptide-protein complexes. In recent years, the surge of available biological data has given rise to the development of an increasing number of machine learning models for predicting peptide-protein interactions. These models offer efficient solutions to address the challenges associated with traditional computational approaches. Furthermore, they offer enhanced accuracy, robustness, and interpretability in their predictive outcomes. This review presents a comprehensive overview of machine learning and deep learning models that have emerged in recent years for the prediction of peptide-protein interactions.
Collapse
Affiliation(s)
- Song Yin
- Department of Chemical and Biomolecular Engineering, University of Illinois Urbana-Champaign Urbana 61801 Illinois USA
| | - Xuenan Mi
- Center for Biophysics and Quantitative Biology, University of Illinois Urbana-Champaign Urbana IL 61801 USA
| | - Diwakar Shukla
- Department of Chemical and Biomolecular Engineering, University of Illinois Urbana-Champaign Urbana 61801 Illinois USA
- Center for Biophysics and Quantitative Biology, University of Illinois Urbana-Champaign Urbana IL 61801 USA
- Department of Bioengineering, University of Illinois Urbana-Champaign Urbana IL 61801 USA
| |
Collapse
|
21
|
Zhang Y, Zhou C. PfgPDI: Pocket feature-enabled graph neural network for protein-drug interaction prediction. J Bioinform Comput Biol 2024; 22:2450004. [PMID: 38812467 DOI: 10.1142/s0219720024500045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/31/2024]
Abstract
Biomolecular interaction recognition between ligands and proteins is an essential task, which largely enhances the safety and efficacy in drug discovery and development stage. Studying the interaction between proteins and ligands can improve the understanding of disease pathogenesis and lead to more effective drug targets. Additionally, it can aid in determining drug parameters, ensuring proper absorption, distribution, and metabolism within the body. Due to incomplete feature representation or the model's inadequate adaptation to protein-ligand complexes, the existing methodologies suffer from suboptimal predictive accuracy. To address these pitfalls, in this study, we designed a new deep learning method based on transformer and GCN. We first utilized the transformer network to grasp crucial information of the original protein sequences within the smile sequences and connected them to prevent falling into a local optimum. Furthermore, a series of dilation convolutions are performed to obtain the pocket features and smile features, subsequently subjected to graphical convolution to optimize the connections. The combined representations are fed into the proposed model for classification prediction. Experiments conducted on various protein-ligand binding prediction methods prove the effectiveness of our proposed method. It is expected that the PfgPDI can contribute to drug prediction and accelerate the development of new drugs, while also serving as a valuable partner for drug testing and Research and Development engineers.
Collapse
Affiliation(s)
- Yiqian Zhang
- School of Electrical and Information, Northeast Agricultural University, Harbin 150030, P. R. China
| | - Changjian Zhou
- Department of Data and Computing, Northeast Agricultural University, Harbin 150030, P. R. China
| |
Collapse
|
22
|
Jia P, Zhang F, Wu C, Li M. A comprehensive review of protein-centric predictors for biomolecular interactions: from proteins to nucleic acids and beyond. Brief Bioinform 2024; 25:bbae162. [PMID: 38739759 PMCID: PMC11089422 DOI: 10.1093/bib/bbae162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Revised: 02/17/2024] [Accepted: 03/31/2024] [Indexed: 05/16/2024] Open
Abstract
Proteins interact with diverse ligands to perform a large number of biological functions, such as gene expression and signal transduction. Accurate identification of these protein-ligand interactions is crucial to the understanding of molecular mechanisms and the development of new drugs. However, traditional biological experiments are time-consuming and expensive. With the development of high-throughput technologies, an increasing amount of protein data is available. In the past decades, many computational methods have been developed to predict protein-ligand interactions. Here, we review a comprehensive set of over 160 protein-ligand interaction predictors, which cover protein-protein, protein-nucleic acid, protein-peptide and protein-other ligands (nucleotide, heme, ion) interactions. We have carried out a comprehensive analysis of the above four types of predictors from several significant perspectives, including their inputs, feature profiles, models, availability, etc. The current methods primarily rely on protein sequences, especially utilizing evolutionary information. The significant improvement in predictions is attributed to deep learning methods. Additionally, sequence-based pretrained models and structure-based approaches are emerging as new trends.
Collapse
Affiliation(s)
- Pengzhen Jia
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| | - Fuhao Zhang
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
- College of Information Engineering, Northwest A&F University, No. 3 Taicheng Road, Yangling, Shaanxi 712100, China
| | - Chaojin Wu
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| |
Collapse
|
23
|
Wu H, Chen R, Li X, Zhang Y, Zhang J, Yang Y, Wan J, Zhou Y, Chen H, Li J, Li R, Zou G. ESKtides: a comprehensive database and mining method for ESKAPE phage-derived antimicrobial peptides. Database (Oxford) 2024; 2024:baae022. [PMID: 38531599 PMCID: PMC10965241 DOI: 10.1093/database/baae022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 12/06/2023] [Accepted: 03/06/2024] [Indexed: 03/28/2024]
Abstract
'Superbugs' have received increasing attention from researchers, such as ESKAPE bacteria (Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa and Enterobacter spp.), which directly led to about 1 270 000 death cases in 2019. Recently, phage peptidoglycan hydrolases (PGHs)-derived antimicrobial peptides were proposed as new antibacterial agents against multidrug-resistant bacteria. However, there is still a lack of methods for mining antimicrobial peptides based on phages or phage PGHs. Here, by using a collection of 6809 genomes of ESKAPE isolates and corresponding phages in public databases, based on a unified annotation process of all the genomes, PGHs were systematically identified, from which peptides were mined. As a result, a total of 12 067 248 peptides with high antibacterial activities were respectively determined. A user-friendly tool was developed to predict the phage PGHs-derived antimicrobial peptides from customized genomes, which also allows the calculation of peptide phylogeny, physicochemical properties, and secondary structure. Finally, a user-friendly and intuitive database, ESKtides (http://www.phageonehealth.cn:9000/ESKtides), was designed for data browsing, searching and downloading, which provides a rich peptide library based on ESKAPE prophages and phages. Database URL: 10.1093/database/baae022.
Collapse
Affiliation(s)
- Hongfang Wu
- National Key Laboratory of Agricultural Microbiology, College of Biomedicine and Health, Huazhong Agricultural University, Shenzhen Institute of Nutrition and Health, Huazhong Agricultural University, Shizishan Street No. 1, Wuhan 430070, China
- Hubei Hongshan Laboratory, College of Food Science and Technology, Huazhong Agricultural University, Shizishan Street No. 1, Wuhan 430070, China
| | - Rongxian Chen
- National Key Laboratory of Agricultural Microbiology, College of Biomedicine and Health, Huazhong Agricultural University, Shenzhen Institute of Nutrition and Health, Huazhong Agricultural University, Shizishan Street No. 1, Wuhan 430070, China
- Hubei Hongshan Laboratory, College of Food Science and Technology, Huazhong Agricultural University, Shizishan Street No. 1, Wuhan 430070, China
| | - Xuejian Li
- National Key Laboratory of Agricultural Microbiology, College of Biomedicine and Health, Huazhong Agricultural University, Shenzhen Institute of Nutrition and Health, Huazhong Agricultural University, Shizishan Street No. 1, Wuhan 430070, China
- Hubei Hongshan Laboratory, College of Food Science and Technology, Huazhong Agricultural University, Shizishan Street No. 1, Wuhan 430070, China
- College of Informatics, Huazhong Agricultural University, Shizishan Street No. 1, Wuhan 430070, China
| | - Yue Zhang
- National Key Laboratory of Agricultural Microbiology, College of Biomedicine and Health, Huazhong Agricultural University, Shenzhen Institute of Nutrition and Health, Huazhong Agricultural University, Shizishan Street No. 1, Wuhan 430070, China
- Hubei Hongshan Laboratory, College of Food Science and Technology, Huazhong Agricultural University, Shizishan Street No. 1, Wuhan 430070, China
| | - Jianwei Zhang
- National Key Laboratory of Crop Genetic Improvement, Shizishan Street No. 1, Wuhan 430070, China
| | - Yanbo Yang
- College of Informatics, Huazhong Agricultural University, Shizishan Street No. 1, Wuhan 430070, China
| | - Jun Wan
- Hubei Hongshan Laboratory, College of Food Science and Technology, Huazhong Agricultural University, Shizishan Street No. 1, Wuhan 430070, China
| | - Yang Zhou
- College of Fisheries, Huazhong Agricultural University, Shizishan Street No. 1, Wuhan 430070, China
| | - Huanchun Chen
- National Key Laboratory of Agricultural Microbiology, College of Biomedicine and Health, Huazhong Agricultural University, Shenzhen Institute of Nutrition and Health, Huazhong Agricultural University, Shizishan Street No. 1, Wuhan 430070, China
- College of Veterinary Medicine, Huazhong Agricultural University, Shizishan Street No. 1, Wuhan 430070, China
| | - Jinquan Li
- National Key Laboratory of Agricultural Microbiology, College of Biomedicine and Health, Huazhong Agricultural University, Shenzhen Institute of Nutrition and Health, Huazhong Agricultural University, Shizishan Street No. 1, Wuhan 430070, China
- Hubei Hongshan Laboratory, College of Food Science and Technology, Huazhong Agricultural University, Shizishan Street No. 1, Wuhan 430070, China
- College of Veterinary Medicine, Huazhong Agricultural University, Shizishan Street No. 1, Wuhan 430070, China
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road No. 97, Shenzhen 518000, China
- Shenzhen Institute of Quality & Safety Inspection and Research, Buxin Road No. 97, Shenzhen 518000, China
| | - Runze Li
- National Key Laboratory of Agricultural Microbiology, College of Biomedicine and Health, Huazhong Agricultural University, Shenzhen Institute of Nutrition and Health, Huazhong Agricultural University, Shizishan Street No. 1, Wuhan 430070, China
- Hubei Hongshan Laboratory, College of Food Science and Technology, Huazhong Agricultural University, Shizishan Street No. 1, Wuhan 430070, China
| | - Geng Zou
- National Key Laboratory of Agricultural Microbiology, College of Biomedicine and Health, Huazhong Agricultural University, Shenzhen Institute of Nutrition and Health, Huazhong Agricultural University, Shizishan Street No. 1, Wuhan 430070, China
- Hubei Hongshan Laboratory, College of Food Science and Technology, Huazhong Agricultural University, Shizishan Street No. 1, Wuhan 430070, China
| |
Collapse
|
24
|
Yin S, Mi X, Shukla D. Leveraging Machine Learning Models for Peptide-Protein Interaction Prediction. ARXIV 2024:arXiv:2310.18249v2. [PMID: 37961736 PMCID: PMC10635286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
Peptides play a pivotal role in a wide range of biological activities through participating in up to 40% protein-protein interactions in cellular processes. They also demonstrate remarkable specificity and efficacy, making them promising candidates for drug development. However, predicting peptide-protein complexes by traditional computational approaches, such as Docking and Molecular Dynamics simulations, still remains a challenge due to high computational cost, flexible nature of peptides, and limited structural information of peptide-protein complexes. In recent years, the surge of available biological data has given rise to the development of an increasing number of machine learning models for predicting peptide-protein interactions. These models offer efficient solutions to address the challenges associated with traditional computational approaches. Furthermore, they offer enhanced accuracy, robustness, and interpretability in their predictive outcomes. This review presents a comprehensive overview of machine learning and deep learning models that have emerged in recent years for the prediction of peptide-protein interactions.
Collapse
Affiliation(s)
- Song Yin
- Department of Chemical and Biomolecular Engineering, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States
- These authors contributed to the work equally
| | - Xuenan Mi
- Center for Biophysics and Quantitative Biology, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States
- These authors contributed to the work equally
| | - Diwakar Shukla
- Department of Chemical and Biomolecular Engineering, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States
- Center for Biophysics and Quantitative Biology, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States
- Department of Bioengineering, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States
| |
Collapse
|
25
|
Yin Z, Chen Y, Hao Y, Pandiyan S, Shao J, Wang L. FOTF-CPI: A compound-protein interaction prediction transformer based on the fusion of optimal transport fragments. iScience 2024; 27:108756. [PMID: 38230261 PMCID: PMC10790010 DOI: 10.1016/j.isci.2023.108756] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Revised: 11/05/2023] [Accepted: 12/13/2023] [Indexed: 01/18/2024] Open
Abstract
Compound-protein interaction (CPI) affinity prediction plays an important role in reducing the cost and time of drug discovery. However, the interpretability of how fragments function in CPI is impacted by the fact that current methods ignore the affinity relationships between fragments of compounds and fragments of proteins in CPI modeling. This article introduces an improved Transformer called FOTF-CPI (a Fusion of Optimal Transport Fragments compound-protein interaction prediction model). We use an optimal transport-based fragmentation approach to improve the model's understanding of compound and protein sequences. Additionally, a fused attention mechanism is employed, which combines the features of fragments to capture full affinity information. This fused attention redistributes higher attention scores to fragments with higher affinity. Experimental results show FOTF-CPI achieves an average 2% higher performance than other models on all three datasets. Furthermore, the visualization confirms the potential of FOTF-CPI for drug discovery applications.
Collapse
Affiliation(s)
- Zeyu Yin
- School of Information Science and Technology, Nantong University, Nantong 226001, China
| | - Yu Chen
- School of Information Science and Technology, Nantong University, Nantong 226001, China
| | - Yajie Hao
- School of Information Science and Technology, Nantong University, Nantong 226001, China
| | - Sanjeevi Pandiyan
- Research Center for Intelligent Information Technology, Nantong University, Nantong 226001, China
| | - Jinsong Shao
- School of Information Science and Technology, Nantong University, Nantong 226001, China
| | - Li Wang
- School of Information Science and Technology, Nantong University, Nantong 226001, China
- Research Center for Intelligent Information Technology, Nantong University, Nantong 226001, China
| |
Collapse
|
26
|
Matinyan S, Filipcik P, Abrahams JP. Deep learning applications in protein crystallography. Acta Crystallogr A Found Adv 2024; 80:1-17. [PMID: 38189437 PMCID: PMC10833361 DOI: 10.1107/s2053273323009300] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2023] [Accepted: 10/24/2023] [Indexed: 01/09/2024] Open
Abstract
Deep learning techniques can recognize complex patterns in noisy, multidimensional data. In recent years, researchers have started to explore the potential of deep learning in the field of structural biology, including protein crystallography. This field has some significant challenges, in particular producing high-quality and well ordered protein crystals. Additionally, collecting diffraction data with high completeness and quality, and determining and refining protein structures can be problematic. Protein crystallographic data are often high-dimensional, noisy and incomplete. Deep learning algorithms can extract relevant features from these data and learn to recognize patterns, which can improve the success rate of crystallization and the quality of crystal structures. This paper reviews progress in this field.
Collapse
Affiliation(s)
| | | | - Jan Pieter Abrahams
- Biozentrum, Basel University, Basel, Switzerland
- Paul Scherrer Institute, Villigen, Switzerland
| |
Collapse
|
27
|
Molzahn C, Kuechler ER, Zemlyankina I, Nierves L, Ali T, Cole G, Wang J, Albu RF, Zhu M, Cashman NR, Gilch S, Karsan A, Lange PF, Gsponer J, Mayor T. Shift of the insoluble content of the proteome in the aging mouse brain. Proc Natl Acad Sci U S A 2023; 120:e2310057120. [PMID: 37906643 PMCID: PMC10636323 DOI: 10.1073/pnas.2310057120] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Accepted: 09/24/2023] [Indexed: 11/02/2023] Open
Abstract
During aging, the cellular response to unfolded proteins is believed to decline, resulting in diminished proteostasis. In model organisms, such as Caenorhabditis elegans, proteostatic decline with age has been linked to proteome solubility shifts and the onset of protein aggregation. However, this correlation has not been extensively characterized in aging mammals. To uncover age-dependent changes in the insoluble portion of a mammalian proteome, we analyzed the detergent-insoluble fraction of mouse brain tissue by mass spectrometry. We identified a group of 171 proteins, including the small heat shock protein α-crystallin, that become enriched in the detergent-insoluble fraction obtained from old mice. To enhance our ability to detect features associated with proteins in that fraction, we complemented our data with a meta-analysis of studies reporting the detergent-insoluble proteins in various mouse models of aging and neurodegeneration. Strikingly, insoluble proteins from young and old mice are distinct in several features in our study and across the collected literature data. In younger mice, proteins are more likely to be disordered, part of membraneless organelles, and involved in RNA binding. These traits become less prominent with age, as an increased number of structured proteins enter the pellet fraction. This analysis suggests that age-related changes to proteome organization lead a group of proteins with specific features to become detergent-insoluble. Importantly, these features are not consistent with those associated with proteins driving membraneless organelle formation. We see no evidence in our system of a general increase of condensate proteins in the detergent-insoluble fraction with age.
Collapse
Affiliation(s)
- Cristen Molzahn
- Department of Biochemistry and Molecular Biology, Michael Smith Laboratories, University of British Columbia, Vancouver, BCV6T 1Z4, Canada
- Edward Leong Center for Healthy Aging, University of British Columbia, Vancouver, BCV6T 1Z3, Canada
| | - Erich R. Kuechler
- Department of Biochemistry and Molecular Biology, Michael Smith Laboratories, University of British Columbia, Vancouver, BCV6T 1Z4, Canada
| | - Irina Zemlyankina
- Department of Biochemistry and Molecular Biology, Michael Smith Laboratories, University of British Columbia, Vancouver, BCV6T 1Z4, Canada
| | - Lorenz Nierves
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BCV6T 1Z4, Canada
- Michael Cuccione Childhood Cancer Research Program, British Columbia Children's Hospital Research Institute, Vancouver, BCV5Z 4H4, Canada
| | - Tahir Ali
- Faculty of Veterinary Medicine and Hotchkiss Brain Institute, University of Calgary, Calgary, ABT2N 4Z6, Canada
| | - Grace Cole
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BCV6T 1Z4, Canada
- British Columbia Cancer Research Institute, Vancouver, BCV5Z 1L3, Canada
| | - Jing Wang
- Division of Neurology and Djavad Mowafaghian Centre for Brain Health, University of British Columbia, Vancouver, BCV6T 1Z3, Canada
| | - Razvan F. Albu
- Department of Biochemistry and Molecular Biology, Michael Smith Laboratories, University of British Columbia, Vancouver, BCV6T 1Z4, Canada
| | - Mang Zhu
- Department of Biochemistry and Molecular Biology, Michael Smith Laboratories, University of British Columbia, Vancouver, BCV6T 1Z4, Canada
| | - Neil R. Cashman
- Division of Neurology and Djavad Mowafaghian Centre for Brain Health, University of British Columbia, Vancouver, BCV6T 1Z3, Canada
| | - Sabine Gilch
- Faculty of Veterinary Medicine and Hotchkiss Brain Institute, University of Calgary, Calgary, ABT2N 4Z6, Canada
| | - Aly Karsan
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BCV6T 1Z4, Canada
- British Columbia Cancer Research Institute, Vancouver, BCV5Z 1L3, Canada
| | - Philipp F. Lange
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BCV6T 1Z4, Canada
- Michael Cuccione Childhood Cancer Research Program, British Columbia Children's Hospital Research Institute, Vancouver, BCV5Z 4H4, Canada
- British Columbia Cancer Research Institute, Vancouver, BCV5Z 1L3, Canada
| | - Jörg Gsponer
- Department of Biochemistry and Molecular Biology, Michael Smith Laboratories, University of British Columbia, Vancouver, BCV6T 1Z4, Canada
| | - Thibault Mayor
- Department of Biochemistry and Molecular Biology, Michael Smith Laboratories, University of British Columbia, Vancouver, BCV6T 1Z4, Canada
- Edward Leong Center for Healthy Aging, University of British Columbia, Vancouver, BCV6T 1Z3, Canada
| |
Collapse
|
28
|
Yue T, Wang Y, Zhang L, Gu C, Xue H, Wang W, Lyu Q, Dun Y. Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models. Int J Mol Sci 2023; 24:15858. [PMID: 37958843 PMCID: PMC10649223 DOI: 10.3390/ijms242115858] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 10/24/2023] [Accepted: 10/30/2023] [Indexed: 11/15/2023] Open
Abstract
The data explosion driven by advancements in genomic research, such as high-throughput sequencing techniques, is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in various fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning, since we expect a superhuman intelligence that explores beyond our knowledge to interpret the genome from deep learning. A powerful deep learning model should rely on the insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with proper deep learning-based architecture, and we remark on practical considerations of developing deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research and point out current challenges and potential research directions for future genomics applications. We believe the collaborative use of ever-growing diverse data and the fast iteration of deep learning models will continue to contribute to the future of genomics.
Collapse
Affiliation(s)
- Tianwei Yue
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Yuanxin Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Longxiang Zhang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Chunming Gu
- Department of Biomedical Engineering, School of Medicine, Johns Hopkins University, Baltimore, MD 21218, USA;
| | - Haoru Xue
- The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA;
| | - Wenping Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Qi Lyu
- Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI 48824, USA;
| | - Yujie Dun
- School of Information and Communications Engineering, Xi’an Jiaotong University, Xi’an 710049, China;
| |
Collapse
|
29
|
Milchevskiy YV, Milchevskaya VY, Nikitin AM, Kravatsky YV. Effective Local and Secondary Protein Structure Prediction by Combining a Neural Network-Based Approach with Extensive Feature Design and Selection without Reliance on Evolutionary Information. Int J Mol Sci 2023; 24:15656. [PMID: 37958639 PMCID: PMC10648199 DOI: 10.3390/ijms242115656] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 10/24/2023] [Accepted: 10/25/2023] [Indexed: 11/15/2023] Open
Abstract
Protein structure prediction continues to pose multiple challenges despite outstanding progress that is largely attributable to the use of novel machine learning techniques. One of the widely used representations of local 3D structure-protein blocks (PBs)-can be treated in a similar way to secondary structure classes. Here, we present a new approach for predicting local conformation in terms of PB classes solely from amino acid sequences. We apply the RMSD metric to ensure unambiguous future 3D protein structure recovery. The selection of statistically assessed features is a key component of the proposed method. We suggest that ML input features should be created from the statistically significant predictors that are derived from the amino acids' physicochemical properties and the resolved structures' statistics. The statistical significance of the suggested features was assessed using a stepwise regression analysis that permitted the evaluation of the contribution and statistical significance of each predictor. We used the set of 380 statistically significant predictors as a learning model for the regression neural network that was trained using the PISCES30 dataset. When using the same dataset and metrics for benchmarking, our method outperformed all other methods reported in the literature for the CB513 nonredundant dataset (for the PBs, Q16 = 81.01%, and for the DSSP, Q3 = 85.99% and Q8 = 79.35%).
Collapse
Affiliation(s)
- Yury V. Milchevskiy
- Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Vavilov Str., 32, 119991 Moscow, Russia (Y.V.K.)
| | - Vladislava Y. Milchevskaya
- Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Vavilov Str., 32, 119991 Moscow, Russia (Y.V.K.)
- Institute of Medical Statistics and Bioinformatics, University of Cologne, 50931 Cologne, Germany
| | - Alexei M. Nikitin
- Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Vavilov Str., 32, 119991 Moscow, Russia (Y.V.K.)
| | - Yury V. Kravatsky
- Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Vavilov Str., 32, 119991 Moscow, Russia (Y.V.K.)
- Center for Precision Genome Editing and Genetic Technologies for Biomedicine, Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, 119991 Moscow, Russia
| |
Collapse
|
30
|
Kumari R S, Sethi G, Krishna R. Development of multi-epitope based subunit vaccine against Mycobacterium Tuberculosis using immunoinformatics approach. J Biomol Struct Dyn 2023; 42:12365-12384. [PMID: 37880982 DOI: 10.1080/07391102.2023.2270065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Accepted: 10/07/2023] [Indexed: 10/27/2023]
Abstract
The etiological agent of tuberculosis (TB), Mycobacterium tuberculosis, is a deadly pathogen that adapts to thrive within the host. Since 2020, the COVID-19 pandemic has had colossal health, societal, and economic consequences, which have affected the reporting of new incidences and mortality cases of TB. As per the WHO 2022 report, 10.6 million people were diagnosed with TB, and 1.6 million died worldwide. The increase in resistant strains of tuberculosis is making it more burdensome to reach the End TB strategy. A reliable and efficient TB vaccine that may avert both primary infection and recurrence of latent TB in adults and adolescents is of the utmost importance. In this study, we used computational techniques to predict the ability of HLA molecules to display epitopes for six TB proteins (PPE68, PE_PGRS17, EspC, LDT4, RpfD, and RpfC) to design the multi-epitope subunit vaccine. From the aimed proteins, the potential B-cell, helper T lymphocyte (HTL), and cytotoxic T lymphocyte (CTL) epitopes were predicted and linked together with LPA adjuvant, and the vaccine was designed. The vaccine's physicochemical analysis demonstrates that it is non-allergic, non-toxic, and antigenic. Then, the vaccine structure was predicted, improved, and verified to yield the optimal structure. The developed vaccine's binding mechanism with distinct immunogenic receptors (Tlr2 and MHC-II) was assessed utilizing molecular docking. The molecular dynamic simulation and MMPBSA analysis were performed to comprehend the complexes' dynamics and stability. The immune simulation was utilized to anticipate the vaccine's immunogenic attributes. In silico cloning was employed to demonstrate the efficient expression of the designed vaccine in E. coli as a host. Moreover, in vitro and in vivo animal testing is required to determine the efficacy of the in silico developed vaccine.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Savita Kumari R
- Department of Bioinformatics, Pondicherry University, Puducherry, India
| | - Guneswar Sethi
- Department of Bioinformatics, Pondicherry University, Puducherry, India
- Department of Predictive Toxicology, Korea Institute of Toxicology (KIT), Republic of Korea
| | - Ramadas Krishna
- Department of Bioinformatics, Pondicherry University, Puducherry, India
| |
Collapse
|
31
|
Gurdo N, Volke DC, McCloskey D, Nikel PI. Automating the design-build-test-learn cycle towards next-generation bacterial cell factories. N Biotechnol 2023; 74:1-15. [PMID: 36736693 DOI: 10.1016/j.nbt.2023.01.002] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2022] [Revised: 01/15/2023] [Accepted: 01/22/2023] [Indexed: 02/04/2023]
Abstract
Automation is playing an increasingly significant role in synthetic biology. Groundbreaking technologies, developed over the past 20 years, have enormously accelerated the construction of efficient microbial cell factories. Integrating state-of-the-art tools (e.g. for genome engineering and analytical techniques) into the design-build-test-learn cycle (DBTLc) will shift the metabolic engineering paradigm from an almost artisanal labor towards a fully automated workflow. Here, we provide a perspective on how a fully automated DBTLc could be harnessed to construct the next-generation bacterial cell factories in a fast, high-throughput fashion. Innovative toolsets and approaches that pushed the boundaries in each segment of the cycle are reviewed to this end. We also present the most recent efforts on automation of the DBTLc, which heralds a fully autonomous pipeline for synthetic biology in the near future.
Collapse
Affiliation(s)
- Nicolás Gurdo
- The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Kongens, Lyngby, Denmark
| | - Daniel C Volke
- The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Kongens, Lyngby, Denmark
| | - Douglas McCloskey
- The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Kongens, Lyngby, Denmark
| | - Pablo Iván Nikel
- The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Kongens, Lyngby, Denmark.
| |
Collapse
|
32
|
Jiang Y, Wang R, Feng J, Jin J, Liang S, Li Z, Yu Y, Ma A, Su R, Zou Q, Ma Q, Wei L. Explainable Deep Hypergraph Learning Modeling the Peptide Secondary Structure Prediction. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2023; 10:e2206151. [PMID: 36794291 PMCID: PMC10104664 DOI: 10.1002/advs.202206151] [Citation(s) in RCA: 33] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Revised: 01/20/2023] [Indexed: 06/18/2023]
Abstract
Accurately predicting peptide secondary structures remains a challenging task due to the lack of discriminative information in short peptides. In this study, PHAT is proposed, a deep hypergraph learning framework for the prediction of peptide secondary structures and the exploration of downstream tasks. The framework includes a novel interpretable deep hypergraph multi-head attention network that uses residue-based reasoning for structure prediction. The algorithm can incorporate sequential semantic information from large-scale biological corpus and structural semantic information from multi-scale structural segmentation, leading to better accuracy and interpretability even with extremely short peptides. The interpretable models are able to highlight the reasoning of structural feature representations and the classification of secondary substructures. The importance of secondary structures in peptide tertiary structure reconstruction and downstream functional analysis is further demonstrated, highlighting the versatility of our models. To facilitate the use of the model, an online server is established which is accessible via http://inner.wei-group.net/PHAT/. The work is expected to assist in the design of functional peptides and contribute to the advancement of structural biology research.
Collapse
Affiliation(s)
- Yi Jiang
- School of SoftwareShandong UniversityJinanShandong250101China
- Joint SDU‐NTU Centre for Artificial Intelligence Research (C‐FAIR)Shandong UniversityJinanShandong250101China
| | - Ruheng Wang
- School of SoftwareShandong UniversityJinanShandong250101China
- Joint SDU‐NTU Centre for Artificial Intelligence Research (C‐FAIR)Shandong UniversityJinanShandong250101China
| | - Jiuxin Feng
- School of SoftwareShandong UniversityJinanShandong250101China
- Joint SDU‐NTU Centre for Artificial Intelligence Research (C‐FAIR)Shandong UniversityJinanShandong250101China
| | - Junru Jin
- School of SoftwareShandong UniversityJinanShandong250101China
- Joint SDU‐NTU Centre for Artificial Intelligence Research (C‐FAIR)Shandong UniversityJinanShandong250101China
| | - Sirui Liang
- School of SoftwareShandong UniversityJinanShandong250101China
- Joint SDU‐NTU Centre for Artificial Intelligence Research (C‐FAIR)Shandong UniversityJinanShandong250101China
| | - Zhongshen Li
- School of SoftwareShandong UniversityJinanShandong250101China
- Joint SDU‐NTU Centre for Artificial Intelligence Research (C‐FAIR)Shandong UniversityJinanShandong250101China
| | - Yingying Yu
- School of SoftwareShandong UniversityJinanShandong250101China
- Joint SDU‐NTU Centre for Artificial Intelligence Research (C‐FAIR)Shandong UniversityJinanShandong250101China
| | - Anjun Ma
- Department of Biomedical InformaticsCollege of MedicineThe Ohio State UniversityColumbusOH43210USA
| | - Ran Su
- College of Intelligence and ComputingTianjin UniversityTianjin300350China
| | - Quan Zou
- Institute of Fundamental and Frontier SciencesUniversity of Electronic Science and Technology of ChinaChengduSichuan610054China
| | - Qin Ma
- Department of Biomedical InformaticsCollege of MedicineThe Ohio State UniversityColumbusOH43210USA
| | - Leyi Wei
- School of SoftwareShandong UniversityJinanShandong250101China
- Joint SDU‐NTU Centre for Artificial Intelligence Research (C‐FAIR)Shandong UniversityJinanShandong250101China
| |
Collapse
|
33
|
Gao T, Zhao Y, Zhang L, Wang H. Secondary and Topological Structural Merge Prediction of Alpha-Helical Transmembrane Proteins Using a Hybrid Model Based on Hidden Markov and Long Short-Term Memory Neural Networks. Int J Mol Sci 2023; 24:5720. [PMID: 36982795 PMCID: PMC10057634 DOI: 10.3390/ijms24065720] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2023] [Revised: 03/11/2023] [Accepted: 03/13/2023] [Indexed: 03/19/2023] Open
Abstract
Alpha-helical transmembrane proteins (αTMPs) play essential roles in drug targeting and disease treatments. Due to the challenges of using experimental methods to determine their structure, αTMPs have far fewer known structures than soluble proteins. The topology of transmembrane proteins (TMPs) can determine the spatial conformation relative to the membrane, while the secondary structure helps to identify their functional domain. They are highly correlated on αTMPs sequences, and achieving a merge prediction is instructive for further understanding the structure and function of αTMPs. In this study, we implemented a hybrid model combining Deep Learning Neural Networks (DNNs) with a Class Hidden Markov Model (CHMM), namely HDNNtopss. DNNs extract rich contextual features through stacked attention-enhanced Bidirectional Long Short-Term Memory (BiLSTM) networks and Convolutional Neural Networks (CNNs), and CHMM captures state-associative temporal features. The hybrid model not only reasonably considers the probability of the state path but also has a fitting and feature-extraction capability for deep learning, which enables flexible prediction and makes the resulting sequence more biologically meaningful. It outperforms current advanced merge-prediction methods with a Q4 of 0.779 and an MCC of 0.673 on the independent test dataset, which have practical, solid significance. In comparison to advanced prediction methods for topological and secondary structures, it achieves the highest topology prediction with a Q2 of 0.884, which has a strong comprehensive performance. At the same time, we implemented a joint training method, Co-HDNNtopss, and achieved a good performance to provide an important reference for similar hybrid-model training.
Collapse
Affiliation(s)
- Ting Gao
- School of Information Science and Technology, Institute of Computational Biology, Northeast Normal University, Changchun 130117, China; (T.G.); (Y.Z.)
| | - Yutong Zhao
- School of Information Science and Technology, Institute of Computational Biology, Northeast Normal University, Changchun 130117, China; (T.G.); (Y.Z.)
| | - Li Zhang
- School of Computer Science and Engineering, Changchun University of Technology, Changchun 130012, China;
| | - Han Wang
- School of Information Science and Technology, Institute of Computational Biology, Northeast Normal University, Changchun 130117, China; (T.G.); (Y.Z.)
| |
Collapse
|
34
|
Gormez Y, Aydin Z. IGPRED-MultiTask: A Deep Learning Model to Predict Protein Secondary Structure, Torsion Angles and Solvent Accessibility. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1104-1113. [PMID: 35849663 DOI: 10.1109/tcbb.2022.3191395] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Protein secondary structure, solvent accessibility and torsion angle predictions are preliminary steps to predict 3D structure of a protein. Deep learning approaches have achieved significant improvements in predicting various features of protein structure. In this study, IGPRED-Multitask, a deep learning model with multi task learning architecture based on deep inception network, graph convolutional network and a bidirectional long short-term memory is proposed. Moreover, hyper-parameters of the model are fine-tuned using Bayesian optimization, which is faster and more effective than grid search. The same benchmark test data sets as in the OPUS-TASS paper including TEST2016, TEST2018, CASP12, CASP13, CASPFM, HARD68, CAMEO93, CAMEO93_HARD, as well as the train and validation sets, are used for fair comparison with the literature. Statistically significant improvements are observed in secondary structure prediction on 4 datasets, in phi angle prediction on 2 datasets and in psi angel prediction on 3 datasets compared to the state-of-the-art methods. For solvent accessibility prediction, TEST2016 and TEST2018 datasets are used only to assess the performance of the proposed model.
Collapse
|
35
|
Yuan L, Ma Y, Liu Y. Ensemble deep learning models for protein secondary structure prediction using bidirectional temporal convolution and bidirectional long short-term memory. Front Bioeng Biotechnol 2023; 11:1051268. [PMID: 36860882 PMCID: PMC9968878 DOI: 10.3389/fbioe.2023.1051268] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Accepted: 02/03/2023] [Indexed: 02/16/2023] Open
Abstract
Protein secondary structure prediction (PSSP) is a challenging task in computational biology. However, existing models with deep architectures are not sufficient and comprehensive for deep long-range feature extraction of long sequences. This paper proposes a novel deep learning model to improve Protein secondary structure prediction. In the model, our proposed bidirectional temporal convolutional network (BTCN) can extract the bidirectional deep local dependencies in protein sequences segmented by the sliding window technique, the bidirectional long short-term memory (BLSTM) network can extract the global interactions between residues, and our proposed multi-scale bidirectional temporal convolutional network (MSBTCN) can further capture the bidirectional multi-scale long-range features of residues while preserving the hidden layer information more comprehensively. In particular, we also propose that fusing the features of 3-state and 8-state Protein secondary structure prediction can further improve the prediction accuracy. Moreover, we also propose and compare multiple novel deep models by combining bidirectional long short-term memory with temporal convolutional network (TCN), reverse temporal convolutional network (RTCN), multi-scale temporal convolutional network (multi-scale bidirectional temporal convolutional network), bidirectional temporal convolutional network and multi-scale bidirectional temporal convolutional network, respectively. Furthermore, we demonstrate that the reverse prediction of secondary structure outperforms the forward prediction, suggesting that amino acids at later positions have a greater impact on secondary structure recognition. Experimental results on benchmark datasets including CASP10, CASP11, CASP12, CASP13, CASP14, and CB513 show that our methods achieve better prediction performance compared to five state-of-the-art methods.
Collapse
Affiliation(s)
- Lu Yuan
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
| | - Yuming Ma
- *Correspondence: Yuming Ma, ; Yihui Liu,
| | - Yihui Liu
- *Correspondence: Yuming Ma, ; Yihui Liu,
| |
Collapse
|
36
|
Gogoi CR, Rahman A, Saikia B, Baruah A. Protein Dihedral Angle Prediction: The State of the Art. ChemistrySelect 2023. [DOI: 10.1002/slct.202203427] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Affiliation(s)
| | - Aziza Rahman
- Department of Chemistry Dibrugarh University Dibrugarh Assam India
| | - Bondeepa Saikia
- Department of Chemistry Dibrugarh University Dibrugarh Assam India
| | - Anupaul Baruah
- Department of Chemistry Dibrugarh University Dibrugarh Assam India
| |
Collapse
|
37
|
Milchevskiy YV, Milchevskaya VY, Kravatsky YV. Method to Generate Complex Predictive Features for Machine Learning-Based Prediction of the Local Structure and Functions of Proteins. Mol Biol 2023. [DOI: 10.1134/s0026893323010089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/03/2023]
|
38
|
Yuan L, Ma Y, Liu Y. Protein secondary structure prediction based on Wasserstein generative adversarial networks and temporal convolutional networks with convolutional block attention modules. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:2203-2218. [PMID: 36899529 DOI: 10.3934/mbe.2023102] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
As an important task in bioinformatics, protein secondary structure prediction (PSSP) is not only beneficial to protein function research and tertiary structure prediction, but also to promote the design and development of new drugs. However, current PSSP methods cannot sufficiently extract effective features. In this study, we propose a novel deep learning model WGACSTCN, which combines Wasserstein generative adversarial network with gradient penalty (WGAN-GP), convolutional block attention module (CBAM) and temporal convolutional network (TCN) for 3-state and 8-state PSSP. In the proposed model, the mutual game of generator and discriminator in WGAN-GP module can effectively extract protein features, and our CBAM-TCN local extraction module can capture key deep local interactions in protein sequences segmented by sliding window technique, and the CBAM-TCN long-range extraction module can further capture the key deep long-range interactions in sequences. We evaluate the performance of the proposed model on seven benchmark datasets. Experimental results show that our model exhibits better prediction performance compared to the four state-of-the-art models. The proposed model has strong feature extraction ability, which can extract important information more comprehensively.
Collapse
Affiliation(s)
- Lu Yuan
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China
| | - Yuming Ma
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China
| | - Yihui Liu
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China
| |
Collapse
|
39
|
Mufassirin MMM, Newton MAH, Sattar A. Artificial intelligence for template-free protein structure prediction: a comprehensive review. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10350-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
40
|
Newton MH, Zaman R, Mataeimoghadam F, Rahman J, Sattar A. Constraint Guided Beta-Sheet Refinement for Protein Structure Prediction. Comput Biol Chem 2022; 101:107773. [DOI: 10.1016/j.compbiolchem.2022.107773] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Revised: 09/15/2022] [Accepted: 09/16/2022] [Indexed: 11/16/2022]
|
41
|
Salaikumaran MR, Kasamuthu PS, Aathmanathan VS, Burra VLSP. An in silico approach to study the role of epitope order in the multi-epitope-based peptide (MEBP) vaccine design. Sci Rep 2022; 12:12584. [PMID: 35869117 PMCID: PMC9307121 DOI: 10.1038/s41598-022-16445-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2021] [Accepted: 07/11/2022] [Indexed: 11/09/2022] Open
Abstract
AbstractWith different countries facing multiple waves, with some SARS-CoV-2 variants more deadly and virulent, the COVID-19 pandemic is becoming more dangerous by the day and the world is facing an even more dreadful extended pandemic with exponential positive cases and increasing death rates. There is an urgent need for more efficient and faster methods of vaccine development against SARS-CoV-2. Compared to experimental protocols, the opportunities to innovate are very high in immunoinformatics/in silico approaches, especially with the recent adoption of structural bioinformatics in peptide vaccine design. In recent times, multi-epitope-based peptide vaccine candidates (MEBPVCs) have shown extraordinarily high humoral and cellular responses to immunization. Most of the publications claim that respective reported MEBPVC(s) assembled using a set of in silico predicted epitopes, to be the computationally validated potent vaccine candidate(s) ready for experimental validation. However, in this article, for a given set of predicted epitopes, it is shown that the published MEBPVC is one among the many possible variants and there is high likelihood of finding more potent MEBPVCs than the published candidates. To test the same, a methodology is developed where novel MEBP variants are derived by changing the epitope order of the published MEBPVC. Further, to overcome the limitations of current qualitative methods of assessment of MEBPVC, to enable quantitative comparison and ranking for the discovery of more potent MEBPVCs, novel predictors, Percent Epitope Accessibility (PEA), Receptor specific MEBP vaccine potency (RMVP), MEBP vaccine potency (MVP) are introduced. The MEBP variants indeed showed varied MVP scores indicating varied immunogenicity. Further, the MEBP variants with IDs, SPVC_446 and SPVC_537, had the highest MVP scores indicating these variants to be more potent MEBPVCs than the published MEBPVC and hence should be preferred candidates for immediate experimental testing and validation. The method enables quicker selection and high throughput experimental validation of vaccine candidates. This study also opens the opportunity to develop new software tools for designing more potent MEBPVCs in less time.
Collapse
|
42
|
Yuan L, Hu X, Ma Y, Liu Y. DLBLS_SS: protein secondary structure prediction using deep learning and broad learning system. RSC Adv 2022; 12:33479-33487. [PMID: 36505696 PMCID: PMC9682407 DOI: 10.1039/d2ra06433b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 11/16/2022] [Indexed: 11/24/2022] Open
Abstract
Protein secondary structure prediction (PSSP) is not only beneficial to the study of protein structure and function but also to the development of drugs. As a challenging task in computational biology, experimental methods for PSSP are time-consuming and expensive. In this paper, we propose a novel PSSP model DLBLS_SS based on deep learning and broad learning system (BLS) to predict 3-state and 8-state secondary structure. We first use a bidirectional long short-term memory (BLSTM) network to extract global features in residue sequences. Then, our proposed SEBTCN based on temporal convolutional networks (TCN) and channel attention can capture bidirectional key long-range dependencies in sequences. We also use BLS to rapidly optimize fused features while further capturing local interactions between residues. We conduct extensive experiments on public test sets including CASP10, CASP11, CASP12, CASP13, CASP14 and CB513 to evaluate the performance of the model. Experimental results show that our model exhibits better 3-state and 8-state PSSP performance compared to five state-of-the-art models.
Collapse
Affiliation(s)
- Lu Yuan
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences) Jinan 250353 China
| | - Xiaopei Hu
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences) Jinan 250353 China
| | - Yuming Ma
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences) Jinan 250353 China
| | - Yihui Liu
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences) Jinan 250353 China
| |
Collapse
|
43
|
Targeting hydrophobicity in biofilm-associated protein (Bap) as a novel antibiofilm strategy against Staphylococcus aureus biofilm. Biophys Chem 2022; 289:106860. [DOI: 10.1016/j.bpc.2022.106860] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Revised: 07/20/2022] [Accepted: 07/20/2022] [Indexed: 11/23/2022]
|
44
|
Qing R, Hao S, Smorodina E, Jin D, Zalevsky A, Zhang S. Protein Design: From the Aspect of Water Solubility and Stability. Chem Rev 2022; 122:14085-14179. [PMID: 35921495 PMCID: PMC9523718 DOI: 10.1021/acs.chemrev.1c00757] [Citation(s) in RCA: 102] [Impact Index Per Article: 34.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Indexed: 12/13/2022]
Abstract
Water solubility and structural stability are key merits for proteins defined by the primary sequence and 3D-conformation. Their manipulation represents important aspects of the protein design field that relies on the accurate placement of amino acids and molecular interactions, guided by underlying physiochemical principles. Emulated designer proteins with well-defined properties both fuel the knowledge-base for more precise computational design models and are used in various biomedical and nanotechnological applications. The continuous developments in protein science, increasing computing power, new algorithms, and characterization techniques provide sophisticated toolkits for solubility design beyond guess work. In this review, we summarize recent advances in the protein design field with respect to water solubility and structural stability. After introducing fundamental design rules, we discuss the transmembrane protein solubilization and de novo transmembrane protein design. Traditional strategies to enhance protein solubility and structural stability are introduced. The designs of stable protein complexes and high-order assemblies are covered. Computational methodologies behind these endeavors, including structure prediction programs, machine learning algorithms, and specialty software dedicated to the evaluation of protein solubility and aggregation, are discussed. The findings and opportunities for Cryo-EM are presented. This review provides an overview of significant progress and prospects in accurate protein design for solubility and stability.
Collapse
Affiliation(s)
- Rui Qing
- State
Key Laboratory of Microbial Metabolism, School of Life Sciences and
Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
- Media
Lab, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
- The
David H. Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Shilei Hao
- Media
Lab, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
- Key
Laboratory of Biorheological Science and Technology, Ministry of Education, College of Bioengineering, Chongqing University, Chongqing 400030, China
| | - Eva Smorodina
- Department
of Immunology, University of Oslo and Oslo
University Hospital, Oslo 0424, Norway
| | - David Jin
- Avalon GloboCare
Corp., Freehold, New Jersey 07728, United States
| | - Arthur Zalevsky
- Laboratory
of Bioinformatics Approaches in Combinatorial Chemistry and Biology, Shemyakin−Ovchinnikov Institute of Bioorganic
Chemistry RAS, Moscow 117997, Russia
| | - Shuguang Zhang
- Media
Lab, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
45
|
Wang L, Zhong H, Xue Z, Wang Y. Res-Dom: predicting protein domain boundary from sequence using deep residual network and Bi-LSTM. BIOINFORMATICS ADVANCES 2022; 2:vbac060. [PMID: 36699417 PMCID: PMC9710680 DOI: 10.1093/bioadv/vbac060] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/22/2022] [Revised: 07/01/2022] [Accepted: 08/30/2022] [Indexed: 01/28/2023]
Abstract
Motivation Protein domains are the basic units of proteins that can fold, function and evolve independently. Protein domain boundary partition plays an important role in protein structure prediction, understanding their biological functions, annotating their evolutionary mechanisms and protein design. Although there are many methods that have been developed to predict domain boundaries from protein sequence over the past two decades, there is still much room for improvement. Results In this article, a novel domain boundary prediction tool called Res-Dom was developed, which is based on a deep residual network, bidirectional long short-term memory (Bi-LSTM) and transfer learning. We used deep residual neural networks to extract higher-order residue-related information. In addition, we also used a pre-trained protein language model called ESM to extract sequence embedded features, which can summarize sequence context information more abundantly. To improve the global representation of these deep residual networks, a Bi-LSTM network was also designed to consider long-range interactions between residues. Res-Dom was then tested on an independent test set including 342 proteins and generated correct single-domain and multi-domain classifications with a Matthew's correlation coefficient of 0.668, which was 17.6% higher than the second-best compared method. For domain boundaries, the normalized domain overlapping score of Res-Dom was 0.849, which was 5% higher than the second-best compared method. Furthermore, Res-Dom required significantly less time than most of the recently developed state-of-the-art domain prediction methods. Availability and implementation All source code, datasets and model are available at http://isyslab.info/Res-Dom/.
Collapse
Affiliation(s)
- Lei Wang
- Institute of Medical Artificial Intelligence, Binzhou Medical University, Yantai, Shandong 264003, China.,School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Haolin Zhong
- School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Zhidong Xue
- Institute of Medical Artificial Intelligence, Binzhou Medical University, Yantai, Shandong 264003, China.,School of Software Engineering, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Yan Wang
- Institute of Medical Artificial Intelligence, Binzhou Medical University, Yantai, Shandong 264003, China.,School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| |
Collapse
|
46
|
Roca-Martinez J, Lazar T, Gavalda-Garcia J, Bickel D, Pancsa R, Dixit B, Tzavella K, Ramasamy P, Sanchez-Fornaris M, Grau I, Vranken WF. Challenges in describing the conformation and dynamics of proteins with ambiguous behavior. Front Mol Biosci 2022; 9:959956. [PMID: 35992270 PMCID: PMC9382080 DOI: 10.3389/fmolb.2022.959956] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Accepted: 06/27/2022] [Indexed: 11/13/2022] Open
Abstract
Traditionally, our understanding of how proteins operate and how evolution shapes them is based on two main data sources: the overall protein fold and the protein amino acid sequence. However, a significant part of the proteome shows highly dynamic and/or structurally ambiguous behavior, which cannot be correctly represented by the traditional fixed set of static coordinates. Representing such protein behaviors remains challenging and necessarily involves a complex interpretation of conformational states, including probabilistic descriptions. Relating protein dynamics and multiple conformations to their function as well as their physiological context (e.g., post-translational modifications and subcellular localization), therefore, remains elusive for much of the proteome, with studies to investigate the effect of protein dynamics relying heavily on computational models. We here investigate the possibility of delineating three classes of protein conformational behavior: order, disorder, and ambiguity. These definitions are explored based on three different datasets, using interpretable machine learning from a set of features, from AlphaFold2 to sequence-based predictions, to understand the overlap and differences between these datasets. This forms the basis for a discussion on the current limitations in describing the behavior of dynamic and ambiguous proteins.
Collapse
Affiliation(s)
- Joel Roca-Martinez
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, VUB/ULB, Brussels, Belgium
| | - Tamas Lazar
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
- VIB-VUB Center for Structural Biology, Brussels, Belgium
| | - Jose Gavalda-Garcia
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, VUB/ULB, Brussels, Belgium
| | - David Bickel
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, VUB/ULB, Brussels, Belgium
| | - Rita Pancsa
- Research Centre for Natural Sciences, Institute of Enzymology, Budapest, Hungary
| | - Bhawna Dixit
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, VUB/ULB, Brussels, Belgium
- IBiTech-Biommeda, Universiteit Gent, Gent, Belgium
| | - Konstantina Tzavella
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, VUB/ULB, Brussels, Belgium
| | - Pathmanaban Ramasamy
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, VUB/ULB, Brussels, Belgium
- VIB-UGent Center for Medical Biotechnology, Universiteit Gent, Gent, Belgium
| | - Maite Sanchez-Fornaris
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, VUB/ULB, Brussels, Belgium
- Department of Computer Sciences, University of Camagüey, Camagüey, Cuba
| | - Isel Grau
- Information Systems, Eindhoven University of Technology, Eindhoven, Netherlands
| | - Wim F. Vranken
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, VUB/ULB, Brussels, Belgium
| |
Collapse
|
47
|
Pulse labeling reveals the tail end of protein folding by proteome profiling. Cell Rep 2022; 40:111096. [PMID: 35858568 PMCID: PMC9893312 DOI: 10.1016/j.celrep.2022.111096] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2021] [Revised: 05/18/2022] [Accepted: 06/23/2022] [Indexed: 02/04/2023] Open
Abstract
Accurate and efficient folding of nascent protein sequences into their native states requires support from the protein homeostasis network. Herein we probe which newly translated proteins are thermo-sensitive, making them susceptible to misfolding and aggregation under heat stress using pulse-SILAC mass spectrometry. We find a distinct group of proteins that is highly sensitive to this perturbation when newly synthesized but not once matured. These proteins are abundant and highly structured. Notably, they display a tendency to form β sheet secondary structures, have more complex folding topology, and are enriched for chaperone-binding motifs, suggesting a higher demand for chaperone-assisted folding. These polypeptides are also more often components of stable protein complexes in comparison with other proteins. Combining these findings suggests the existence of a specific subset of proteins in the cell that is particularly vulnerable to misfolding and aggregation following synthesis before reaching the native state.
Collapse
|
48
|
Huffman A, Ong E, Hur J, D’Mello A, Tettelin H, He Y. COVID-19 vaccine design using reverse and structural vaccinology, ontology-based literature mining and machine learning. Brief Bioinform 2022; 23:bbac190. [PMID: 35649389 PMCID: PMC9294427 DOI: 10.1093/bib/bbac190] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2022] [Revised: 04/13/2022] [Accepted: 04/26/2022] [Indexed: 12/11/2022] Open
Abstract
Rational vaccine design, especially vaccine antigen identification and optimization, is critical to successful and efficient vaccine development against various infectious diseases including coronavirus disease 2019 (COVID-19). In general, computational vaccine design includes three major stages: (i) identification and annotation of experimentally verified gold standard protective antigens through literature mining, (ii) rational vaccine design using reverse vaccinology (RV) and structural vaccinology (SV) and (iii) post-licensure vaccine success and adverse event surveillance and its usage for vaccine design. Protegen is a database of experimentally verified protective antigens, which can be used as gold standard data for rational vaccine design. RV predicts protective antigen targets primarily from genome sequence analysis. SV refines antigens through structural engineering. Recently, RV and SV approaches, with the support of various machine learning methods, have been applied to COVID-19 vaccine design. The analysis of post-licensure vaccine adverse event report data also provides valuable results in terms of vaccine safety and how vaccines should be used or paused. Ontology standardizes and incorporates heterogeneous data and knowledge in a human- and computer-interpretable manner, further supporting machine learning and vaccine design. Future directions on rational vaccine design are discussed.
Collapse
Affiliation(s)
- Anthony Huffman
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, Michigan 48109, USA
| | - Edison Ong
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, Michigan 48109, USA
| | - Junguk Hur
- Department of Biomedical Sciences, University of North Dakota School of Medicine and Health Sciences, Grand Forks, North Dakota 58202, USA
| | - Adonis D’Mello
- Department of Microbiology and Immunology, Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Hervé Tettelin
- Department of Microbiology and Immunology, Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Yongqun He
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, Michigan 48109, USA
- Unit for Laboratory Animal Medicine, Department of Microbiology and Immunology, Center for Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, Michigan 48109, USA
| |
Collapse
|
49
|
Gou X, Feng X, Shi H, Guo T, Xie R, Liu Y, Wang Q, Li H, Yang B, Chen L, Lu Y. PPVED: A machine learning tool for predicting the effect of single amino acid substitution on protein function in plants. PLANT BIOTECHNOLOGY JOURNAL 2022; 20:1417-1431. [PMID: 35398963 PMCID: PMC9241370 DOI: 10.1111/pbi.13823] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/16/2021] [Accepted: 04/03/2022] [Indexed: 05/31/2023]
Abstract
Single amino acid substitution (SAAS) produces the most common variant of protein function change under physiological conditions. As the number of SAAS events in plants has increased exponentially, an effective prediction tool is required to help identify and distinguish functional SAASs from the whole genome as either potentially causal traits or as variants. Here, we constructed a plant SAAS database that stores 12 865 SAASs in 6172 proteins and developed a tool called Plant Protein Variation Effect Detector (PPVED) that predicts the effect of SAASs on protein function in plants. PPVED achieved an 87% predictive accuracy when applied to plant SAASs, an accuracy that was much higher than those from six human database software: SIFT, PROVEAN, PANTHER-PSEP, PhD-SNP, PolyPhen-2, and MutPred2. The predictive effect of six SAASs from three proteins in Arabidopsis and maize was validated with wet lab experiments, of which five substitution sites were accurately predicted. PPVED could facilitate the identification and characterization of genetic variants that explain observed phenotype variations in plants, contributing to solutions for challenges in functional genomics and systems biology. PPVED can be accessed under a CC-BY (4.0) license via http://www.ppved.org.cn.
Collapse
Affiliation(s)
- Xiangjian Gou
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest ChinaWenjiangSichuanChina
- Maize Research InstituteSichuan Agricultural UniversityWenjiangSichuanChina
| | - Xuanjun Feng
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest ChinaWenjiangSichuanChina
- Maize Research InstituteSichuan Agricultural UniversityWenjiangSichuanChina
| | - Haoran Shi
- Chengdu Academy of Agricultural and Forestry SciencesWenjiangSichuanChina
| | - Tingting Guo
- National Key Laboratory of Crop Genetic ImprovementHuazhong Agricultural UniversityWuhanHubeiChina
| | - Rongqian Xie
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest ChinaWenjiangSichuanChina
- Maize Research InstituteSichuan Agricultural UniversityWenjiangSichuanChina
| | - Yaxi Liu
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest ChinaWenjiangSichuanChina
- Triticeae Research InstituteSichuan Agricultural UniversityWenjiangSichuanChina
| | - Qi Wang
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest ChinaWenjiangSichuanChina
| | - Hongxiang Li
- College of Information EngineeringSichuan Agricultural UniversityYa’anSichuanChina
| | - Banglie Yang
- College of Information EngineeringSichuan Agricultural UniversityYa’anSichuanChina
| | - Lixue Chen
- College of Information EngineeringSichuan Agricultural UniversityYa’anSichuanChina
| | - Yanli Lu
- State Key Laboratory of Crop Gene Exploration and Utilization in Southwest ChinaWenjiangSichuanChina
- Maize Research InstituteSichuan Agricultural UniversityWenjiangSichuanChina
| |
Collapse
|
50
|
Huang X, Shi Y, Yan J, Qu W, Li X, Tan J. LPI-CSFFR: Combining serial fusion with feature reuse for predicting LncRNA-protein interactions. Comput Biol Chem 2022; 99:107718. [PMID: 35785626 DOI: 10.1016/j.compbiolchem.2022.107718] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Revised: 05/24/2022] [Accepted: 06/22/2022] [Indexed: 11/03/2022]
Abstract
Long non-coding RNAs (LncRNAs) play important roles in a series of life activities, and they function primarily with proteins. The wet experimental-based methods in lncRNA-protein interactions (lncRPIs) study are time-consuming and expensive. In this study, we propose for the first time a novel feature fusion method, the LPI-CSFFR, to train and predict LncRPIs based on a Convolutional Neural Network (CNN) with feature reuse and serial fusion in sequences, secondary structures, and physicochemical properties of proteins and lncRNAs. The experimental results indicate that LPI-CSFFR achieves excellent performance on the datasets RPI1460 and RPI1807 with an accuracy of 83.7 % and 98.1 %, respectively. We further compare LPI-CSFFR with the state-of-the-art existing methods on the same benchmark datasets to evaluate the performance. In addition, to test the generalization performance of the model, we independently test sample pairs of five model organisms, where Mus musculus are the highest prediction accuracy of 99.5 %, and we find multiple hotspot proteins after constructing an interaction network. Finally, we test the predictive power of the LPI-CSFFR for sample pairs with unknown interactions. The results indicate that LPI-CSFFR is promising for predicting potential LncRPIs. The relevant source code and the data used in this study are available at https://github.com/JianjunTan-Beijing/LPI-CSFFR.
Collapse
Affiliation(s)
- Xiaoqian Huang
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing University of Technology, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing 100124, China
| | - Yi Shi
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing University of Technology, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing 100124, China
| | - Jing Yan
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing University of Technology, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing 100124, China
| | - Wenyan Qu
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing University of Technology, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing 100124, China
| | - Xiaoyi Li
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing University of Technology, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing 100124, China
| | - Jianjun Tan
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing University of Technology, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing 100124, China.
| |
Collapse
|