1
|
Yuan R, Zhang J, Zhou J, Cong Q. Recent progress and future challenges in structure-based protein-protein interaction prediction. Mol Ther 2025; 33:2252-2268. [PMID: 40195117 DOI: 10.1016/j.ymthe.2025.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2025] [Revised: 03/05/2025] [Accepted: 04/02/2025] [Indexed: 04/09/2025] Open
Abstract
Protein-protein interactions (PPIs) play a fundamental role in cellular processes, and understanding these interactions is crucial for advances in both basic biological science and biomedical applications. This review presents an overview of recent progress in computational methods for modeling protein complexes and predicting PPIs based on 3D structures, focusing on the transformative role of artificial intelligence-based approaches. We further discuss the expanding biomedical applications of PPI research, including the elucidation of disease mechanisms, drug discovery, and therapeutic design. Despite these advances, significant challenges remain in predicting host-pathogen interactions, interactions between intrinsically disordered regions, and interactions related to immune responses. These challenges are worthwhile for future explorations and represent the frontier of research in this field.
Collapse
Affiliation(s)
- Rongqing Yuan
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Jing Zhang
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Jian Zhou
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Qian Cong
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA.
| |
Collapse
|
2
|
Jiang J, Zhang C, Ke L, Hayes N, Zhu Y, Qiu H, Zhang B, Zhou T, Wei GW. A review of machine learning methods for imbalanced data challenges in chemistry. Chem Sci 2025; 16:7637-7658. [PMID: 40271022 PMCID: PMC12013631 DOI: 10.1039/d5sc00270b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2025] [Accepted: 04/06/2025] [Indexed: 04/25/2025] Open
Abstract
Imbalanced data, where certain classes are significantly underrepresented in a dataset, is a widespread machine learning (ML) challenge across various fields of chemistry, yet it remains inadequately addressed. This data imbalance can lead to biased ML or deep learning (DL) models, which fail to accurately predict the underrepresented classes, thus limiting the robustness and applicability of these models. With the rapid advancement of ML and DL algorithms, several promising solutions to this issue have emerged, prompting the need for a comprehensive review of current methodologies. In this review, we examine the prominent ML approaches used to tackle the imbalanced data challenge in different areas of chemistry, including resampling techniques, data augmentation techniques, algorithmic approaches, and feature engineering strategies. Each of these methods is evaluated in the context of its application across various aspects of chemistry, such as drug discovery, materials science, cheminformatics, and catalysis. We also explore future directions for overcoming the imbalanced data challenge and emphasize data augmentation via physical models, large language models (LLMs), and advanced mathematics. The benefit of balanced data in new material design and production and the persistent challenges are discussed. Overall, this review aims to elucidate the prevalent ML techniques applied to mitigate the impacts of imbalanced data within the field of chemistry and offer insights into future directions for research and application.
Collapse
Affiliation(s)
- Jian Jiang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University Wuhan 430200 P R. China
- Department of Mathematics, Michigan State University East Lansing Michigan 48824 USA
| | - Chunhuan Zhang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University Wuhan 430200 P R. China
| | - Lu Ke
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University Wuhan 430200 P R. China
| | - Nicole Hayes
- Department of Mathematics, Michigan State University East Lansing Michigan 48824 USA
| | - Yueying Zhu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University Wuhan 430200 P R. China
| | - Huahai Qiu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University Wuhan 430200 P R. China
| | - Bengong Zhang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University Wuhan 430200 P R. China
| | - Tianshou Zhou
- Key Laboratory of Computational Mathematics, Guangdong Province, School of Mathematics, Sun Yat-sen University Guangzhou 510006 P R. China
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University East Lansing Michigan 48824 USA
- Department of Electrical and Computer Engineering, Michigan State University East Lansing Michigan 48824 USA
- Department of Biochemistry and Molecular Biology, Michigan State University East Lansing Michigan 48824 USA
| |
Collapse
|
3
|
Shao D, Zou Y, Ma L, Yi S. Multiscale and global-local U-Net for protein-protein interaction site prediction. Comput Biol Chem 2025; 118:108485. [PMID: 40306099 DOI: 10.1016/j.compbiolchem.2025.108485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2024] [Revised: 03/18/2025] [Accepted: 04/21/2025] [Indexed: 05/02/2025]
Abstract
Precise prediction of protein-protein interaction sites (PPIS) is fundamental to deciphering cellular mechanisms and accelerating therapeutic discovery. Despite significant advancements in computational approaches, current methods frequently fail to integrate multiscale features that simultaneously capture global context and local interactions. We present Multiscale and Global-Local U-Net for Protein-Protein Interaction Site Prediction (MGU-PPIS), a novel architecture designed to address this critical limitation. Our model leverages a U-Net framework with implemented multi-level pooling to extract comprehensive multiscale features. Within each scale, we synergistically combine Transformer networks, Graph Convolutional Networks (GCNs), and Graph Attention Networks (GATs) to simultaneously capture global patterns and local structural motifs. We implement Laplacian positional encoding to effectively represent global protein structural characteristics. In our framework, proteins are conceptualized as graph structures where individual residues function as nodes and their spatial relationships define edges. The model processes information through an innovative two-stage U-Net architecture, where output features from the initial stage serve as refined inputs for the subsequent stage. This dual-stage design, coupled with our graph-based representation, enables MGU-PPIS to extract a rich spectrum of multiscale features encompassing both global context and local interactions at each scale. Comprehensive experimental validation demonstrates that MGU-PPIS significantly outperforms state-of-the-art methods in predictive accuracy. Beyond introducing a novel computational strategy for PPIS prediction, our work establishes a foundation for advances in protein functional analysis and structure-based drug design.
Collapse
Affiliation(s)
- Dangguo Shao
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, Yunnan, China
| | - Yuyang Zou
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, Yunnan, China
| | - Lei Ma
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, Yunnan, China.
| | - Sanli Yi
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, Yunnan, China.
| |
Collapse
|
4
|
Han YL, Yin HH, Li C, Du J, He Y, Guan YX. Discovery of New Pentapeptide Inhibitors Against Amyloid-β Aggregation Using Word2Vec and Molecular Simulation. ACS Chem Neurosci 2025; 16:1055-1065. [PMID: 39999409 DOI: 10.1021/acschemneuro.4c00661] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/27/2025] Open
Abstract
Alzheimer's disease (AD) is characterized by the aggregation of amyloid-β (Aβ) peptides into toxic oligomers and fibrils. The efficacy of existing peptide inhibitors based on the central hydrophobic core (CHC) sequence of Aβ42 remains limited due to self-aggregation or poor inhibition. This study aimed to identify novel pentapeptide inhibitors with high similarity and low binding energy to the CHC region LVFFA using a new computational screening workflow based on Word2Vec and molecular simulation. The antimicrobial peptides and human brain protein sequences were used for training the Word2Vec model. After tuning the parameters of the Word2Vec model, 1017 pentapeptides with high similarity to LVFFA were identified. Molecular docking was employed to estimate the affinity of the pentapeptides for the target of Aβ14-42 pentamer, and 103 peptides with favorable docking scores were obtained. Finally, five pentapeptides with a low binding energy and high binding stability via molecular dynamics simulation were experimentally validated using thioflavin T assays. Surprisingly, one pentapeptide, i.e., PALIR, exhibited significant inhibition surpassing the positive control LPFFN. This study demonstrates an effective combinatorial strategy to discover new peptide inhibitors. With PALIR representing a promising lead candidate, further optimization of PALIR could aid in the development of improved therapies to prevent amyloid toxicity in AD.
Collapse
Affiliation(s)
- Yin-Lei Han
- College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310058, China
| | - Huan-Huan Yin
- College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310058, China
| | - Chen Li
- College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310058, China
| | - Jiangyue Du
- Department of General Practice, Sir Run Run Shaw Hospital of Zhejiang University, Hangzhou 310020, China
| | - Yi He
- College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310058, China
- Department of Chemical Engineering, University of Washington, Seattle, Washington 98195, United States
| | - Yi-Xin Guan
- College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310058, China
| |
Collapse
|
5
|
Koul M, Kaushik S, Singh K, Sharma D. VITALdb: to select the best viroinformatics tools for a desired virus or application. Brief Bioinform 2025; 26:bbaf084. [PMID: 40063348 PMCID: PMC11892104 DOI: 10.1093/bib/bbaf084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2024] [Revised: 01/14/2025] [Accepted: 02/17/2025] [Indexed: 05/13/2025] Open
Abstract
The recent pandemics of viral diseases, COVID-19/mpox (humans) and lumpy skin disease (cattle), have kept us glued to viral research. These pandemics along with the recent human metapneumovirus outbreak have exposed the urgency for early diagnosis of viral infections, vaccine development, and discovery of novel antiviral drugs and therapeutics. To support this, there is an armamentarium of virus-specific computational tools that are currently available. VITALdb (VIroinformatics Tools and ALgorithms database) is a resource of ~360 viroinformatics tools encompassing all major viruses (SARS-CoV-2, influenza virus, human immunodeficiency virus, papillomavirus, herpes simplex virus, hepatitis virus, dengue virus, Ebola virus, Zika virus, etc.) and several diverse applications [structural and functional annotation, antiviral peptides development, subspecies characterization, recognition of viral recombination, inhibitors identification, phylogenetic analysis, virus-host prediction, viral metagenomics, detection of mutation(s), primer designing, etc.]. Resources, tools, and other utilities mentioned in this article will not only facilitate further developments in the realm of viroinformatics but also provide tremendous fillip to translate fundamental knowledge into applied research. Most importantly, VITALdb is an inevitable tool for selecting the best tool(s) to carry out a desired task and hence will prove to be a vital database (VITALdb) for the scientific community. Database URL: https://compbio.iitr.ac.in/vitaldb.
Collapse
Affiliation(s)
- Mira Koul
- Computational Biology and Translational Bioinformatics (CBTB) Laboratory, Department of Biosciences and Bioengineering, Indian Institute of Technology Roorkee, Roorkee 247667, Uttarakhand, India
| | - Shalini Kaushik
- Computational Biology and Translational Bioinformatics (CBTB) Laboratory, Department of Biosciences and Bioengineering, Indian Institute of Technology Roorkee, Roorkee 247667, Uttarakhand, India
| | - Kavya Singh
- Computational Biology and Translational Bioinformatics (CBTB) Laboratory, Department of Biosciences and Bioengineering, Indian Institute of Technology Roorkee, Roorkee 247667, Uttarakhand, India
| | - Deepak Sharma
- Computational Biology and Translational Bioinformatics (CBTB) Laboratory, Department of Biosciences and Bioengineering, Indian Institute of Technology Roorkee, Roorkee 247667, Uttarakhand, India
| |
Collapse
|
6
|
Charoenkwan P, Chumnanpuen P, Schaduangrat N, Shoombuatong W. Deepstack-ACE: A deep stacking-based ensemble learning framework for the accelerated discovery of ACE inhibitory peptides. Methods 2025; 234:131-140. [PMID: 39709069 DOI: 10.1016/j.ymeth.2024.12.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2024] [Revised: 11/27/2024] [Accepted: 12/07/2024] [Indexed: 12/23/2024] Open
Abstract
Identifying angiotensin-I-converting enzyme (ACE) inhibitory peptides accurately is crucial for understanding the primary factor that regulates the renin-angiotensin system and for providing guidance in developing new potential drugs. Given the inherent experimental complexities, using computational methods for in silico peptide identification could be indispensable for facilitating the high-throughput characterization of ACE inhibitory peptides. In this paper, we propose a novel deep stacking-based ensemble learning framework, termed Deepstack-ACE, to precisely identify ACE inhibitory peptides. In Deepstack-ACE, the input peptide sequences are fed into the word2vec embedding technique to generate sequence representations. Then, these representations were employed to train five powerful deep learning methods, including long short-term memory, convolutional neural network, multi-layer perceptron, gated recurrent unit network, and recurrent neural network, for the construction of base-classifiers. Finally, the optimized stacked model was constructed based on the best combination of selected base-classifiers. Benchmarking experiments showed that Deepstack-ACE attained a more accurate and robust identification of ACE inhibitory peptides compared to its base-classifiers and several conventional machine learning classifiers. Remarkably, in the independent test, our proposed model significantly outperformed the current state-of-the-art methods, with a balanced accuracy of 0.916, sensitivity of 0.911, and Matthews correlation coefficient scores of 0.826. Moreover, we developed a user-friendly web server for Deepstack-ACE, which is freely available at https://pmlabqsar.pythonanywhere.com/Deepstack-ACE. We anticipate that our proposed Deepstack-ACE model can provide a faster and reasonably accurate identification of ACE inhibitory peptides.
Collapse
Affiliation(s)
- Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai 50200, Thailand
| | - Pramote Chumnanpuen
- Department of Zoology, Faculty of Science, Kasetsart University, Bangkok 10900, Thailand; Kasetsart University International College (KUIC), Kasetsart University, Bangkok 10900, Thailand
| | - Nalini Schaduangrat
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| | - Watshara Shoombuatong
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand.
| |
Collapse
|
7
|
Feng T, Chen X, Wu S, Tang W, Zhou H, Fang Z. Predicting the bacterial host range of plasmid genomes using the language model-based one-class support vector machine algorithm. Microb Genom 2025; 11. [PMID: 39932495 DOI: 10.1099/mgen.0.001355] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/08/2025] Open
Abstract
The prediction of the plasmid host range is crucial for investigating the dissemination of plasmids and the transfer of resistance and virulence genes mediated by plasmids. Several machine learning-based tools have been developed to predict plasmid host ranges. These tools have been trained and tested based on the bacterial host records of plasmids in related databases. Typically, a plasmid genome in databases such as the National Center for Biotechnology Information is annotated with only one or a few bacterial hosts, which does not encompass all possible hosts. Consequently, existing methods may significantly underestimate the host ranges of mobile plasmids. In this work, we propose a novel method named HRPredict, which employs a word vector model to digitally represent the encoded proteins on plasmid genomes. Since it is difficult to confirm which host a particular plasmid definitely cannot enter, we developed a machine learning approach for predicting whether a plasmid can enter a specific bacterium as a no-negative samples learning task. Using multiple one-class support vector machine (SVM) models that do not require negative samples for training, HRPredict predicts the host range of plasmids across 45 families, 56 genera and 56 species. In the benchmark test set, we constructed reliable negative samples for each host taxonomic unit via two indirect methods, and we found that the area under the curve (AUC), F1-score, recall, precision and accuracy of most taxonomic unit prediction models exceeded 0.9. Among the 13 broad-host-range plasmid types, HRPredict demonstrated greater coverage than HOTSPOT and PlasmidHostFinder, thus successfully predicting the majority of hosts previously reported. Through feature importance calculation for each SVM model, we found that genes closely related to the plasmid host range are involved in functions such as bacterial adaptability, pathogenicity and survival. These findings provide significant insight into the mechanisms through which bacteria adjust to diverse environments through plasmids. The HRPredict algorithm is expected to facilitate in-depth research on the spread of broad-host-range plasmids and enable host-range predictions for novel plasmids reconstructed from microbiome sequencing data.
Collapse
Affiliation(s)
- Tao Feng
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, 510280, PR China
- Guangzhou Chest Hospital, Hengzhigang Road 1066, Guangzhou, 510095, PR China
| | - Xirao Chen
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, 510280, PR China
| | - Shufang Wu
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, 510280, PR China
| | - Waijiao Tang
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, 510280, PR China
| | - Hongwei Zhou
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, 510280, PR China
| | - Zhencheng Fang
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, 510280, PR China
| |
Collapse
|
8
|
Taha K. Protein-protein interaction detection using deep learning: A survey, comparative analysis, and experimental evaluation. Comput Biol Med 2025; 185:109449. [PMID: 39644584 DOI: 10.1016/j.compbiomed.2024.109449] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2024] [Revised: 11/13/2024] [Accepted: 11/14/2024] [Indexed: 12/09/2024]
Abstract
This survey paper provides a comprehensive analysis of various Deep Learning (DL) techniques and algorithms for detecting protein-protein interactions (PPIs). It examines the scalability, interpretability, accuracy, and efficiency of each technique, offering a detailed empirical and experimental evaluation. Empirically, the techniques are assessed based on four key criteria, while experimentally, they are ranked by specific algorithms and broader methodological categories. Deep Neural Networks (DNNs) demonstrated high accuracy but faced limitations such as overfitting and low interpretability. Convolutional Neural Networks (CNNs) were highly efficient at extracting hierarchical features from biological sequences, while Generative Stochastic Networks (GSNs) excelled in handling uncertainty. Long Short-Term Memory (LSTM) networks effectively captured temporal dependencies within PPI sequences, though they presented scalability challenges. This paper concludes with insights into potential improvements and future directions for advancing DL techniques in PPI identification, highlighting areas where further optimization can enhance performance and applicability.
Collapse
Affiliation(s)
- Kamal Taha
- Department of Computer Science, Khalifa University, Abu Dhabi, United Arab Emirates.
| |
Collapse
|
9
|
Bowyer S, Allen DJ, Furnham N. Unveiling the ghost: machine learning's impact on the landscape of virology. J Gen Virol 2025; 106. [PMID: 39804261 DOI: 10.1099/jgv.0.002067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/02/2025] Open
Abstract
The complexity and speed of evolution in viruses with RNA genomes makes predictive identification of variants with epidemic or pandemic potential challenging. In recent years, machine learning has become an increasingly capable technology for addressing this challenge, as advances in methods and computational power have dramatically improved the performance of models and led to their widespread adoption across industries and disciplines. Nascent applications of machine learning technology to virus research have now expanded, providing new tools for handling large-scale datasets and leading to a reshaping of existing workflows for phenotype prediction, phylogenetic analysis, drug discovery and more. This review explores how machine learning has been applied to and has impacted the study of viruses, before addressing the strengths and limitations of its techniques and finally highlighting the next steps that are needed for the technology to reach its full potential in this challenging and ever-relevant research area.
Collapse
Affiliation(s)
- Sebastian Bowyer
- Department of Infection Biology, Faculty of Infectious and Tropical Diseases, London School of Hygiene and Tropical Medicine, London, UK
| | - David J Allen
- Department of Comparative Biomedical Sciences, Section Infection and Immunity, School of Veterinary Medicine, Faculty of Health and Medical Sciences, University of Surrey, Guildford, UK
| | - Nicholas Furnham
- Department of Infection Biology, Faculty of Infectious and Tropical Diseases, London School of Hygiene and Tropical Medicine, London, UK
| |
Collapse
|
10
|
Chen H, Liu J, Tang G, Hao G, Yang G. Bioinformatic Resources for Exploring Human-virus Protein-protein Interactions Based on Binding Modes. GENOMICS, PROTEOMICS & BIOINFORMATICS 2024; 22:qzae075. [PMID: 39404802 PMCID: PMC11658832 DOI: 10.1093/gpbjnl/qzae075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Revised: 10/05/2024] [Accepted: 10/11/2024] [Indexed: 12/21/2024]
Abstract
Historically, there have been many outbreaks of viral diseases that have continued to claim millions of lives. Research on human-virus protein-protein interactions (PPIs) is vital to understanding the principles of human-virus relationships, providing an essential foundation for developing virus control strategies to combat diseases. The rapidly accumulating data on human-virus PPIs offer unprecedented opportunities for bioinformatics research around human-virus PPIs. However, available detailed analyses and summaries to help use these resources systematically and efficiently are lacking. Here, we comprehensively review the bioinformatic resources used in human-virus PPI research, and discuss and compare their functions, performance, and limitations. This review aims to provide researchers with a bioinformatic toolbox that will hopefully better facilitate the exploration of human-virus PPIs based on binding modes.
Collapse
Affiliation(s)
- Huimin Chen
- State Key Laboratory of Green Pesticide, International Joint Research Center for Intelligent Biosensor Technology and Health, Central China Normal University, Wuhan 430079, China
| | - Jiaxin Liu
- State Key Laboratory of Green Pesticide, International Joint Research Center for Intelligent Biosensor Technology and Health, Central China Normal University, Wuhan 430079, China
| | - Gege Tang
- State Key Laboratory of Green Pesticide, International Joint Research Center for Intelligent Biosensor Technology and Health, Central China Normal University, Wuhan 430079, China
| | - Gefei Hao
- State Key Laboratory of Green Pesticide, International Joint Research Center for Intelligent Biosensor Technology and Health, Central China Normal University, Wuhan 430079, China
- State Key Laboratory of Green Pesticide, Key Laboratory of Green Pesticide and Agricultural Bioengineering, Ministry of Education, Center for Research and Development of Fine Chemicals, Guizhou University, Guiyang 550025, China
| | - Guangfu Yang
- State Key Laboratory of Green Pesticide, International Joint Research Center for Intelligent Biosensor Technology and Health, Central China Normal University, Wuhan 430079, China
| |
Collapse
|
11
|
Zouari S, Ali F, Masmoudi A, Ghazalah SA, Alghamdi W, Kateb FA, Ibrahim N. Deep-GB: A novel deep learning model for globular protein prediction using CNN-BiLSTM architecture and enhanced PSSM with trisection strategy. IET Syst Biol 2024; 18:208-217. [PMID: 39514139 DOI: 10.1049/syb2.12108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2024] [Revised: 09/30/2024] [Accepted: 10/27/2024] [Indexed: 11/16/2024] Open
Abstract
Globular proteins (GPs) play vital roles in a wide range of biological processes, encompassing enzymatic catalysis and immune responses. Enzymes, among these globular proteins, facilitate biochemical reactions, while others, such as haemoglobin, contribute to essential physiological functions such as oxygen transport. Given the importance of these considerations, accurately identifying Globular proteins is essential. To address the need for precise GP identification, this research introduces an innovative approach that employs a hybrid-based deep learning model called Deep-GP. We generated two datasets based on primary sequences and developed a novel feature descriptor called, Consensus Sequence-based Trisection-Position Specific Scoring Matrix (CST-PSSM). The model training phase involved the application of deep learning techniques, including the bidirectional long short-term memory network (BiLSTM), gated recurrent unit (GRU), and convolutional neural network (CNN). The BiLSTM and CNN were hybridised for ensemble learning. The CST-PSSM-based ensemble model achieved the most accurate predictive outcomes, outperforming other competitive predictors across both training and testing datasets. This demonstrates the potential of harnessing deep learning for precise GB prediction as a robust tool to expedite research, streamline drug discovery, and unveil novel therapeutic targets.
Collapse
Affiliation(s)
- Sonia Zouari
- National Engineering School of Sfax, University of Sfax, Sfax, Tunisia
| | - Farman Ali
- Department of Computer Science, Bahria University Islamabad Campus, Islamabad, Pakistan
| | - Atef Masmoudi
- Department of Computer Science, College of Computer Science, King Khalid University, Abha, Saudi Arabia
| | - Sarah Abu Ghazalah
- Department of Informatics and Computer System, College of Computer Science, King Khalid University, Abha, Saudi Arabia
| | - Wajdi Alghamdi
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Faris A Kateb
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Nouf Ibrahim
- Family Medicine Clinic, Makkah Armed Force Medical Center, Makkah, Saudi Arabia
| |
Collapse
|
12
|
Zhang L, Wang S, Wang Y, Zhao T. HBFormer: a single-stream framework based on hybrid attention mechanism for identification of human-virus protein-protein interactions. Bioinformatics 2024; 40:btae724. [PMID: 39673490 PMCID: PMC11648999 DOI: 10.1093/bioinformatics/btae724] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2024] [Revised: 10/27/2024] [Accepted: 12/02/2024] [Indexed: 12/16/2024] Open
Abstract
MOTIVATION Exploring human-virus protein-protein interactions (PPIs) is crucial for unraveling the underlying pathogenic mechanisms of viruses. Limitations in the coverage and scalability of high-throughput approaches have impeded the identification of certain key interactions. Current popular computational methods adopt a two-stream pipeline to identify PPIs, which can only achieve relation modeling of protein pairs at the classification phase. However, the fitting capacity of the classifier is insufficient to comprehensively mine the complex interaction patterns between protein pairs. RESULTS In this study, we propose a pioneering single-stream framework HBFormer that combines hybrid attention mechanism and multimodal feature fusion strategy for identifying human-virus PPIs. The Transformer architecture based on hybrid attention can bridge the bidirectional information flows between human protein and viral protein, thus unifying joint feature learning and relation modeling of protein pairs. The experimental results demonstrate that HBFormer not only achieves superior performance on multiple human-virus PPI datasets but also outperforms 5 other state-of-the-art human-virus PPI identification methods. Moreover, ablation studies and scalability experiments further validate the effectiveness of our single-stream framework. AVAILABILITY AND IMPLEMENTATION Codes and datasets are available at https://github.com/RmQ5v/HBFormer.
Collapse
Affiliation(s)
- Liyuan Zhang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang 150000, China
| | - Sicong Wang
- Institute of Bioinformatics, Harbin Institute of Technology, Harbin, Heilongjiang 150000, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang 150000, China
| | - Tianyi Zhao
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang 150000, China
| |
Collapse
|
13
|
Wang C, Zou Q. MFPSP: Identification of fungal species-specific phosphorylation site using offspring competition-based genetic algorithm. PLoS Comput Biol 2024; 20:e1012607. [PMID: 39556608 DOI: 10.1371/journal.pcbi.1012607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2024] [Accepted: 11/03/2024] [Indexed: 11/20/2024] Open
Abstract
Protein phosphorylation is essential in various signal transduction and cellular processes. To date, most tools are designed for model organisms, but only a handful of methods are suitable for predicting task in fungal species, and their performance still leaves much to be desired. In this study, a novel tool called MFPSP is developed for phosphorylation site prediction in multi-fungal species. The amino acids sequence features were derived from physicochemical and distributed information, and an offspring competition-based genetic algorithm was applied for choosing the most effective feature subset. The comparison results shown that MFPSP achieves a more advanced and balanced performance to several state-of-the-art available toolkits. Feature contribution and interaction exploration indicating the proposed model is efficient in uncovering concealed patterns within sequence. We anticipate MFPSP to serve as a valuable bioinformatics tool and benefiting practical experiments by pre-screening potential phosphorylation sites and enhancing our functional understanding of phosphorylation modifications in fungi. The source code and datasets are accessible at https://github.com/AI4HKB/MFPSP/.
Collapse
Affiliation(s)
- Chao Wang
- Center for Genomic and Personalized Medicine, Guangxi key Laboratory for Genomic and Personalized Medicine, Guangxi Collaborative Innovation Center for Genomic and Personalized Medicine, Guangxi Medical University, Nanning, Guangxi, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
14
|
Beltrán JF, Belén LH, Yáñez AJ, Jimenez L. Predicting viral proteins that evade the innate immune system: a machine learning-based immunoinformatics tool. BMC Bioinformatics 2024; 25:351. [PMID: 39522017 PMCID: PMC11550529 DOI: 10.1186/s12859-024-05972-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2024] [Accepted: 10/29/2024] [Indexed: 11/16/2024] Open
Abstract
Viral proteins that evade the host's innate immune response play a crucial role in pathogenesis, significantly impacting viral infections and potential therapeutic strategies. Identifying these proteins through traditional methods is challenging and time-consuming due to the complexity of virus-host interactions. Leveraging advancements in computational biology, we present VirusHound-II, a novel tool that utilizes machine learning techniques to predict viral proteins evading the innate immune response with high accuracy. We evaluated a comprehensive range of machine learning models, including ensemble methods, neural networks, and support vector machines. Using a dataset of 1337 viral proteins known to evade the innate immune response (VPEINRs) and an equal number of non-VPEINRs, we employed pseudo amino acid composition as the molecular descriptor. Our methodology involved a tenfold cross-validation strategy on 80% of the data for training, followed by testing on an independent dataset comprising the remaining 20%. The random forest model demonstrated superior performance metrics, achieving 0.9290 accuracy, 0.9283 F1 score, 0.9354 precision, and 0.9213 sensitivity in the independent testing phase. These results establish VirusHound-II as an advancement in computational virology, accessible via a user-friendly web application. We anticipate that VirusHound-II will be a crucial resource for researchers, enabling the rapid and reliable prediction of viral proteins evading the innate immune response. This tool has the potential to accelerate the identification of therapeutic targets and enhance our understanding of viral evasion mechanisms, contributing to the development of more effective antiviral strategies and advancing our knowledge of virus-host interactions.
Collapse
Affiliation(s)
- Jorge F Beltrán
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar 01145, Temuco, Chile.
| | - Lisandra Herrera Belén
- Departamento de Ciencias Básicas, Facultad de Ciencias, Universidad Santo Tomas, Temuco, Chile
| | - Alejandro J Yáñez
- Departamento de Investigación y Desarrollo, Greenvolution SpA., Puerto Varas, Chile
- Interdisciplinary Center for Aquaculture Research (INCAR), Concepcion, Chile
| | - Luis Jimenez
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar 01145, Temuco, Chile
| |
Collapse
|
15
|
Beltrán JF, Herrera-Belén L, Yáñez AJ, Jimenez L. Prediction of viral oncoproteins through the combination of generative adversarial networks and machine learning techniques. Sci Rep 2024; 14:27108. [PMID: 39511292 PMCID: PMC11543823 DOI: 10.1038/s41598-024-77028-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2024] [Accepted: 10/18/2024] [Indexed: 11/15/2024] Open
Abstract
Viral oncoproteins play crucial roles in transforming normal cells into cancer cells, representing a significant factor in the etiology of various cancers. Traditionally, identifying these oncoproteins is both time-consuming and costly. With advancements in computational biology, bioinformatics tools based on machine learning have emerged as effective methods for predicting biological activities. Here, for the first time, we propose an innovative approach that combines Generative Adversarial Networks (GANs) with supervised learning methods to enhance the accuracy and generalizability of viral oncoprotein prediction. Our methodology evaluated multiple machine learning models, including Random Forest, Multilayer Perceptron, Light Gradient Boosting Machine, eXtreme Gradient Boosting, and Support Vector Machine. In ten-fold cross-validation on our training dataset, the GAN-enhanced Random Forest model demonstrated superior performance metrics: 0.976 accuracy, 0.976 F1 score, 0.977 precision, 0.976 sensitivity, and 1.0 AUC. During independent testing, this model achieved 0.982 accuracy, 0.982 F1 score, 0.982 precision, 0.982 sensitivity, and 1.0 AUC. These results establish our new tool, VirOncoTarget, accessible via a web application. We anticipate that VirOncoTarget will be a valuable resource for researchers, enabling rapid and reliable viral oncoprotein prediction and advancing our understanding of their role in cancer biology.
Collapse
Affiliation(s)
- Jorge F Beltrán
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar 01145, Temuco, Chile.
| | - Lisandra Herrera-Belén
- Departamento de Ciencias Básicas, Facultad de Ciencias, Universidad Santo Tomas, Temuco, Chile
| | - Alejandro J Yáñez
- Departamento de Investigación y Desarrollo, Greenvolution SpA, Puerto Varas, Chile
- Interdisciplinary Center for Aquaculture Research (INCAR), Concepcion, Chile
| | - Luis Jimenez
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar 01145, Temuco, Chile
| |
Collapse
|
16
|
Tayebi Z, Ali S, Patterson M. TCellR2Vec: efficient feature selection for TCR sequences for cancer classification. PeerJ Comput Sci 2024; 10:e2239. [PMID: 39650499 PMCID: PMC11622898 DOI: 10.7717/peerj-cs.2239] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2024] [Accepted: 07/14/2024] [Indexed: 12/11/2024]
Abstract
Cancer remains one of the leading causes of death globally. New immunotherapies that harness the patient's immune system to fight cancer show promise, but their development requires analyzing the diversity of immune cells called T-cells. T-cells have receptors that recognize and bind to cancer cells. Sequencing these T-cell receptors allows to provide insights into their immune response, but extracting useful information is challenging. In this study, we propose a new computational method, TCellR2Vec, to select key features from T-cell receptor sequences for classifying different cancer types. We extracted features like amino acid composition, charge, and diversity measures and combined them with other sequence embedding techniques. For our experiments, we used a dataset of over 50,000 T-cell receptor sequences from five cancer types, which showed that TCellR2Vec improved classification accuracy and efficiency over baseline methods. These results demonstrate TCellR2Vec's ability to capture informative aspects of complex T-cell receptor sequences. By improving computational analysis of the immune response, TCellR2Vec could aid the development of personalized immunotherapies tailored to each patient's T-cells. This has important implications for creating more effective cancer treatments based on the individual's immune system.
Collapse
Affiliation(s)
- Zahra Tayebi
- Computer Science, Georgia State University, Atlanta, GA, United States of America
| | - Sarwan Ali
- Computer Science, Georgia State University, Atlanta, GA, United States of America
| | - Murray Patterson
- Computer Science, Georgia State University, Atlanta, GA, United States of America
| |
Collapse
|
17
|
Chen C, He Z, Zhao J, Zhu X, Li J, Wu X, Chen Z, Chen H, Jia G. Zoonotic outbreak risk prediction with long short-term memory models: a case study with schistosomiasis, echinococcosis, and leptospirosis. BMC Infect Dis 2024; 24:1062. [PMID: 39333964 PMCID: PMC11437667 DOI: 10.1186/s12879-024-09892-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Accepted: 09/05/2024] [Indexed: 09/30/2024] Open
Abstract
BACKGROUND Zoonotic infections, characterized with huge pathogen diversity, wide affecting area and great society harm, have become a major global public health problem. Early and accurate prediction of their outbreaks is crucial for disease control. The aim of this study was to develop zoonotic diseases risk predictive models based on time-series incidence data and three zoonotic diseases in mainland China were employed as cases. METHODS The incidence data for schistosomiasis, echinococcosis, and leptospirosis were downloaded from the Scientific Data Centre of the National Ministry of Health of China, and were processed by interpolation, dynamic curve reconstruction and time series decomposition. Data were decomposed into three distinct components: the trend component, the seasonal component, and the residual component. The trend component was used as input to construct the Long Short-Term Memory (LSTM) prediction model, while the seasonal component was used in the comparison of the periods and amplitudes. Finaly, the accuracy of the hybrid LSTM prediction model was comprehensive evaluated. RESULTS This study employed trend series of incidence numbers and incidence rates of three zoonotic diseases for modeling. The prediction results of the model showed that the predicted incidence number and incidence rate were very close to the real incidence data. Model evaluation revealed that the prediction error of the hybrid LSTM model was smaller than that of the single LSTM. Thus, these results demonstrate that using trending sequences as input sequences for the model leads to better-fitting predictive models. CONCLUSIONS Our study successfully developed LSTM hybrid models for disease outbreak risk prediction using three zoonotic diseases as case studies. We demonstrate that the LSTM, when combined with time series decomposition, delivers more accurate results compared to conventional LSTM models using the raw data series. Disease outbreak trends can be predicted more accurately using hybrid models.
Collapse
Affiliation(s)
- Chunrong Chen
- College of Animal Science and Technology, Guangxi University, Nanning, 530004, China
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518120, China
| | - Zhaoyuan He
- College of Animal Science and Technology, Guangxi University, Nanning, 530004, China
| | - Jin Zhao
- College of Animal Science and Technology, Guangxi University, Nanning, 530004, China
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518120, China
| | - Xuhui Zhu
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518120, China
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Jiabao Li
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518120, China
- College of Data Science, Taiyuan University of Technology, Taiyuan, 030024, China
| | - Xinnan Wu
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518120, China
- College of Data Science, Taiyuan University of Technology, Taiyuan, 030024, China
| | - Zhongting Chen
- College of Animal Science and Technology, Guangxi University, Nanning, 530004, China
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518120, China
| | - Hailan Chen
- College of Animal Science and Technology, Guangxi University, Nanning, 530004, China.
| | - Gengjie Jia
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518120, China.
| |
Collapse
|
18
|
Kurata H, Harun-Or-Roshid M, Tsukiyama S, Maeda K. PredIL13: Stacking a variety of machine and deep learning methods with ESM-2 language model for identifying IL13-inducing peptides. PLoS One 2024; 19:e0309078. [PMID: 39172871 PMCID: PMC11340954 DOI: 10.1371/journal.pone.0309078] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Accepted: 08/05/2024] [Indexed: 08/24/2024] Open
Abstract
Interleukin (IL)-13 has emerged as one of the recently identified cytokine. Since IL-13 causes the severity of COVID-19 and alters crucial biological processes, it is urgent to explore novel molecules or peptides capable of including IL-13. Computational prediction has received attention as a complementary method to in-vivo and in-vitro experimental identification of IL-13 inducing peptides, because experimental identification is time-consuming, laborious, and expensive. A few computational tools have been presented, including the IL13Pred and iIL13Pred. To increase prediction capability, we have developed PredIL13, a cutting-edge ensemble learning method with the latest ESM-2 protein language model. This method stacked the probability scores outputted by 168 single-feature machine/deep learning models, and then trained a logistic regression-based meta-classifier with the stacked probability score vectors. The key technology was to implement ESM-2 and to select the optimal single-feature models according to their absolute weight coefficient for logistic regression (AWCLR), an indicator of the importance of each single-feature model. Especially, the sequential deletion of single-feature models based on the iterative AWCLR ranking (SDIWC) method constructed the meta-classifier consisting of the top 16 single-feature models, named PredIL13, while considering the model's accuracy. The PredIL13 greatly outperformed the-state-of-the-art predictors, thus is an invaluable tool for accelerating the detection of IL13-inducing peptide within the human genome.
Collapse
Affiliation(s)
- Hiroyuki Kurata
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, Kawazu, Iizuka, Fukuoka, Japan
| | - Md. Harun-Or-Roshid
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, Kawazu, Iizuka, Fukuoka, Japan
| | - Sho Tsukiyama
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, Kawazu, Iizuka, Fukuoka, Japan
| | - Kazuhiro Maeda
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, Kawazu, Iizuka, Fukuoka, Japan
| |
Collapse
|
19
|
Guevara-Barrientos D, Kaundal R. Malivhu: A Comprehensive Bioinformatics Resource for Filtering SARS and MERS Virus Proteins by Their Classification, Family and Species, and Prediction of Their Interactions Against Human Proteins. Bioinform Biol Insights 2024; 18:11779322241263671. [PMID: 39148721 PMCID: PMC11325310 DOI: 10.1177/11779322241263671] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2024] [Accepted: 06/04/2024] [Indexed: 08/17/2024] Open
Abstract
COVID 19 pandemic is still ongoing, having taken more than 6 million human lives with it, and it seems that the world will have to learn how to live with the virus around. In consequence, there is a need to develop different treatments against it, not only with vaccines, but also new medicines. To do this, human-virus protein-protein interactions (PPIs) play a key part in drug-target discovery, but finding them experimentally can be either costly or sometimes unreliable. Therefore, computational methods arose as a powerful alternative to predict these interactions, reducing costs and helping researchers confirm only certain interactions instead of trying all possible combinations in the laboratory. Malivhu is a tool that predicts human-virus PPIs through a 4-phase process using machine learning models, where phase 1 filters ssRNA(+) class virus proteins, phase 2 filters Coronaviridae family proteins and phase 3 filters severe acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS) species proteins, and phase 4 predicts human-SARS-CoV/SARS-CoV-2/MERS protein-protein interactions. The performance of the models was measured with Matthews correlation coefficient, F1-score, specificity, sensitivity, and accuracy scores, getting accuracies of 99.07%, 99.83%, and 100% for the first 3 phases, respectively, and 94.24% for human-SARS-CoV PPI, 94.50% for human-SARS-CoV-2 PPI, and 95.45% for human-MERS PPI on independent testing. All the prediction models developed for each of the 4 phases were implemented as web server which is freely available at https://kaabil.net/malivhu/.
Collapse
Affiliation(s)
- David Guevara-Barrientos
- Department of Computer Science, College of Science, Utah State University, Logan, UT, USA
- Bioinformatics Facility, Center for Integrated BioSystems, Utah State University, Logan, UT, USA
| | - Rakesh Kaundal
- Department of Computer Science, College of Science, Utah State University, Logan, UT, USA
- Bioinformatics Facility, Center for Integrated BioSystems, Utah State University, Logan, UT, USA
- Department of Plants, Soils & Climate, College of Agriculture and Applied Sciences, Utah State University, Logan, UT, USA
| |
Collapse
|
20
|
Volzhenin K, Bittner L, Carbone A. SENSE-PPI reconstructs interactomes within, across, and between species at the genome scale. iScience 2024; 27:110371. [PMID: 39055916 PMCID: PMC11269938 DOI: 10.1016/j.isci.2024.110371] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 05/04/2024] [Accepted: 06/21/2024] [Indexed: 07/28/2024] Open
Abstract
Ab initio computational reconstructions of protein-protein interaction (PPI) networks will provide invaluable insights into cellular systems, enabling the discovery of novel molecular interactions and elucidating biological mechanisms within and between organisms. Leveraging the latest generation protein language models and recurrent neural networks, we present SENSE-PPI, a sequence-based deep learning model that efficiently reconstructs ab initio PPIs, distinguishing partners among tens of thousands of proteins and identifying specific interactions within functionally similar proteins. SENSE-PPI demonstrates high accuracy, limited training requirements, and versatility in cross-species predictions, even with non-model organisms and human-virus interactions. Its performance decreases for phylogenetically more distant model and non-model organisms, but signal alteration is very slow. In this regard, it demonstrates the important role of parameters in protein language models. SENSE-PPI is very fast and can test 10,000 proteins against themselves in a matter of hours, enabling the reconstruction of genome-wide proteomes.
Collapse
Affiliation(s)
- Konstantin Volzhenin
- Sorbonne Université, CNRS, IBPS, UMR 7238, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), 75005 Paris, France
| | - Lucie Bittner
- Institut de Systématique, Evolution, Biodiversité (ISYEB), Muséum national d’Histoire naturelle, CNRS, Sorbonne Université, EPHE, Université des Antilles, Paris, France
- Institut Universitaire de France, Paris, France
| | - Alessandra Carbone
- Sorbonne Université, CNRS, IBPS, UMR 7238, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), 75005 Paris, France
- Institut Universitaire de France, Paris, France
| |
Collapse
|
21
|
Jin R, Ye Q, Wang J, Cao Z, Jiang D, Wang T, Kang Y, Xu W, Hsieh CY, Hou T. AttABseq: an attention-based deep learning prediction method for antigen-antibody binding affinity changes based on protein sequences. Brief Bioinform 2024; 25:bbae304. [PMID: 38960407 PMCID: PMC11221889 DOI: 10.1093/bib/bbae304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2023] [Revised: 04/15/2024] [Accepted: 06/11/2024] [Indexed: 07/05/2024] Open
Abstract
The optimization of therapeutic antibodies through traditional techniques, such as candidate screening via hybridoma or phage display, is resource-intensive and time-consuming. In recent years, computational and artificial intelligence-based methods have been actively developed to accelerate and improve the development of therapeutic antibodies. In this study, we developed an end-to-end sequence-based deep learning model, termed AttABseq, for the predictions of the antigen-antibody binding affinity changes connected with antibody mutations. AttABseq is a highly efficient and generic attention-based model by utilizing diverse antigen-antibody complex sequences as the input to predict the binding affinity changes of residue mutations. The assessment on the three benchmark datasets illustrates that AttABseq is 120% more accurate than other sequence-based models in terms of the Pearson correlation coefficient between the predicted and experimental binding affinity changes. Moreover, AttABseq also either outperforms or competes favorably with the structure-based approaches. Furthermore, AttABseq consistently demonstrates robust predictive capabilities across a diverse array of conditions, underscoring its remarkable capacity for generalization across a wide spectrum of antigen-antibody complexes. It imposes no constraints on the quantity of altered residues, rendering it particularly applicable in scenarios where crystallographic structures remain unavailable. The attention-based interpretability analysis indicates that the causal effects of point mutations on antibody-antigen binding affinity changes can be visualized at the residue level, which might assist automated antibody sequence optimization. We believe that AttABseq provides a fiercely competitive answer to therapeutic antibody optimization.
Collapse
Affiliation(s)
- Ruofan Jin
- College of Pharmaceutical Science, Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Zhejiang University, Yuhangtang Road 866, Hangzhou 310058, Zhejiang, China
- College of Life Science, Zhejiang University, Yuhangtang Road 866, Hangzhou 310058, Zhejiang, China
| | - Qing Ye
- College of Pharmaceutical Science, Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Zhejiang University, Yuhangtang Road 866, Hangzhou 310058, Zhejiang, China
| | - Jike Wang
- College of Pharmaceutical Science, Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Zhejiang University, Yuhangtang Road 866, Hangzhou 310058, Zhejiang, China
| | - Zheng Cao
- College of Computer Science and Technology, Zhejiang University, Yuhangtang Road 866, Hangzhou 310058, Zhejiang, China
| | - Dejun Jiang
- College of Pharmaceutical Science, Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Zhejiang University, Yuhangtang Road 866, Hangzhou 310058, Zhejiang, China
| | - Tianyue Wang
- College of Pharmaceutical Science, Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Zhejiang University, Yuhangtang Road 866, Hangzhou 310058, Zhejiang, China
| | - Yu Kang
- College of Pharmaceutical Science, Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Zhejiang University, Yuhangtang Road 866, Hangzhou 310058, Zhejiang, China
| | - Wanting Xu
- College of Pharmaceutical Science, Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Zhejiang University, Yuhangtang Road 866, Hangzhou 310058, Zhejiang, China
| | - Chang-Yu Hsieh
- College of Pharmaceutical Science, Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Zhejiang University, Yuhangtang Road 866, Hangzhou 310058, Zhejiang, China
| | - Tingjun Hou
- College of Pharmaceutical Science, Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Zhejiang University, Yuhangtang Road 866, Hangzhou 310058, Zhejiang, China
| |
Collapse
|
22
|
Zeng X, Meng FF, Wen ML, Li SJ, Li Y. GNNGL-PPI: multi-category prediction of protein-protein interactions using graph neural networks based on global graphs and local subgraphs. BMC Genomics 2024; 25:406. [PMID: 38724906 PMCID: PMC11080243 DOI: 10.1186/s12864-024-10299-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Accepted: 04/10/2024] [Indexed: 05/13/2024] Open
Abstract
Most proteins exert their functions by interacting with other proteins, making the identification of protein-protein interactions (PPI) crucial for understanding biological activities, pathological mechanisms, and clinical therapies. Developing effective and reliable computational methods for predicting PPI can significantly reduce the time-consuming and labor-intensive associated traditional biological experiments. However, accurately identifying the specific categories of protein-protein interactions and improving the prediction accuracy of the computational methods remain dual challenges. To tackle these challenges, we proposed a novel graph neural network method called GNNGL-PPI for multi-category prediction of PPI based on global graphs and local subgraphs. GNNGL-PPI consisted of two main components: using Graph Isomorphism Network (GIN) to extract global graph features from PPI network graph, and employing GIN As Kernel (GIN-AK) to extract local subgraph features from the subgraphs of protein vertices. Additionally, considering the imbalanced distribution of samples in each category within the benchmark datasets, we introduced an Asymmetric Loss (ASL) function to further enhance the predictive performance of the method. Through evaluations on six benchmark test sets formed by three different dataset partitioning algorithms (Random, BFS, DFS), GNNGL-PPI outperformed the state-of-the-art multi-category prediction methods of PPI, as measured by the comprehensive performance evaluation metric F1-measure. Furthermore, interpretability analysis confirmed the effectiveness of GNNGL-PPI as a reliable multi-category prediction method for predicting protein-protein interactions.
Collapse
Affiliation(s)
- Xin Zeng
- College of Mathematics and Computer Science, Dali University, 671003, Dali, China
| | - Fan-Fang Meng
- College of Mathematics and Computer Science, Dali University, 671003, Dali, China
| | - Meng-Liang Wen
- State Key Laboratory for Conservation and Utilization of Bio-Resources in Yunnan, Yunnan University, 650000, Kunming, China
| | - Shu-Juan Li
- Yunnan Institute of Endemic Diseases Control & Prevention, 671000, Dali, China
| | - Yi Li
- College of Mathematics and Computer Science, Dali University, 671003, Dali, China.
| |
Collapse
|
23
|
Pan J, Zhang Z, Li Y, Yu J, You Z, Li C, Wang S, Zhu M, Ren F, Zhang X, Sun Y, Wang S. A microbial knowledge graph-based deep learning model for predicting candidate microbes for target hosts. Brief Bioinform 2024; 25:bbae119. [PMID: 38555472 PMCID: PMC10981679 DOI: 10.1093/bib/bbae119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 02/23/2024] [Accepted: 03/02/2024] [Indexed: 04/02/2024] Open
Abstract
Predicting interactions between microbes and hosts plays critical roles in microbiome population genetics and microbial ecology and evolution. How to systematically characterize the sophisticated mechanisms and signal interplay between microbes and hosts is a significant challenge for global health risks. Identifying microbe-host interactions (MHIs) can not only provide helpful insights into their fundamental regulatory mechanisms, but also facilitate the development of targeted therapies for microbial infections. In recent years, computational methods have become an appealing alternative due to the high risk and cost of wet-lab experiments. Therefore, in this study, we utilized rich microbial metagenomic information to construct a novel heterogeneous microbial network (HMN)-based model named KGVHI to predict candidate microbes for target hosts. Specifically, KGVHI first built a HMN by integrating human proteins, viruses and pathogenic bacteria with their biological attributes. Then KGVHI adopted a knowledge graph embedding strategy to capture the global topological structure information of the whole network. A natural language processing algorithm is used to extract the local biological attribute information from the nodes in HMN. Finally, we combined the local and global information and fed it into a blended deep neural network (DNN) for training and prediction. Compared to state-of-the-art methods, the comprehensive experimental results show that our model can obtain excellent results on the corresponding three MHI datasets. Furthermore, we also conducted two pathogenic bacteria case studies to further indicate that KGVHI has excellent predictive capabilities for potential MHI pairs.
Collapse
Affiliation(s)
- Jie Pan
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, College of Life Sciences, Northwest University, Xi’an 710069, China
| | - Zhen Zhang
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, College of Life Sciences, Northwest University, Xi’an 710069, China
| | - Ying Li
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, College of Life Sciences, Northwest University, Xi’an 710069, China
| | - Jiaoyang Yu
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, College of Life Sciences, Northwest University, Xi’an 710069, China
| | - Zhuhong You
- School of Computer Science, Northwestern Polytechnical University, Xi’an 710129, China
| | - Chenyu Li
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, College of Life Sciences, Northwest University, Xi’an 710069, China
| | - Shixu Wang
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, College of Life Sciences, Northwest University, Xi’an 710069, China
| | - Minghui Zhu
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, College of Life Sciences, Northwest University, Xi’an 710069, China
| | - Fengzhi Ren
- North China Pharmaceutical Group, Shijiazhuang 050015, Hebei, China
- National Microbial Medicine Engineering & Research Center, Shijiazhuang 050015, Hebei, China
| | - Xuexia Zhang
- North China Pharmaceutical Group, Shijiazhuang 050015, Hebei, China
- National Microbial Medicine Engineering & Research Center, Shijiazhuang 050015, Hebei, China
| | - Yanmei Sun
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, College of Life Sciences, Northwest University, Xi’an 710069, China
| | - Shiwei Wang
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, College of Life Sciences, Northwest University, Xi’an 710069, China
| |
Collapse
|
24
|
Ma Y, Zhao Y, Ma Y. Kernel Bayesian nonlinear matrix factorization based on variational inference for human-virus protein-protein interaction prediction. Sci Rep 2024; 14:5693. [PMID: 38454139 PMCID: PMC10920681 DOI: 10.1038/s41598-024-56208-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2023] [Accepted: 03/04/2024] [Indexed: 03/09/2024] Open
Abstract
Identification of potential human-virus protein-protein interactions (PPIs) contributes to the understanding of the mechanisms of viral infection and to the development of antiviral drugs. Existing computational models often have more hyperparameters that need to be adjusted manually, which limits their computational efficiency and generalization ability. Based on this, this study proposes a kernel Bayesian logistic matrix decomposition model with automatic rank determination, VKBNMF, for the prediction of human-virus PPIs. VKBNMF introduces auxiliary information into the logistic matrix decomposition and sets the prior probabilities of the latent variables to build a Bayesian framework for automatic parameter search. In addition, we construct the variational inference framework of VKBNMF to ensure the solution efficiency. The experimental results show that for the scenarios of paired PPIs, VKBNMF achieves an average AUPR of 0.9101, 0.9316, 0.8727, and 0.9517 on the four benchmark datasets, respectively, and for the scenarios of new human (viral) proteins, VKBNMF still achieves a higher hit rate. The case study also further demonstrated that VKBNMF can be used as an effective tool for the prediction of human-virus PPIs.
Collapse
Affiliation(s)
- Yingjun Ma
- School of Mathematics and Statistics, Xiamen University of Technology, Xiamen, China
| | - Yongbiao Zhao
- School of Computer, Central China Normal University, Wuhan, China
| | - Yuanyuan Ma
- School of Computer Engineering, Hubei University of Arts and Science, Xiangyang, China.
- Hubei Key Laboratory of Power System Design and Test for Electrical Vehicle, Hubei University of Arts and Science, Xiangyang, China.
| |
Collapse
|
25
|
Zhang HQ, Liu SH, Li R, Yu JW, Ye DX, Yuan SS, Lin H, Huang CB, Tang H. MIBPred: Ensemble Learning-Based Metal Ion-Binding Protein Classifier. ACS OMEGA 2024; 9:8439-8447. [PMID: 38405489 PMCID: PMC10882704 DOI: 10.1021/acsomega.3c09587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Revised: 01/16/2024] [Accepted: 01/22/2024] [Indexed: 02/27/2024]
Abstract
In biological organisms, metal ion-binding proteins participate in numerous metabolic activities and are closely associated with various diseases. To accurately predict whether a protein binds to metal ions and the type of metal ion-binding protein, this study proposed a classifier named MIBPred. The classifier incorporated advanced Word2Vec technology from the field of natural language processing to extract semantic features of the protein sequence language and combined them with position-specific score matrix (PSSM) features. Furthermore, an ensemble learning model was employed for the metal ion-binding protein classification task. In the model, we independently trained XGBoost, LightGBM, and CatBoost algorithms and integrated the output results through an SVM voting mechanism. This innovative combination has led to a significant breakthrough in the predictive performance of our model. As a result, we achieved accuracies of 95.13% and 85.19%, respectively, in predicting metal ion-binding proteins and their types. Our research not only confirms the effectiveness of Word2Vec technology in extracting semantic information from protein sequences but also highlights the outstanding performance of the MIBPred classifier in the problem of metal ion-binding protein types. This study provides a reliable tool and method for the in-depth exploration of the structure and function of metal ion-binding proteins.
Collapse
Affiliation(s)
- Hong-Qi Zhang
- School
of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of
China, Chengdu 610054, China
| | - Shang-Hua Liu
- School
of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of
China, Chengdu 610054, China
| | - Rui Li
- School
of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of
China, Chengdu 610054, China
| | - Jun-Wen Yu
- School
of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of
China, Chengdu 610054, China
| | - Dong-Xin Ye
- School
of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of
China, Chengdu 610054, China
| | - Shi-Shi Yuan
- School
of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of
China, Chengdu 610054, China
| | - Hao Lin
- School
of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of
China, Chengdu 610054, China
| | - Cheng-Bing Huang
- School
of Computer Science and Technology, Aba Teachers University, Aba 623002, China
| | - Hua Tang
- School
of Basic Medical Sciences, Southwest Medical
University, Luzhou 646000, China
- Central
Nervous System Drug Key Laboratory of Sichuan Province, Luzhou 646000, China
| |
Collapse
|
26
|
Fu X, Yuan Y, Qiu H, Suo H, Song Y, Li A, Zhang Y, Xiao C, Li Y, Dou L, Zhang Z, Cui F. AGF-PPIS: A protein-protein interaction site predictor based on an attention mechanism and graph convolutional networks. Methods 2024; 222:142-151. [PMID: 38242383 DOI: 10.1016/j.ymeth.2024.01.006] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 01/04/2024] [Accepted: 01/13/2024] [Indexed: 01/21/2024] Open
Abstract
Protein-protein interactions play an important role in various biological processes. Interaction among proteins has a wide range of applications. Therefore, the correct identification of protein-protein interactions sites is crucial. In this paper, we propose a novel predictor for protein-protein interactions sites, AGF-PPIS, where we utilize a multi-head self-attention mechanism (introducing a graph structure), graph convolutional network, and feed-forward neural network. We use the Euclidean distance between each protein residue to generate the corresponding protein graph as the input of AGF-PPIS. On the independent test dataset Test_60, AGF-PPIS achieves superior performance over comparative methods in terms of seven different evaluation metrics (ACC, precision, recall, F1-score, MCC, AUROC, AUPRC), which fully demonstrates the validity and superiority of the proposed AGF-PPIS model. The source codes and the steps for usage of AGF-PPIS are available at https://github.com/fxh1001/AGF-PPIS.
Collapse
Affiliation(s)
- Xiuhao Fu
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Ye Yuan
- Beidahuang Industry Group General Hospital, Harbin 150001, China
| | - Haoye Qiu
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Haodong Suo
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Yingying Song
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Anqi Li
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Yupeng Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Cuilin Xiao
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Yazi Li
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Lijun Dou
- Genomic Medicine Institute, Lerner Research Institute, Cleveland, OH 44106, USA
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China.
| | - Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou 570228, China.
| |
Collapse
|
27
|
Wu S, Feng T, Tang W, Qi C, Gao J, He X, Wang J, Zhou H, Fang Z. metaProbiotics: a tool for mining probiotic from metagenomic binning data based on a language model. Brief Bioinform 2024; 25:bbae085. [PMID: 38487846 PMCID: PMC10940841 DOI: 10.1093/bib/bbae085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 01/26/2024] [Accepted: 02/15/2024] [Indexed: 03/18/2024] Open
Abstract
Beneficial bacteria remain largely unexplored. Lacking systematic methods, understanding probiotic community traits becomes challenging, leading to various conclusions about their probiotic effects among different publications. We developed language model-based metaProbiotics to rapidly detect probiotic bins from metagenomes, demonstrating superior performance in simulated benchmark datasets. Testing on gut metagenomes from probiotic-treated individuals, it revealed the probioticity of intervention strains-derived bins and other probiotic-associated bins beyond the training data, such as a plasmid-like bin. Analyses of these bins revealed various probiotic mechanisms and bai operon as probiotic Ruminococcaceae's potential marker. In different health-disease cohorts, these bins were more common in healthy individuals, signifying their probiotic role, but relevant health predictions based on the abundance profiles of these bins faced cross-disease challenges. To better understand the heterogeneous nature of probiotics, we used metaProbiotics to construct a comprehensive probiotic genome set from global gut metagenomic data. Module analysis of this set shows that diseased individuals often lack certain probiotic gene modules, with significant variation of the missing modules across different diseases. Additionally, different gene modules on the same probiotic have heterogeneous effects on various diseases. We thus believe that gene function integrity of the probiotic community is more crucial in maintaining gut homeostasis than merely increasing specific gene abundance, and adding probiotics indiscriminately might not boost health. We expect that the innovative language model-based metaProbiotics tool will promote novel probiotic discovery using large-scale metagenomic data and facilitate systematic research on bacterial probiotic effects. The metaProbiotics program can be freely downloaded at https://github.com/zhenchengfang/metaProbiotics.
Collapse
Affiliation(s)
- Shufang Wu
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, China
| | - Tao Feng
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, China
| | - Waijiao Tang
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, China
| | - Cancan Qi
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, China
| | - Jie Gao
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, China
- Department of Gastroenterology, The Second Affiliated Hospital of Guangzhou Medical University, Guangzhou, China
| | - Xiaolong He
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, China
| | - Jiaxuan Wang
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, China
| | - Hongwei Zhou
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, China
| | - Zhencheng Fang
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, China
| |
Collapse
|
28
|
Yang X, Wuchty S, Liang Z, Ji L, Wang B, Zhu J, Zhang Z, Dong Y. Multi-modal features-based human-herpesvirus protein-protein interaction prediction by using LightGBM. Brief Bioinform 2024; 25:bbae005. [PMID: 38279649 PMCID: PMC10818167 DOI: 10.1093/bib/bbae005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Revised: 12/25/2023] [Accepted: 01/01/2021] [Indexed: 01/28/2024] Open
Abstract
The identification of human-herpesvirus protein-protein interactions (PPIs) is an essential and important entry point to understand the mechanisms of viral infection, especially in malignant tumor patients with common herpesvirus infection. While natural language processing (NLP)-based embedding techniques have emerged as powerful approaches, the application of multi-modal embedding feature fusion to predict human-herpesvirus PPIs is still limited. Here, we established a multi-modal embedding feature fusion-based LightGBM method to predict human-herpesvirus PPIs. In particular, we applied document and graph embedding approaches to represent sequence, network and function modal features of human and herpesviral proteins. Training our LightGBM models through our compiled non-rigorous and rigorous benchmarking datasets, we obtained significantly better performance compared to individual-modal features. Furthermore, our model outperformed traditional feature encodings-based machine learning methods and state-of-the-art deep learning-based methods using various benchmarking datasets. In a transfer learning step, we show that our model that was trained on human-herpesvirus PPI dataset without cytomegalovirus data can reliably predict human-cytomegalovirus PPIs, indicating that our method can comprehensively capture multi-modal fusion features of protein interactions across various herpesvirus subtypes. The implementation of our method is available at https://github.com/XiaodiYangpku/MultimodalPPI/.
Collapse
Affiliation(s)
- Xiaodi Yang
- Department of Hematology, Peking University First Hospital, Beijing, China
| | - Stefan Wuchty
- Department of Computer Science, University of Miami, Miami FL, 33146, USA
- Department of Biology, University of Miami, Miami FL, 33146, USA
- Institute of Data Science and Computation, University of Miami, Miami, FL 33146, USA
- Sylvester Comprehensive Cancer Center, University of Miami, Miami, FL 33136, USA
| | - Zeyin Liang
- Department of Hematology, Peking University First Hospital, Beijing, China
| | - Li Ji
- Department of Hematology, Peking University First Hospital, Beijing, China
| | - Bingjie Wang
- Department of Hematology, Peking University First Hospital, Beijing, China
| | - Jialin Zhu
- Department of Hematology, Peking University First Hospital, Beijing, China
| | - Ziding Zhang
- State Key Laboratory of Animal Biotech Breeding, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| | - Yujun Dong
- Department of Hematology, Peking University First Hospital, Beijing, China
| |
Collapse
|
29
|
Feng T, Wu S, Zhou H, Fang Z. MOBFinder: a tool for mobilization typing of plasmid metagenomic fragments based on a language model. Gigascience 2024; 13:giae047. [PMID: 39101782 PMCID: PMC11299106 DOI: 10.1093/gigascience/giae047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Revised: 05/31/2024] [Accepted: 06/24/2024] [Indexed: 08/06/2024] Open
Abstract
BACKGROUND Mobilization typing (MOB) is a classification scheme for plasmid genomes based on their relaxase gene. The host ranges of plasmids of different MOB categories are diverse, and MOB is crucial for investigating plasmid mobilization, especially the transmission of resistance genes and virulence factors. However, MOB typing of plasmid metagenomic data is challenging due to the highly fragmented characteristics of metagenomic contigs. RESULTS We developed MOBFinder, an 11-class classifier, for categorizing plasmid fragments into 10 MOB types and a nonmobilizable category. We first performed MOB typing to classify complete plasmid genomes according to relaxase information and then constructed an artificial benchmark dataset of plasmid metagenomic fragments (PMFs) from those complete plasmid genomes whose MOB types are well annotated. Next, based on natural language models, we used word vectors to characterize the PMFs. Several random forest classification models were trained and integrated to predict fragments of different lengths. Evaluating the tool using the benchmark dataset, we found that MOBFinder outperforms previous tools such as MOBscan and MOB-suite, with an overall accuracy approximately 59% higher than that of MOB-suite. Moreover, the balanced accuracy, harmonic mean, and F1-score reached up to 99% for some MOB types. When applied to a cohort of patients with type 2 diabetes (T2D), MOBFinder offered insights suggesting that the MOBF type plasmid, which is widely present in Escherichia and Klebsiella, and the MOBQ type plasmid might accelerate antibiotic resistance transmission in patients with T2D. CONCLUSIONS To the best of our knowledge, MOBFinder is the first tool for MOB typing of PMFs. The tool is freely available at https://github.com/FengTaoSMU/MOBFinder.
Collapse
Affiliation(s)
- Tao Feng
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou 510280, China
| | - Shufang Wu
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou 510280, China
| | - Hongwei Zhou
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou 510280, China
| | - Zhencheng Fang
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou 510280, China
| |
Collapse
|
30
|
Chen HM, Liu JX, Liu D, Hao GF, Yang GF. Human-virus protein-protein interactions maps assist in revealing the pathogenesis of viral infection. Rev Med Virol 2024; 34:e2517. [PMID: 38282401 DOI: 10.1002/rmv.2517] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Revised: 09/12/2023] [Accepted: 01/16/2024] [Indexed: 01/30/2024]
Abstract
Many significant viral infections have been recorded in human history, which have caused enormous negative impacts worldwide. Human-virus protein-protein interactions (PPIs) mediate viral infection and immune processes in the host. The identification, quantification, localization, and construction of human-virus PPIs maps are critical prerequisites for understanding the biophysical basis of the viral invasion process and characterising the framework for all protein functions. With the technological revolution and the introduction of artificial intelligence, the human-virus PPIs maps have been expanded rapidly in the past decade and shed light on solving complicated biomedical problems. However, there is still a lack of prospective insight into the field. In this work, we comprehensively review and compare the effectiveness, potential, and limitations of diverse approaches for constructing large-scale PPIs maps in human-virus, including experimental methods based on biophysics and biochemistry, databases of human-virus PPIs, computational methods based on artificial intelligence, and tools for visualising PPIs maps. The work aims to provide a toolbox for researchers, hoping to better assist in deciphering the relationship between humans and viruses.
Collapse
Affiliation(s)
- Hui-Min Chen
- National Key Laboratory of Green Pesticide, Central China Normal University, Wuhan, China
| | - Jia-Xin Liu
- National Key Laboratory of Green Pesticide, Central China Normal University, Wuhan, China
| | - Di Liu
- CAS Key Laboratory of Special Pathogens and Biosafety, Wuhan Institute of Virology, Center for Biosafety Mega-Science, Chinese Academy of Sciences, Wuhan, China
| | - Ge-Fei Hao
- National Key Laboratory of Green Pesticide, Central China Normal University, Wuhan, China
- National Key Laboratory of Green Pesticide, Key Laboratory of Green Pesticide and Agricultural Bioengineering, Ministry of Education, Center for Research and Development of Fine Chemicals, Guizhou University, Guiyang, China
| | - Guang-Fu Yang
- National Key Laboratory of Green Pesticide, Central China Normal University, Wuhan, China
| |
Collapse
|
31
|
Wang S, Liu Y, Liu Y, Zhang Y, Zhu X. BERT-5mC: an interpretable model for predicting 5-methylcytosine sites of DNA based on BERT. PeerJ 2023; 11:e16600. [PMID: 38089911 PMCID: PMC10712318 DOI: 10.7717/peerj.16600] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Accepted: 11/15/2023] [Indexed: 12/18/2023] Open
Abstract
DNA 5-methylcytosine (5mC) is widely present in multicellular eukaryotes, which plays important roles in various developmental and physiological processes and a wide range of human diseases. Thus, it is essential to accurately detect the 5mC sites. Although current sequencing technologies can map genome-wide 5mC sites, these experimental methods are both costly and time-consuming. To achieve a fast and accurate prediction of 5mC sites, we propose a new computational approach, BERT-5mC. First, we pre-trained a domain-specific BERT (bidirectional encoder representations from transformers) model by using human promoter sequences as language corpus. BERT is a deep two-way language representation model based on Transformer. Second, we fine-tuned the domain-specific BERT model based on the 5mC training dataset to build the model. The cross-validation results show that our model achieves an AUROC of 0.966 which is higher than other state-of-the-art methods such as iPromoter-5mC, 5mC_Pred, and BiLSTM-5mC. Furthermore, our model was evaluated on the independent test set, which shows that our model achieves an AUROC of 0.966 that is also higher than other state-of-the-art methods. Moreover, we analyzed the attention weights generated by BERT to identify a number of nucleotide distributions that are closely associated with 5mC modifications. To facilitate the use of our model, we built a webserver which can be freely accessed at: http://5mc-pred.zhulab.org.cn.
Collapse
Affiliation(s)
- Shuyu Wang
- School of Sciences, Anhui Agricultural University, Hefei, Anhui, China
| | - Yinbo Liu
- School of Sciences, Anhui Agricultural University, Hefei, Anhui, China
| | - Yufeng Liu
- School of Sciences, Anhui Agricultural University, Hefei, Anhui, China
| | - Yong Zhang
- School of Sciences, Anhui Agricultural University, Hefei, Anhui, China
| | - Xiaolei Zhu
- School of Sciences, Anhui Agricultural University, Hefei, Anhui, China
| |
Collapse
|
32
|
Markus B, C GC, Andreas K, Arkadij K, Stefan L, Gustav O, Elina S, Radka S. Accelerating Biocatalysis Discovery with Machine Learning: A Paradigm Shift in Enzyme Engineering, Discovery, and Design. ACS Catal 2023; 13:14454-14469. [PMID: 37942268 PMCID: PMC10629211 DOI: 10.1021/acscatal.3c03417] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Revised: 09/29/2023] [Accepted: 10/03/2023] [Indexed: 11/10/2023]
Abstract
Emerging computational tools promise to revolutionize protein engineering for biocatalytic applications and accelerate the development timelines previously needed to optimize an enzyme to its more efficient variant. For over a decade, the benefits of predictive algorithms have helped scientists and engineers navigate the complexity of functional protein sequence space. More recently, spurred by dramatic advances in underlying computational tools, the promise of faster, cheaper, and more accurate enzyme identification, characterization, and engineering has catapulted terms such as artificial intelligence and machine learning to the must-have vocabulary in the field. This Perspective aims to showcase the current status of applications in pharmaceutical industry and also to discuss and celebrate the innovative approaches in protein science by highlighting their potential in selected recent developments and offering thoughts on future opportunities for biocatalysis. It also critically assesses the technology's limitations, unanswered questions, and unmet challenges.
Collapse
Affiliation(s)
- Braun Markus
- Department
of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010 Graz, Austria
| | - Gruber Christian C
- Enzyme
and Drug Discovery, Innophore. 1700 Montgomery Street, San Francisco, California 94111, United States
| | - Krassnigg Andreas
- Enzyme
and Drug Discovery, Innophore. 1700 Montgomery Street, San Francisco, California 94111, United States
| | - Kummer Arkadij
- Moderna,
Inc., 200 Technology
Square, Cambridge, Massachusetts 02139, United States
| | - Lutz Stefan
- Codexis
Inc., 200 Penobscot Drive, Redwood City, California 94063, United States
| | - Oberdorfer Gustav
- Department
of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010 Graz, Austria
| | - Siirola Elina
- Novartis
Institute for Biomedical Research, Global Discovery Chemistry, Basel CH-4108, Switzerland
| | - Snajdrova Radka
- Novartis
Institute for Biomedical Research, Global Discovery Chemistry, Basel CH-4108, Switzerland
| |
Collapse
|
33
|
Halsana AA, Chakroborty T, Halder AK, Basu S. DensePPI: A Novel Image-Based Deep Learning Method for Prediction of Protein-Protein Interactions. IEEE Trans Nanobioscience 2023; 22:904-911. [PMID: 37028059 DOI: 10.1109/tnb.2023.3251192] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/05/2023]
Abstract
Protein-protein interactions (PPI) are crucial for understanding the behaviour of living organisms and identifying disease associations. This paper proposes DensePPI, a novel deep convolution strategy applied to the 2D image map generated from the interacting protein pairs for PPI prediction. A colour encoding scheme has been introduced to embed the bigram interaction possibilities of Amino Acids into RGB colour space to enhance the learning and prediction task. The DensePPI model is trained on 5.5 million sub-images of size 128×128 generated from nearly 36,000 interacting and 36,000 non-interacting benchmark protein pairs. The performance is evaluated on independent datasets from five different organisms; Caenorhabditis elegans, Escherichia coli, Helicobacter Pylori, Homo sapiens and Mus Musculus. The proposed model achieves an average prediction accuracy score of 99.95% on these datasets, considering inter-species and intra-species interactions. The performance of DensePPI is compared with the state-of-the-art methods and outperforms those approaches in different evaluation metrics. Improved performance of DensePPI indicates the efficiency of the image-based encoding strategy of sequence information with the deep learning architecture in PPI prediction. The enhanced performance on diverse test sets shows that the DensePPI is significant for intra-species interaction prediction and cross-species interactions. The dataset, supplementary file, and the developed models are available at https://github.com/Aanzil/DensePPI for academic use only.
Collapse
|
34
|
Liu T, Gao H, Ren X, Xu G, Liu B, Wu N, Luo H, Wang Y, Tu T, Yao B, Guan F, Teng Y, Huang H, Tian J. Protein-protein interaction and site prediction using transfer learning. Brief Bioinform 2023; 24:bbad376. [PMID: 37870286 DOI: 10.1093/bib/bbad376] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 09/14/2023] [Accepted: 10/02/2023] [Indexed: 10/24/2023] Open
Abstract
The advanced language models have enabled us to recognize protein-protein interactions (PPIs) and interaction sites using protein sequences or structures. Here, we trained the MindSpore ProteinBERT (MP-BERT) model, a Bidirectional Encoder Representation from Transformers, using protein pairs as inputs, making it suitable for identifying PPIs and their respective interaction sites. The pretrained model (MP-BERT) was fine-tuned as MPB-PPI (MP-BERT on PPI) and demonstrated its superiority over the state-of-the-art models on diverse benchmark datasets for predicting PPIs. Moreover, the model's capability to recognize PPIs among various organisms was evaluated on multiple organisms. An amalgamated organism model was designed, exhibiting a high level of generalization across the majority of organisms and attaining an accuracy of 92.65%. The model was also customized to predict interaction site propensity by fine-tuning it with PPI site data as MPB-PPISP. Our method facilitates the prediction of both PPIs and their interaction sites, thereby illustrating the potency of transfer learning in dealing with the protein pair task.
Collapse
Affiliation(s)
- Tuoyu Liu
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Han Gao
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China
| | - Xiaopu Ren
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Guoshun Xu
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China
| | - Bo Liu
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Ningfeng Wu
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Huiying Luo
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China
| | - Yuan Wang
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China
| | - Tao Tu
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China
| | - Bin Yao
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China
| | - Feifei Guan
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Yue Teng
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Academy of Military Medical Sciences, Beijing 100071, China
| | - Huoqing Huang
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China
| | - Jian Tian
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| |
Collapse
|
35
|
Lee M. Recent Advances in Deep Learning for Protein-Protein Interaction Analysis: A Comprehensive Review. Molecules 2023; 28:5169. [PMID: 37446831 DOI: 10.3390/molecules28135169] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Revised: 06/30/2023] [Accepted: 06/30/2023] [Indexed: 07/15/2023] Open
Abstract
Deep learning, a potent branch of artificial intelligence, is steadily leaving its transformative imprint across multiple disciplines. Within computational biology, it is expediting progress in the understanding of Protein-Protein Interactions (PPIs), key components governing a wide array of biological functionalities. Hence, an in-depth exploration of PPIs is crucial for decoding the intricate biological system dynamics and unveiling potential avenues for therapeutic interventions. As the deployment of deep learning techniques in PPI analysis proliferates at an accelerated pace, there exists an immediate demand for an exhaustive review that encapsulates and critically assesses these novel developments. Addressing this requirement, this review offers a detailed analysis of the literature from 2021 to 2023, highlighting the cutting-edge deep learning methodologies harnessed for PPI analysis. Thus, this review stands as a crucial reference for researchers in the discipline, presenting an overview of the recent studies in the field. This consolidation helps elucidate the dynamic paradigm of PPI analysis, the evolution of deep learning techniques, and their interdependent dynamics. This scrutiny is expected to serve as a vital aid for researchers, both well-established and newcomers, assisting them in maneuvering the rapidly shifting terrain of deep learning applications in PPI analysis.
Collapse
Affiliation(s)
- Minhyeok Lee
- School of Electrical and Electronics Engineering, Chung-Ang University, Seoul 06974, Republic of Korea
| |
Collapse
|
36
|
Ao C, Ye X, Sakurai T, Zou Q, Yu L. m5U-SVM: identification of RNA 5-methyluridine modification sites based on multi-view features of physicochemical features and distributed representation. BMC Biol 2023; 21:93. [PMID: 37095510 PMCID: PMC10127088 DOI: 10.1186/s12915-023-01596-0] [Citation(s) in RCA: 36] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Accepted: 04/12/2023] [Indexed: 04/26/2023] Open
Abstract
BACKGROUND RNA 5-methyluridine (m5U) modifications are obtained by methylation at the C5 position of uridine catalyzed by pyrimidine methylation transferase, which is related to the development of human diseases. Accurate identification of m5U modification sites from RNA sequences can contribute to the understanding of their biological functions and the pathogenesis of related diseases. Compared to traditional experimental methods, computational methods developed based on machine learning with ease of use can identify modification sites from RNA sequences in an efficient and time-saving manner. Despite the good performance of these computational methods, there are some drawbacks and limitations. RESULTS In this study, we have developed a novel predictor, m5U-SVM, based on multi-view features and machine learning algorithms to construct predictive models for identifying m5U modification sites from RNA sequences. In this method, we used four traditional physicochemical features and distributed representation features. The optimized multi-view features were obtained from the four fused traditional physicochemical features by using the two-step LightGBM and IFS methods, and then the distributed representation features were fused with the optimized physicochemical features to obtain the new multi-view features. The best performing classifier, support vector machine, was identified by screening different machine learning algorithms. Compared with the results, the performance of the proposed model is better than that of the existing state-of-the-art tool. CONCLUSIONS m5U-SVM provides an effective tool that successfully captures sequence-related attributes of modifications and can accurately predict m5U modification sites from RNA sequences. The identification of m5U modification sites helps to understand and delve into the related biological processes and functions.
Collapse
Affiliation(s)
- Chunyan Ao
- School of Computer Science and Technology, Xidian University, Xi'an, China
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
| | - Tetsuya Sakurai
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China.
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, China.
| |
Collapse
|
37
|
Yang S, Yang Z, Yang J. 4mCBERT: A computing tool for the identification of DNA N4-methylcytosine sites by sequence- and chemical-derived information based on ensemble learning strategies. Int J Biol Macromol 2023; 231:123180. [PMID: 36646347 DOI: 10.1016/j.ijbiomac.2023.123180] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2022] [Revised: 11/26/2022] [Accepted: 12/30/2022] [Indexed: 01/15/2023]
Abstract
N4-methylcytosine (4mC) is an important DNA chemical modification pattern which is a new methylation modification discovered in recent years and plays critical roles in gene expression regulation, defense against invading genetic elements, genomic imprinting, and so on. Identifying 4mC site from DNA sequence segment contributes to discovering more novel modification patterns. In this paper, we present a model called 4mCBERT that encodes DNA sequence segments by sequence characteristics including one-hot, electron-ion interaction pseudopotential, nucleotide chemical property, word2vec and chemical information containing physicochemical properties (PCP), chemical bidirectional encoder representations from transformers (chemical BERT) and employs ensemble learning framework to develop a prediction model. PCP and chemical BERT features are firstly constructed and applied to predict 4mC sites and show positive contributions to identifying 4mC. For the Matthew's Correlation Coefficient, 4mCBERT significantly outperformed other state-of-the-art models on six independent benchmark datasets including A. thaliana, C. elegans, D. melanogaster, E. coli, G. Pickering, and G. subterraneous by 4.32 % to 24.39 %, 2.52 % to 31.65 %, 2 % to 16.49 %, 6.63 % to 35.15, 8.59 % to 61.85 %, and 8.45 % to 34.45 %. Moreover, 4mCBERT is designed to allow users to predict 4mC sites and retrain 4mC prediction models. In brief, 4mCBERT shows higher performance on six benchmark datasets by incorporating sequence- and chemical-driven information and is available at http://cczubio.top/4mCBERT and https://github.com/abcair/4mCBERT.
Collapse
Affiliation(s)
- Sen Yang
- School of Computer Science and Artificial Intelligence, Aliyun School of Big Data, School of Software, Changzhou 213164, China; The Affiliated Changzhou No 2 People's Hospital of Nanjing Medical University, Changzhou 213164, China.
| | - Zexi Yang
- School of Computer Science and Artificial Intelligence, Aliyun School of Big Data, School of Software, Changzhou 213164, China
| | - Jun Yang
- School of Educational Sciences, Yili Normal University, Yining 835000, China
| |
Collapse
|
38
|
Karamveer, Tiwary BK. Genomic coevolution of papillomavirus and immune system in placental mammals indicates the role of IFN-γ in the emergence of new variants. Carcinogenesis 2023:bgad007. [PMID: 36827464 DOI: 10.1093/carcin/bgad007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2022] [Indexed: 02/26/2023] Open
Abstract
Papillomaviruses (PVs) are causative agents for warts and cancers in different parts of the body in the mammalian lineage. Therefore, these viruses are proposed as model organisms to study host immune responses to pathogens causing chronic infections. The virus-associated cancer progression depends on two integral processes namely angiogenesis and immune response (AIR). The angiogenesis process aids in tumour progression through vessel formation and maturation but the host immune response, in contrast, makes every attempt to eliminate pathogens and thereby maintain healthy tissues. However, the evolutionary contribution of individual viral genes and host AIR genes in carcinogenesis is yet to be explored. Here, we applied the evolutionary genomics approach to find correlated evolution between six PV genes and 23 host AIR-related genes. We estimated that IFN-γ is the only host gene evolving in a correlated manner with all six PV genes under study. Furthermore, three papillomavirus genes, L2, E6, and E7, are found to interact with two third of host AIR-related genes. Moreover, a combined differential gene expression analysis and network analysis showed that inflammatory cytokine IFN-γ is the key regulator of hub genes in the PPI network of the differentially expressed genes. Functional enrichment of these hub genes is consistent with their established role in different cancers and viral infections. Overall, we conclude that IFN-γ maintains selective pressure on mammalian PV genes and seems to be a potential biomarker for PV-related cancers. This study demonstrates the evolutionary importance of IFN-γ in deciding the fate of carcinogenic PV variants.
Collapse
Affiliation(s)
- Karamveer
- Department of Bioinformatics, School of Life Sciences Pondicherry University Pondicherry-605 014 India
| | - Basant K Tiwary
- Department of Bioinformatics, School of Life Sciences Pondicherry University Pondicherry-605 014 India
| |
Collapse
|
39
|
Kang Y, Xu Y, Wang X, Pu B, Yang X, Rao Y, Chen J. HN-PPISP: a hybrid network based on MLP-Mixer for protein-protein interaction site prediction. Brief Bioinform 2023; 24:6833645. [PMID: 36403092 DOI: 10.1093/bib/bbac480] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2022] [Revised: 09/16/2022] [Accepted: 10/09/2022] [Indexed: 11/21/2022] Open
Abstract
MOTIVATION Biological experimental approaches to protein-protein interaction (PPI) site prediction are critical for understanding the mechanisms of biochemical processes but are time-consuming and laborious. With the development of Deep Learning (DL) techniques, the most popular Convolutional Neural Networks (CNN)-based methods have been proposed to address these problems. Although significant progress has been made, these methods still have limitations in encoding the characteristics of each amino acid in protein sequences. Current methods cannot efficiently explore the nature of Position Specific Scoring Matrix (PSSM), secondary structure and raw protein sequences by processing them all together. For PPI site prediction, how to effectively model the PPI context with attention to prediction remains an open problem. In addition, the long-distance dependencies of PPI features are important, which is very challenging for many CNN-based methods because the innate ability of CNN is difficult to outperform auto-regressive models like Transformers. RESULTS To effectively mine the properties of PPI features, a novel hybrid neural network named HN-PPISP is proposed, which integrates a Multi-layer Perceptron Mixer (MLP-Mixer) module for local feature extraction and a two-stage multi-branch module for global feature capture. The model merits Transformer, TextCNN and Bi-LSTM as a powerful alternative for PPI site prediction. On the one hand, this is the first application of an advanced Transformer (i.e. MLP-Mixer) with a hybrid network for sequence-based PPI prediction. On the other hand, unlike existing methods that treat global features altogether, the proposed two-stage multi-branch hybrid module firstly assigns different attention scores to the input features and then encodes the feature through different branch modules. In the first stage, different improved attention modules are hybridized to extract features from the raw protein sequences, secondary structure and PSSM, respectively. In the second stage, a multi-branch network is designed to aggregate information from both branches in parallel. The two branches encode the features and extract dependencies through several operations such as TextCNN, Bi-LSTM and different activation functions. Experimental results on real-world public datasets show that our model consistently achieves state-of-the-art performance over seven remarkable baselines. AVAILABILITY The source code of HN-PPISP model is available at https://github.com/ylxu05/HN-PPISP.
Collapse
Affiliation(s)
- Yan Kang
- National Pilot School of Software, Yunnan University, Kunming, 650091, P.R. China
| | - Yulong Xu
- National Pilot School of Software, Yunnan University, Kunming, 650091, P.R. China
| | - Xinchao Wang
- National Pilot School of Software, Yunnan University, Kunming, 650091, P.R. China
| | - Bin Pu
- College of Computer Science and Electronic Engineeringg, Hunan University, Changsha, 410082, P.R. China
| | - Xuekun Yang
- National Pilot School of Software, Yunnan University, Kunming, 650091, P.R. China
| | - Yulong Rao
- National Pilot School of Software, Yunnan University, Kunming, 650091, P.R. China
| | - Jianguo Chen
- School of Software Engineering, Sun Yat-Sen University, Zhuhai, 519082, P.R. China
| |
Collapse
|
40
|
Ma Y, Zhong J. Logistic tensor decomposition with sparse subspace learning for prediction of multiple disease types of human-virus protein-protein interactions. Brief Bioinform 2023; 24:6961474. [PMID: 36573486 DOI: 10.1093/bib/bbac604] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Revised: 12/04/2022] [Accepted: 12/08/2022] [Indexed: 12/28/2022] Open
Abstract
Viral infection involves a large number of protein-protein interactions (PPIs) between the virus and the host, and the identification of these PPIs plays an important role in revealing viral infection and pathogenesis. Existing computational models focus on predicting whether human proteins and viral proteins interact, and rarely take into account the types of diseases associated with these interactions. Although there are computational models based on a matrix and tensor decomposition for predicting multi-type biological interaction relationships, these methods cannot effectively model high-order nonlinear relationships of biological entities and are not suitable for integrating multiple features. To this end, we propose a novel computational framework, LTDSSL, to determine human-virus PPIs under different disease types. LTDSSL utilizes logistic functions to model nonlinear associations, sets importance levels to emphasize the importance of observed interactions and utilizes sparse subspace learning of multiple features to improve model performance. Experimental results show that LTDSSL has better predictive performance for both new disease types and new triples than the state-of-the-art methods. In addition, the case study further demonstrates that LTDSSL can effectively predict human-viral PPIs under various disease types.
Collapse
Affiliation(s)
- Yingjun Ma
- School of Mathematics and Statistics, Xiamen University of Technology, Xiamen, 361024 , China
| | - Junjiang Zhong
- School of Mathematics and Statistics, Xiamen University of Technology, Xiamen, 361024 , China
| |
Collapse
|
41
|
Ortiz-Vilchis P, De-la-Cruz-García JS, Ramirez-Arellano A. Identification of Relevant Protein Interactions with Partial Knowledge: A Complex Network and Deep Learning Approach. BIOLOGY 2023; 12:140. [PMID: 36671832 PMCID: PMC9856098 DOI: 10.3390/biology12010140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 01/11/2023] [Accepted: 01/12/2023] [Indexed: 01/18/2023]
Abstract
Protein-protein interactions (PPIs) are the basis for understanding most cellular events in biological systems. Several experimental methods, e.g., biochemical, molecular, and genetic methods, have been used to identify protein-protein associations. However, some of them, such as mass spectrometry, are time-consuming and expensive. Machine learning (ML) techniques have been widely used to characterize PPIs, increasing the number of proteins analyzed simultaneously and optimizing time and resources for identifying and predicting protein-protein functional linkages. Previous ML approaches have focused on well-known networks or specific targets but not on identifying relevant proteins with partial or null knowledge of the interaction networks. The proposed approach aims to generate a relevant protein sequence based on bidirectional Long-Short Term Memory (LSTM) with partial knowledge of interactions. The general framework comprises conducting a scale-free and fractal complex network analysis. The outcome of these analyses is then used to fine-tune the fractal method for the vital protein extraction of PPI networks. The results show that several PPI networks are self-similar or fractal, but that both features cannot coexist. The generated protein sequences (by the bidirectional LSTM) also contain an average of 39.5% of proteins in the original sequence. The average length of the generated sequences was 17% of the original one. Finally, 95% of the generated sequences were true.
Collapse
Affiliation(s)
- Pilar Ortiz-Vilchis
- Sección de Estudios de Posgrado e Investigación, Escuela Superior de Medicina, Instituto Politécnico Nacional, Mexico City 11340, Mexico
| | - Jazmin-Susana De-la-Cruz-García
- Sección de Estudios de Posgrado e Investigación, Unidad Profesional Interdisciplinaria de Ingeniería y Ciencias Sociales y Administrativas, Instituto Politécnico Nacional, Mexico City 08400, Mexico
| | - Aldo Ramirez-Arellano
- Sección de Estudios de Posgrado e Investigación, Unidad Profesional Interdisciplinaria de Ingeniería y Ciencias Sociales y Administrativas, Instituto Politécnico Nacional, Mexico City 08400, Mexico
| |
Collapse
|
42
|
Murakami Y, Mizuguchi K. Recent developments of sequence-based prediction of protein-protein interactions. Biophys Rev 2022; 14:1393-1411. [PMID: 36589735 PMCID: PMC9789376 DOI: 10.1007/s12551-022-01038-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/08/2022] [Indexed: 12/25/2022] Open
Abstract
The identification of protein-protein interactions (PPIs) can lead to a better understanding of cellular functions and biological processes of proteins and contribute to the design of drugs to target disease-causing PPIs. In addition, targeting host-pathogen PPIs is useful for elucidating infection mechanisms. Although several experimental methods have been used to identify PPIs, these methods can yet to draw complete PPI networks. Hence, computational techniques are increasingly required for the prediction of potential PPIs, which have never been seen experimentally. Recent high-performance sequence-based methods have contributed to the construction of PPI networks and the elucidation of pathogenetic mechanisms in specific diseases. However, the usefulness of these methods depends on the quality and quantity of training data of PPIs. In this brief review, we introduce currently available PPI databases and recent sequence-based methods for predicting PPIs. Also, we discuss key issues in this field and present future perspectives of the sequence-based PPI predictions.
Collapse
Affiliation(s)
- Yoichi Murakami
- grid.440890.10000 0004 0640 9413Tokyo University of Information Sciences, 4-1 Onaridai, Wakaba-Ku, Chiba, 265-8501 Japan
| | - Kenji Mizuguchi
- grid.136593.b0000 0004 0373 3971Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita-Shi, Osaka, 565-0871 Japan ,grid.482562.fNational Institutes of Biomedical Innovation, Health and Nutrition, 7-6-8 Saito Asagi, Ibaraki, Osaka 567-0085 Japan
| |
Collapse
|
43
|
Neumann D, Roy S, Minhas FUAA, Ben-Hur A. On the choice of negative examples for prediction of host-pathogen protein interactions. FRONTIERS IN BIOINFORMATICS 2022; 2:1083292. [PMID: 36591335 PMCID: PMC9798088 DOI: 10.3389/fbinf.2022.1083292] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Accepted: 11/14/2022] [Indexed: 12/23/2022] Open
Abstract
As practitioners of machine learning in the area of bioinformatics we know that the quality of the results crucially depends on the quality of our labeled data. While there is a tendency to focus on the quality of positive examples, the negative examples are equally as important. In this opinion paper we revisit the problem of choosing negative examples for the task of predicting protein-protein interactions, either among proteins of a given species or for host-pathogen interactions and describe important issues that are prevalent in the current literature. The challenge in creating datasets for this task is the noisy nature of the experimentally derived interactions and the lack of information on non-interacting proteins. A standard approach is to choose random pairs of non-interacting proteins as negative examples. Since the interactomes of all species are only partially known, this leads to a very small percentage of false negatives. This is especially true for host-pathogen interactions. To address this perceived issue, some researchers have chosen to select negative examples as pairs of proteins whose sequence similarity to the positive examples is sufficiently low. This clearly reduces the chance for false negatives, but also makes the problem much easier than it really is, leading to over-optimistic accuracy estimates. We demonstrate the effect of this form of bias using a selection of recent protein interaction prediction methods of varying complexity, and urge researchers to pay attention to the details of generating their datasets for potential biases like this.
Collapse
Affiliation(s)
- Don Neumann
- Department Computer Science, Colorado State University, Fort Collins, CO, United States,*Correspondence: Don Neumann, ; Asa Ben-Hur,
| | - Soumyadip Roy
- Department Computer Science, Colorado State University, Fort Collins, CO, United States
| | | | - Asa Ben-Hur
- Department Computer Science, Colorado State University, Fort Collins, CO, United States,*Correspondence: Don Neumann, ; Asa Ben-Hur,
| |
Collapse
|
44
|
Asim MN, Fazeel A, Ibrahim MA, Dengel A, Ahmed S. MP-VHPPI: Meta predictor for viral host protein-protein interaction prediction in multiple hosts and viruses. Front Med (Lausanne) 2022; 9:1025887. [PMID: 36465911 PMCID: PMC9709337 DOI: 10.3389/fmed.2022.1025887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Accepted: 10/17/2022] [Indexed: 09/19/2023] Open
Abstract
Viral-host protein-protein interaction (VHPPI) prediction is essential to decoding molecular mechanisms of viral pathogens and host immunity processes that eventually help to control the propagation of viral diseases and to design optimized therapeutics. Multiple AI-based predictors have been developed to predict diverse VHPPIs across a wide range of viruses and hosts, however, these predictors produce better performance only for specific types of hosts and viruses. The prime objective of this research is to develop a robust meta predictor (MP-VHPPI) capable of more accurately predicting VHPPI across multiple hosts and viruses. The proposed meta predictor makes use of two well-known encoding methods Amphiphilic Pseudo-Amino Acid Composition (APAAC) and Quasi-sequence (QS) Order that capture amino acids sequence order and distributional information to most effectively generate the numerical representation of complete viral-host raw protein sequences. Feature agglomeration method is utilized to transform the original feature space into a more informative feature space. Random forest (RF) and Extra tree (ET) classifiers are trained on optimized feature space of both APAAC and QS order separate encoders and by combining both encodings. Further predictions of both classifiers are utilized to feed the Support Vector Machine (SVM) classifier that makes final predictions. The proposed meta predictor is evaluated over 7 different benchmark datasets, where it outperforms existing VHPPI predictors with an average performance of 3.07, 6.07, 2.95, and 2.85% in terms of accuracy, Mathews correlation coefficient, precision, and sensitivity, respectively. To facilitate the scientific community, the MP-VHPPI web server is available at https://sds_genetic_analysis.opendfki.de/MP-VHPPI/.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
| | - Ahtisham Fazeel
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
| | - Muhammad Ali Ibrahim
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
| | - Andreas Dengel
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
| |
Collapse
|
45
|
Cross-attention PHV: Prediction of human and virus protein-protein interactions using cross-attention-based neural networks. Comput Struct Biotechnol J 2022; 20:5564-5573. [PMID: 36249566 PMCID: PMC9546503 DOI: 10.1016/j.csbj.2022.10.012] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2022] [Revised: 10/05/2022] [Accepted: 10/05/2022] [Indexed: 11/30/2022] Open
Abstract
Cross-attention PHV implements two key technologies: cross-attention mechanism and 1D-CNN. It accurately predicts PPIs between human and unknown influenza viruses/SARS-CoV-2. It extracts critical taxonomic and evolutionary differences responsible for PPI prediction.
Viral infections represent a major health concern worldwide. The alarming rate at which SARS-CoV-2 spreads, for example, led to a worldwide pandemic. Viruses incorporate genetic material into the host genome to hijack host cell functions such as the cell cycle and apoptosis. In these viral processes, protein–protein interactions (PPIs) play critical roles. Therefore, the identification of PPIs between humans and viruses is crucial for understanding the infection mechanism and host immune responses to viral infections and for discovering effective drugs. Experimental methods including mass spectrometry-based proteomics and yeast two-hybrid assays are widely used to identify human-virus PPIs, but these experimental methods are time-consuming, expensive, and laborious. To overcome this problem, we developed a novel computational predictor, named cross-attention PHV, by implementing two key technologies of the cross-attention mechanism and a one-dimensional convolutional neural network (1D-CNN). The cross-attention mechanisms were very effective in enhancing prediction and generalization abilities. Application of 1D-CNN to the word2vec-generated feature matrices reduced computational costs, thus extending the allowable length of protein sequences to 9000 amino acid residues. Cross-attention PHV outperformed existing state-of-the-art models using a benchmark dataset and accurately predicted PPIs for unknown viruses. Cross-attention PHV also predicted human–SARS-CoV-2 PPIs with area under the curve values >0.95. The Cross-attention PHV web server and source codes are freely available at https://kurata35.bio.kyutech.ac.jp/Cross-attention_PHV/ and https://github.com/kuratahiroyuki/Cross-Attention_PHV, respectively.
Collapse
Key Words
- 1D-CNN, One-dimensional-CNN
- AC, Accuracy
- AUC, Area under the curve
- CNN, Convolutional neural network
- Convolutional neural network
- DT, Decision tree
- F1, F1-score
- HV-PPIs, Human-virus PPIs
- HuV-PPI, Human–unknown virus PPI
- Human
- LR, Linear regression
- MCC, Matthews correlation coefficient
- PPIs, Protein-protein interactions
- Protein–protein interaction
- RF, Random forest
- SARS-CoV-2
- SARS-CoV-2, Severe acute respiratory syndrome coronavirus 2
- SN, Sensitivity
- SP, Specificity
- SVM, Support vector machine
- T-SNE, T-distributed stochastic neighbor embedding
- Virus
- W2V, Word2vec
- Word2vec
Collapse
|
46
|
Madan S, Demina V, Stapf M, Ernst O, Fröhlich H. Accurate prediction of virus-host protein-protein interactions via a Siamese neural network using deep protein sequence embeddings. PATTERNS (NEW YORK, N.Y.) 2022; 3:100551. [PMID: 36124304 PMCID: PMC9481957 DOI: 10.1016/j.patter.2022.100551] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/11/2022] [Revised: 03/28/2022] [Accepted: 06/16/2022] [Indexed: 11/13/2022]
Abstract
Prediction and understanding of virus-host protein-protein interactions (PPIs) have relevance for the development of novel therapeutic interventions. In addition, virus-like particles open novel opportunities to deliver therapeutics to targeted cell types and tissues. Given our incomplete knowledge of PPIs on the one hand and the cost and time associated with experimental procedures on the other, we here propose a deep learning approach to predict virus-host PPIs. Our method (Siamese Tailored deep sequence Embedding of Proteins [STEP]) is based on recent deep protein sequence embedding techniques, which we integrate into a Siamese neural network. After showing the state-of-the-art performance of STEP on external datasets, we apply it to two use cases, severe acute respiratory syndrome coronavirus 2 and John Cunningham polyomavirus, to predict virus-host PPIs. Altogether our work highlights the potential of deep sequence embedding techniques originating from the field of NLP as well as explainable artificial intelligence methods for the analysis of biological sequences.
Collapse
Affiliation(s)
- Sumit Madan
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, 53757 Sankt Augustin, Germany
- Institute of Computer Science, University of Bonn, 53115 Bonn, Germany
| | | | - Marcus Stapf
- NEUWAY Pharma GmbH, In den Dauen 6A, 53117 Bonn, Germany
| | - Oliver Ernst
- NEUWAY Pharma GmbH, In den Dauen 6A, 53117 Bonn, Germany
| | - Holger Fröhlich
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, 53757 Sankt Augustin, Germany
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, 53113 Bonn, Germany
| |
Collapse
|
47
|
Kumar S, Kumar GS, Maitra SS, Malý P, Bharadwaj S, Sharma P, Dwivedi VD. Viral informatics: bioinformatics-based solution for managing viral infections. Brief Bioinform 2022; 23:6659740. [PMID: 35947964 DOI: 10.1093/bib/bbac326] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2022] [Revised: 06/26/2022] [Accepted: 07/18/2022] [Indexed: 11/13/2022] Open
Abstract
Several new viral infections have emerged in the human population and establishing as global pandemics. With advancements in translation research, the scientific community has developed potential therapeutics to eradicate or control certain viral infections, such as smallpox and polio, responsible for billions of disabilities and deaths in the past. Unfortunately, some viral infections, such as dengue virus (DENV) and human immunodeficiency virus-1 (HIV-1), are still prevailing due to a lack of specific therapeutics, while new pathogenic viral strains or variants are emerging because of high genetic recombination or cross-species transmission. Consequently, to combat the emerging viral infections, bioinformatics-based potential strategies have been developed for viral characterization and developing new effective therapeutics for their eradication or management. This review attempts to provide a single platform for the available wide range of bioinformatics-based approaches, including bioinformatics methods for the identification and management of emerging or evolved viral strains, genome analysis concerning the pathogenicity and epidemiological analysis, computational methods for designing the viral therapeutics, and consolidated information in the form of databases against the known pathogenic viruses. This enriched review of the generally applicable viral informatics approaches aims to provide an overview of available resources capable of carrying out the desired task and may be utilized to expand additional strategies to improve the quality of translation viral informatics research.
Collapse
Affiliation(s)
- Sanjay Kumar
- School of Biotechnology, Jawaharlal Nehru University, New Delhi, India.,Center for Bioinformatics, Computational and Systems Biology, Pathfinder Research and Training Foundation, Greater Noida, India
| | - Geethu S Kumar
- Department of Life Science, School of Basic Science and Research, Sharda University, Greater Noida, Uttar Pradesh, India.,Center for Bioinformatics, Computational and Systems Biology, Pathfinder Research and Training Foundation, Greater Noida, India
| | | | - Petr Malý
- Laboratory of Ligand Engineering, Institute of Biotechnology of the Czech Academy of Sciences v.v.i., BIOCEV Research Center, Vestec, Czech Republic
| | - Shiv Bharadwaj
- Laboratory of Ligand Engineering, Institute of Biotechnology of the Czech Academy of Sciences v.v.i., BIOCEV Research Center, Vestec, Czech Republic
| | - Pradeep Sharma
- Department of Biophysics, All India Institute of Medical Sciences, New Delhi, India
| | - Vivek Dhar Dwivedi
- Center for Bioinformatics, Computational and Systems Biology, Pathfinder Research and Training Foundation, Greater Noida, India.,Institute of Advanced Materials, IAAM, 59053 Ulrika, Sweden
| |
Collapse
|
48
|
Kurata H, Tsukiyama S, Manavalan B. iACVP: markedly enhanced identification of anti-coronavirus peptides using a dataset-specific word2vec model. Brief Bioinform 2022; 23:6623727. [PMID: 35772910 DOI: 10.1093/bib/bbac265] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2022] [Revised: 05/23/2022] [Accepted: 06/06/2022] [Indexed: 01/22/2023] Open
Abstract
The COVID-19 pandemic caused several million deaths worldwide. Development of anti-coronavirus drugs is thus urgent. Unlike conventional non-peptide drugs, antiviral peptide drugs are highly specific, easy to synthesize and modify, and not highly susceptible to drug resistance. To reduce the time and expense involved in screening thousands of peptides and assaying their antiviral activity, computational predictors for identifying anti-coronavirus peptides (ACVPs) are needed. However, few experimentally verified ACVP samples are available, even though a relatively large number of antiviral peptides (AVPs) have been discovered. In this study, we attempted to predict ACVPs using an AVP dataset and a small collection of ACVPs. Using conventional features, a binary profile and a word-embedding word2vec (W2V), we systematically explored five different machine learning methods: Transformer, Convolutional Neural Network, bidirectional Long Short-Term Memory, Random Forest (RF) and Support Vector Machine. Via exhaustive searches, we found that the RF classifier with W2V consistently achieved better performance on different datasets. The two main controlling factors were: (i) the dataset-specific W2V dictionary was generated from the training and independent test datasets instead of the widely used general UniProt proteome and (ii) a systematic search was conducted and determined the optimal k-mer value in W2V, which provides greater discrimination between positive and negative samples. Therefore, our proposed method, named iACVP, consistently provides better prediction performance compared with existing state-of-the-art methods. To assist experimentalists in identifying putative ACVPs, we implemented our model as a web server accessible via the following link: http://kurata35.bio.kyutech.ac.jp/iACVP.
Collapse
Affiliation(s)
- Hiroyuki Kurata
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Sho Tsukiyama
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Balachandran Manavalan
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| |
Collapse
|
49
|
Decoding the protein-ligand interactions using parallel graph neural networks. Sci Rep 2022; 12:7624. [PMID: 35538084 PMCID: PMC9086424 DOI: 10.1038/s41598-022-10418-2] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Accepted: 04/06/2022] [Indexed: 12/13/2022] Open
Abstract
Protein-ligand interactions (PLIs) are essential for biochemical functionality and their identification is crucial for estimating biophysical properties for rational therapeutic design. Currently, experimental characterization of these properties is the most accurate method, however, this is very time-consuming and labor-intensive. A number of computational methods have been developed in this context but most of the existing PLI prediction heavily depends on 2D protein sequence data. Here, we present a novel parallel graph neural network (GNN) to integrate knowledge representation and reasoning for PLI prediction to perform deep learning guided by expert knowledge and informed by 3D structural data. We develop two distinct GNN architectures: [Formula: see text] is the base implementation that employs distinct featurization to enhance domain-awareness, while [Formula: see text] is a novel implementation that can predict with no prior knowledge of the intermolecular interactions. The comprehensive evaluation demonstrated that GNN can successfully capture the binary interactions between ligand and protein's 3D structure with 0.979 test accuracy for [Formula: see text] and 0.958 for [Formula: see text] for predicting activity of a protein-ligand complex. These models are further adapted for regression tasks to predict experimental binding affinities and [Formula: see text] crucial for compound's potency and efficacy. We achieve a Pearson correlation coefficient of 0.66 and 0.65 on experimental affinity and 0.50 and 0.51 on [Formula: see text] with [Formula: see text] and [Formula: see text], respectively, outperforming similar 2D sequence based models. Our method can serve as an interpretable and explainable artificial intelligence (AI) tool for predicted activity, potency, and biophysical properties of lead candidates. To this end, we show the utility of [Formula: see text] on SARS-Cov-2 protein targets by screening a large compound library and comparing the prediction with the experimentally measured data.
Collapse
|
50
|
Yang X, Yang S, Ren P, Wuchty S, Zhang Z. Deep Learning-Powered Prediction of Human-Virus Protein-Protein Interactions. Front Microbiol 2022; 13:842976. [PMID: 35495666 PMCID: PMC9051481 DOI: 10.3389/fmicb.2022.842976] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2021] [Accepted: 03/25/2022] [Indexed: 11/13/2022] Open
Abstract
Identifying human-virus protein-protein interactions (PPIs) is an essential step for understanding viral infection mechanisms and antiviral response of the human host. Recent advances in high-throughput experimental techniques enable the significant accumulation of human-virus PPI data, which have further fueled the development of machine learning-based human-virus PPI prediction methods. Emerging as a very promising method to predict human-virus PPIs, deep learning shows the powerful ability to integrate large-scale datasets, learn complex sequence-structure relationships of proteins and convert the learned patterns into final prediction models with high accuracy. Focusing on the recent progresses of deep learning-powered human-virus PPI predictions, we review technical details of these newly developed methods, including dataset preparation, deep learning architectures, feature engineering, and performance assessment. Moreover, we discuss the current challenges and potential solutions and provide future perspectives of human-virus PPI prediction in the coming post-AlphaFold2 era.
Collapse
Affiliation(s)
- Xiaodi Yang
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing, China
| | - Shiping Yang
- State Key Laboratory of Plant Physiology and Biochemistry, College of Biological Sciences, China Agricultural University, Beijing, China
| | - Panyu Ren
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing, China
| | - Stefan Wuchty
- Department of Computer Science, University of Miami, Miami, FL, United States
- Department of Biology, University of Miami, Miami, FL, United States
- Sylvester Comprehensive Cancer Center, University of Miami, Miami, FL, United States
| | - Ziding Zhang
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing, China
- *Correspondence: Ziding Zhang,
| |
Collapse
|