1
|
Li Y, Zou Q, Dai Q, Stalin A, Luo X. Identifying the DNA methylation preference of transcription factors using ProtBERT and SVM. PLoS Comput Biol 2025; 21:e1012513. [PMID: 40359430 PMCID: PMC12121914 DOI: 10.1371/journal.pcbi.1012513] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2024] [Revised: 05/29/2025] [Accepted: 04/29/2025] [Indexed: 05/15/2025] Open
Abstract
Transcription factors (TFs) can affect gene expression by binding to certain specific DNA sequences. This binding process of TFs may be modulated by DNA methylation. A subset of TFs that serve as methylation readers preferentially binds to certain methylated DNA and is defined as TFPM. The identification of TFPMs enhances our understanding of DNA methylation's role in gene regulation. However, their experimental identification is resource-demanding. In this study, we propose a novel two-step computational approach to classify TFs and TFPMs. First, we employed a fine-tuned ProtBERT model to differentiate between the classes of TFs and non-TFs. Second, we combined the Reduced Amino Acid Category (RAAC) with K-mer and SVM to predict the potential of TFs to bind to methylated DNA. Comparative experiments demonstrate that our proposed methods outperform all existing approaches and emphasize the efficiency of our computational framework in classifying TFs and TFPMs. Cross-species validation on an independent mouse dataset further demonstrates the generalizability of our proposed framework In addition, we conducted predictions on all human transcription factors and found that most of the top 20 proteins belong to the Krueppel C2H2-type Zinc-finger family. So far, some studies have demonstrated a partial correlation between this family and DNA methylation and confirmed the preference of some of its members, thereby showing the robustness of our approach.
Collapse
Affiliation(s)
- Yanchao Li
- School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Qi Dai
- College of Life Science and medicine, Zhejiang Sci-Tech University, Hangzhou, Zhejiang, China
| | - Antony Stalin
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Ximei Luo
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| |
Collapse
|
2
|
Luo J, Zhao K, Chen J, Yang C, Qu F, Liu Y, Jin X, Yan K, Zhang Y, Liu B. iMFP-LG: Identify Novel Multi-functional Peptides Using Protein Language Models and Graph-based Deep Learning. GENOMICS, PROTEOMICS & BIOINFORMATICS 2025; 22:qzae084. [PMID: 39585308 PMCID: PMC12011362 DOI: 10.1093/gpbjnl/qzae084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Revised: 10/25/2024] [Accepted: 11/21/2024] [Indexed: 11/26/2024]
Abstract
Functional peptides are short amino acid fragments that have a wide range of beneficial functions for living organisms. The majority of previous studies have focused on mono-functional peptides, but an increasing number of multi-functional peptides have been discovered. Although there have been enormous experimental efforts to assay multi-functional peptides, only a small portion of millions of known peptides has been explored. The development of effective and accurate techniques for identifying multi-functional peptides can facilitate their discovery and mechanistic understanding. In this study, we presented iMFP-LG, a method for multi-functional peptide identification based on protein language models (pLMs) and graph attention networks (GATs). Our comparative analyses demonstrated that iMFP-LG outperformed the state-of-the-art methods in identifying both multi-functional bioactive peptides and multi-functional therapeutic peptides. The interpretability of iMFP-LG was also illustrated by visualizing attention patterns in pLMs and GATs. Regarding the outstanding performance of iMFP-LG on the identification of multi-functional peptides, we employed iMFP-LG to screen novel peptides with both anti-microbial and anti-cancer functions from millions of known peptides in the UniRef90 database. As a result, eight candidate peptides were identified, among which one candidate was validated to process both anti-bacterial and anti-cancer properties through molecular structure alignment and biological experiments. We anticipate that iMFP-LG can assist in the discovery of multi-functional peptides and contribute to the advancement of peptide drug design.
Collapse
Affiliation(s)
- Jiawei Luo
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China
| | - Kejuan Zhao
- School of Science, Harbin Institute of Technology, Shenzhen 518055, China
| | - Junjie Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China
| | - Caihua Yang
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China
| | - Fuchuan Qu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China
| | - Yumeng Liu
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen 518055, China
| | - Xiaopeng Jin
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen 518055, China
| | - Ke Yan
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 10081, China
| | - Yang Zhang
- School of Science, Harbin Institute of Technology, Shenzhen 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 10081, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing 10081, China
| |
Collapse
|
3
|
Jiang R, Yue Z, Shang L, Wang D, Wei N. PEZy-miner: An artificial intelligence driven approach for the discovery of plastic-degrading enzyme candidates. Metab Eng Commun 2024; 19:e00248. [PMID: 39310048 PMCID: PMC11414552 DOI: 10.1016/j.mec.2024.e00248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2024] [Revised: 07/14/2024] [Accepted: 09/03/2024] [Indexed: 09/25/2024] Open
Abstract
Plastic waste has caused a global environmental crisis. Biocatalytic depolymerization mediated by enzymes has emerged as an efficient and sustainable alternative for plastic treatment and recycling. However, it is challenging and time-consuming to discover novel plastic-degrading enzymes using conventional cultivation-based or omics methods. There is a growing interest in developing effective computational methods to identify new enzymes with desirable plastic degradation functionalities by exploring the ever-increasing databases of protein sequences. In this study, we designed an innovative machine learning-based framework, named PEZy-Miner, to mine for enzymes with high potential in degrading plastics of interest. Two datasets integrating information from experimentally verified enzymes and homologs with unknown plastic-degrading activity were created respectively, covering eleven types of plastic substrates. Protein language models and binary classification models were developed to predict enzymatic degradation of plastics along with confidence and uncertainty estimation. PEZy-Miner exhibited high prediction accuracy and stability when validated on experimentally verified enzymes. Furthermore, by masking the experimentally verified enzymes and blending them into homolog dataset, PEZy-Miner effectively concentrated the experimentally verified entries by 14∼30 times while shortlisting promising plastic-degrading enzyme candidates. We applied PEZy-Miner to 0.1 million putative sequences, out of which 27 new sequences were identified with high confidence. This study provided a new computational tool for mining and recommending promising new plastic-degrading enzymes.
Collapse
Affiliation(s)
- Renjing Jiang
- Department of Civil and Environmental Engineering, University of Illinois Urbana-Champaign, Urbana, IL, 61801, United States
| | - Zhenrui Yue
- School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL, 61820, United States
| | - Lanyu Shang
- School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL, 61820, United States
| | - Dong Wang
- School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL, 61820, United States
| | - Na Wei
- Department of Civil and Environmental Engineering, University of Illinois Urbana-Champaign, Urbana, IL, 61801, United States
| |
Collapse
|
4
|
Sui J, Chen J, Chen Y, Iwamori N, Sun J. GASIDN: identification of sub-Golgi proteins with multi-scale feature fusion. BMC Genomics 2024; 25:1019. [PMID: 39478465 PMCID: PMC11526662 DOI: 10.1186/s12864-024-10954-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2024] [Accepted: 10/24/2024] [Indexed: 11/02/2024] Open
Abstract
The Golgi apparatus is a crucial component of the inner membrane system in eukaryotic cells, playing a central role in protein biosynthesis. Dysfunction of the Golgi apparatus has been linked to neurodegenerative diseases. Accurate identification of sub-Golgi protein types is therefore essential for developing effective treatments for such diseases. Due to the expensive and time-consuming nature of experimental methods for identifying sub-Golgi protein types, various computational methods have been developed as identification tools. However, the majority of these methods rely solely on neighboring features in the protein sequence and neglect the crucial spatial structure information of the protein.To discover alternative methods for accurately identifying sub-Golgi proteins, we have developed a model called GASIDN. The GASIDN model extracts multi-dimension features by utilizing a 1D convolution module on protein sequences and a graph learning module on contact maps constructed from AlphaFold2.The model utilizes the deep representation learning model SeqVec to initialize protein sequences. GASIDN achieved accuracy values of 98.4% and 96.4% in independent testing and ten-fold cross-validation, respectively, outperforming the majority of previous predictors. To the best of our knowledge, this is the first method that utilizes multi-scale feature fusion to identify and locate sub-Golgi proteins. In order to assess the generalizability and scalability of our model, we conducted experiments to apply it in the identification of proteins from other organelles, including plant vacuoles and peroxisomes. The results obtained from these experiments demonstrated promising outcomes, indicating the effectiveness and versatility of our model. The source code and datasets can be accessed at https://github.com/SJNNNN/GASIDN .
Collapse
Affiliation(s)
- Jianan Sui
- School of Information Science and Engineering, University of Jinan, Jinan, China
| | - Jiazi Chen
- Laboratory of Zoology, Graduate School of Bioresource and Bioenvironmental Sciences, Kyushu University, Fukuoka-shi, Fukuoka, Japan
| | - Yuehui Chen
- School of Artificial Intelligence Institute and Information Science and Engineering, University of Jinan, Jinan, China.
| | - Naoki Iwamori
- Laboratory of Zoology, Graduate School of Bioresource and Bioenvironmental Sciences, Kyushu University, Fukuoka-shi, Fukuoka, Japan
| | - Jin Sun
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
| |
Collapse
|
5
|
Sheng D, Jin C, Yue K, Yue M, Liang Y, Xue X, Li P, Zhao G, Zhang L. Pan-cancer atlas of tumor-resident microbiome, immunity and prognosis. Cancer Lett 2024; 598:217077. [PMID: 38908541 DOI: 10.1016/j.canlet.2024.217077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Revised: 05/23/2024] [Accepted: 06/14/2024] [Indexed: 06/24/2024]
Abstract
The existence of microbiome in human tumors has been determined widely, but evaluating the contribution of intratumoral bacteria and fungi to tumor immunity and prognosis from a pan-cancer perspective remains absent. We designed an improved microbial analysis pipeline to reduce interference from host sequences, complemented with integration analysis of intratumoral microbiota at species level with clinical indicators, tumor microenvironment, and prognosis across cancer types. We found that intratumoral microbiota is associated with immunophenotyping, with high-immunity subtypes showing greater bacterial and fungal richness compared to low-immunity groups. We also noted that the combination of fungi and bacteria demonstrated promising prognostic value across cancer types. We, thus, present The Cancer Microbiota (TCMbio), an interactive platform that provides the intratumoral bacteria and fungi data, and a comprehensive analysis module for 33 types of cancers. This led to the discovery of clinical and prognostic significance of intratumoral microbes.
Collapse
Affiliation(s)
- Dashuang Sheng
- Microbiome-X, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China; State Key Laboratory of Microbial Technology, Shandong University, Qingdao, China
| | - Chuandi Jin
- Microbiome-X, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
| | - Kaile Yue
- Microbiome-X, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
| | - Min Yue
- Microbiome-X, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
| | - Yijia Liang
- Microbiome-X, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
| | - Xinxin Xue
- Microbiome-X, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
| | - Pingfu Li
- Shandong Huxley Medical Technology Co.,Ltd., Jinan, China
| | - Guoping Zhao
- Microbiome-X, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China; State Key Laboratory of Microbial Technology, Shandong University, Qingdao, China; CAS Key Laboratory of Computational Biology, Bio-Med Big Data Center, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China.
| | - Lei Zhang
- Microbiome-X, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China; State Key Laboratory of Microbial Technology, Shandong University, Qingdao, China.
| |
Collapse
|
6
|
Su Z, Wu Y, Cao K, Du J, Cao L, Wu Z, Wu X, Wang X, Song Y, Wang X, Duan H. APEX-pHLA: A novel method for accurate prediction of the binding between exogenous short peptides and HLA class I molecules. Methods 2024; 228:38-47. [PMID: 38772499 DOI: 10.1016/j.ymeth.2024.05.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2024] [Revised: 04/28/2024] [Accepted: 05/18/2024] [Indexed: 05/23/2024] Open
Abstract
Human leukocyte antigen (HLA) molecules play critically significant role within the realm of immunotherapy due to their capacities to recognize and bind exogenous antigens such as peptides, subsequently delivering them to immune cells. Predicting the binding between peptides and HLA molecules (pHLA) can expedite the screening of immunogenic peptides and facilitate vaccine design. However, traditional experimental methods are time-consuming and inefficient. In this study, an efficient method based on deep learning was developed for predicting peptide-HLA binding, which treated peptide sequences as linguistic entities. It combined the architectures of textCNN and BiLSTM to create a deep neural network model called APEX-pHLA. This model operated without limitations related to HLA class I allele variants and peptide segment lengths, enabling efficient encoding of sequence features for both HLA and peptide segments. On the independent test set, the model achieved Accuracy, ROC_AUC, F1, and MCC is 0.9449, 0.9850, 0.9453, and 0.8899, respectively. Similarly, on an external test set, the results were 0.9803, 0.9574, 0.8835, and 0.7863, respectively. These findings outperformed fifteen methods previously reported in the literature. The accurate prediction capability of the APEX-pHLA model in peptide-HLA binding might provide valuable insights for future HLA vaccine design.
Collapse
Affiliation(s)
- Zhihao Su
- College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, China.
| | - Yejian Wu
- College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, China
| | - Kaiqiang Cao
- College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, China.
| | - Jie Du
- College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, China.
| | - Lujing Cao
- College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, China
| | - Zhipeng Wu
- College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, China
| | - Xinyi Wu
- College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, China
| | - Xinqiao Wang
- College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, China
| | - Ying Song
- College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, China.
| | - Xudong Wang
- College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, China.
| | - Hongliang Duan
- Faculty of Applied Sciences, Macao Polytechnic University, Macao 999078, China.
| |
Collapse
|
7
|
Machaca V, Goyzueta V, Cruz MG, Sejje E, Pilco LM, López J, Túpac Y. Transformers meets neoantigen detection: a systematic literature review. J Integr Bioinform 2024; 21:jib-2023-0043. [PMID: 38960869 PMCID: PMC11377031 DOI: 10.1515/jib-2023-0043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2023] [Accepted: 03/20/2024] [Indexed: 07/05/2024] Open
Abstract
Cancer immunology offers a new alternative to traditional cancer treatments, such as radiotherapy and chemotherapy. One notable alternative is the development of personalized vaccines based on cancer neoantigens. Moreover, Transformers are considered a revolutionary development in artificial intelligence with a significant impact on natural language processing (NLP) tasks and have been utilized in proteomics studies in recent years. In this context, we conducted a systematic literature review to investigate how Transformers are applied in each stage of the neoantigen detection process. Additionally, we mapped current pipelines and examined the results of clinical trials involving cancer vaccines.
Collapse
Affiliation(s)
| | | | | | - Erika Sejje
- Universidad Nacional de San Agustín, Arequipa, Perú
| | | | | | - Yván Túpac
- 187038 Universidad Católica San Pablo , Arequipa, Perú
| |
Collapse
|
8
|
Bulashevska A, Nacsa Z, Lang F, Braun M, Machyna M, Diken M, Childs L, König R. Artificial intelligence and neoantigens: paving the path for precision cancer immunotherapy. Front Immunol 2024; 15:1394003. [PMID: 38868767 PMCID: PMC11167095 DOI: 10.3389/fimmu.2024.1394003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Accepted: 05/13/2024] [Indexed: 06/14/2024] Open
Abstract
Cancer immunotherapy has witnessed rapid advancement in recent years, with a particular focus on neoantigens as promising targets for personalized treatments. The convergence of immunogenomics, bioinformatics, and artificial intelligence (AI) has propelled the development of innovative neoantigen discovery tools and pipelines. These tools have revolutionized our ability to identify tumor-specific antigens, providing the foundation for precision cancer immunotherapy. AI-driven algorithms can process extensive amounts of data, identify patterns, and make predictions that were once challenging to achieve. However, the integration of AI comes with its own set of challenges, leaving space for further research. With particular focus on the computational approaches, in this article we have explored the current landscape of neoantigen prediction, the fundamental concepts behind, the challenges and their potential solutions providing a comprehensive overview of this rapidly evolving field.
Collapse
Affiliation(s)
- Alla Bulashevska
- Host-Pathogen-Interactions, Paul-Ehrlich-Institut, Langen, Germany
| | - Zsófia Nacsa
- Host-Pathogen-Interactions, Paul-Ehrlich-Institut, Langen, Germany
| | - Franziska Lang
- TRON - Translational Oncology at the University Medical Center of the Johannes Gutenberg University gGmbH, Mainz, Germany
| | - Markus Braun
- Host-Pathogen-Interactions, Paul-Ehrlich-Institut, Langen, Germany
| | - Martin Machyna
- Host-Pathogen-Interactions, Paul-Ehrlich-Institut, Langen, Germany
| | - Mustafa Diken
- TRON - Translational Oncology at the University Medical Center of the Johannes Gutenberg University gGmbH, Mainz, Germany
| | - Liam Childs
- Host-Pathogen-Interactions, Paul-Ehrlich-Institut, Langen, Germany
| | - Renate König
- Host-Pathogen-Interactions, Paul-Ehrlich-Institut, Langen, Germany
| |
Collapse
|
9
|
Liu M, Wu T, Li X, Zhu Y, Chen S, Huang J, Zhou F, Liu H. ACPPfel: Explainable deep ensemble learning for anticancer peptides prediction based on feature optimization. Front Genet 2024; 15:1352504. [PMID: 38487252 PMCID: PMC10937565 DOI: 10.3389/fgene.2024.1352504] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Accepted: 02/19/2024] [Indexed: 03/17/2024] Open
Abstract
Background: Cancer is a significant global health problem that continues to cause a high number of deaths worldwide. Traditional cancer treatments often come with risks that can compromise the functionality of vital organs. As a potential alternative to these conventional therapies, Anticancer peptides (ACPs) have garnered attention for their small size, high specificity, and reduced toxicity, making them as a promising option for cancer treatments. Methods: However, the process of identifying effective ACPs through wet-lab screening experiments is time-consuming and requires a lot of labor. To overcome this challenge, a deep ensemble learning method is constructed to predict anticancer peptides (ACPs) in this study. To evaluate the reliability of the framework, four different datasets are used in this study for training and testing. During the training process of the model, integration of feature selection methods, feature dimensionality reduction measures, and optimization of the deep ensemble model are carried out. Finally, we explored the interpretability of features that affected the final prediction results and built a web server platform to facilitate anticancer peptides prediction, which can be used by all researchers for further studies. This web server can be accessed at http://lmylab.online:5001/. Results: The result of this study achieves an accuracy rate of 98.53% and an AUC (Area under Curve) value of 0.9972 on the ACPfel dataset, it has improvements on other datasets as well.
Collapse
Affiliation(s)
- Mingyou Liu
- School of Biology and Engineering (School of Health Medicine Modern Industry), Guizhou Medical University, Guiyang, China
- Engineering Research Center of Health Medicine Biotechnology of Guizhou Province, Guizhou Medical University, Guiyang, China
| | - Tao Wu
- School of Biology and Engineering (School of Health Medicine Modern Industry), Guizhou Medical University, Guiyang, China
| | - Xue Li
- School of Biology and Engineering (School of Health Medicine Modern Industry), Guizhou Medical University, Guiyang, China
- Engineering Research Center of Health Medicine Biotechnology of Guizhou Province, Guizhou Medical University, Guiyang, China
| | - Yingxue Zhu
- School of Biology and Engineering (School of Health Medicine Modern Industry), Guizhou Medical University, Guiyang, China
- Engineering Research Center of Health Medicine Biotechnology of Guizhou Province, Guizhou Medical University, Guiyang, China
| | - Sen Chen
- School of Biology and Engineering (School of Health Medicine Modern Industry), Guizhou Medical University, Guiyang, China
| | - Jian Huang
- School of Life Science and Technology, University of Electronic Science and Technology, Chengdu, China
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu, China
| | - Fengfeng Zhou
- School of Biology and Engineering (School of Health Medicine Modern Industry), Guizhou Medical University, Guiyang, China
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China
| | - Hongmei Liu
- School of Biology and Engineering (School of Health Medicine Modern Industry), Guizhou Medical University, Guiyang, China
- Engineering Research Center of Health Medicine Biotechnology of Guizhou Province, Guizhou Medical University, Guiyang, China
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China
| |
Collapse
|
10
|
Conev A, Fasoulis R, Hall-Swan S, Ferreira R, Kavraki LE. HLAEquity: Examining biases in pan-allele peptide-HLA binding predictors. iScience 2024; 27:108613. [PMID: 38188519 PMCID: PMC10770483 DOI: 10.1016/j.isci.2023.108613] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 11/13/2023] [Accepted: 11/29/2023] [Indexed: 01/09/2024] Open
Abstract
Peptide-HLA (pHLA) binding prediction is essential in screening peptide candidates for personalized peptide vaccines. Machine learning (ML) pHLA binding prediction tools are trained on vast amounts of data and are effective in screening peptide candidates. Most ML models report the ability to generalize to HLA alleles unseen during training ("pan-allele" models). However, the use of datasets with imbalanced allele content raises concerns about biased model performance. First, we examine the data bias of two ML-based pan-allele pHLA binding predictors. We find that the pHLA datasets overrepresent alleles from geographic populations of high-income countries. Second, we show that the identified data bias is perpetuated within ML models, leading to algorithmic bias and subpar performance for alleles expressed in low-income geographic populations. We draw attention to the potential therapeutic consequences of this bias, and we challenge the use of the term "pan-allele" to describe models trained with currently available public datasets.
Collapse
Affiliation(s)
- Anja Conev
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Romanos Fasoulis
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Sarah Hall-Swan
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Rodrigo Ferreira
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Lydia E. Kavraki
- Department of Computer Science, Rice University, Houston, TX, USA
| |
Collapse
|
11
|
Zeng X, Meng FF, Li X, Zhong KY, Jiang B, Li Y. GHGPR-PPIS: A graph convolutional network for identifying protein-protein interaction site using heat kernel with Generalized PageRank techniques and edge self-attention feature processing block. Comput Biol Med 2024; 168:107683. [PMID: 37984202 DOI: 10.1016/j.compbiomed.2023.107683] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Revised: 10/10/2023] [Accepted: 11/06/2023] [Indexed: 11/22/2023]
Abstract
Accurately pinpointing protein-protein interaction site (PPIS) on the molecular level is of utmost significance for annotating protein function and comprehending the mechanisms underpinning various diseases. While numerous computational methods for predicting PPIS have emerged, they have indeed mitigated the labor and time constraints associated with traditional experimental methods. However, the predictive accuracy of these methods has yet to reach the desired threshold. In this context, we proposed a groundbreaking graph-based computational model called GHGPR-PPIS. This innovative model leveraged a graph convolutional network using heat kernel (GraphHeat) in conjunction with Generalized PageRank techniques (GHGPR) to predict PPIS. Additionally, building upon the GHGPR framework, we devised an edge self-attention feature processing block, further augmenting the performance of the model. Experimental findings conclusively demonstrated that GHGPR-PPIS surpassed all competing state-of-the-art models when evaluated on the benchmark test set. Impressively, on two distinct independent test sets and a specific protein chain, GHGPR-PPIS consistently demonstrated superior generalization performance and practical applicability compared to the comparative model, AGAT-PPIS. Lastly, leveraging the t-SNE dimensionality reduction algorithm and clustering visualization technique, we delved into an interpretability analysis of the effectiveness of GHGPR-PPIS by meticulously comparing the outputs from different stages of the model.
Collapse
Affiliation(s)
- Xin Zeng
- College of Mathematics and Computer Science, Dali University, Dali, 671003, China
| | - Fan-Fang Meng
- College of Mathematics and Computer Science, Dali University, Dali, 671003, China
| | - Xin Li
- College of Mathematics and Computer Science, Dali University, Dali, 671003, China
| | - Kai-Yang Zhong
- College of Mathematics and Computer Science, Dali University, Dali, 671003, China
| | - Bei Jiang
- Yunnan Key Laboratory of Screening and Research on Anti-pathogenic Plant Resources from Western Yunnan, Dali University, Dali, 671000, China
| | - Yi Li
- College of Mathematics and Computer Science, Dali University, Dali, 671003, China.
| |
Collapse
|
12
|
Zeng X, Zhong KY, Jiang B, Li Y. Fusing Sequence and Structural Knowledge by Heterogeneous Models to Accurately and Interpretively Predict Drug-Target Affinity. Molecules 2023; 28:8005. [PMID: 38138496 PMCID: PMC10745601 DOI: 10.3390/molecules28248005] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Revised: 12/06/2023] [Accepted: 12/06/2023] [Indexed: 12/24/2023] Open
Abstract
Drug-target affinity (DTA) prediction is crucial for understanding molecular interactions and aiding drug discovery and development. While various computational methods have been proposed for DTA prediction, their predictive accuracy remains limited, failing to delve into the structural nuances of interactions. With increasingly accurate and accessible structure prediction of targets, we developed a novel deep learning model, named S2DTA, to accurately predict DTA by fusing sequence features of drug SMILES, targets, and pockets and their corresponding graph structural features using heterogeneous models based on graph and semantic networks. Experimental findings underscored that complex feature representations imparted negligible enhancements to the model's performance. However, the integration of heterogeneous models demonstrably bolstered predictive accuracy. In comparison to three state-of-the-art methodologies, such as DeepDTA, GraphDTA, and DeepDTAF, S2DTA's performance became more evident. It exhibited a 25.2% reduction in mean absolute error (MAE) and a 20.1% decrease in root mean square error (RMSE). Additionally, S2DTA showed some improvements in other crucial metrics, including Pearson Correlation Coefficient (PCC), Spearman, Concordance Index (CI), and R2, with these metrics experiencing increases of 19.6%, 17.5%, 8.1%, and 49.4%, respectively. Finally, we conducted an interpretability analysis on the effectiveness of S2DTA by bidirectional self-attention mechanism. The analysis results supported that S2DTA was an effective and accurate tool for predicting DTA.
Collapse
Affiliation(s)
- Xin Zeng
- College of Mathematics and Computer Science, Dali University, Dali 671003, China; (X.Z.); (K.-Y.Z.)
| | - Kai-Yang Zhong
- College of Mathematics and Computer Science, Dali University, Dali 671003, China; (X.Z.); (K.-Y.Z.)
| | - Bei Jiang
- Yunnan Key Laboratory of Screening and Research on Anti-Pathogenic Plant Resources from Western Yunnan, Dali University, Dali 671000, China;
| | - Yi Li
- College of Mathematics and Computer Science, Dali University, Dali 671003, China; (X.Z.); (K.-Y.Z.)
| |
Collapse
|
13
|
Le NQK. Leveraging transformers-based language models in proteome bioinformatics. Proteomics 2023; 23:e2300011. [PMID: 37381841 DOI: 10.1002/pmic.202300011] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2023] [Revised: 06/13/2023] [Accepted: 06/13/2023] [Indexed: 06/30/2023]
Abstract
In recent years, the rapid growth of biological data has increased interest in using bioinformatics to analyze and interpret this data. Proteomics, which studies the structure, function, and interactions of proteins, is a crucial area of bioinformatics. Using natural language processing (NLP) techniques in proteomics is an emerging field that combines machine learning and text mining to analyze biological data. Recently, transformer-based NLP models have gained significant attention for their ability to process variable-length input sequences in parallel, using self-attention mechanisms to capture long-range dependencies. In this review paper, we discuss the recent advancements in transformer-based NLP models in proteome bioinformatics and examine their advantages, limitations, and potential applications to improve the accuracy and efficiency of various tasks. Additionally, we highlight the challenges and future directions of using these models in proteome bioinformatics research. Overall, this review provides valuable insights into the potential of transformer-based NLP models to revolutionize proteome bioinformatics.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
- AIBioMed Research Group, Taipei Medical University, Taipei, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei, Taiwan
- Translational Imaging Research Center, Taipei Medical University Hospital, Taipei, Taiwan
| |
Collapse
|
14
|
Huang G, Tang X, Zheng P. DeepHLAPred: a deep learning-based method for non-classical HLA binder prediction. BMC Genomics 2023; 24:706. [PMID: 37993812 PMCID: PMC10666343 DOI: 10.1186/s12864-023-09796-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Accepted: 11/08/2023] [Indexed: 11/24/2023] Open
Abstract
Human leukocyte antigen (HLA) is closely involved in regulating the human immune system. Despite great advance in detecting classical HLA Class I binders, there are few methods or toolkits for recognizing non-classical HLA Class I binders. To fill in this gap, we have developed a deep learning-based tool called DeepHLAPred. The DeepHLAPred used electron-ion interaction pseudo potential, integer numerical mapping and accumulated amino acid frequency as initial representation of non-classical HLA binder sequence. The deep learning module was used to further refine high-level representations. The deep learning module comprised two parallel convolutional neural networks, each followed by maximum pooling layer, dropout layer, and bi-directional long short-term memory network. The experimental results showed that the DeepHLAPred reached the state-of-the-art performanceson the cross-validation test and the independent test. The extensive test demonstrated the rationality of the DeepHLAPred. We further analyzed sequence pattern of non-classical HLA class I binders by information entropy. The information entropy of non-classical HLA binder sequence implied sequence pattern to a certain extent. In addition, we have developed a user-friendly webserver for convenient use, which is available at http://www.biolscience.cn/DeepHLApred/ . The tool and the analysis is helpful to detect non-classical HLA Class I binder. The source code and data is available at https://github.com/tangxingyu0/DeepHLApred .
Collapse
Affiliation(s)
- Guohua Huang
- School of Information Technology and Administration, Hunan University of Finance and Economics, Changsha, Hunan, 410215, China.
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China.
| | - Xingyu Tang
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China
| | - Peijie Zheng
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China
| |
Collapse
|
15
|
Wang GA, Yan X, Li X, Liu Y, Xia J, Zhu X. MSTL-Kace: Prediction of Prokaryotic Lysine Acetylation Sites Based on Multistage Transfer Learning Strategy. ACS OMEGA 2023; 8:41930-41942. [PMID: 37969991 PMCID: PMC10634282 DOI: 10.1021/acsomega.3c07086] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/16/2023] [Revised: 10/11/2023] [Accepted: 10/13/2023] [Indexed: 11/17/2023]
Abstract
As one of the most important post-translational modifications (PTM), lysine acetylation (Kace) plays an important role in various biological activities. Traditional experimental methods for identifying Kace sites are inefficient and expensive. Instead, several machine learning methods have been developed for Kace site prediction, and hand-crafted features have been used to encode the protein sequences. However, there are still two challenges: the complex biological information may be under-represented by these manmade features and the small sample issue of some species needs to be addressed. We propose a novel model, MSTL-Kace, which was developed based on transfer learning strategy with pretrained bidirectional encoder representations from transformers (BERT) model. In this model, the high-level embeddings were extracted from species-specific BERT models, and a two-stage fine-tuning strategy was used to deal with small sample issue. Specifically, a domain-specific BERT model was pretrained using all of the sequences in our data sets, which was then fine-tuned, or two-stage fine-tuned based on the training data set of each species to obtain the species-specific BERT models. Afterward, the embeddings of residues were extracted from the fine-tuned model and fed to the different downstream learning algorithms. After comparison, the best model for the six prokaryotic species was built by using a random forest. The results for the independent test sets show that our model outperforms the state-of-the-art methods on all six species. The source codes and data for MSTL-Kace are available at https://github.com/leo97king/MSTL-Kace.
Collapse
Affiliation(s)
- Gang-Ao Wang
- School
of Sciences, Anhui Agricultural University, Hefei 230036, Anhui, China
| | - Xiaodi Yan
- School
of Sciences, Anhui Agricultural University, Hefei 230036, Anhui, China
| | - Xiang Li
- School
of Sciences, Anhui Agricultural University, Hefei 230036, Anhui, China
| | - Yinbo Liu
- School
of Sciences, Anhui Agricultural University, Hefei 230036, Anhui, China
| | - Junfeng Xia
- Key
Laboratory of Intelligent Computing and Signal Processing of Ministry
of Education, Institutes of Physical Science and Information Technology, Anhui University, Hefei 230601, Anhui, China
| | - Xiaolei Zhu
- School
of Sciences, Anhui Agricultural University, Hefei 230036, Anhui, China
| |
Collapse
|
16
|
Wang Y, Wang Z, Liu Y, Yu Q, Liu Y, Luo C, Wang S, Liu H, Liu M, Zhang G, Fan Y, Li K, Huang L, Duan M, Zhou F. Reconstructing the cytokine view for the multi-view prediction of COVID-19 mortality. BMC Infect Dis 2023; 23:622. [PMID: 37735372 PMCID: PMC10514938 DOI: 10.1186/s12879-023-08291-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Accepted: 04/28/2023] [Indexed: 09/23/2023] Open
Abstract
BACKGROUND Coronavirus disease 2019 (COVID-19) is a rapidly developing and sometimes lethal pulmonary disease. Accurately predicting COVID-19 mortality will facilitate optimal patient treatment and medical resource deployment, but the clinical practice still needs to address it. Both complete blood counts and cytokine levels were observed to be modified by COVID-19 infection. This study aimed to use inexpensive and easily accessible complete blood counts to build an accurate COVID-19 mortality prediction model. The cytokine fluctuations reflect the inflammatory storm induced by COVID-19, but their levels are not as commonly accessible as complete blood counts. Therefore, this study explored the possibility of predicting cytokine levels based on complete blood counts. METHODS We used complete blood counts to predict cytokine levels. The predictive model includes an autoencoder, principal component analysis, and linear regression models. We used classifiers such as support vector machine and feature selection models such as adaptive boost to predict the mortality of COVID-19 patients. RESULTS Complete blood counts and original cytokine levels reached the COVID-19 mortality classification area under the curve (AUC) values of 0.9678 and 0.9111, respectively, and the cytokine levels predicted by the feature set alone reached the classification AUC value of 0.9844. The predicted cytokine levels were more significantly associated with COVID-19 mortality than the original values. CONCLUSIONS Integrating the predicted cytokine levels and complete blood counts improved a COVID-19 mortality prediction model using complete blood counts only. Both the cytokine level prediction models and the COVID-19 mortality prediction models are publicly available at http://www.healthinformaticslab.org/supp/resources.php .
Collapse
Affiliation(s)
- Yueying Wang
- College of Computer Science and Technology, Jilin University, 130012, Changchun, China
- School of Biology and Engineering, Guizhou Medical University, 550025, Guiyang, Guizhou, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, 130012, Changchun, China
- Department of Epidemiology and Biostatistics, School of Public Health, Jilin University, 130021, Changchun, Jilin Province, China
| | - Zhao Wang
- College of Software, Jilin University, 130012, Changchun, China
| | - Yaqing Liu
- College of Computer Science and Technology, Jilin University, 130012, Changchun, China
| | - Qiong Yu
- Department of Epidemiology and Biostatistics, School of Public Health, Jilin University, 130021, Changchun, Jilin Province, China
| | - Yujia Liu
- College of Software, Jilin University, 130012, Changchun, China
| | - Changfan Luo
- College of Software, Jilin University, 130012, Changchun, China
| | - Siyang Wang
- College of Software, Jilin University, 130012, Changchun, China
| | - Hongmei Liu
- School of Biology and Engineering, Guizhou Medical University, 550025, Guiyang, Guizhou, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, 130012, Changchun, China
- Engineering Research Center of Medical Biotechnology, Guizhou Medical University, 550025, Guiyang, Guizhou, China
| | - Mingyou Liu
- School of Biology and Engineering, Guizhou Medical University, 550025, Guiyang, Guizhou, China
| | - Gongyou Zhang
- School of Biology and Engineering, Guizhou Medical University, 550025, Guiyang, Guizhou, China
| | - Yusi Fan
- College of Software, Jilin University, 130012, Changchun, China
| | - Kewei Li
- College of Computer Science and Technology, Jilin University, 130012, Changchun, China
- School of Biology and Engineering, Guizhou Medical University, 550025, Guiyang, Guizhou, China
| | - Lan Huang
- College of Computer Science and Technology, Jilin University, 130012, Changchun, China
- School of Biology and Engineering, Guizhou Medical University, 550025, Guiyang, Guizhou, China
| | - Meiyu Duan
- College of Computer Science and Technology, Jilin University, 130012, Changchun, China.
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, 130012, Changchun, China.
| | - Fengfeng Zhou
- College of Computer Science and Technology, Jilin University, 130012, Changchun, China.
- School of Biology and Engineering, Guizhou Medical University, 550025, Guiyang, Guizhou, China.
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, 130012, Changchun, China.
| |
Collapse
|
17
|
Li F, Liu S, Li K, Zhang Y, Duan M, Yao Z, Zhu G, Guo Y, Wang Y, Huang L, Zhou F. EpiTEAmDNA: Sequence feature representation via transfer learning and ensemble learning for identifying multiple DNA epigenetic modification types across species. Comput Biol Med 2023; 160:107030. [PMID: 37196456 DOI: 10.1016/j.compbiomed.2023.107030] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2023] [Revised: 04/21/2023] [Accepted: 05/10/2023] [Indexed: 05/19/2023]
Abstract
Methylation is a major DNA epigenetic modification for regulating the biological processes without altering the DNA sequence, and multiple types of DNA methylations have been discovered, including 6mA, 5hmC, and 4mC. Multiple computational approaches were developed to automatically identify the DNA methylation residues using machine learning or deep learning algorithms. The machine learning (ML) based methods are difficult to be transferred to the other predicting tasks of the DNA methylation sites using additional knowledge. Deep learning (DL) may facilitate the transfer learning of knowledge from similar tasks, but they are often ineffective on small datasets. This study proposes an integrated feature representation framework EpiTEAmDNA based on the strategies of transfer learning and ensemble learning, which is evaluated on multiple DNA methylation types across 15 species. EpiTEAmDNA integrates convolutional neural network (CNN) and conventional machine learning methods, and shows improved performances than the existing DL-based methods on small datasets when no additional knowledge is available. The experimental data suggests that the EpiTEAmDNA models may be further improved via transfer learning based on additional knowledge. The evaluation experiments on the independent test datasets also suggest that the proposed EpiTEAmDNA framework outperforms the existing models in most prediction tasks of the 3 DNA methylation types across 15 species. The source code, pre-trained global model, and the EpiTEAmDNA feature representation framework are freely available at http://www.healthinformaticslab.org/supp/.
Collapse
Affiliation(s)
- Fei Li
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Shuai Liu
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Kewei Li
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Yaqi Zhang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Meiyu Duan
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China.
| | - Zhaomin Yao
- College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, 110167, China
| | - Gancheng Zhu
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Yutong Guo
- College of Life Sciences, Jilin University, Changchun, Jilin, 130012, China
| | - Ying Wang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Lan Huang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Fengfeng Zhou
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China.
| |
Collapse
|
18
|
Kalemati M, Darvishi S, Koohi S. CapsNet-MHC predicts peptide-MHC class I binding based on capsule neural networks. Commun Biol 2023; 6:492. [PMID: 37147498 PMCID: PMC10162658 DOI: 10.1038/s42003-023-04867-2] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2022] [Accepted: 04/24/2023] [Indexed: 05/07/2023] Open
Abstract
The Major Histocompatibility Complex (MHC) binds to the derived peptides from pathogens to present them to killer T cells on the cell surface. Developing computational methods for accurate, fast, and explainable peptide-MHC binding prediction can facilitate immunotherapies and vaccine development. Various deep learning-based methods rely on separate feature extraction from the peptide and MHC sequences and ignore their pairwise binding information. This paper develops a capsule neural network-based method to efficiently capture the peptide-MHC complex features to predict the peptide-MHC class I binding. Various evaluations confirmed our method outperformance over the alternative methods, while it can provide accurate prediction over less available data. Moreover, for providing precise insights into the results, we explored the essential features that contributed to the prediction. Since the simulation results demonstrated consistency with the experimental studies, we concluded that our method can be utilized for the accurate, rapid, and interpretable peptide-MHC binding prediction to assist biological therapies.
Collapse
Affiliation(s)
- Mahmood Kalemati
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Saeid Darvishi
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Somayyeh Koohi
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran.
| |
Collapse
|
19
|
IUP-BERT: Identification of Umami Peptides Based on BERT Features. Foods 2022; 11:foods11223742. [PMID: 36429332 PMCID: PMC9689418 DOI: 10.3390/foods11223742] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Revised: 11/14/2022] [Accepted: 11/16/2022] [Indexed: 11/23/2022] Open
Abstract
Umami is an important widely-used taste component of food seasoning. Umami peptides are specific structural peptides endowing foods with a favorable umami taste. Laboratory approaches used to identify umami peptides are time-consuming and labor-intensive, which are not feasible for rapid screening. Here, we developed a novel peptide sequence-based umami peptide predictor, namely iUP-BERT, which was based on the deep learning pretrained neural network feature extraction method. After optimization, a single deep representation learning feature encoding method (BERT: bidirectional encoder representations from transformer) in conjugation with the synthetic minority over-sampling technique (SMOTE) and support vector machine (SVM) methods was adopted for model creation to generate predicted probabilistic scores of potential umami peptides. Further extensive empirical experiments on cross-validation and an independent test showed that iUP-BERT outperformed the existing methods with improvements, highlighting its effectiveness and robustness. Finally, an open-access iUP-BERT web server was built. To our knowledge, this is the first efficient sequence-based umami predictor created based on a single deep-learning pretrained neural network feature extraction method. By predicting umami peptides, iUP-BERT can help in further research to improve the palatability of dietary supplements in the future.
Collapse
|
20
|
Dong B, Li M, Jiang B, Gao B, Li D, Zhang T. Antimicrobial Peptides Prediction method based on sequence multidimensional feature embedding. Front Genet 2022; 13:1069558. [PMID: 36468005 PMCID: PMC9714691 DOI: 10.3389/fgene.2022.1069558] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Accepted: 11/02/2022] [Indexed: 09/10/2024] Open
Abstract
Antimicrobial peptides (AMPs) are alkaline substances with efficient bactericidal activity produced in living organisms. As the best substitute for antibiotics, they have been paid more and more attention in scientific research and clinical application. AMPs can be produced from almost all organisms and are capable of killing a wide variety of pathogenic microorganisms. In addition to being antibacterial, natural AMPs have many other therapeutically important activities, such as wound healing, antioxidant and immunomodulatory effects. To discover new AMPs, the use of wet experimental methods is expensive and difficult, and bioinformatics technology can effectively solve this problem. Recently, some deep learning methods have been applied to the prediction of AMPs and achieved good results. To further improve the prediction accuracy of AMPs, this paper designs a new deep learning method based on sequence multidimensional representation. By encoding and embedding sequence features, and then inputting the model to identify AMPs, high-precision classification of AMPs and Non-AMPs with lengths of 10-200 is achieved. The results show that our method improved accuracy by 1.05% compared to the most advanced model in independent data validation without decreasing other indicators.
Collapse
Affiliation(s)
- Benzhi Dong
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Mengna Li
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Bei Jiang
- Tianjin Second People's Hospital, Tianjin Institute of Hepatology, Tianjin, China
| | - Bo Gao
- Department of Radiology, The Second Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Dan Li
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Tianjiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| |
Collapse
|
21
|
Khare E, Gonzalez-Obeso C, Kaplan DL, Buehler MJ. CollagenTransformer: End-to-End Transformer Model to Predict Thermal Stability of Collagen Triple Helices Using an NLP Approach. ACS Biomater Sci Eng 2022; 8:4301-4310. [PMID: 36149671 DOI: 10.1021/acsbiomaterials.2c00737] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Collagen is one of the most important structural proteins in biology, and its structural hierarchy plays a crucial role in many mechanically important biomaterials. Here, we demonstrate how transformer models can be used to predict, directly from the primary amino acid sequence, the thermal stability of collagen triple helices, measured via the melting temperature Tm. We report two distinct transformer architectures to compare performance. First, we train a small transformer model from scratch, using our collagen data set featuring only 633 sequence-to-Tm pairings. Second, we use a large pretrained transformer model, ProtBERT, and fine-tune it for a particular downstream task by utilizing sequence-to-Tm pairings, using a deep convolutional network to translate natural language processing BERT embeddings into required features. Both the small transformer model and the fine-tuned ProtBERT model have similar R2 values of test data (R2 = 0.84 vs 0.79, respectively), but the ProtBERT is a much larger pretrained model that may not always be applicable for other biological or biomaterials questions. Specifically, we show that the small transformer model requires only 0.026% of the number of parameters compared to the much larger model but reaches almost the same accuracy for the test set. We compare the performance of both models against 71 newly published sequences for which Tm has been obtained as a validation set and find reasonable agreement, with ProtBERT outperforming the small transformer model. The results presented here are, to our best knowledge, the first demonstration of the use of transformer models for relatively small data sets and for the prediction of specific biophysical properties of interest. We anticipate that the work presented here serves as a starting point for transformer models to be applied to other biophysical problems.
Collapse
Affiliation(s)
- Eesha Khare
- Laboratory for Atomistic and Molecular Mechanics (LAMM), Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, Massachusetts 02139, United States.,Department of Materials Science and Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, Massachusetts 02139, United States
| | | | - David L Kaplan
- Tufts University, Medford, Massachusetts 02155, United States
| | - Markus J Buehler
- Laboratory for Atomistic and Molecular Mechanics (LAMM), Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, Massachusetts 02139, United States.,Center for Computational Science and Engineering, Schwarzman College of Computing, Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, Massachusetts 02139, United States
| |
Collapse
|