1
|
Su L, Ma Z, Ji H, Kong J, Yan W, Zhang Q, Li J, Zuo M. From prediction to design: Revealing the mechanisms of umami peptides using interpretable deep learning, quantum chemical simulations, and module substitution. Food Chem 2025; 483:144301. [PMID: 40233511 DOI: 10.1016/j.foodchem.2025.144301] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Revised: 03/24/2025] [Accepted: 04/08/2025] [Indexed: 04/17/2025]
Abstract
This study screened and designed umami peptides using deep learning model and module substitution strategies. The predictive model, which integrates pre-training, enhanced feature, and contrastive learning module, achieved an accuracy of 0.94, outperforming other models by 2-9 %. Umami peptides were identified through virtual hydrolysis, model predictions, and sensory evaluation. Peptides EN, ETR, GK4, RK5, ER6, EF7, IL8, VR9, DL10, and PK14 demonstrated umami taste and exhibited umami-enhancing effects with MSG. Module substitution strategy, where highly contributive module from umami peptides replace corresponding module in bitter peptides, facilitates peptide design and modification. The mechanism underlying module substitution and taste presentation were elucidated via molecular docking and active site analysis, revealing that substituted peptides form more hydrogen bonds and hydrophobic interactions with T1R1/T1R3. Amino acids D, E, Q, K, and R were critical for umami taste. This study provides an efficient tool for rapid umami peptide screening and expands the repository.
Collapse
Affiliation(s)
- Lijun Su
- National Engineering Research Center for Agri-Product Quality Traceability, Beijing Technology and Business University, Beijing 100048, China; School of Food and Health, Beijing Technology and Business University, Beijing 100048, China
| | - Zhenren Ma
- National Engineering Research Center for Agri-Product Quality Traceability, Beijing Technology and Business University, Beijing 100048, China
| | - Huizhuo Ji
- National Engineering Research Center for Agri-Product Quality Traceability, Beijing Technology and Business University, Beijing 100048, China; School of Food and Health, Beijing Technology and Business University, Beijing 100048, China
| | - Jianlei Kong
- National Engineering Research Center for Agri-Product Quality Traceability, Beijing Technology and Business University, Beijing 100048, China.
| | - Wenjing Yan
- National Engineering Research Center for Agri-Product Quality Traceability, Beijing Technology and Business University, Beijing 100048, China
| | - Qingchuan Zhang
- National Engineering Research Center for Agri-Product Quality Traceability, Beijing Technology and Business University, Beijing 100048, China
| | - Jian Li
- School of Food and Health, Beijing Technology and Business University, Beijing 100048, China
| | - Min Zuo
- School of Information, Beijing Wuzi University, Beijing 101126, China.
| |
Collapse
|
2
|
Asim MN, Asif T, Hassan F, Dengel A. Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models. Database (Oxford) 2025; 2025:baaf027. [PMID: 40448683 DOI: 10.1093/database/baaf027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Revised: 02/06/2025] [Accepted: 03/26/2025] [Indexed: 06/02/2025]
Abstract
Protein sequence analysis examines the order of amino acids within protein sequences to unlock diverse types of a wealth of knowledge about biological processes and genetic disorders. It helps in forecasting disease susceptibility by finding unique protein signatures, or biomarkers that are linked to particular disease states. Protein Sequence analysis through wet-lab experiments is expensive, time-consuming and error prone. To facilitate large-scale proteomics sequence analysis, the biological community is striving for utilizing AI competence for transitioning from wet-lab to computer aided applications. However, Proteomics and AI are two distinct fields and development of AI-driven protein sequence analysis applications requires knowledge of both domains. To bridge the gap between both fields, various review articles have been written. However, these articles focus revolves around few individual tasks or specific applications rather than providing a comprehensive overview about wide tasks and applications. Following the need of a comprehensive literature that presents a holistic view of wide array of tasks and applications, contributions of this manuscript are manifold: It bridges the gap between Proteomics and AI fields by presenting a comprehensive array of AI-driven applications for 63 distinct protein sequence analysis tasks. It equips AI researchers by facilitating biological foundations of 63 protein sequence analysis tasks. It enhances development of AI-driven protein sequence analysis applications by providing comprehensive details of 68 protein databases. It presents a rich data landscape, encompassing 627 benchmark datasets of 63 diverse protein sequence analysis tasks. It highlights the utilization of 25 unique word embedding methods and 13 language models in AI-driven protein sequence analysis applications. It accelerates the development of AI-driven applications by facilitating current state-of-the-art performances across 63 protein sequence analysis tasks.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence, Kaiserslautern 67663, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
| | - Tayyaba Asif
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern 67663, Germany
| | - Faiza Hassan
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern 67663, Germany
| | - Andreas Dengel
- German Research Center for Artificial Intelligence, Kaiserslautern 67663, Germany
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern 67663, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
| |
Collapse
|
3
|
Liang Y, Li M. A deep learning model for prediction of lysine crotonylation sites by fusing multi-features based on multi-head self-attention mechanism. Sci Rep 2025; 15:18940. [PMID: 40442183 PMCID: PMC12122789 DOI: 10.1038/s41598-025-04058-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2025] [Accepted: 05/23/2025] [Indexed: 06/02/2025] Open
Abstract
Lysine crotonylation (Kcr) is an important post-translational modification, which is present in both histone and non-histone proteins, and plays a key role in a variety of biological processes such as metabolism and cell differentiation. Therefore, rapid and accurate identification of this modification has become a key task to study its biological effects. In the past few years, some calculation methods have been developed, but there is room for improvement in prediction performance. In this paper, we propose an effective model named DeepMM-Kcr, which is based on multiple features and an innovative deep learning framework. Multiple features are extracted from natural language processing features and hand-crafted features, where natural language processing features include token embedding and positional embedding encoded by transformer, and hand-crafted features include one-hot, amino acid index and position-weighted amino acid composition, and encoded by bidirectional long short-term memory network. Then natural language processing features and hand-crafted features are fusing by multi-head self-attention mechanism. Finally, a deep learning framework is constructed based on convolutional neural network, bidirectional gated recurrent unit and multilayer perceptron for robust prediction of Kcr sites. On the independent test set, the accuracy of DeepMM-Kcr is highest among the existing models. The experimental results show that our model has very good performance in predicting Kcr sites. The source datasets and codes (in Python) are publicly available at https://github.com/yunyunliang88/DeepMM-Kcr .
Collapse
Affiliation(s)
- Yunyun Liang
- School of Science, Xi'an Polytechnic University, Xi'an, 710048, People's Republic of China.
| | - Minwei Li
- School of Science, Xi'an Polytechnic University, Xi'an, 710048, People's Republic of China
| |
Collapse
|
4
|
Ji L, Hou W, Zhou H, Xiong L, Liu C, Yuan Z, Li L. EBMGP: a deep learning model for genomic prediction based on Elastic Net feature selection and bidirectional encoder representations from transformer's embedding and multi-head attention pooling. TAG. THEORETICAL AND APPLIED GENETICS. THEORETISCHE UND ANGEWANDTE GENETIK 2025; 138:103. [PMID: 40253568 PMCID: PMC12009238 DOI: 10.1007/s00122-025-04894-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/13/2024] [Accepted: 03/27/2025] [Indexed: 04/21/2025]
Abstract
Enhancing early selection through genomic estimated breeding values is pivotal for reducing generation intervals and accelerating breeding programs. Recently, deep learning (DL) approaches have gained prominence in genomic prediction (GP). Here, we introduce a novel DL framework for GP based on Elastic Net feature selection and bidirectional encoder representations from transformer's embedding and multi-head attention pooling (EBMGP). EBMGP applies Elastic Net for the selection of features, thereby diminishing the computational burden and bolstering the predictive accuracy. In EBMGP, SNPs are treated as "words," and groups of adjacent SNPs with similar LD levels are considered "sentences." By applying bidirectional encoder representations from transformers embeddings, this method models SNPs in a manner analogous to human language, capturing complex genetic interactions at both the "word" and "sentence" scales. This flexible representation seamlessly integrates into any DL network and demonstrates a marked improvement in predictive performance for EBMGP and SoyDNGP compared to the widely used one-hot representation. We propose multi-head attention pooling, which can adaptively assign weights to features while learning features from multiple subspaces through multi-heads for a high level of semantic understanding. In a comprehensive comparative analysis across four diverse plant and animal datasets, EBMGP outperformed competing models in 13 out of 16 tasks, achieving accuracy gains ranging from 0.74 to 9.55% over the second-best model. These results underscore EBMGP's robustness in genomic prediction and highlight its potential for deep learning applications in life sciences.
Collapse
Affiliation(s)
- Lu Ji
- Hunan Engineering and Technology Research Center for Agricultural Big Data Analysis and Decision-Making, Hunan Agricultural University, Changsha, 410128, China
- Basic Biology Laboratory, Hunan First Normal University, Changsha, 410205, China
| | - Wei Hou
- College of Bioscience and Biotechnology, Hunan Agricultural University, Changsha, 410128, China
| | - Heng Zhou
- Hunan Engineering and Technology Research Center for Agricultural Big Data Analysis and Decision-Making, Hunan Agricultural University, Changsha, 410128, China
| | - Liwen Xiong
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, Beijing, 100049, China
| | - Chunhai Liu
- Hunan Engineering and Technology Research Center for Agricultural Big Data Analysis and Decision-Making, Hunan Agricultural University, Changsha, 410128, China
| | - Zheming Yuan
- Hunan Engineering and Technology Research Center for Agricultural Big Data Analysis and Decision-Making, Hunan Agricultural University, Changsha, 410128, China.
| | - Lanzhi Li
- Hunan Engineering and Technology Research Center for Agricultural Big Data Analysis and Decision-Making, Hunan Agricultural University, Changsha, 410128, China.
| |
Collapse
|
5
|
Yue Y, Fan H, Zhao J, Xia J. Protein language model-based prediction for plant miRNA encoded peptides. PeerJ Comput Sci 2025; 11:e2733. [PMID: 40134870 PMCID: PMC11935769 DOI: 10.7717/peerj-cs.2733] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2024] [Accepted: 02/05/2025] [Indexed: 03/27/2025]
Abstract
Plant miRNA encoded peptides (miPEPs), which are short peptides derived from small open reading frames within primary miRNAs, play a crucial role in regulating diverse plant traits. Plant miPEPs identification is challenging due to limitations in the available number of known miPEPs for training. Existing prediction methods rely on manually encoded features, including miPEPPred-FRL, to infer plant miPEPs. Recent advances in deep learning modeling of protein sequences provide an opportunity to improve the representation of key features, leveraging large datasets of protein sequences. In this study, we propose an accurate prediction model, called pLM4PEP, which integrates ESM2 peptide embedding with machine learning methods. Our model not only demonstrates precise identification capabilities for plant miPEPs, but also achieves remarkable results across diverse datasets that include other bioactive peptides. The source codes, datasets of pLM4PEP are available at https://github.com/xialab-ahu/pLM4PEP.
Collapse
Affiliation(s)
- Yishan Yue
- College of Mathematics and System Science, Xinjiang University, Urumqi, Xinjiang, China
| | - Henghui Fan
- Institutes of Physical Science and Information Technology, Anhui University, Hefei, Anhui, China
| | - Jianping Zhao
- College of Mathematics and System Science, Xinjiang University, Urumqi, Xinjiang, China
| | - Junfeng Xia
- Institutes of Physical Science and Information Technology, Anhui University, Hefei, Anhui, China
| |
Collapse
|
6
|
Li F, Bin Y, Zhao J, Zheng C. DeepPD: A Deep Learning Method for Predicting Peptide Detectability Based on Multi-feature Representation and Information Bottleneck. Interdiscip Sci 2025; 17:200-214. [PMID: 39661307 DOI: 10.1007/s12539-024-00665-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 10/07/2024] [Accepted: 10/09/2024] [Indexed: 12/12/2024]
Abstract
Peptide detectability measures the relationship between the protein composition and abundance in the sample and the peptides identified during the analytical procedure. This relationship has significant implications for the fundamental tasks of proteomics. Existing methods primarily rely on a single type of feature representation, which limits their ability to capture the intricate and diverse characteristics of peptides. In response to this limitation, we introduce DeepPD, an innovative deep learning framework incorporating multi-feature representation and the information bottleneck principle (IBP) to predict peptide detectability. DeepPD extracts semantic information from peptides using evolutionary scale modeling 2 (ESM-2) and integrates sequence and evolutionary information to construct the feature space collaboratively. The IBP effectively guides the feature learning process, minimizing redundancy in the feature space. Experimental results across various datasets demonstrate that DeepPD outperforms state-of-the-art methods. Furthermore, we demonstrate that DeepPD exhibits competitive generalization and transfer learning capabilities across diverse datasets and species. In conclusion, DeepPD emerges as the most effective method for predicting peptide detectability, showcasing its potential applicability to other protein sequence prediction tasks.
Collapse
Affiliation(s)
- Fenglin Li
- College of Mathematics and System Science, Xinjiang University, Urumqi, 830046, China
| | - Yannan Bin
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Information Materials and Intelligent Sensing Laboratory of Anhui Province, and School of Artificial Intelligence, Anhui University, Hefei, 230601, China
| | - Jianping Zhao
- College of Mathematics and System Science, Xinjiang University, Urumqi, 830046, China.
| | - Chunhou Zheng
- College of Mathematics and System Science, Xinjiang University, Urumqi, 830046, China.
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Information Materials and Intelligent Sensing Laboratory of Anhui Province, and School of Artificial Intelligence, Anhui University, Hefei, 230601, China.
| |
Collapse
|
7
|
Clark JD, Mi X, Mitchell DA, Shukla D. Substrate prediction for RiPP biosynthetic enzymes via masked language modeling and transfer learning. DIGITAL DISCOVERY 2025; 4:343-354. [PMID: 39649639 PMCID: PMC11622008 DOI: 10.1039/d4dd00170b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/20/2024] [Accepted: 11/28/2024] [Indexed: 12/11/2024]
Abstract
Ribosomally synthesized and post-translationally modified peptide (RiPP) biosynthetic enzymes often exhibit promiscuous substrate preferences that cannot be reduced to simple rules. Large language models are promising tools for predicting the specificity of RiPP biosynthetic enzymes. However, state-of-the-art protein language models are trained on relatively few peptide sequences. A previous study comprehensively profiled the peptide substrate preferences of LazBF (a two-component serine dehydratase) and LazDEF (a three-component azole synthetase) from the lactazole biosynthetic pathway. We demonstrated that masked language modeling of LazBF substrate preferences produced language model embeddings that improved downstream prediction of both LazBF and LazDEF substrates. Similarly, masked language modeling of LazDEF substrate preferences produced embeddings that improved prediction of both LazBF and LazDEF substrates. Our results suggest that the models learned functional forms that are transferable between distinct enzymatic transformations that act within the same biosynthetic pathway. We found that a single high-quality data set of substrates and non-substrates for a RiPP biosynthetic enzyme improved substrate prediction for distinct enzymes in data-scarce scenarios. We then fine-tuned models on each data set and showed that the fine-tuned models provided interpretable insight that we anticipate will facilitate the design of substrate libraries that are compatible with desired RiPP biosynthetic pathways.
Collapse
Affiliation(s)
- Joseph D Clark
- School of Molecular and Cellular Biology, University of Illinois at Urbana-Champaign Urbana IL 61801 USA
| | - Xuenan Mi
- Center for Biophysics and Quantitative Biology, University of Illinois at Urbana-Champaign Urbana IL 61801 USA
| | - Douglas A Mitchell
- Department of Biochemistry, Vanderbilt University School of Medicine Nashville TN 37232 USA
- Department of Chemistry, Vanderbilt University Nashville TN 37232 USA
| | - Diwakar Shukla
- Center for Biophysics and Quantitative Biology, University of Illinois at Urbana-Champaign Urbana IL 61801 USA
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign Urbana IL 61801 USA
- Department of Bioengineering, University of Illinois at Urbana-Champaign Urbana IL 61801 USA
- Department of Chemistry, University of Illinois at Urbana-Chamapaign Urbana IL 61801 USA
| |
Collapse
|
8
|
Zhao J, Liu H, Kang L, Gao W, Lu Q, Rao Y, Yue Z. deep-AMPpred: A Deep Learning Method for Identifying Antimicrobial Peptides and Their Functional Activities. J Chem Inf Model 2025; 65:997-1008. [PMID: 39792442 DOI: 10.1021/acs.jcim.4c01913] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2025]
Abstract
Antimicrobial peptides (AMPs) are small peptides that play an important role in disease defense. As the problem of pathogen resistance caused by the misuse of antibiotics intensifies, the identification of AMPs as alternatives to antibiotics has become a hot topic. Accurately identifying AMPs using computational methods has been a key issue in the field of bioinformatics in recent years. Although there are many machine learning-based AMP identification tools, most of them do not focus on or only focus on a few functional activities. Predicting the multiple activities of antimicrobial peptides can help discover candidate peptides with broad-spectrum antimicrobial ability. We propose a two-stage AMP predictor deep-AMPpred, in which the first stage distinguishes AMP from other peptides, and the second stage solves the multilabel problem of 13 common functional activities of AMP. deep-AMPpred combines the ESM-2 model to encode the features of AMP and integrates CNN, BiLSTM, and CBAM models to discover AMP and its functional activities. The ESM-2 model captures the global contextual features of the peptide sequence, while CNN, BiLSTM, and CBAM combine local feature extraction, long-term and short-term dependency modeling, and attention mechanisms to improve the performance of deep-AMPpred in AMP and its function prediction. Experimental results demonstrate that deep-AMPpred performs well in accurately identifying AMPs and predicting their functional activities. This confirms the effectiveness of using the ESM-2 model to capture meaningful peptide sequence features and integrating multiple deep learning models for AMP identification and activity prediction.
Collapse
Affiliation(s)
- Jun Zhao
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Key Laboratory of Agricultural Sensors for Ministry of Agriculture and Rural Affairs, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Hangcheng Liu
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Key Laboratory of Agricultural Sensors for Ministry of Agriculture and Rural Affairs, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Leyao Kang
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Key Laboratory of Agricultural Sensors for Ministry of Agriculture and Rural Affairs, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Wanling Gao
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Key Laboratory of Agricultural Sensors for Ministry of Agriculture and Rural Affairs, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Quan Lu
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Key Laboratory of Agricultural Sensors for Ministry of Agriculture and Rural Affairs, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Yuan Rao
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Key Laboratory of Agricultural Sensors for Ministry of Agriculture and Rural Affairs, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Zhenyu Yue
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Key Laboratory of Agricultural Sensors for Ministry of Agriculture and Rural Affairs, Anhui Agricultural University, Hefei, Anhui 230036, China
- Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, Hefei, Anhui 230036, China
| |
Collapse
|
9
|
Luo J, Zhao K, Chen J, Yang C, Qu F, Liu Y, Jin X, Yan K, Zhang Y, Liu B. iMFP-LG: Identify Novel Multi-functional Peptides Using Protein Language Models and Graph-based Deep Learning. GENOMICS, PROTEOMICS & BIOINFORMATICS 2025; 22:qzae084. [PMID: 39585308 PMCID: PMC12011362 DOI: 10.1093/gpbjnl/qzae084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Revised: 10/25/2024] [Accepted: 11/21/2024] [Indexed: 11/26/2024]
Abstract
Functional peptides are short amino acid fragments that have a wide range of beneficial functions for living organisms. The majority of previous studies have focused on mono-functional peptides, but an increasing number of multi-functional peptides have been discovered. Although there have been enormous experimental efforts to assay multi-functional peptides, only a small portion of millions of known peptides has been explored. The development of effective and accurate techniques for identifying multi-functional peptides can facilitate their discovery and mechanistic understanding. In this study, we presented iMFP-LG, a method for multi-functional peptide identification based on protein language models (pLMs) and graph attention networks (GATs). Our comparative analyses demonstrated that iMFP-LG outperformed the state-of-the-art methods in identifying both multi-functional bioactive peptides and multi-functional therapeutic peptides. The interpretability of iMFP-LG was also illustrated by visualizing attention patterns in pLMs and GATs. Regarding the outstanding performance of iMFP-LG on the identification of multi-functional peptides, we employed iMFP-LG to screen novel peptides with both anti-microbial and anti-cancer functions from millions of known peptides in the UniRef90 database. As a result, eight candidate peptides were identified, among which one candidate was validated to process both anti-bacterial and anti-cancer properties through molecular structure alignment and biological experiments. We anticipate that iMFP-LG can assist in the discovery of multi-functional peptides and contribute to the advancement of peptide drug design.
Collapse
Affiliation(s)
- Jiawei Luo
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China
| | - Kejuan Zhao
- School of Science, Harbin Institute of Technology, Shenzhen 518055, China
| | - Junjie Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China
| | - Caihua Yang
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China
| | - Fuchuan Qu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China
| | - Yumeng Liu
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen 518055, China
| | - Xiaopeng Jin
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen 518055, China
| | - Ke Yan
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 10081, China
| | - Yang Zhang
- School of Science, Harbin Institute of Technology, Shenzhen 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 10081, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing 10081, China
| |
Collapse
|
10
|
Guan C, Fernandes FC, Franco OL, de la Fuente-Nunez C. Leveraging large language models for peptide antibiotic design. CELL REPORTS. PHYSICAL SCIENCE 2025; 6:102359. [PMID: 39949833 PMCID: PMC11823563 DOI: 10.1016/j.xcrp.2024.102359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/16/2025]
Abstract
Large language models (LLMs) have significantly impacted various domains of our society, including recent applications in complex fields such as biology and chemistry. These models, built on sophisticated neural network architectures and trained on extensive datasets, are powerful tools for designing, optimizing, and generating molecules. This review explores the role of LLMs in discovering and designing antibiotics, focusing on peptide molecules. We highlight advancements in drug design and outline the challenges of applying LLMs in these areas.
Collapse
Affiliation(s)
- Changge Guan
- Machine Biology Group, Departments of Psychiatry and Microbiology, Institute for Biomedical Informatics, Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Departments of Bioengineering and Chemical and Biomolecular Engineering, School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, PA, USA
- Department of Chemistry, School of Arts and Sciences, University of Pennsylvania, Philadelphia, PA, USA
- Penn Institute for Computational Science, University of Pennsylvania, Philadelphia, PA, USA
- These authors contributed equally
| | - Fabiano C. Fernandes
- Centro de Análises Proteômicas e Bioquímicas, Pós-Graduação em Ciências Genômicas e Biotecnologia, Universidade Católica de Brasília, Brasília, Brazil
- Departamento de Ciência da Computação, Instituto Federal de Brasília, Campus Taguatinga, Brasília, Brazil
- These authors contributed equally
| | - Octavio L. Franco
- Centro de Análises Proteômicas e Bioquímicas, Pós-Graduação em Ciências Genômicas e Biotecnologia, Universidade Católica de Brasília, Brasília, Brazil
- S-Inova Biotech, Programa de Pós-Graduação em Biotecnologia, Universidade Católica Dom Bosco, Campo Grande, Brazil
| | - Cesar de la Fuente-Nunez
- Machine Biology Group, Departments of Psychiatry and Microbiology, Institute for Biomedical Informatics, Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Departments of Bioengineering and Chemical and Biomolecular Engineering, School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, PA, USA
- Department of Chemistry, School of Arts and Sciences, University of Pennsylvania, Philadelphia, PA, USA
- Penn Institute for Computational Science, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
11
|
Bizzotto E, Zampieri G, Treu L, Filannino P, Di Cagno R, Campanaro S. Classification of bioactive peptides: A systematic benchmark of models and encodings. Comput Struct Biotechnol J 2024; 23:2442-2452. [PMID: 38867723 PMCID: PMC11168199 DOI: 10.1016/j.csbj.2024.05.040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 05/10/2024] [Accepted: 05/22/2024] [Indexed: 06/14/2024] Open
Abstract
Bioactive peptides are short amino acid chains possessing biological activity and exerting physiological effects relevant to human health. Despite their therapeutic value, their identification remains a major problem, as it mainly relies on time-consuming in vitro tests. While bioinformatic tools for the identification of bioactive peptides are available, they are focused on specific functional classes and have not been systematically tested on realistic settings. To tackle this problem, bioactive peptide sequences and functions were here gathered from a variety of databases to generate a unified collection of bioactive peptides from microbial fermentation. This collection was organized into nine functional classes including some previously studied and some unexplored such as immunomodulatory, opioid and cardiovascular peptides. Upon assessing their sequence properties, four alternative encoding methods were tested in combination with a multitude of machine learning algorithms, from basic classifiers like logistic regression to advanced algorithms like BERT. Tests on a total of 171 models showed that, while some functions are intrinsically easier to detect, no single combination of classifiers and encoders worked universally well for all classes. For this reason, we unified all the best individual models for each class and generated CICERON (Classification of bIoaCtive pEptides fRom micrObial fermeNtation), a classification tool for the functional classification of peptides. State-of-the-art classifiers were found to underperform on our realistic benchmark dataset compared to the models included in CICERON. Altogether, our work provides a tool for real-world peptide classification and can serve as a benchmark for future model development.
Collapse
Affiliation(s)
- Edoardo Bizzotto
- Department of Biology, University of Padua, Via U. Bassi 58/b, Padova 35131, Italy
| | - Guido Zampieri
- Department of Biology, University of Padua, Via U. Bassi 58/b, Padova 35131, Italy
| | - Laura Treu
- Department of Biology, University of Padua, Via U. Bassi 58/b, Padova 35131, Italy
| | - Pasquale Filannino
- Department of Soil, Plant and Food Science, University of Bari Aldo Moro, Via G. Amendola 165/a, Bari 70126, Italy
| | - Raffaella Di Cagno
- Faculty of Agricultural, Environmental and Food Sciences, Free University of Bolzano, Piazza Universita, 5, Bolzano 39100, Italy
| | - Stefano Campanaro
- Department of Biology, University of Padua, Via U. Bassi 58/b, Padova 35131, Italy
| |
Collapse
|
12
|
Luo X, Chi ASY, Lin AH, Ong TJ, Wong L, Rahman CR. Benchmarking recent computational tools for DNA-binding protein identification. Brief Bioinform 2024; 26:bbae634. [PMID: 39657630 PMCID: PMC11630855 DOI: 10.1093/bib/bbae634] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2024] [Revised: 10/29/2024] [Accepted: 11/20/2024] [Indexed: 12/12/2024] Open
Abstract
Identification of DNA-binding proteins (DBPs) is a crucial task in genome annotation, as it aids in understanding gene regulation, DNA replication, transcriptional control, and various cellular processes. In this paper, we conduct an unbiased benchmarking of 11 state-of-the-art computational tools as well as traditional tools such as ScanProsite, BLAST, and HMMER for identifying DBPs. We highlight the data leakage issue in conventional datasets leading to inflated performance. We introduce new evaluation datasets to support further development. Through a comprehensive evaluation pipeline, we identify potential limitations in models, feature extraction techniques, and training methods, and recommend solutions regarding these issues. We show that combining the predictions of the two best computational tools with BLAST-based prediction significantly enhances DBP identification capability. We provide this consensus method as user-friendly software. The datasets and software are available at https://github.com/Rafeed-bot/DNA_BP_Benchmarking.
Collapse
Affiliation(s)
- Xizi Luo
- School of Computing, National University of Singapore, Singapore 119077, Singapore
| | - Amadeus Song Yi Chi
- School of Computing, National University of Singapore, Singapore 119077, Singapore
| | - Andre Huikai Lin
- School of Computing, National University of Singapore, Singapore 119077, Singapore
| | - Tze Jet Ong
- School of Computing, National University of Singapore, Singapore 119077, Singapore
| | - Limsoon Wong
- School of Computing, National University of Singapore, Singapore 119077, Singapore
| | | |
Collapse
|
13
|
Liu X, Luo J, Wang X, Zhang Y, Chen J. Directed evolution of antimicrobial peptides using multi-objective zeroth-order optimization. Brief Bioinform 2024; 26:bbae715. [PMID: 39800873 PMCID: PMC11725395 DOI: 10.1093/bib/bbae715] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2024] [Revised: 12/08/2024] [Accepted: 12/27/2024] [Indexed: 01/16/2025] Open
Abstract
Antimicrobial peptides (AMPs) emerge as a type of promising therapeutic compounds that exhibit broad spectrum antimicrobial activity with high specificity and good tolerability. Natural AMPs usually need further rational design for improving antimicrobial activity and decreasing toxicity to human cells. Although several algorithms have been developed to optimize AMPs with desired properties, they explored the variations of AMPs in a discrete amino acid sequence space, usually suffering from low efficiency, lack diversity, and local optimum. In this work, we propose a novel directed evolution method, named PepZOO, for optimizing multi-properties of AMPs in a continuous representation space guided by multi-objective zeroth-order optimization. PepZOO projects AMPs from a discrete amino acid sequence space into continuous latent representation space by a variational autoencoder. Subsequently, the latent embeddings of prototype AMPs are taken as start points and iteratively updated according to the guidance of multi-objective zeroth-order optimization. Experimental results demonstrate PepZOO outperforms state-of-the-art methods on improving the multi-properties in terms of antimicrobial function, activity, toxicity, and binding affinity to the targets. Molecular docking and molecular dynamics simulations are further employed to validate the effectiveness of our method. Moreover, PepZOO can reveal important motifs which are required to maintain a particular property during the evolution by aligning the evolutionary sequences. PepZOO provides a novel research paradigm that optimizes AMPs by exploring property change instead of exploring sequence mutations, accelerating the discovery of potential therapeutic peptides.
Collapse
Affiliation(s)
- Xianliang Liu
- School of Computer Science and Technology, Harbin Institute of Technology, HIT Campus, Shenzhen University Town, Nanshan District, Shenzhen 518055, Guangdong, China
| | - Jiawei Luo
- School of Computer Science and Technology, Harbin Institute of Technology, HIT Campus, Shenzhen University Town, Nanshan District, Shenzhen 518055, Guangdong, China
| | - Xinyan Wang
- Core Research Facility, Southern University of Science and Technology, No. 1088 Xueyuan Road, Nanshan District, Shenzhen 518055, Guangdong, China
| | - Yang Zhang
- School of Science, Harbin Institute of Technology, HIT Campus, Shenzhen University Town, Nanshan District, Shenzhen 518055, Guangdong, China
| | - Junjie Chen
- School of Computer Science and Technology, Harbin Institute of Technology, HIT Campus, Shenzhen University Town, Nanshan District, Shenzhen 518055, Guangdong, China
| |
Collapse
|
14
|
Tang Q, Xiang Y, Gao W, Zhu L, Xu Z, Li Y, Yue Z. TeaTFactor: A Prediction Tool for Tea Plant Transcription Factors Based on BERT. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:2123-2132. [PMID: 39150804 DOI: 10.1109/tcbb.2024.3444466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/18/2024]
Abstract
A transcription factor (TF) is a sequence-specific DNA-binding protein, which plays key roles in cell-fate decision by regulating gene expression. Predicting TFs is key for tea plant research community, as they regulate gene expression, influencing plant growth, development, and stress responses. It is a challenging task through wet lab experimental validation, due to their rarity, as well as the high cost and time requirements. As a result, computational methods are increasingly popular to be chosen. The pre-training strategy has been applied to many tasks in natural language processing (NLP) and has achieved impressive performance. In this paper, we present a novel recognition algorithm named TeaTFactor that utilizes pre-training for the model training of TFs prediction. The model is built upon the BERT architecture, initially pre-trained using protein data from UniProt. Subsequently, the model was fine-tuned using the collected TFs data of tea plants. We evaluated four different word segmentation methods and the existing state-of-the-art prediction tools. According to the comprehensive experimental results and a case study, our model is superior to existing models and achieves the goal of accurate identification. In addition, we have developed a web server at http://teatfactor.tlds.cc, which we believe will facilitate future studies on tea transcription factors and advance the field of crop synthetic biology.
Collapse
|
15
|
Qi D, Song C, Liu T. PreDBP-PLMs: Prediction of DNA-binding proteins based on pre-trained protein language models and convolutional neural networks. Anal Biochem 2024; 694:115603. [PMID: 38986796 DOI: 10.1016/j.ab.2024.115603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2024] [Revised: 06/15/2024] [Accepted: 07/06/2024] [Indexed: 07/12/2024]
Abstract
The recognition of DNA-binding proteins (DBPs) is the crucial step to understanding their roles in various biological processes such as genetic regulation, gene expression, cell cycle control, DNA repair, and replication within cells. However, conventional experimental methods for identifying DBPs are usually time-consuming and expensive. Therefore, there is an urgent need to develop rapid and efficient computational methods for the prediction of DBPs. In this study, we proposed a novel predictor named PreDBP-PLMs to further improve the identification accuracy of DBPs by fusing the pre-trained protein language model (PLM) ProtT5 embedding with evolutionary features as input to the classic convolutional neural network (CNN) model. Firstly, the ProtT5 embedding was combined with different evolutionary features derived from the position-specific scoring matrix (PSSM) to represent protein sequences. Then, the optimal feature combination was selected and input to the CNN classifier for the prediction of DBPs. Finally, the 5-fold cross-validation (CV), the leave-one-out CV (LOOCV), and the independent set test were adopted to examine the performance of PreDBP-PLMs on the benchmark datasets. Compared to the existing state-of-the-art predictors, PreDBP-PLMs exhibits an accuracy improvement of 0.5 % and 5.2 % on the PDB186 and PDB2272 datasets, respectively. It demonstrated that the proposed method could serve as a useful tool for the recognition of DBPs.
Collapse
Affiliation(s)
- Dawei Qi
- College of Information Technology, Shanghai Ocean University, Shanghai, 201306, China
| | - Chen Song
- College of Information Technology, Shanghai Ocean University, Shanghai, 201306, China
| | - Taigang Liu
- College of Information Technology, Shanghai Ocean University, Shanghai, 201306, China.
| |
Collapse
|
16
|
Gao W, Zhao J, Gui J, Wang Z, Chen J, Yue Z. Comprehensive Assessment of BERT-Based Methods for Predicting Antimicrobial Peptides. J Chem Inf Model 2024; 64:7772-7785. [PMID: 39316765 DOI: 10.1021/acs.jcim.4c00507] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/26/2024]
Abstract
In recent years, the prediction of antimicrobial peptides (AMPs) has gained prominence due to their high antibacterial activity and reduced susceptibility to drug resistance, making them potential antibiotic substitutes. To advance the field of AMP recognition, an increasing number of natural language processing methods are being applied. These methods exhibit diversity in terms of pretraining models, pretraining data sets, word vector embeddings, feature encoding methods, and downstream classification models. Here, we provide a comprehensive survey of current BERT-based methods for AMP prediction. An independent benchmark test data set is constructed to evaluate the predictive capabilities of the surveyed tools. Furthermore, we compared the predictive performance of these computational methods based on six different AMP public databases. LM_pred (BFD) outperformed all other surveyed tools due to abundant pretraining data set and the unique vector embedding approach. To avoid the impact of varying training data sets used by different methods on prediction performance, we performed the 5-fold cross-validation experiments using the same data set, involving retraining. Additionally, to explore the applicability and generalization ability of the models, we constructed a short peptide data set and an external data set to test the retrained models. Although these prediction methods based on BERT can achieve good prediction performance, there is still room for improvement in recognition accuracy. With the continuous enhancement of protein language model, we proposed an AMP prediction method based on the ESM-2 pretrained model called iAMP-bert. Experimental results demonstrate that iAMP-bert outperforms other approaches. iAMP-bert is freely accessible to the public at http://iamp.aielab.cc/.
Collapse
Affiliation(s)
- Wanling Gao
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Jun Zhao
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Jianfeng Gui
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Zehan Wang
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Jie Chen
- National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, Guangdong 518060, China
| | - Zhenyu Yue
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| |
Collapse
|
17
|
Zhao Y, Zhang S, Liang Y. HemoFuse: multi-feature fusion based on multi-head cross-attention for identification of hemolytic peptides. Sci Rep 2024; 14:22518. [PMID: 39342017 PMCID: PMC11438874 DOI: 10.1038/s41598-024-74326-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Accepted: 09/25/2024] [Indexed: 10/01/2024] Open
Abstract
Hemolytic peptides are therapeutic peptides that damage red blood cells. However, therapeutic peptides used in medical treatment must exhibit low toxicity to red blood cells to achieve the desired therapeutic effect. Therefore, accurate prediction of the hemolytic activity of therapeutic peptides is essential for the development of peptide therapies. In this study, a multi-feature cross-fusion model, HemoFuse, for hemolytic peptide identification is proposed. The feature vectors of peptide sequences are transformed by word embedding technique and four hand-crafted feature extraction methods. We apply multi-head cross-attention mechanism to hemolytic peptide identification for the first time. It captures the interaction between word embedding features and hand-crafted features by calculating the attention of all positions in them, so that multiple features can be deeply fused. Moreover, we visualize the features obtained by this module to enhance its interpretability. On the comprehensive integrated dataset, HemoFuse achieves ideal results, with ACC, SP, SN, MCC, F1, AUC, and AP of 0.7575, 0.8814, 0.5793, 0.4909, 0.6620, 0.8387, and 0.7118, respectively. Compared with HemoDL proposed by Yang et al., it is 3.32%, 3.89%, 5.93%, 10.6%, 8.17%, 5.88%, and 2.72% higher. Other ablation experiments also prove that our model is reasonable and efficient. The codes and datasets are accessible at https://github.com/z11code/Hemo .
Collapse
Affiliation(s)
- Ya Zhao
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, P. R. China
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, P. R. China.
| | - Yunyun Liang
- School of Science, Xi'an Polytechnic University, Xi'an, 710048, P. R. China
| |
Collapse
|
18
|
Zhang B, Hou Z, Yang Y, Wong KC, Zhu H, Li X. SOFB is a comprehensive ensemble deep learning approach for elucidating and characterizing protein-nucleic-acid-binding residues. Commun Biol 2024; 7:679. [PMID: 38830995 PMCID: PMC11148103 DOI: 10.1038/s42003-024-06332-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Accepted: 05/15/2024] [Indexed: 06/05/2024] Open
Abstract
Proteins and nucleic-acids are essential components of living organisms that interact in critical cellular processes. Accurate prediction of nucleic acid-binding residues in proteins can contribute to a better understanding of protein function. However, the discrepancy between protein sequence information and obtained structural and functional data renders most current computational models ineffective. Therefore, it is vital to design computational models based on protein sequence information to identify nucleic acid binding sites in proteins. Here, we implement an ensemble deep learning model-based nucleic-acid-binding residues on proteins identification method, called SOFB, which characterizes protein sequences by learning the semantics of biological dynamics contexts, and then develop an ensemble deep learning-based sequence network to learn feature representation and classification by explicitly modeling dynamic semantic information. Among them, the language learning model, which is constructed from natural language to biological language, captures the underlying relationships of protein sequences, and the ensemble deep learning-based sequence network consisting of different convolutional layers together with Bi-LSTM refines various features for optimal performance. Meanwhile, to address the imbalanced issue, we adopt ensemble learning to train multiple models and then incorporate them. Our experimental results on several DNA/RNA nucleic-acid-binding residue datasets demonstrate that our proposed model outperforms other state-of-the-art methods. In addition, we conduct an interpretability analysis of the identified nucleic acid binding residue sequences based on the attention weights of the language learning model, revealing novel insights into the dynamic semantic information that supports the identified nucleic acid binding residues. SOFB is available at https://github.com/Encryptional/SOFB and https://figshare.com/articles/online_resource/SOFB_figshare_rar/25499452 .
Collapse
Affiliation(s)
- Bin Zhang
- School of Artificial Intelligence, Jilin University, Changchun, China
| | - Zilong Hou
- School of Artificial Intelligence, Jilin University, Changchun, China
| | - Yuning Yang
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Canada
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong, Hong Kong SAR
| | - Haoran Zhu
- School of Artificial Intelligence, Jilin University, Changchun, China.
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Changchun, China.
| |
Collapse
|
19
|
Chaudhari JK, Pant S, Jha R, Pathak RK, Singh DB. Biological big-data sources, problems of storage, computational issues, and applications: a comprehensive review. Knowl Inf Syst 2024; 66:3159-3209. [DOI: 10.1007/s10115-023-02049-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Revised: 09/12/2023] [Accepted: 12/11/2023] [Indexed: 01/03/2025]
|
20
|
Cordoves-Delgado G, García-Jacas CR. Predicting Antimicrobial Peptides Using ESMFold-Predicted Structures and ESM-2-Based Amino Acid Features with Graph Deep Learning. J Chem Inf Model 2024; 64:4310-4321. [PMID: 38739853 DOI: 10.1021/acs.jcim.3c02061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
Currently, antimicrobial resistance constitutes a serious threat to human health. Drugs based on antimicrobial peptides (AMPs) constitute one of the alternatives to address it. Shallow and deep learning (DL)-based models have mainly been built from amino acid sequences to predict AMPs. Recent advances in tertiary (3D) structure prediction have opened new opportunities in this field. In this sense, models based on graphs derived from predicted peptide structures have recently been proposed. However, these models are not in correspondence with state-of-the-art approaches to codify evolutionary information, and, in addition, they are memory- and time-consuming because depend on multiple sequence alignment. Herein, we presented a framework to create alignment-free models based on graph representations generated from ESMFold-predicted peptide structures, whose nodes are characterized with amino acid-level evolutionary information derived from the Evolutionary Scale Modeling (ESM-2) models. A graph attention network (GAT) was implemented to assess the usefulness of the framework in the AMP classification. To this end, a set comprised of 67,058 peptides was used. It was demonstrated that the proposed methodology allowed to build GAT models with generalization abilities consistently better than 20 state-of-the-art non-DL-based and DL-based models. The best GAT models were developed using evolutionary information derived from the 36- and 33-layer ESM-2 models. Similarity studies showed that the best-built GAT models codified different chemical spaces, and thus they were fused to significantly improve the classification. In general, the results suggest that esm-AxP-GDL is a promissory tool to develop good, structure-dependent, and alignment-free models that can be successfully applied in the screening of large data sets. This framework should not only be useful to classify AMPs but also for modeling other peptide and protein activities.
Collapse
Affiliation(s)
- Greneter Cordoves-Delgado
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| | - César R García-Jacas
- Cátedras CONAHCYT - Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| |
Collapse
|
21
|
Lobanov MY, Slizen MV, Dovidchenko NV, Panfilov AV, Surin AA, Likhachev IV, Galzitskaya OV. Comparison of deep learning models with simple method to assess the problem of antimicrobial peptides prediction. Mol Inform 2024; 43:e202200181. [PMID: 36961202 DOI: 10.1002/minf.202200181] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Revised: 03/20/2023] [Accepted: 03/23/2023] [Indexed: 03/25/2023]
Abstract
Antibiotic-resistant strains are an emerging threat to public health. The usage of antimicrobial peptides (AMPs) is one of the promising approaches to solve this problem. For the development of new AMPs, it is necessary to have reliable prediction methods. Recently, deep learning approaches have been used to predict AMP. In this paper, we want to compare simple and complex methods for these purposes. We used the BERT transformer to create sequence embeddings and the multilayer perceptron (MLP) and light attention (LA) approaches for classification. One of them reached about 80 % accuracy and specificity in benchmark testing, which is on par with the best available methods. For comparison, we proposed a simple method using only the amino acid composition of proteins or peptides. This method has shown good results, at the level of the best methods. We have prepared a special server for predicting the ability of AMPs by amino acid composition: http://bioproteom.protres.ru/antimicrob/.
Collapse
Affiliation(s)
- M Y Lobanov
- Laboratory of Bioinformatics and Proteomics, Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Moscow Region, Russia
| | - M V Slizen
- Laboratory of Bioinformatics and Proteomics, Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Moscow Region, Russia
| | - N V Dovidchenko
- Laboratory of Bioinformatics and Proteomics, Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Moscow Region, Russia
| | - A V Panfilov
- Laboratory of Bioinformatics and Proteomics, Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Moscow Region, Russia
| | - A A Surin
- Faculty of Applied math, MIREA - Russian Technological University, Moscow, 119454, Russia
| | - I V Likhachev
- Laboratory of Bioinformatics and Proteomics, Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Moscow Region, Russia
- Institute of Mathematical Problems of Biology branch of Keldysh Institute of Applied Mathematics, Russian Academy of Sciences, 142290, Pushchino, Russia
| | - O V Galzitskaya
- Laboratory of Bioinformatics and Proteomics, Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Moscow Region, Russia
- Laboratory of Structure and Function of Muscle Proteins, Institute of Theoretical and Experimental Biophysics, Russian Academy of Sciences, 142290, Pushchino, Moscow Region, Russia
| |
Collapse
|
22
|
Li H, Meng J, Wang Z, Tang Y, Xia S, Wang Y, Qin Z, Luan Y. miPEPPred-FRL: A Novel Method for Predicting Plant MiRNA-Encoded Peptides Using Adaptive Feature Representation Learning. J Chem Inf Model 2024; 64:2889-2900. [PMID: 37733290 DOI: 10.1021/acs.jcim.3c01020] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/22/2023]
Abstract
MicroRNAs (miRNAs) are an essential type of small molecule RNAs that play significant regulatory roles in organisms. Recent studies have demonstrated that small open reading frames (sORFs) harbored in primary miRNAs (pri-miRNAs) can encode small peptides, known as miPEPs. Plant miPEPs can increase the abundance and activity of cognate miRNAs by promoting the transcription of their corresponding pri-miRNAs, thereby modulating plant traits. Biological experiments are the most effective way to accurately identify miPEPs; however, they are time-consuming and expensive. Hence, an efficient computational method for the identification of miPEPs on a large scale is highly desirable. Up to now, there have been no specialized computational tools for identifying miPEPs. In this work, a novel predictor named miPEPPred-FRL based on an adaptive feature representation learning framework that consists of the feature transformation module and the cascade architecture has been proposed. The feature transformation module integrating a newly designed feature selection method and classifier selection rule is developed to convert sequence-based features into primary class and probabilistic features, which are then fed into the improved cascade architecture to obtain more stable and discriminative augmented features. Finally, the augmented features are utilized to construct the final predictor. Cross-validation experiments illustrate that the novel feature selection method and classifier selection rule contribute to boosting the feature representation ability of the framework. Furthermore, the high accuracy of miPEPPred-FRL on independent testing data suggests that it is a trustworthy and valuable tool for the identification of miPEPs.
Collapse
Affiliation(s)
- Haibin Li
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024, China
| | - Jun Meng
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024, China
| | - Zhaowei Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024, China
| | - Youwei Tang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024, China
| | - Shihao Xia
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024, China
| | - Yu Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024, China
| | - Zhaojing Qin
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024, China
| | - Yushi Luan
- School of Bioengineering, Dalian University of Technology, Dalian, Liaoning 116024, China
| |
Collapse
|
23
|
Chen L, Hu Z, Rong Y, Lou B. Deep2Pep: A deep learning method in multi-label classification of bioactive peptide. Comput Biol Chem 2024; 109:108021. [PMID: 38308955 DOI: 10.1016/j.compbiolchem.2024.108021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Revised: 12/27/2023] [Accepted: 01/18/2024] [Indexed: 02/05/2024]
Abstract
Functional peptides are easy to absorb and have low side effects, which has attracted increasing interest from pharmaceutical scientists. However, due to the limitations in the laboratory funding and human resources, it is difficult to screen the functional peptides from a large number of peptides with unknown functions. With the development of machine learning and Deep learning, the combination of computational methods and biological information provides an effective method for identifying peptide functions. To explore the value of multi-functional active peptides, a new deep learning method named Deep2Pep (Deep learning to Peptides) was constructed, which was based on sequence encoding, embedding, and language tokenizer. It can achieve predictions of peptides on antimicrobial, antihypertensive, antioxidant and antihyperglycemic by converting sequence information into digital vectors, combined BiLSTM, attention-residual algorithm, and BERT Encoder. The results showed that Deep2Pep had a Hamming Loss of 0.095, subset Accuracy of 0.737, and Macro F1-Score of 0.734. which outperformed other models. BiLSTM played a primary role in Deep2Pep, which BERT encoder was in an auxiliary position. Deep learning algorithms was used in this study to accurately predict the four active functions of peptides, and it was expected to provide effective references for predicting multi-functional peptides.
Collapse
Affiliation(s)
- Lihua Chen
- School of Perfume and Aroma Technology, Shanghai Institute of Technology, Shanghai 201418, China
| | - Zhenkang Hu
- School of Perfume and Aroma Technology, Shanghai Institute of Technology, Shanghai 201418, China
| | - Yuzhi Rong
- School of Perfume and Aroma Technology, Shanghai Institute of Technology, Shanghai 201418, China.
| | - Bao Lou
- Institute of Hydrobiology, Zhejiang Academy of Agricultural Sciences, Hangzhou 310021, China.
| |
Collapse
|
24
|
Palacios A, Acharya P, Peidl A, Beck M, Blanco E, Mishra A, Bawa-Khalfe T, Pakhrin S. SumoPred-PLM: human SUMOylation and SUMO2/3 sites Prediction using Pre-trained Protein Language Model. NAR Genom Bioinform 2024; 6:lqae011. [PMID: 38327870 PMCID: PMC10849187 DOI: 10.1093/nargab/lqae011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 11/17/2023] [Accepted: 01/17/2024] [Indexed: 02/09/2024] Open
Abstract
SUMOylation is an essential post-translational modification system with the ability to regulate nearly all aspects of cellular physiology. Three major paralogues SUMO1, SUMO2 and SUMO3 form a covalent bond between the small ubiquitin-like modifier with lysine residues at consensus sites in protein substrates. Biochemical studies continue to identify unique biological functions for protein targets conjugated to SUMO1 versus the highly homologous SUMO2 and SUMO3 paralogues. Yet, the field has failed to harness contemporary AI approaches including pre-trained protein language models to fully expand and/or recognize the SUMOylated proteome. Herein, we present a novel, deep learning-based approach called SumoPred-PLM for human SUMOylation prediction with sensitivity, specificity, Matthew's correlation coefficient, and accuracy of 74.64%, 73.36%, 0.48% and 74.00%, respectively, on the CPLM 4.0 independent test dataset. In addition, this novel platform uses contextualized embeddings obtained from a pre-trained protein language model, ProtT5-XL-UniRef50 to identify SUMO2/3-specific conjugation sites. The results demonstrate that SumoPred-PLM is a powerful and unique computational tool to predict SUMOylation sites in proteins and accelerate discovery.
Collapse
Affiliation(s)
- Andrew Vargas Palacios
- Department of Computer Science and Engineering Technology, University of Houston-Downtown, 1 Main St., Houston, TX 77002, USA
| | - Pujan Acharya
- Department of Computer Science and Engineering Technology, University of Houston-Downtown, 1 Main St., Houston, TX 77002, USA
| | - Anthony Stephen Peidl
- Department of Biology and Biochemistry, Center for Nuclear Receptors & Cell Signaling, University of Houston, Houston, TX 77204, USA
| | - Moriah Rene Beck
- Department of Chemistry and Biochemistry, Wichita State University, 1845 Fairmount St., Wichita, KS 67260, USA
| | - Eduardo Blanco
- Department of Computer Science, University of Arizona, 1040 4th St., Tucson, AZ 85721, USA
| | - Avdesh Mishra
- Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, Kingsville, TX 78363, USA
| | - Tasneem Bawa-Khalfe
- Department of Biology and Biochemistry, Center for Nuclear Receptors & Cell Signaling, University of Houston, Houston, TX 77204, USA
| | - Subash Chandra Pakhrin
- Department of Computer Science and Engineering Technology, University of Houston-Downtown, 1 Main St., Houston, TX 77002, USA
| |
Collapse
|
25
|
Zhuang J, Gao W, Su R. EnAMP: A novel deep learning ensemble antibacterial peptide recognition algorithm based on multi-features. J Bioinform Comput Biol 2024; 22:2450001. [PMID: 38406833 DOI: 10.1142/s021972002450001x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
Antimicrobial peptides (AMPs), as the preferred alternatives to antibiotics, have wide application with good prospects. Identifying AMPs through wet lab experiments remains expensive, time-consuming and challenging. Many machine learning methods have been proposed to predict AMPs and achieved good results. In this work, we combine two kinds of word embedding features with the statistical features of peptide sequences to develop an ensemble classifier, named EnAMP, in which, two deep neural networks are trained based on Word2vec and Glove word embedding features of peptide sequences, respectively, meanwhile, we utilize statistical features of peptide sequences to train random forest and support vector machine classifiers. The average of four classifiers is the final prediction result. Compared with other state-of-the-art algorithms on six datasets, EnAMP outperforms most existing models with similar computational costs, even when compared with high computational cost algorithms based on Bidirectional Encoder Representation from Transformers (BERT), the performance of our model is comparable. EnAMP source code and the data are available at https://github.com/ruisue/EnAMP.
Collapse
Affiliation(s)
- Jujuan Zhuang
- School of Science, Dalian Maritime University, Dalian, Liaoning, P. R. China
| | - Wanquan Gao
- School of Science, Dalian Maritime University, Dalian, Liaoning, P. R. China
| | - Rui Su
- School of Science, Dalian Maritime University, Dalian, Liaoning, P. R. China
| |
Collapse
|
26
|
Liu F, Yuan C, Chen H, Yang F. Prediction of linear B-cell epitopes based on protein sequence features and BERT embeddings. Sci Rep 2024; 14:2464. [PMID: 38291341 PMCID: PMC10828400 DOI: 10.1038/s41598-024-53028-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Accepted: 01/26/2024] [Indexed: 02/01/2024] Open
Abstract
Linear B-cell epitopes (BCEs) play a key role in the development of peptide vaccines and immunodiagnostic reagents. Therefore, the accurate identification of linear BCEs is of great importance in the prevention of infectious diseases and the diagnosis of related diseases. The experimental methods used to identify BCEs are both expensive and time-consuming and they do not meet the demand for identification of large-scale protein sequence data. As a result, there is a need to develop an efficient and accurate computational method to rapidly identify linear BCE sequences. In this work, we developed the new linear BCE prediction method LBCE-BERT. This method is based on peptide chain sequence information and natural language model BERT embedding information, using an XGBoost classifier. The models were trained on three benchmark datasets. The model was training on three benchmark datasets for hyperparameter selection and was subsequently evaluated on several test datasets. The result indicate that our proposed method outperforms others in terms of AUROC and accuracy. The LBCE-BERT model is publicly available at: https://github.com/Lfang111/LBCE-BERT .
Collapse
Affiliation(s)
- Fang Liu
- School of Humanistic Medicine, Anhui Medical University, Hefei, 230032, Anhui, China
| | - ChengCheng Yuan
- School of Biomedical Engineering, Anhui Medical University, Hefei, 230030, Anhui, China
| | - Haoqiang Chen
- School of Humanistic Medicine, Anhui Medical University, Hefei, 230032, Anhui, China
| | - Fei Yang
- School of Biomedical Engineering, Anhui Medical University, Hefei, 230030, Anhui, China.
| |
Collapse
|
27
|
Wang R, Wang T, Zhuo L, Wei J, Fu X, Zou Q, Yao X. Diff-AMP: tailored designed antimicrobial peptide framework with all-in-one generation, identification, prediction and optimization. Brief Bioinform 2024; 25:bbae078. [PMID: 38446739 PMCID: PMC10939340 DOI: 10.1093/bib/bbae078] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 01/25/2024] [Accepted: 02/08/2024] [Indexed: 03/08/2024] Open
Abstract
Antimicrobial peptides (AMPs), short peptides with diverse functions, effectively target and combat various organisms. The widespread misuse of chemical antibiotics has led to increasing microbial resistance. Due to their low drug resistance and toxicity, AMPs are considered promising substitutes for traditional antibiotics. While existing deep learning technology enhances AMP generation, it also presents certain challenges. Firstly, AMP generation overlooks the complex interdependencies among amino acids. Secondly, current models fail to integrate crucial tasks like screening, attribute prediction and iterative optimization. Consequently, we develop a integrated deep learning framework, Diff-AMP, that automates AMP generation, identification, attribute prediction and iterative optimization. We innovatively integrate kinetic diffusion and attention mechanisms into the reinforcement learning framework for efficient AMP generation. Additionally, our prediction module incorporates pre-training and transfer learning strategies for precise AMP identification and screening. We employ a convolutional neural network for multi-attribute prediction and a reinforcement learning-based iterative optimization strategy to produce diverse AMPs. This framework automates molecule generation, screening, attribute prediction and optimization, thereby advancing AMP research. We have also deployed Diff-AMP on a web server, with code, data and server details available in the Data Availability section.
Collapse
Affiliation(s)
- Rui Wang
- School of Data Science and Artificial Intelligence, Wenzhou University of Technology, 325000 Wenzhou, China
| | - Tao Wang
- School of Data Science and Artificial Intelligence, Wenzhou University of Technology, 325000 Wenzhou, China
| | - Linlin Zhuo
- School of Data Science and Artificial Intelligence, Wenzhou University of Technology, 325000 Wenzhou, China
| | - Jinhang Wei
- School of Data Science and Artificial Intelligence, Wenzhou University of Technology, 325000 Wenzhou, China
| | - Xiangzheng Fu
- College of Computer Science and Electronic Engineering, Hunan University, 410012 Changsha, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 611730 Chengdu, China
| | - Xiaojun Yao
- Faculty of Applied Sciences, Macao Polytechnic University, 999078 Macao, China
| |
Collapse
|
28
|
Yu H, Wang R, Qiao J, Wei L. Multi-CGAN: Deep Generative Model-Based Multiproperty Antimicrobial Peptide Design. J Chem Inf Model 2024; 64:316-326. [PMID: 38135439 DOI: 10.1021/acs.jcim.3c01881] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2023]
Abstract
Antimicrobial peptides are peptides that are effective against bacteria and viruses, and the discovery of new antimicrobial peptides is of great importance to human life and health. Although the design of antimicrobial peptides using machine learning methods has achieved good results in recent years, it remains a challenge to learn and design novel antimicrobial peptides with multiple properties of interest from peptide data with certain property labels. To this end, we propose Multi-CGAN, a deep generative model-based architecture that can learn from single-attribute peptide data and generate antimicrobial peptide sequences with multiple attributes that we need, which may have a potentially wide range of uses in drug discovery. In particular, we verified that our Multi-CGAN generated peptides with the desired properties have good performance in terms of generation rate. Moreover, a comprehensive statistical analysis demonstrated that our generated peptides are diverse and have a low probability of being homologous to the training data. Interestingly, we found that the performance of many popular deep learning methods on the antimicrobial peptide prediction task can be improved by using Multi-CGAN to expand the data on the training set of the original task, indicating the high quality of our generated peptides and the robust ability of our method. In addition, we also investigated whether it is possible to directionally generate peptide sequences with specified properties by controlling the input noise sampling for our model.
Collapse
Affiliation(s)
- Haoqing Yu
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Ruheng Wang
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Jianbo Qiao
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| |
Collapse
|
29
|
Wang S, Liu Y, Liu Y, Zhang Y, Zhu X. BERT-5mC: an interpretable model for predicting 5-methylcytosine sites of DNA based on BERT. PeerJ 2023; 11:e16600. [PMID: 38089911 PMCID: PMC10712318 DOI: 10.7717/peerj.16600] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Accepted: 11/15/2023] [Indexed: 12/18/2023] Open
Abstract
DNA 5-methylcytosine (5mC) is widely present in multicellular eukaryotes, which plays important roles in various developmental and physiological processes and a wide range of human diseases. Thus, it is essential to accurately detect the 5mC sites. Although current sequencing technologies can map genome-wide 5mC sites, these experimental methods are both costly and time-consuming. To achieve a fast and accurate prediction of 5mC sites, we propose a new computational approach, BERT-5mC. First, we pre-trained a domain-specific BERT (bidirectional encoder representations from transformers) model by using human promoter sequences as language corpus. BERT is a deep two-way language representation model based on Transformer. Second, we fine-tuned the domain-specific BERT model based on the 5mC training dataset to build the model. The cross-validation results show that our model achieves an AUROC of 0.966 which is higher than other state-of-the-art methods such as iPromoter-5mC, 5mC_Pred, and BiLSTM-5mC. Furthermore, our model was evaluated on the independent test set, which shows that our model achieves an AUROC of 0.966 that is also higher than other state-of-the-art methods. Moreover, we analyzed the attention weights generated by BERT to identify a number of nucleotide distributions that are closely associated with 5mC modifications. To facilitate the use of our model, we built a webserver which can be freely accessed at: http://5mc-pred.zhulab.org.cn.
Collapse
Affiliation(s)
- Shuyu Wang
- School of Sciences, Anhui Agricultural University, Hefei, Anhui, China
| | - Yinbo Liu
- School of Sciences, Anhui Agricultural University, Hefei, Anhui, China
| | - Yufeng Liu
- School of Sciences, Anhui Agricultural University, Hefei, Anhui, China
| | - Yong Zhang
- School of Sciences, Anhui Agricultural University, Hefei, Anhui, China
| | - Xiaolei Zhu
- School of Sciences, Anhui Agricultural University, Hefei, Anhui, China
| |
Collapse
|
30
|
Ma Y, Pei Y, Li C. Predictive Recognition of DNA-binding Proteins Based on Pre-trained Language Model BERT. J Bioinform Comput Biol 2023; 21:2350028. [PMID: 38248912 DOI: 10.1142/s0219720023500282] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2024]
Abstract
Identifying proteins is crucial for disease diagnosis and treatment. With the increase of known proteins, large-scale batch predictions are essential. However, traditional biological experiments being time-consuming and expensive are difficult to accomplish this task efficiently. Nevertheless, deep learning algorithms based on big data analysis have manifested potential in this aspect. In recent years, language representation models, especially BERT, have made significant advancements in natural language processing. In this paper, using three protein segmentation methods and three encoder numbers, nine BERT models with different sizes are constructed to predict whether known proteins are DNA-binding proteins or not. Furthermore, based on the concept of protein motifs, multi-scale convolutional networks are fused into the models to extract the local features of DNA-binding proteins. Finally, we find that the larger the number of encoders, the better the model predictions under the condition of considering each amino acid in the protein as a word. Our proposed algorithm achieves 81.88% sensitivity and 0.39 MCC value on the test set. Furthermore, it achieves 62.41% accuracy on the independent test set PDB2272. It is evident that our proposed method can be a tool to assist in the identification of DNA-binding proteins.
Collapse
Affiliation(s)
- Yue Ma
- School of Computer Science and Technology, Tiangong University, Tianjin, P. R. China
| | - Yongzhen Pei
- School of Mathematical Sciences, Tiangong University, Tianjin, P. R. China
| | - Changguo Li
- Department of Basic Science, Army Military Transportation University, Tianjin, P. R. China
| |
Collapse
|
31
|
Le NQK. Leveraging transformers-based language models in proteome bioinformatics. Proteomics 2023; 23:e2300011. [PMID: 37381841 DOI: 10.1002/pmic.202300011] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2023] [Revised: 06/13/2023] [Accepted: 06/13/2023] [Indexed: 06/30/2023]
Abstract
In recent years, the rapid growth of biological data has increased interest in using bioinformatics to analyze and interpret this data. Proteomics, which studies the structure, function, and interactions of proteins, is a crucial area of bioinformatics. Using natural language processing (NLP) techniques in proteomics is an emerging field that combines machine learning and text mining to analyze biological data. Recently, transformer-based NLP models have gained significant attention for their ability to process variable-length input sequences in parallel, using self-attention mechanisms to capture long-range dependencies. In this review paper, we discuss the recent advancements in transformer-based NLP models in proteome bioinformatics and examine their advantages, limitations, and potential applications to improve the accuracy and efficiency of various tasks. Additionally, we highlight the challenges and future directions of using these models in proteome bioinformatics research. Overall, this review provides valuable insights into the potential of transformer-based NLP models to revolutionize proteome bioinformatics.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
- AIBioMed Research Group, Taipei Medical University, Taipei, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei, Taiwan
- Translational Imaging Research Center, Taipei Medical University Hospital, Taipei, Taiwan
| |
Collapse
|
32
|
Guan C, Luo J, Li S, Tan ZL, Wang Y, Chen H, Yamamoto N, Zhang C, Lu Y, Chen J, Xing XH. Exploration of DPP-IV Inhibitory Peptide Design Rules Assisted by the Deep Learning Pipeline That Identifies the Restriction Enzyme Cutting Site. ACS OMEGA 2023; 8:39662-39672. [PMID: 37901493 PMCID: PMC10601436 DOI: 10.1021/acsomega.3c05571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/30/2023] [Accepted: 09/27/2023] [Indexed: 10/31/2023]
Abstract
The mining of antidiabetic dipeptidyl peptidase IV (DPP-IV) inhibitory peptides (DPP-IV-IPs) is currently a costly and laborious process. Due to the absence of rational peptide design rules, it relies on cumbersome screening of unknown enzyme hydrolysates. Here, we present an enhanced deep learning model called bidirectional encoder representation (BERT)-DPPIV, specifically designed to classify DPP-IV-IPs and explore their design rules to discover potent candidates. The end-to-end model utilizes a fine-tuned BERT architecture to extract structural/functional information from input peptides and accurately identify DPP-IV-Ips from input peptides. Experimental results in the benchmark data set showed BERT-DPPIV yielded state-of-the-art accuracy and MCC of 0.894 and 0.790, surpassing the 0.797 and 0.594 obtained by the sequence-feature model. Furthermore, we leveraged the attention mechanism to uncover that our model could recognize the restriction enzyme cutting site and specific residues that contribute to the inhibition of DPP-IV. Moreover, guided by BERT-DPPIV, proposed design rules for DPP-IV inhibitory tripeptides and pentapeptides were validated, and they can be used to screen potent DPP-IV-IPs.
Collapse
Affiliation(s)
- Changge Guan
- Key
Laboratory for Industrial Biocatalysis, Ministry of Education of China,
Department of Chemical Engineering, Tsinghua
University, Beijing 100084, China
| | - Jiawei Luo
- Department
of Computer Science and Technology, Harbin
Institute of Technology, Shenzhen 518055, China
| | - Shucheng Li
- Key
Laboratory for Industrial Biocatalysis, Ministry of Education of China,
Department of Chemical Engineering, Tsinghua
University, Beijing 100084, China
| | - Zheng Lin Tan
- School
of Life Science and Technology, Tokyo Institute
of Technology, 4259 Nagatsutacho, Midori Ward, Yokohama,
Kanagawa Prefecture 226-0026, Japan
| | - Yi Wang
- Key
Laboratory for Industrial Biocatalysis, Ministry of Education of China,
Department of Chemical Engineering, Tsinghua
University, Beijing 100084, China
| | - Haihong Chen
- Institute
of Biopharmaceutical and Health Engineering, Tsinghua Shenzhen International Graduate School, Shenzhen 518055, China
- Institute
of Biomedical Health Technology and Engineering, Shenzhen Bay Laboratory, Shenzhen 518118, China
| | - Naoyuki Yamamoto
- School
of Life Science and Technology, Tokyo Institute
of Technology, 4259 Nagatsutacho, Midori Ward, Yokohama,
Kanagawa Prefecture 226-0026, Japan
| | - Chong Zhang
- Key
Laboratory for Industrial Biocatalysis, Ministry of Education of China,
Department of Chemical Engineering, Tsinghua
University, Beijing 100084, China
- Center
for Synthetic and Systems Biology, Tsinghua
University, Beijing 100084, China
| | - Yuan Lu
- Key
Laboratory for Industrial Biocatalysis, Ministry of Education of China,
Department of Chemical Engineering, Tsinghua
University, Beijing 100084, China
| | - Junjie Chen
- Department
of Computer Science and Technology, Harbin
Institute of Technology, Shenzhen 518055, China
| | - Xin-Hui Xing
- Key
Laboratory for Industrial Biocatalysis, Ministry of Education of China,
Department of Chemical Engineering, Tsinghua
University, Beijing 100084, China
- Institute
of Biopharmaceutical and Health Engineering, Tsinghua Shenzhen International Graduate School, Shenzhen 518055, China
- Institute
of Biomedical Health Technology and Engineering, Shenzhen Bay Laboratory, Shenzhen 518118, China
- Center
for Synthetic and Systems Biology, Tsinghua
University, Beijing 100084, China
| |
Collapse
|
33
|
Zhang J, Yan W, Zhang Q, Li Z, Liang L, Zuo M, Zhang Y. Umami-BERT: An interpretable BERT-based model for umami peptides prediction. Food Res Int 2023; 172:113142. [PMID: 37689906 DOI: 10.1016/j.foodres.2023.113142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Revised: 06/12/2023] [Accepted: 06/13/2023] [Indexed: 09/11/2023]
Abstract
Umami peptides have received extensive attention due to their ability to enhance flavors and provide nutritional benefits. The increasing demand for novel umami peptides and the vast number of peptides present in food call for more efficient methods to screen umami peptides, and further exploration is necessary. Therefore, the purpose of this study is to develop deep learning (DL) model to realize rapid screening of umami peptides. The Umami-BERT model was devised utilizing a novel two-stage training strategy with Bidirectional Encoder Representations from Transformers (BERT) and the inception network. In the pre-training stage, attention mechanisms were implemented on a large amount of bioactive peptides sequences to acquire high-dimensional generalized features. In the re-training stage, umami peptide prediction was carried out on UMP789 dataset, which is developed through the latest research. The model achieved the performance with an accuracy (ACC) of 93.23% and MCC of 0.78 on the balanced dataset, as well as an ACC of 95.00% and MCC of 0.85 on the unbalanced dataset. The results demonstrated that Umami-BERT could predict umami peptides directly from their amino acid sequences and exceeded the performance of other models. Furthermore, Umami-BERT enabled the analysis of attention pattern learned by Umami-BERT model. The amino acids Alanine (A), Cysteine (C), Aspartate (D), and Glutamicacid (E) were found to be the most significant contributors to umami peptides. Additionally, the patterns of summarized umami peptides involving A, C, D, and E were analyzed based on the learned attention weights. Consequently, Umami-BERT exhibited great potential in the large-scale screening of candidate peptides and offers novel insight for the further exploration of umami peptides.
Collapse
Affiliation(s)
- Jingcheng Zhang
- Food Laboratory of Zhongyuan, Beijing Technology and Business University, No. 11/33, Fucheng Road, Haidian District, Beijing 100048, China; Key Laboratory of Flavor Science of China Gengeral Chamber of Commerce, Beijing Technology and Business University, No. 11/33, Fucheng Road, Haidian District, Beijing 100048, China.
| | - Wenjing Yan
- National Engineering Research Centre for Agri-product Quality Traceability, Beijing Technology and Business University, No. 11/33, Fucheng Road, Haidian District, Beijing 100048, China.
| | - Qingchuan Zhang
- National Engineering Research Centre for Agri-product Quality Traceability, Beijing Technology and Business University, No. 11/33, Fucheng Road, Haidian District, Beijing 100048, China.
| | - Zihan Li
- National Engineering Research Centre for Agri-product Quality Traceability, Beijing Technology and Business University, No. 11/33, Fucheng Road, Haidian District, Beijing 100048, China.
| | - Li Liang
- Food Laboratory of Zhongyuan, Beijing Technology and Business University, No. 11/33, Fucheng Road, Haidian District, Beijing 100048, China; Key Laboratory of Flavor Science of China Gengeral Chamber of Commerce, Beijing Technology and Business University, No. 11/33, Fucheng Road, Haidian District, Beijing 100048, China.
| | - Min Zuo
- National Engineering Research Centre for Agri-product Quality Traceability, Beijing Technology and Business University, No. 11/33, Fucheng Road, Haidian District, Beijing 100048, China.
| | - Yuyu Zhang
- Food Laboratory of Zhongyuan, Beijing Technology and Business University, No. 11/33, Fucheng Road, Haidian District, Beijing 100048, China; Key Laboratory of Flavor Science of China Gengeral Chamber of Commerce, Beijing Technology and Business University, No. 11/33, Fucheng Road, Haidian District, Beijing 100048, China.
| |
Collapse
|
34
|
Yao L, Zhang Y, Li W, Chung C, Guan J, Zhang W, Chiang Y, Lee T. DeepAFP: An effective computational framework for identifying antifungal peptides based on deep learning. Protein Sci 2023; 32:e4758. [PMID: 37595093 PMCID: PMC10503419 DOI: 10.1002/pro.4758] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2023] [Revised: 08/02/2023] [Accepted: 08/10/2023] [Indexed: 08/20/2023]
Abstract
Fungal infections have become a significant global health issue, affecting millions worldwide. Antifungal peptides (AFPs) have emerged as a promising alternative to conventional antifungal drugs due to their low toxicity and low propensity for inducing resistance. In this study, we developed a deep learning-based framework called DeepAFP to efficiently identify AFPs. DeepAFP fully leverages and mines composition information, evolutionary information, and physicochemical properties of peptides by employing combined kernels from multiple branches of convolutional neural network with bi-directional long short-term memory layers. In addition, DeepAFP integrates a transfer learning strategy to obtain efficient representations of peptides for improving model performance. DeepAFP demonstrates strong predictive ability on carefully curated datasets, yielding an accuracy of 93.29% and an F1-score of 93.45% on the DeepAFP-Main dataset. The experimental results show that DeepAFP outperforms existing AFP prediction tools, achieving state-of-the-art performance. Finally, we provide a downloadable AFP prediction tool to meet the demands of large-scale prediction and facilitate the usage of our framework by the public or other researchers. Our framework can accurately identify AFPs in a short time without requiring significant human and material resources, and hence can accelerate the development of AFPs as well as contribute to the treatment of fungal infections. Furthermore, our method can provide new perspectives for other biological sequence analysis tasks.
Collapse
Affiliation(s)
- Lantian Yao
- Kobilka Institute of Innovative Drug Discovery, School of MedicineThe Chinese University of Hong KongShenzhenChina
- School of Science and EngineeringThe Chinese University of Hong KongShenzhenChina
| | - Yuntian Zhang
- School of MedicineThe Chinese University of Hong KongShenzhenChina
| | - Wenshuo Li
- School of Science and EngineeringThe Chinese University of Hong KongShenzhenChina
| | - Chia‐Ru Chung
- Department of Computer Science and Information EngineeringNational Central UniversityTaoyuanTaiwan
| | - Jiahui Guan
- School of MedicineThe Chinese University of Hong KongShenzhenChina
| | - Wenyang Zhang
- School of MedicineThe Chinese University of Hong KongShenzhenChina
| | - Ying‐Chih Chiang
- Kobilka Institute of Innovative Drug Discovery, School of MedicineThe Chinese University of Hong KongShenzhenChina
- School of MedicineThe Chinese University of Hong KongShenzhenChina
| | - Tzong‐Yi Lee
- Institute of Bioinformatics and Systems BiologyNational Yang Ming Chiao Tung UniversityHsinchuTaiwan
- Center for Intelligent Drug Systems and Smart Bio‐devices (IDS2B)National Yang Ming Chiao Tung UniversityHsinchuTaiwan
| |
Collapse
|
35
|
Ju H, Bai J, Jiang J, Che Y, Chen X. Comparative evaluation and analysis of DNA N4-methylcytosine methylation sites using deep learning. Front Genet 2023; 14:1254827. [PMID: 37671040 PMCID: PMC10476523 DOI: 10.3389/fgene.2023.1254827] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Accepted: 07/31/2023] [Indexed: 09/07/2023] Open
Abstract
DNA N4-methylcytosine (4mC) is significantly involved in biological processes, such as DNA expression, repair, and replication. Therefore, accurate prediction methods are urgently needed. Deep learning methods have transformed applications that previously require sequencing expertise into engineering challenges that do not require expertise to solve. Here, we compare a variety of state-of-the-art deep learning models on six benchmark datasets to evaluate their performance in 4mC methylation site detection. We visualize the statistical analysis of the datasets and the performance of different deep-learning models. We conclude that deep learning can greatly expand the potential of methylation site prediction.
Collapse
Affiliation(s)
- Hong Ju
- Heilongjiang Agricultural Engineering Vocational College, Harbin, China
| | - Jie Bai
- Engineering Research Center of Integration and Application of Digital Learning Technology, Ministry of Education, Hangzhou, China
| | - Jing Jiang
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Yusheng Che
- Heilongjiang Agricultural Engineering Vocational College, Harbin, China
| | - Xin Chen
- Department of Neurosurgical Laboratory, The First Affiliated Hospital of Harbin Medical University, Harbin, China
| |
Collapse
|
36
|
Xu J, Li F, Li C, Guo X, Landersdorfer C, Shen HH, Peleg AY, Li J, Imoto S, Yao J, Akutsu T, Song J. iAMPCN: a deep-learning approach for identifying antimicrobial peptides and their functional activities. Brief Bioinform 2023; 24:bbad240. [PMID: 37369638 PMCID: PMC10359087 DOI: 10.1093/bib/bbad240] [Citation(s) in RCA: 37] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Revised: 05/30/2023] [Accepted: 06/08/2023] [Indexed: 06/29/2023] Open
Abstract
Antimicrobial peptides (AMPs) are short peptides that play crucial roles in diverse biological processes and have various functional activities against target organisms. Due to the abuse of chemical antibiotics and microbial pathogens' increasing resistance to antibiotics, AMPs have the potential to be alternatives to antibiotics. As such, the identification of AMPs has become a widely discussed topic. A variety of computational approaches have been developed to identify AMPs based on machine learning algorithms. However, most of them are not capable of predicting the functional activities of AMPs, and those predictors that can specify activities only focus on a few of them. In this study, we first surveyed 10 predictors that can identify AMPs and their functional activities in terms of the features they employed and the algorithms they utilized. Then, we constructed comprehensive AMP datasets and proposed a new deep learning-based framework, iAMPCN (identification of AMPs based on CNNs), to identify AMPs and their related 22 functional activities. Our experiments demonstrate that iAMPCN significantly improved the prediction performance of AMPs and their corresponding functional activities based on four types of sequence features. Benchmarking experiments on the independent test datasets showed that iAMPCN outperformed a number of state-of-the-art approaches for predicting AMPs and their functional activities. Furthermore, we analyzed the amino acid preferences of different AMP activities and evaluated the model on datasets of varying sequence redundancy thresholds. To facilitate the community-wide identification of AMPs and their corresponding functional types, we have made the source codes of iAMPCN publicly available at https://github.com/joy50706/iAMPCN/tree/master. We anticipate that iAMPCN can be explored as a valuable tool for identifying potential AMPs with specific functional activities for further experimental validation.
Collapse
Affiliation(s)
- Jing Xu
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Fuyi Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- College of Information Engineering, Northwest A&F University, Shaanxi 712100, China
- The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC 3800, Australia
| | - Chen Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Xudong Guo
- College of Information Engineering, Northwest A&F University, Shaanxi 712100, China
| | - Cornelia Landersdorfer
- Monash Institute of Pharmaceutical Sciences, Monash University, Melbourne, VIC 3800, Australia
| | - Hsin-Hui Shen
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Department of Materials Science and Engineering, Faculty of Engineering, Monash University, Clayton, VIC, 3800, Australia
| | - Anton Y Peleg
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Department of Infectious Diseases, Alfred Hospital, Alfred Health, Melbourne, Victoria, Australia
| | - Jian Li
- Monash Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800, Australia
| | - Seiya Imoto
- Division of Health Medical Intelligence, Human Genome Center, Institute of Medical Science, The University of Tokyo, Minato-ku, Tokyo, Japan
- Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, Bunkyo-ku, Tokyo, Japan
| | | | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji 611-0011, Japan
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji 611-0011, Japan
| |
Collapse
|
37
|
Jing Y, Zhang S, Wang H. DapNet-HLA: Adaptive dual-attention mechanism network based on deep learning to predict non-classical HLA binding sites. Anal Biochem 2023; 666:115075. [PMID: 36740003 DOI: 10.1016/j.ab.2023.115075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 01/30/2023] [Accepted: 02/02/2023] [Indexed: 02/05/2023]
Abstract
Human leukocyte antigen (HLA) plays a vital role in immunomodulatory function. Studies have shown that immunotherapy based on non-classical HLA has essential applications in cancer, COVID-19, and allergic diseases. However, there are few deep learning methods to predict non-classical HLA alleles. In this work, an adaptive dual-attention network named DapNet-HLA is established based on existing datasets. Firstly, amino acid sequences are transformed into digital vectors by looking up the table. To overcome the feature sparsity problem caused by unique one-hot encoding, the fused word embedding method is used to map each amino acid to a low-dimensional word vector optimized with the training of the classifier. Then, we use the GCB (group convolution block), SENet attention (squeeze-and-excitation networks), BiLSTM (bidirectional long short-term memory network), and Bahdanau attention mechanism to construct the classifier. The use of SENet can make the weight of the effective feature map high, so that the model can be trained to achieve better results. Attention mechanism is an Encoder-Decoder model used to improve the effectiveness of RNN, LSTM or GRU (gated recurrent neural network). The ablation experiment shows that DapNet-HLA has the best adaptability for five datasets. On the five test datasets, the ACC index and MCC index of DapNet-HLA are 4.89% and 0.0933 higher than the comparison method, respectively. According to the ROC curve and PR curve verified by the 5-fold cross-validation, the AUC value of each fold has a slight fluctuation, which proves the robustness of the DapNet-HLA. The codes and datasets are accessible at https://github.com/JYY625/DapNet-HLA.
Collapse
Affiliation(s)
- Yuanyuan Jing
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China.
| | - Houqiang Wang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| |
Collapse
|
38
|
Liu Y, Wang S, Li X, Liu Y, Zhu X. NeuroPpred-SVM: A New Model for Predicting Neuropeptides Based on Embeddings of BERT. J Proteome Res 2023; 22:718-728. [PMID: 36749151 DOI: 10.1021/acs.jproteome.2c00363] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Neuropeptides play pivotal roles in different physiological processes and are related to different kinds of diseases. Identification of neuropeptides is of great benefit for studying the mechanism of these physiological processes and the treatment of neurological disorders. Several state-of-the-art neuropeptide predictors have been developed by using a two-layer stacking ensemble algorithm. Although the two-layer stacking ensemble algorithm can improve the feature representability, these models are complex, which are not as efficient as the models based on one classifier. In this study, we proposed a new model, NeuroPpred-SVM, to predict neuropeptides based on the embeddings of Bidirectional Encoder Representations from Transformers and other sequential features by using a support vector machine (SVM). The experimental results indicate that our model achieved a cross-validation area under the receiver operating characteristic (AUROC) curve of 0.969 on the training data set and an AUROC of 0.966 on the independent test set. By comparing our model with the other four state-of-the-art models including NeuroPIpred, PredNeuroP, NeuroPpred-Fuse, and NeuroPpred-FRL on the independent test set, our model achieved the highest AUROC, Matthews correlation coefficient, accuracy, and specificity, which indicate that our model outperforms the existing models. We believed that NeuroPpred-SVM could be a useful tool for identifying neuropeptides with high accuracy and low cost. The data sets and Python code are available at https://github.com/liuyf-a/NeuroPpred-SVM.
Collapse
Affiliation(s)
- Yufeng Liu
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Shuyu Wang
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Xiang Li
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Yinbo Liu
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Xiaolei Zhu
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China
| |
Collapse
|
39
|
Deep learning drives efficient discovery of novel antihypertensive peptides from soybean protein isolate. Food Chem 2023; 404:134690. [DOI: 10.1016/j.foodchem.2022.134690] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Revised: 09/29/2022] [Accepted: 10/17/2022] [Indexed: 11/06/2022]
|
40
|
Yu H, Luo X. IPPF-FE: an integrated peptide and protein function prediction framework based on fused features and ensemble models. Brief Bioinform 2023; 24:6834141. [PMID: 36403184 DOI: 10.1093/bib/bbac476] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Revised: 09/23/2022] [Accepted: 10/05/2022] [Indexed: 11/21/2022] Open
Abstract
The prediction of peptide and protein function is important for research and industrial applications, and many machine learning methods have been developed for this purpose. The existing models have encountered many challenges, including the lack of effective and comprehensive features and the limited applicability of each model. Here, we introduce an Integrated Peptide and Protein function prediction Framework based on Fused features and Ensemble models (IPPF-FE), which can accurately capture the relationship between features and labels. The results indicated that IPPF-FE outperformed existing state-of-the-art (SOTA) models on more than 8 different categories of peptide and protein tasks. In addition, t-distributed Stochastic Neighbour Embedding demonstrated the advantages of IPPF-FE. We anticipate that our method will become a versatile tool for peptide and protein prediction tasks and shed light on the future development of related models. The model is open source and available in the GitHub repository https://github.com/Luo-SynBioLab/IPPF-FE.
Collapse
Affiliation(s)
- Han Yu
- Center for Synthetic Biochemistry, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| | - Xiaozhou Luo
- Shenzhen Key Laboratory for the Intelligent Microbial Manufacturing of Medicines, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China.,University of Chinese Academy of Sciences, Beijing 100049, China.,CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China.,Center for Synthetic Biochemistry, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| |
Collapse
|
41
|
Liu Y, Liu Y, Wang S, Zhu X. LBCE-XGB: A XGBoost Model for Predicting Linear B-Cell Epitopes Based on BERT Embeddings. Interdiscip Sci 2023; 15:293-305. [PMID: 36646842 DOI: 10.1007/s12539-023-00549-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 12/28/2022] [Accepted: 01/03/2023] [Indexed: 01/18/2023]
Abstract
Accurately detecting linear B-cell epitopes (BCEs) makes great sense in vaccine design, immunodiagnostic test, antibody production, disease prevention and treatment. Wet-lab experiments for determining linear BCEs are both expensive and laborious, which are not able to meet the recognition needs of modern massive protein sequence data. Instead, computational methods can efficiently identify linear BCEs with low cost. Although several computational methods are available, the performance is still not satisfactory. Thus, we propose a new method, LBCE-XGB, to forecast linear BCEs based on XGBoost algorithm. To represent the biological information concealed in peptide sequences, the embeddings of the residues were obtained from a pre-trained domain-specific BERT model. In addition, the other five types of attributes comprising amino acid composition, amino acid antigenicity scale were also extracted. The best feature combination was determined according to the cross-validation results. Against the models developed by other deep learning and machine learning algorithms, LBCE-XGB achieves the top performance with an AUROC of 0.845 for fivefold cross-validation. The results on the independent test set show that our model attains an AUROC of 0.838 which is substantially higher than other state-of-the-art methods. The outcomes indicate that the representations of BERT could be an effective feature in predicting linear BCEs and we believe that LBCE-XGB could be a useful medium for detecting linear B cell epitopes with high accuracy and low cost.
Collapse
Affiliation(s)
- Yufeng Liu
- School of Sciences, Anhui Agricultural University, Hefei, 230036, Anhui, China
| | - Yinbo Liu
- School of Sciences, Anhui Agricultural University, Hefei, 230036, Anhui, China
| | - Shuyu Wang
- School of Sciences, Anhui Agricultural University, Hefei, 230036, Anhui, China
| | - Xiaolei Zhu
- School of Sciences, Anhui Agricultural University, Hefei, 230036, Anhui, China.
| |
Collapse
|
42
|
Application of a deep generative model produces novel and diverse functional peptides against microbial resistance. Comput Struct Biotechnol J 2022; 21:463-471. [PMID: 36618982 PMCID: PMC9804011 DOI: 10.1016/j.csbj.2022.12.029] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Revised: 12/13/2022] [Accepted: 12/16/2022] [Indexed: 12/23/2022] Open
Abstract
Antimicrobial resistance could threaten millions of lives in the immediate future. Antimicrobial peptides (AMPs) are an alternative to conventional antibiotics practice against infectious diseases. Despite the potential contribution of AMPs to the antibiotic's world, their development and optimization have encountered serious challenges. Cutting-edge methods with novel and improved selectivity toward resistant targets must be established to create AMPs-driven treatments. Here, we present AMPTrans-lstm, a deep generative network-based approach for the rational design of AMPs. The AMPTrans-lstm pipeline involves pre-training, transfer learning, and module identification. The AMPTrans-lstm model has two sub-models, namely, (long short-term memory) LSTM sampler and Transformer converter, which can be connected in series to make full use of the stability of LSTM and the novelty of Transformer model. These elements could generate AMPs candidates, which can then be tailored for specific applications. By analyzing the generated sequence and trained AMPs, we prove that AMPTrans-lstm can expand the design space of the trained AMPs and produce reasonable and brand-new AMPs sequences. AMPTrans-lstm can generate functional peptides for antimicrobial resistance with good novelty and diversity, so it is an efficient AMPs design tool.
Collapse
|
43
|
IUP-BERT: Identification of Umami Peptides Based on BERT Features. Foods 2022; 11:foods11223742. [PMID: 36429332 PMCID: PMC9689418 DOI: 10.3390/foods11223742] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Revised: 11/14/2022] [Accepted: 11/16/2022] [Indexed: 11/23/2022] Open
Abstract
Umami is an important widely-used taste component of food seasoning. Umami peptides are specific structural peptides endowing foods with a favorable umami taste. Laboratory approaches used to identify umami peptides are time-consuming and labor-intensive, which are not feasible for rapid screening. Here, we developed a novel peptide sequence-based umami peptide predictor, namely iUP-BERT, which was based on the deep learning pretrained neural network feature extraction method. After optimization, a single deep representation learning feature encoding method (BERT: bidirectional encoder representations from transformer) in conjugation with the synthetic minority over-sampling technique (SMOTE) and support vector machine (SVM) methods was adopted for model creation to generate predicted probabilistic scores of potential umami peptides. Further extensive empirical experiments on cross-validation and an independent test showed that iUP-BERT outperformed the existing methods with improvements, highlighting its effectiveness and robustness. Finally, an open-access iUP-BERT web server was built. To our knowledge, this is the first efficient sequence-based umami predictor created based on a single deep-learning pretrained neural network feature extraction method. By predicting umami peptides, iUP-BERT can help in further research to improve the palatability of dietary supplements in the future.
Collapse
|
44
|
García-Jacas CR, García-González LA, Martinez-Rios F, Tapia-Contreras IP, Brizuela CA. Handcrafted versus non-handcrafted (self-supervised) features for the classification of antimicrobial peptides: complementary or redundant? Brief Bioinform 2022; 23:6754757. [PMID: 36215083 DOI: 10.1093/bib/bbac428] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Revised: 08/28/2022] [Accepted: 09/02/2022] [Indexed: 12/14/2022] Open
Abstract
Antimicrobial peptides (AMPs) have received a great deal of attention given their potential to become a plausible option to fight multi-drug resistant bacteria as well as other pathogens. Quantitative sequence-activity models (QSAMs) have been helpful to discover new AMPs because they allow to explore a large universe of peptide sequences and help reduce the number of wet lab experiments. A main aspect in the building of QSAMs based on shallow learning is to determine an optimal set of protein descriptors (features) required to discriminate between sequences with different antimicrobial activities. These features are generally handcrafted from peptide sequence datasets that are labeled with specific antimicrobial activities. However, recent developments have shown that unsupervised approaches can be used to determine features that outperform human-engineered (handcrafted) features. Thus, knowing which of these two approaches contribute to a better classification of AMPs, it is a fundamental question in order to design more accurate models. Here, we present a systematic and rigorous study to compare both types of features. Experimental outcomes show that non-handcrafted features lead to achieve better performances than handcrafted features. However, the experiments also prove that an improvement in performance is achieved when both types of features are merged. A relevance analysis reveals that non-handcrafted features have higher information content than handcrafted features, while an interaction-based importance analysis reveals that handcrafted features are more important. These findings suggest that there is complementarity between both types of features. Comparisons regarding state-of-the-art deep models show that shallow models yield better performances both when fed with non-handcrafted features alone and when fed with non-handcrafted and handcrafted features together.
Collapse
Affiliation(s)
- César R García-Jacas
- Cátedras CONACYT - Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| | - Luis A García-González
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| | | | - Issac P Tapia-Contreras
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| | - Carlos A Brizuela
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| |
Collapse
|
45
|
Dong B, Li M, Jiang B, Gao B, Li D, Zhang T. Antimicrobial Peptides Prediction method based on sequence multidimensional feature embedding. Front Genet 2022; 13:1069558. [PMID: 36468005 PMCID: PMC9714691 DOI: 10.3389/fgene.2022.1069558] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Accepted: 11/02/2022] [Indexed: 09/10/2024] Open
Abstract
Antimicrobial peptides (AMPs) are alkaline substances with efficient bactericidal activity produced in living organisms. As the best substitute for antibiotics, they have been paid more and more attention in scientific research and clinical application. AMPs can be produced from almost all organisms and are capable of killing a wide variety of pathogenic microorganisms. In addition to being antibacterial, natural AMPs have many other therapeutically important activities, such as wound healing, antioxidant and immunomodulatory effects. To discover new AMPs, the use of wet experimental methods is expensive and difficult, and bioinformatics technology can effectively solve this problem. Recently, some deep learning methods have been applied to the prediction of AMPs and achieved good results. To further improve the prediction accuracy of AMPs, this paper designs a new deep learning method based on sequence multidimensional representation. By encoding and embedding sequence features, and then inputting the model to identify AMPs, high-precision classification of AMPs and Non-AMPs with lengths of 10-200 is achieved. The results show that our method improved accuracy by 1.05% compared to the most advanced model in independent data validation without decreasing other indicators.
Collapse
Affiliation(s)
- Benzhi Dong
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Mengna Li
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Bei Jiang
- Tianjin Second People's Hospital, Tianjin Institute of Hepatology, Tianjin, China
| | - Bo Gao
- Department of Radiology, The Second Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Dan Li
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Tianjiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| |
Collapse
|
46
|
An J, Weng X. Collectively encoding protein properties enriches protein language models. BMC Bioinformatics 2022; 23:467. [DOI: 10.1186/s12859-022-05031-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Accepted: 10/31/2022] [Indexed: 11/10/2022] Open
Abstract
AbstractPre-trained natural language processing models on a large natural language corpus can naturally transfer learned knowledge to protein domains by fine-tuning specific in-domain tasks. However, few studies focused on enriching such protein language models by jointly learning protein properties from strongly-correlated protein tasks. Here we elaborately designed a multi-task learning (MTL) architecture, aiming to decipher implicit structural and evolutionary information from three sequence-level classification tasks for protein family, superfamily and fold. Considering the co-existing contextual relevance between human words and protein language, we employed BERT, pre-trained on a large natural language corpus, as our backbone to handle protein sequences. More importantly, the encoded knowledge obtained in the MTL stage can be well transferred to more fine-grained downstream tasks of TAPE. Experiments on structure- or evolution-related applications demonstrate that our approach outperforms many state-of-the-art Transformer-based protein models, especially in remote homology detection.
Collapse
|
47
|
Pang Y, Yao L, Xu J, Wang Z, Lee TY. Integrating transformer and imbalanced multi-label learning to identify antimicrobial peptides and their functional activities. Bioinformatics 2022; 38:5368-5374. [PMID: 36326438 PMCID: PMC9750108 DOI: 10.1093/bioinformatics/btac711] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Revised: 10/08/2022] [Accepted: 11/02/2022] [Indexed: 11/06/2022] Open
Abstract
MOTIVATION Antimicrobial peptides (AMPs) have the potential to inhibit multiple types of pathogens and to heal infections. Computational strategies can assist in characterizing novel AMPs from proteome or collections of synthetic sequences and discovering their functional abilities toward different microbial targets without intensive labor. RESULTS Here, we present a deep learning-based method for computer-aided novel AMP discovery that utilizes the transformer neural network architecture with knowledge from natural language processing to extract peptide sequence information. We implemented the method for two AMP-related tasks: the first is to discriminate AMPs from other peptides, and the second task is identifying AMPs functional activities related to seven different targets (gram-negative bacteria, gram-positive bacteria, fungi, viruses, cancer cells, parasites and mammalian cell inhibition), which is a multi-label problem. In addition, asymmetric loss was adopted to resolve the intrinsic imbalance of dataset, particularly for the multi-label scenarios. The evaluation showed that our proposed scheme achieves the best performance for the first task (96.85% balanced accuracy) and has a more unbiased prediction for the second task (79.83% balanced accuracy averaged across all functional activities) when compared with that of strategies without imbalanced learning or deep learning. AVAILABILITY AND IMPLEMENTATION The source code and data of this study are available at https://github.com/BiOmicsLab/TransImbAMP. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Jingyi Xu
- School of Life and Health Sciences, School of Medicine, The Chinese University of Hong Kong, Shenzhen, Shenzhen 518172, China
| | - Zhuo Wang
- To whom correspondence should be addressed. or
| | | |
Collapse
|
48
|
Yan J, Cai J, Zhang B, Wang Y, Wong DF, Siu SWI. Recent Progress in the Discovery and Design of Antimicrobial Peptides Using Traditional Machine Learning and Deep Learning. Antibiotics (Basel) 2022; 11:1451. [PMID: 36290108 PMCID: PMC9598685 DOI: 10.3390/antibiotics11101451] [Citation(s) in RCA: 41] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Revised: 10/11/2022] [Accepted: 10/13/2022] [Indexed: 11/16/2022] Open
Abstract
Antimicrobial resistance has become a critical global health problem due to the abuse of conventional antibiotics and the rise of multi-drug-resistant microbes. Antimicrobial peptides (AMPs) are a group of natural peptides that show promise as next-generation antibiotics due to their low toxicity to the host, broad spectrum of biological activity, including antibacterial, antifungal, antiviral, and anti-parasitic activities, and great therapeutic potential, such as anticancer, anti-inflammatory, etc. Most importantly, AMPs kill bacteria by damaging cell membranes using multiple mechanisms of action rather than targeting a single molecule or pathway, making it difficult for bacterial drug resistance to develop. However, experimental approaches used to discover and design new AMPs are very expensive and time-consuming. In recent years, there has been considerable interest in using in silico methods, including traditional machine learning (ML) and deep learning (DL) approaches, to drug discovery. While there are a few papers summarizing computational AMP prediction methods, none of them focused on DL methods. In this review, we aim to survey the latest AMP prediction methods achieved by DL approaches. First, the biology background of AMP is introduced, then various feature encoding methods used to represent the features of peptide sequences are presented. We explain the most popular DL techniques and highlight the recent works based on them to classify AMPs and design novel peptide sequences. Finally, we discuss the limitations and challenges of AMP prediction.
Collapse
Affiliation(s)
- Jielu Yan
- PAMI Research Group, Department of Computer and Information Science, University of Macau, Taipa, Macau, China
| | - Jianxiu Cai
- Faculty of Applied Sciences, Macao Polytechnic University, Macau, China
- Institute of Science and Environment, University of Saint Joseph, Estr. Marginal da Ilha Verde, Macau, China
| | - Bob Zhang
- PAMI Research Group, Department of Computer and Information Science, University of Macau, Taipa, Macau, China
| | - Yapeng Wang
- Faculty of Applied Sciences, Macao Polytechnic University, Macau, China
| | - Derek F. Wong
- NLP2CT Lab, Department of Computer and Information Science, University of Macau, Taipa, Macau, China
| | - Shirley W. I. Siu
- Institute of Science and Environment, University of Saint Joseph, Estr. Marginal da Ilha Verde, Macau, China
- School of Pharmaceutical Sciences, Universiti Sains Malaysia, Pulau Pinang 11800, Malaysia
| |
Collapse
|
49
|
PD-BertEDL: An Ensemble Deep Learning Method Using BERT and Multivariate Representation to Predict Peptide Detectability. Int J Mol Sci 2022; 23:ijms232012385. [PMID: 36293242 PMCID: PMC9604182 DOI: 10.3390/ijms232012385] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Revised: 10/11/2022] [Accepted: 10/12/2022] [Indexed: 12/03/2022] Open
Abstract
Peptide detectability is defined as the probability of identifying a peptide from a mixture of standard samples, which is a key step in protein identification and analysis. Exploring effective methods for predicting peptide detectability is helpful for disease treatment and clinical research. However, most existing computational methods for predicting peptide detectability rely on a single information. With the increasing complexity of feature representation, it is necessary to explore the influence of multivariate information on peptide detectability. Thus, we propose an ensemble deep learning method, PD-BertEDL. Bidirectional encoder representations from transformers (BERT) is introduced to capture the context information of peptides. Context information, sequence information, and physicochemical information of peptides were combined to construct the multivariate feature space of peptides. We use different deep learning methods to capture the high-quality features of different categories of peptides information and use the average fusion strategy to integrate three model prediction results to solve the heterogeneity problem and to enhance the robustness and adaptability of the model. The experimental results show that PD-BertEDL is superior to the existing prediction methods, which can effectively predict peptide detectability and provide strong support for protein identification and quantitative analysis, as well as disease treatment.
Collapse
|
50
|
Chen S, Li Q, Zhao J, Bin Y, Zheng C. NeuroPred-CLQ: incorporating deep temporal convolutional networks and multi-head attention mechanism to predict neuropeptides. Brief Bioinform 2022; 23:6672901. [DOI: 10.1093/bib/bbac319] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2022] [Revised: 06/27/2022] [Accepted: 07/14/2022] [Indexed: 11/13/2022] Open
Abstract
Abstract
Neuropeptides (NPs) are a particular class of informative substances in the immune system and physiological regulation. They play a crucial role in regulating physiological functions in various biological growth and developmental stages. In addition, NPs are crucial for developing new drugs for the treatment of neurological diseases. With the development of molecular biology techniques, some data-driven tools have emerged to predict NPs. However, it is necessary to improve the predictive performance of these tools for NPs. In this study, we developed a deep learning model (NeuroPred-CLQ) based on the temporal convolutional network (TCN) and multi-head attention mechanism to identify NPs effectively and translate the internal relationships of peptide sequences into numerical features by the Word2vec algorithm. The experimental results show that NeuroPred-CLQ learns data information effectively, achieving 93.6% accuracy and 98.8% AUC on the independent test set. The model has better performance in identifying NPs than the state-of-the-art predictors. Visualization of features using t-distribution random neighbor embedding shows that the NeuroPred-CLQ can clearly distinguish the positive NPs from the negative ones. We believe the NeuroPred-CLQ can facilitate drug development and clinical trial studies to treat neurological disorders.
Collapse
Affiliation(s)
- Shouzhi Chen
- School of Mathematics and System Science, Xinjiang University , Urumqi, China
| | - Qing Li
- School of Mathematics and System Science, Xinjiang University , Urumqi, China
| | - Jianping Zhao
- School of Mathematics and System Science, Xinjiang University , Urumqi, China
| | - Yannan Bin
- School of Computer Science and Technology, Anhui University , Hefei, China
| | - Chunhou Zheng
- School of Mathematics and System Science, Xinjiang University , Urumqi, China
- School of Computer Science and Technology, Anhui University , Hefei, China
| |
Collapse
|