1
|
Asim MN, Asif T, Hassan F, Dengel A. Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models. Database (Oxford) 2025; 2025:baaf027. [PMID: 40448683 DOI: 10.1093/database/baaf027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Revised: 02/06/2025] [Accepted: 03/26/2025] [Indexed: 06/02/2025]
Abstract
Protein sequence analysis examines the order of amino acids within protein sequences to unlock diverse types of a wealth of knowledge about biological processes and genetic disorders. It helps in forecasting disease susceptibility by finding unique protein signatures, or biomarkers that are linked to particular disease states. Protein Sequence analysis through wet-lab experiments is expensive, time-consuming and error prone. To facilitate large-scale proteomics sequence analysis, the biological community is striving for utilizing AI competence for transitioning from wet-lab to computer aided applications. However, Proteomics and AI are two distinct fields and development of AI-driven protein sequence analysis applications requires knowledge of both domains. To bridge the gap between both fields, various review articles have been written. However, these articles focus revolves around few individual tasks or specific applications rather than providing a comprehensive overview about wide tasks and applications. Following the need of a comprehensive literature that presents a holistic view of wide array of tasks and applications, contributions of this manuscript are manifold: It bridges the gap between Proteomics and AI fields by presenting a comprehensive array of AI-driven applications for 63 distinct protein sequence analysis tasks. It equips AI researchers by facilitating biological foundations of 63 protein sequence analysis tasks. It enhances development of AI-driven protein sequence analysis applications by providing comprehensive details of 68 protein databases. It presents a rich data landscape, encompassing 627 benchmark datasets of 63 diverse protein sequence analysis tasks. It highlights the utilization of 25 unique word embedding methods and 13 language models in AI-driven protein sequence analysis applications. It accelerates the development of AI-driven applications by facilitating current state-of-the-art performances across 63 protein sequence analysis tasks.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence, Kaiserslautern 67663, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
| | - Tayyaba Asif
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern 67663, Germany
| | - Faiza Hassan
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern 67663, Germany
| | - Andreas Dengel
- German Research Center for Artificial Intelligence, Kaiserslautern 67663, Germany
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern 67663, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
| |
Collapse
|
2
|
Yang X, Liao M, Ye B, Xia J, Zhao J. iEnhancer-GDM: A Deep Learning Framework Based on Generative Adversarial Network and Multi-head Attention Mechanism to Identify Enhancers and Their Strength. Interdiscip Sci 2025:10.1007/s12539-025-00703-9. [PMID: 40335860 DOI: 10.1007/s12539-025-00703-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2024] [Revised: 03/13/2025] [Accepted: 03/13/2025] [Indexed: 05/09/2025]
Abstract
Enhancers are short DNA fragments capable of significantly increase the frequency of gene transcription. They often exert their effects on targeted genes over long distances, either in cis or in trans configurations. Identifying enhancers poses a challenge due to their variable position and sensitivities. Genetic variants within enhancer regions have been implicated in human diseases, highlighting critical importance of enhancers identification and strength prediction. Here, we develop a two-layer predictor named iEnhancer-GDM to identify enhancers and to predict enhancer strength. To address the challenges posed by the limited size of enhancer training dataset, which could cause issues such as model overfitting and low classification accuracy, we introduce a Wasserstein generative adversarial network (WGAN-GP) to augment the dataset. We employ a dna2vec embedding layer to encode raw DNA sequences into numerical feature representations, and then integrate multi-scale convolutional neural network, bidirectional long short-term memory network and multi-head attention mechanism for feature representation and classification. Our results validate the effectiveness of data augmentation in WGAN-GP. Our model iEnhancer-GDM achieves superior performance on an independent test dataset, and outperforms the existing models with improvements of 2.45% for enhancer identification and 11.5% for enhancer strength prediction by benchmarking against current methods. iEnhancer-GDM advances the precise enhancer identification and strength prediction, thereby helping to understand the functions of enhancers and their associations on genomics.
Collapse
Affiliation(s)
- Xiaomei Yang
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, 830017, China
| | - Meng Liao
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, 830017, China
| | - Bin Ye
- Institutes of Physical Science and Information Technology, Anhui University, Hefei, 230601, China.
- School of Computer Science and Technology, Anhui University, Hefei, 230601, China.
| | - Junfeng Xia
- Institutes of Physical Science and Information Technology, Anhui University, Hefei, 230601, China.
| | - Jianping Zhao
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, 830017, China
| |
Collapse
|
3
|
Olenyi T, Marquet C, Grekova A, Houri L, Heinzinger M, Dallago C, Rost B. TMVisDB: Annotation and 3D-visualization of Transmembrane Proteins. J Mol Biol 2025:168997. [PMID: 40133784 DOI: 10.1016/j.jmb.2025.168997] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2024] [Revised: 02/07/2025] [Accepted: 02/07/2025] [Indexed: 03/27/2025]
Abstract
Since the rise of cellular life, transmembrane proteins (TMPs) have been crucial to various cellular processes through their central role as gates and gatekeepers. Despite their importance, experimental high-resolution structures for TMPs remain underrepresented due to experimental challenges. Given its performance leap, structure predictions have begun to close the gap. However, identifying the membrane regions and topology in three-dimensional structure files on a large scale still requires additional in silico predictions. Here, we introduce TMVisDB to sieve through millions of predicted structures for TMPs. This resource enables both browsing through 46 million predicted TMPs and visualizing them along with their topological annotations without having to tap into costly predictions of the AlphaFold3-style. TMVisDB joins AlphaFoldDB structure predictions and transmembrane topology predictions from the protein language model (pLM) based method TMbed. We showcase the utility of TMVisDB for the analysis of proteins through two use cases, namely the B-lymphocyte antigen CD20 (Homo sapiens) and the cellulose synthase (Novosphingobium sp. P6W). We demonstrate the value of TMVisDB for large-scale analyses through findings pertaining to all TMPs predicted for the human proteome. TMVisDB is freely available at https://tmvisdb.rostlab.org.
Collapse
Affiliation(s)
- Tobias Olenyi
- School of Computation, Information, and Technology (CIT), Faculty of Informatics, Chair of Bioinformatics & Computational Biology, TUM (Technical University of Munich), 85748 Garching/Munich, Germany.
| | - Céline Marquet
- School of Computation, Information, and Technology (CIT), Faculty of Informatics, Chair of Bioinformatics & Computational Biology, TUM (Technical University of Munich), 85748 Garching/Munich, Germany
| | - Anastasia Grekova
- Structural and Computational Biology Unit, European Molecular Biology Laboratory (EMBL), 69117 Heidelberg, Germany; TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany
| | - Leen Houri
- TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany
| | - Michael Heinzinger
- School of Computation, Information, and Technology (CIT), Faculty of Informatics, Chair of Bioinformatics & Computational Biology, TUM (Technical University of Munich), 85748 Garching/Munich, Germany
| | - Christian Dallago
- School of Computation, Information, and Technology (CIT), Faculty of Informatics, Chair of Bioinformatics & Computational Biology, TUM (Technical University of Munich), 85748 Garching/Munich, Germany; NVIDIA DE GmbH, Einsteinstraße 172, 81677 Munich, Germany
| | - Burkhard Rost
- School of Computation, Information, and Technology (CIT), Faculty of Informatics, Chair of Bioinformatics & Computational Biology, TUM (Technical University of Munich), 85748 Garching/Munich, Germany; TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany.
| |
Collapse
|
4
|
Charoenkwan P, Chumnanpuen P, Schaduangrat N, Shoombuatong W. Deepstack-ACE: A deep stacking-based ensemble learning framework for the accelerated discovery of ACE inhibitory peptides. Methods 2025; 234:131-140. [PMID: 39709069 DOI: 10.1016/j.ymeth.2024.12.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2024] [Revised: 11/27/2024] [Accepted: 12/07/2024] [Indexed: 12/23/2024] Open
Abstract
Identifying angiotensin-I-converting enzyme (ACE) inhibitory peptides accurately is crucial for understanding the primary factor that regulates the renin-angiotensin system and for providing guidance in developing new potential drugs. Given the inherent experimental complexities, using computational methods for in silico peptide identification could be indispensable for facilitating the high-throughput characterization of ACE inhibitory peptides. In this paper, we propose a novel deep stacking-based ensemble learning framework, termed Deepstack-ACE, to precisely identify ACE inhibitory peptides. In Deepstack-ACE, the input peptide sequences are fed into the word2vec embedding technique to generate sequence representations. Then, these representations were employed to train five powerful deep learning methods, including long short-term memory, convolutional neural network, multi-layer perceptron, gated recurrent unit network, and recurrent neural network, for the construction of base-classifiers. Finally, the optimized stacked model was constructed based on the best combination of selected base-classifiers. Benchmarking experiments showed that Deepstack-ACE attained a more accurate and robust identification of ACE inhibitory peptides compared to its base-classifiers and several conventional machine learning classifiers. Remarkably, in the independent test, our proposed model significantly outperformed the current state-of-the-art methods, with a balanced accuracy of 0.916, sensitivity of 0.911, and Matthews correlation coefficient scores of 0.826. Moreover, we developed a user-friendly web server for Deepstack-ACE, which is freely available at https://pmlabqsar.pythonanywhere.com/Deepstack-ACE. We anticipate that our proposed Deepstack-ACE model can provide a faster and reasonably accurate identification of ACE inhibitory peptides.
Collapse
Affiliation(s)
- Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai 50200, Thailand
| | - Pramote Chumnanpuen
- Department of Zoology, Faculty of Science, Kasetsart University, Bangkok 10900, Thailand; Kasetsart University International College (KUIC), Kasetsart University, Bangkok 10900, Thailand
| | - Nalini Schaduangrat
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| | - Watshara Shoombuatong
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand.
| |
Collapse
|
5
|
Brizuela CA, Liu G, Stokes JM, de la Fuente‐Nunez C. AI Methods for Antimicrobial Peptides: Progress and Challenges. Microb Biotechnol 2025; 18:e70072. [PMID: 39754551 PMCID: PMC11702388 DOI: 10.1111/1751-7915.70072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2024] [Revised: 11/18/2024] [Accepted: 12/16/2024] [Indexed: 01/06/2025] Open
Abstract
Antimicrobial peptides (AMPs) are promising candidates to combat multidrug-resistant pathogens. However, the high cost of extensive wet-lab screening has made AI methods for identifying and designing AMPs increasingly important, with machine learning (ML) techniques playing a crucial role. AI approaches have recently revolutionised this field by accelerating the discovery of new peptides with anti-infective activity, particularly in preclinical mouse models. Initially, classical ML approaches dominated the field, but recently there has been a shift towards deep learning (DL) models. Despite significant contributions, existing reviews have not thoroughly explored the potential of large language models (LLMs), graph neural networks (GNNs) and structure-guided AMP discovery and design. This review aims to fill that gap by providing a comprehensive overview of the latest advancements, challenges and opportunities in using AI methods, with a particular emphasis on LLMs, GNNs and structure-guided design. We discuss the limitations of current approaches and highlight the most relevant topics to address in the coming years for AMP discovery and design.
Collapse
Affiliation(s)
| | - Gary Liu
- Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, David Braley Centre for Antibiotic DiscoveryMcMaster UniversityHamiltonOntarioCanada
| | - Jonathan M. Stokes
- Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, David Braley Centre for Antibiotic DiscoveryMcMaster UniversityHamiltonOntarioCanada
| | - Cesar de la Fuente‐Nunez
- Machine Biology Group, Department of Psychiatry and Microbiology, Institute for Biomedical Informatics, Institute for Translational Medicine and Therapeutics, Perelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Department of Bioengineering and Chemical and Biomolecular Engineering, School of Engineering and Applied ScienceUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Department of Chemistry, School of Arts and SciencesUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Penn Institute for Computational ScienceUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
| |
Collapse
|
6
|
Singh A, Tanwar M, Singh TP, Sharma S, Sharma P. An escape from ESKAPE pathogens: A comprehensive review on current and emerging therapeutics against antibiotic resistance. Int J Biol Macromol 2024; 279:135253. [PMID: 39244118 DOI: 10.1016/j.ijbiomac.2024.135253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Revised: 08/29/2024] [Accepted: 08/30/2024] [Indexed: 09/09/2024]
Abstract
The rise of antimicrobial resistance has positioned ESKAPE pathogens as a serious global health threat, primarily due to the limitations and frequent failures of current treatment options. This growing risk has spurred the scientific community to seek innovative antibiotic therapies and improved oversight strategies. This review aims to provide a comprehensive overview of the origins and resistance mechanisms of ESKAPE pathogens, while also exploring next-generation treatment strategies for these infections. In addition, it will address both traditional and novel approaches to combating antibiotic resistance, offering insights into potential new therapeutic avenues. Emerging research underscores the urgency of developing new antimicrobial agents and strategies to overcome resistance, highlighting the need for novel drug classes and combination therapies. Advances in genomic technologies and a deeper understanding of microbial pathogenesis are crucial in identifying effective treatments. Integrating precision medicine and personalized approaches could enhance therapeutic efficacy. The review also emphasizes the importance of global collaboration in surveillance and stewardship, as well as policy reforms, enhanced diagnostic tools, and public awareness initiatives, to address resistance on a worldwide scale.
Collapse
Affiliation(s)
- Anamika Singh
- Department of Biophysics, All India Institute of Medical Sciences, New Delhi 110029, India
| | - Mansi Tanwar
- Department of Biophysics, All India Institute of Medical Sciences, New Delhi 110029, India
| | - T P Singh
- Department of Biophysics, All India Institute of Medical Sciences, New Delhi 110029, India
| | - Sujata Sharma
- Department of Biophysics, All India Institute of Medical Sciences, New Delhi 110029, India.
| | - Pradeep Sharma
- Department of Biophysics, All India Institute of Medical Sciences, New Delhi 110029, India.
| |
Collapse
|
7
|
Banerjee P, Eulenstein O, Friedberg I. Discovering genomic islands in unannotated bacterial genomes using sequence embedding. BIOINFORMATICS ADVANCES 2024; 4:vbae089. [PMID: 38911822 PMCID: PMC11193100 DOI: 10.1093/bioadv/vbae089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Revised: 05/26/2024] [Accepted: 06/11/2024] [Indexed: 06/25/2024]
Abstract
Motivation Genomic islands (GEIs) are clusters of genes in bacterial genomes that are typically acquired by horizontal gene transfer. GEIs play a crucial role in the evolution of bacteria by rapidly introducing genetic diversity and thus helping them adapt to changing environments. Specifically of interest to human health, many GEIs contain pathogenicity and antimicrobial resistance genes. Detecting GEIs is, therefore, an important problem in biomedical and environmental research. There have been many previous studies for computationally identifying GEIs. Still, most of these studies rely on detecting anomalies in the unannotated nucleotide sequences or on a fixed set of known features on annotated nucleotide sequences. Results Here, we present TreasureIsland, which uses a new unsupervised representation of DNA sequences to predict GEIs. We developed a high-precision boundary detection method featuring an incremental fine-tuning of GEI borders, and we evaluated the accuracy of this framework using a new comprehensive reference dataset, Benbow. We show that TreasureIsland's accuracy rivals other GEI predictors, enabling efficient and faster identification of GEIs in unannotated bacterial genomes. Availability and implementation TreasureIsland is available under an MIT license at: https://github.com/FriedbergLab/GenomicIslandPrediction.
Collapse
Affiliation(s)
- Priyanka Banerjee
- Department of Computer Science, Iowa State University, Ames, IA 50011, United States
| | - Oliver Eulenstein
- Department of Computer Science, Iowa State University, Ames, IA 50011, United States
| | - Iddo Friedberg
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, United States
| |
Collapse
|
8
|
Hamamsy T, Morton JT, Blackwell R, Berenberg D, Carriero N, Gligorijevic V, Strauss CEM, Leman JK, Cho K, Bonneau R. Protein remote homology detection and structural alignment using deep learning. Nat Biotechnol 2024; 42:975-985. [PMID: 37679542 PMCID: PMC11180608 DOI: 10.1038/s41587-023-01917-2] [Citation(s) in RCA: 29] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2022] [Accepted: 07/26/2023] [Indexed: 09/09/2023]
Abstract
Exploiting sequence-structure-function relationships in biotechnology requires improved methods for aligning proteins that have low sequence similarity to previously annotated proteins. We develop two deep learning methods to address this gap, TM-Vec and DeepBLAST. TM-Vec allows searching for structure-structure similarities in large sequence databases. It is trained to accurately predict TM-scores as a metric of structural similarity directly from sequence pairs without the need for intermediate computation or solution of structures. Once structurally similar proteins have been identified, DeepBLAST can structurally align proteins using only sequence information by identifying structurally homologous regions between proteins. It outperforms traditional sequence alignment methods and performs similarly to structure-based alignment methods. We show the merits of TM-Vec and DeepBLAST on a variety of datasets, including better identification of remotely homologous proteins compared with state-of-the-art sequence alignment and structure prediction methods.
Collapse
Grants
- R35GM122515 National Science Foundation (NSF)
- IOS-1546218 National Science Foundation (NSF)
- R35 GM122515 NIGMS NIH HHS
- R01 DK103358 NIDDK NIH HHS
- CBET- 1728858 National Science Foundation (NSF)
- R01 AI130945 NIAID NIH HHS
- This research was supported by NIH R01DK103358, the Simons Foundation, NSF- IOS-1546218, R35GM122515, NSF CBET- 1728858, NIH R01AI130945, to T.H. This research was supported by the intramural research program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) to J.T.M. This research was supported by the Flatiron Institute as part of the Simons Foundation to Robert Blackwell, J.K.L., and N.C. This research was supported by Los Alamos National Lab to C.S. This research was supported by the Samsung Advanced Institute of Technology (Next Generation Deep Learning: from pattern recognition to AI), Samsung Research (Improving Deep Learning using Latent Structure), and NSF Award 1922658 to K.C.
- Simons Foundation
- U.S. Department of Health & Human Services | NIH | Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD)
Collapse
Affiliation(s)
- Tymor Hamamsy
- Center for Data Science, New York University, New York, NY, USA
| | - James T Morton
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
- Biostatistics and Bioinformatics Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD, USA
| | - Robert Blackwell
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Daniel Berenberg
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA
- Prescient Design, New York, NY, USA
| | - Nicholas Carriero
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | | | | | - Julia Koehler Leman
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Kyunghyun Cho
- Center for Data Science, New York University, New York, NY, USA.
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA.
- Prescient Design, New York, NY, USA.
- CIFAR, Toronto, Ontario, Canada.
| | - Richard Bonneau
- Center for Data Science, New York University, New York, NY, USA.
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA.
- Prescient Design, New York, NY, USA.
- Department of Biology, New York University, New York, NY, USA.
| |
Collapse
|
9
|
Zhuang J, Huang X, Liu S, Gao W, Su R, Feng K. MulTFBS: A Spatial-Temporal Network with Multichannels for Predicting Transcription Factor Binding Sites. J Chem Inf Model 2024; 64:4322-4333. [PMID: 38733561 DOI: 10.1021/acs.jcim.3c02088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/13/2024]
Abstract
Revealing the mechanisms that influence transcription factor binding specificity is the key to understanding gene regulation. In previous studies, DNA double helix structure and one-hot embedding have been used successfully to design computational methods for predicting transcription factor binding sites (TFBSs). However, DNA sequence as a kind of biological language, the method of word embedding representation in natural language processing, has not been considered properly in TFBS prediction models. In our work, we integrate different types of features of DNA sequence to design a multichanneled deep learning framework, namely MulTFBS, in which independent one-hot encoding, word embedding encoding, which can incorporate contextual information and extract the global features of the sequences, and double helix three-dimensional structural features have been trained in different channels. To extract sequence high-level information effectively, in our deep learning framework, we select the spatial-temporal network by combining convolutional neural networks and bidirectional long short-term memory networks with attention mechanism. Compared with six state-of-the-art methods on 66 universal protein-binding microarray data sets of different transcription factors, MulTFBS performs best on all data sets in the regression tasks, with the average R2 of 0.698 and the average PCC of 0.833, which are 5.4% and 3.2% higher, respectively, than the suboptimal method CRPTS. In addition, we evaluate the classification performance of MulTFBS for distinguishing bound or unbound regions on TF ChIP-seq data. The results show that our framework also performs well in the TFBS classification tasks.
Collapse
Affiliation(s)
- Jujuan Zhuang
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Xinru Huang
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Shuhan Liu
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Wanquan Gao
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Rui Su
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Kexin Feng
- The School of Science, Dalian Maritime University, Dalian 116026, China
| |
Collapse
|
10
|
Greer SF, Rabiey M, Studholme DJ, Grant M. The potential of bacteriocins and bacteriophages to control bacterial disease of crops with a focus on Xanthomonas spp. J R Soc N Z 2024; 55:302-326. [PMID: 39677383 PMCID: PMC11639067 DOI: 10.1080/03036758.2024.2345315] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Accepted: 04/02/2024] [Indexed: 12/17/2024]
Abstract
Crop production plays a crucial role in ensuring global food security and maintaining economic stability. The presence of bacterial phytopathogens, particularly Xanthomonas species (a key focus of this review), poses significant threats to crops, leading to substantial economic losses. Current control strategies, such as the use of chemicals and antibiotics, face challenges such as environmental impact and the development of antimicrobial resistance. This review discusses the potential of bacteriocins, bacterial-derived proteinaceous antimicrobials and bacteriophages, viruses that target bacteria as sustainable alternatives for effectively managing Xanthomonas diseases. We focus on the diversity of bacteriocins found within xanthomonads by identifying and predicting the structures of candidate bacteriocin genes from publicly available genome sequences using BAGEL4 and AlphaFold. Harnessing the power of bacteriocins and bacteriophages has great potential as an eco-friendly and sustainable approach for precision control of Xanthomonas diseases in agriculture. However, realising the full potential of these natural antimicrobials requires continued research, field trials and collaboration among scientists, regulators and farmers. This collective effort is crucial to establishing these alternatives as promising substitutes for traditional disease management methods.
Collapse
Affiliation(s)
- Shannon F. Greer
- School of Life Sciences, University of Warwick, Innovation Campus, Stratford-upon-Avon, UK
| | - Mojgan Rabiey
- School of Life Sciences, University of Warwick, Innovation Campus, Stratford-upon-Avon, UK
- School of Life Sciences, University of Warwick, Gibbet Hill Campus, Coventry, UK
| | | | - Murray Grant
- School of Life Sciences, University of Warwick, Innovation Campus, Stratford-upon-Avon, UK
- School of Life Sciences, University of Warwick, Gibbet Hill Campus, Coventry, UK
| |
Collapse
|
11
|
Harrigan WL, Ferrell BD, Wommack KE, Polson SW, Schreiber ZD, Belcaid M. Improvements in viral gene annotation using large language models and soft alignments. BMC Bioinformatics 2024; 25:165. [PMID: 38664627 PMCID: PMC11046836 DOI: 10.1186/s12859-024-05779-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 04/12/2024] [Indexed: 04/28/2024] Open
Abstract
BACKGROUND The annotation of protein sequences in public databases has long posed a challenge in molecular biology. This issue is particularly acute for viral proteins, which demonstrate limited homology to known proteins when using alignment, k-mer, or profile-based homology search approaches. A novel methodology employing Large Language Models (LLMs) addresses this methodological challenge by annotating protein sequences based on embeddings. RESULTS Central to our contribution is the soft alignment algorithm, drawing from traditional protein alignment but leveraging embedding similarity at the amino acid level to bypass the need for conventional scoring matrices. This method not only surpasses pooled embedding-based models in efficiency but also in interpretability, enabling users to easily trace homologous amino acids and delve deeper into the alignments. Far from being a black box, our approach provides transparent, BLAST-like alignment visualizations, combining traditional biological research with AI advancements to elevate protein annotation through embedding-based analysis while ensuring interpretability. Tests using the Virus Orthologous Groups and ViralZone protein databases indicated that the novel soft alignment approach recognized and annotated sequences that both blastp and pooling-based methods, which are commonly used for sequence annotation, failed to detect. CONCLUSION The embeddings approach shows the great potential of LLMs for enhancing protein sequence annotation, especially in viral genomics. These findings present a promising avenue for more efficient and accurate protein function inference in molecular biology.
Collapse
Affiliation(s)
- William L Harrigan
- Hawai'i Institute of Marine Biology, University of Hawai'i at Mānoa, Honolulu, HI, 96822, USA
| | - Barbra D Ferrell
- Department of Plant & Soil Sciences, University of Delaware, Newark, DE, 19713, USA
| | - K Eric Wommack
- Department of Plant & Soil Sciences, University of Delaware, Newark, DE, 19713, USA
| | - Shawn W Polson
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, 19713, USA
| | - Zachary D Schreiber
- Department of Plant & Soil Sciences, University of Delaware, Newark, DE, 19713, USA
| | - Mahdi Belcaid
- Department of Computer Science, University of Hawai'i at Mānoa, Honolulu, HI, 96822, USA.
| |
Collapse
|
12
|
Akhter S, Miller JH. BPAGS: a web application for bacteriocin prediction via feature evaluation using alternating decision tree, genetic algorithm, and linear support vector classifier. FRONTIERS IN BIOINFORMATICS 2024; 3:1284705. [PMID: 38268970 PMCID: PMC10807691 DOI: 10.3389/fbinf.2023.1284705] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Accepted: 12/12/2023] [Indexed: 01/26/2024] Open
Abstract
The use of bacteriocins has emerged as a propitious strategy in the development of new drugs to combat antibiotic resistance, given their ability to kill bacteria with both broad and narrow natural spectra. Hence, a compelling requirement arises for a precise and efficient computational model that can accurately predict novel bacteriocins. Machine learning's ability to learn patterns and features from bacteriocin sequences that are difficult to capture using sequence matching-based methods makes it a potentially superior choice for accurate prediction. A web application for predicting bacteriocin was created in this study, utilizing a machine learning approach. The feature sets employed in the application were chosen using alternating decision tree (ADTree), genetic algorithm (GA), and linear support vector classifier (linear SVC)-based feature evaluation methods. Initially, potential features were extracted from the physicochemical, structural, and sequence-profile attributes of both bacteriocin and non-bacteriocin protein sequences. We assessed the candidate features first using the Pearson correlation coefficient, followed by separate evaluations with ADTree, GA, and linear SVC to eliminate unnecessary features. Finally, we constructed random forest (RF), support vector machine (SVM), decision tree (DT), logistic regression (LR), k-nearest neighbors (KNN), and Gaussian naïve Bayes (GNB) models using reduced feature sets. We obtained the overall top performing model using SVM with ADTree-reduced features, achieving an accuracy of 99.11% and an AUC value of 0.9984 on the testing dataset. We also assessed the predictive capabilities of our best-performing models for each reduced feature set relative to our previously developed software solution, a sequence alignment-based tool, and a deep-learning approach. A web application, titled BPAGS (Bacteriocin Prediction based on ADTree, GA, and linear SVC), was developed to incorporate the predictive models built using ADTree, GA, and linear SVC-based feature sets. Currently, the web-based tool provides classification results with associated probability values and has options to add new samples in the training data to improve the predictive efficacy. BPAGS is freely accessible at https://shiny.tricities.wsu.edu/bacteriocin-prediction/.
Collapse
Affiliation(s)
- Suraiya Akhter
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, United States
- School of Engineering and Applied Sciences, Washington State University Tri-Cities, Richland, WA, United States
| | - John H. Miller
- School of Engineering and Applied Sciences, Washington State University Tri-Cities, Richland, WA, United States
| |
Collapse
|
13
|
Aguilera-Puga MDC, Cancelarich NL, Marani MM, de la Fuente-Nunez C, Plisson F. Accelerating the Discovery and Design of Antimicrobial Peptides with Artificial Intelligence. Methods Mol Biol 2024; 2714:329-352. [PMID: 37676607 DOI: 10.1007/978-1-0716-3441-7_18] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/08/2023]
Abstract
Peptides modulate many processes of human physiology targeting ion channels, protein receptors, or enzymes. They represent valuable starting points for the development of new biologics against communicable and non-communicable disorders. However, turning native peptide ligands into druggable materials requires high selectivity and efficacy, predictable metabolism, and good safety profiles. Machine learning models have gradually emerged as cost-effective and time-saving solutions to predict and generate new proteins with optimal properties. In this chapter, we will discuss the evolution and applications of predictive modeling and generative modeling to discover and design safe and effective antimicrobial peptides. We will also present their current limitations and suggest future research directions, applicable to peptide drug design campaigns.
Collapse
Affiliation(s)
- Mariana D C Aguilera-Puga
- Centro de Investigación y de Estudios Avanzados del IPN (CINVESTAV-IPN), Unidad de Genómica Avanzada, Laboratorio Nacional de Genómica para la Biodiversidad (Langebio), Irapuato, Guanajuato, Mexico
- CINVESTAV-IPN, Unidad Irapuato, Departamento de Biotecnología y Bioquímica, Irapuato, Guanajuato, Mexico
| | - Natalia L Cancelarich
- Instituto Patagónico para el Estudio de los Ecosistemas Continentales (IPEEC), Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Puerto Madryn, Argentina
| | - Mariela M Marani
- Instituto Patagónico para el Estudio de los Ecosistemas Continentales (IPEEC), Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Puerto Madryn, Argentina
| | - Cesar de la Fuente-Nunez
- Machine Biology Group, Departments of Psychiatry and Microbiology, Institute for Biomedical Informatics, Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
- Departments of Bioengineering and Chemical and Biomolecular Engineering, School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, PA, USA.
- Penn Institute for Computational Science, University of Pennsylvania, Philadelphia, PA, USA.
| | - Fabien Plisson
- Centro de Investigación y de Estudios Avanzados del IPN (CINVESTAV-IPN), Unidad de Genómica Avanzada, Laboratorio Nacional de Genómica para la Biodiversidad (Langebio), Irapuato, Guanajuato, Mexico.
- CINVESTAV-IPN, Unidad Irapuato, Departamento de Biotecnología y Bioquímica, Irapuato, Guanajuato, Mexico.
| |
Collapse
|
14
|
Wang J, Zhang H, Chen N, Zeng T, Ai X, Wu K. PorcineAI-Enhancer: Prediction of Pig Enhancer Sequences Using Convolutional Neural Networks. Animals (Basel) 2023; 13:2935. [PMID: 37760334 PMCID: PMC10526013 DOI: 10.3390/ani13182935] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Revised: 08/21/2023] [Accepted: 09/05/2023] [Indexed: 09/29/2023] Open
Abstract
Understanding the mechanisms of gene expression regulation is crucial in animal breeding. Cis-regulatory DNA sequences, such as enhancers, play a key role in regulating gene expression. Identifying enhancers is challenging, despite the use of experimental techniques and computational methods. Enhancer prediction in the pig genome is particularly significant due to the costliness of high-throughput experimental techniques. The study constructed a high-quality database of pig enhancers by integrating information from multiple sources. A deep learning prediction framework called PorcineAI-enhancer was developed for the prediction of pig enhancers. This framework employs convolutional neural networks for feature extraction and classification. PorcineAI-enhancer showed excellent performance in predicting pig enhancers, validated on an independent test dataset. The model demonstrated reliable prediction capability for unknown enhancer sequences and performed remarkably well on tissue-specific enhancer sequences.The study developed a deep learning prediction framework, PorcineAI-enhancer, for predicting pig enhancers. The model demonstrated significant predictive performance and potential for tissue-specific enhancers. This research provides valuable resources for future studies on gene expression regulation in pigs.
Collapse
Affiliation(s)
- Ji Wang
- College of Animal Science and Technology, China Agricultural University, Beijing 100193, China; (J.W.); (H.Z.); (T.Z.); (X.A.)
| | - Han Zhang
- College of Animal Science and Technology, China Agricultural University, Beijing 100193, China; (J.W.); (H.Z.); (T.Z.); (X.A.)
| | - Nanzhu Chen
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China;
| | - Tong Zeng
- College of Animal Science and Technology, China Agricultural University, Beijing 100193, China; (J.W.); (H.Z.); (T.Z.); (X.A.)
| | - Xiaohua Ai
- College of Animal Science and Technology, China Agricultural University, Beijing 100193, China; (J.W.); (H.Z.); (T.Z.); (X.A.)
| | - Keliang Wu
- College of Animal Science and Technology, China Agricultural University, Beijing 100193, China; (J.W.); (H.Z.); (T.Z.); (X.A.)
| |
Collapse
|
15
|
Akhter S, Miller JH. BaPreS: a software tool for predicting bacteriocins using an optimal set of features. BMC Bioinformatics 2023; 24:313. [PMID: 37592230 PMCID: PMC10433575 DOI: 10.1186/s12859-023-05330-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Accepted: 05/09/2023] [Indexed: 08/19/2023] Open
Abstract
BACKGROUND Antibiotic resistance is a major public health concern around the globe. As a result, researchers always look for new compounds to develop new antibiotic drugs for combating antibiotic-resistant bacteria. Bacteriocin becomes a promising antimicrobial agent to fight against antibiotic resistance, due to cases of both broad and narrow killing spectra. Sequence matching methods are widely used to identify bacteriocins by comparing them with the known bacteriocin sequences; however, these methods often fail to detect new bacteriocin sequences due to their high diversity. The ability to use a machine learning approach can help find new highly dissimilar bacteriocins for developing highly effective antibiotic drugs. The aim of this work is to develop a machine learning-based software tool called BaPreS (Bacteriocin Prediction Software) using an optimal set of features for detecting bacteriocin protein sequences with high accuracy. We extracted potential features from known bacteriocin and non-bacteriocin sequences by considering the physicochemical and structural properties of the protein sequences. Then we reduced the feature set using statistical justifications and recursive feature elimination technique. Finally, we built support vector machine (SVM) and random forest (RF) models using the selected features and utilized the best machine learning model to implement the software tool. RESULTS We applied BaPreS to an established dataset and evaluated its prediction performance. Acquired results show that the software tool can achieve a prediction accuracy of 95.54% for testing protein sequences. This tool allows users to add new bacteriocin or non-bacteriocin sequences in the training dataset to further enhance the predictive power of the tool. We compared the prediction performance of the BaPreS with a popular sequence matching-based tool and a deep learning-based method, and our software tool outperformed both. CONCLUSIONS BaPreS is a bacteriocin prediction tool that can be used to discover new highly dissimilar bacteriocins for developing highly effective antibiotic drugs. This software tool can be used with Windows, Linux and macOS operating systems. The open-source software package and its user manual are available at https://github.com/suraiya14/BaPreS .
Collapse
Affiliation(s)
- Suraiya Akhter
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, USA.
- School of Engineering and Applied Sciences, Washington State University Tri-Cities, Richland, WA, USA.
| | - John H Miller
- School of Engineering and Applied Sciences, Washington State University Tri-Cities, Richland, WA, USA.
| |
Collapse
|
16
|
Zouhir A, Souiai O, Harigua E, Cherif A, Chaalia AB, Sebei K. ANTIPSEUDOBASE: Database of Antimicrobial Peptides and Essential Oils Against Pseudomonas. Int J Pept Res Ther 2023. [DOI: 10.1007/s10989-023-10511-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/28/2023]
|
17
|
Walsh AM, Leech J, Huttenhower C, Delhomme-Nguyen H, Crispie F, Chervaux C, Cotter P. Integrated molecular approaches for fermented food microbiome research. FEMS Microbiol Rev 2023; 47:fuad001. [PMID: 36725208 PMCID: PMC10002906 DOI: 10.1093/femsre/fuad001] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 12/28/2022] [Accepted: 01/09/2023] [Indexed: 02/03/2023] Open
Abstract
Molecular technologies, including high-throughput sequencing, have expanded our perception of the microbial world. Unprecedented insights into the composition and function of microbial communities have generated large interest, with numerous landmark studies published in recent years relating the important roles of microbiomes and the environment-especially diet and nutrition-in human, animal, and global health. As such, food microbiomes represent an important cross-over between the environment and host. This is especially true of fermented food microbiomes, which actively introduce microbial metabolites and, to a lesser extent, live microbes into the human gut. Here, we discuss the history of fermented foods, and examine how molecular approaches have advanced research of these fermented foods over the past decade. We highlight how various molecular approaches have helped us to understand the ways in which microbes shape the qualities of these products, and we summarize the impacts of consuming fermented foods on the gut. Finally, we explore how advances in bioinformatics could be leveraged to enhance our understanding of fermented foods. This review highlights how integrated molecular approaches are changing our understanding of the microbial communities associated with food fermentation, the creation of unique food products, and their influences on the human microbiome and health.
Collapse
Affiliation(s)
- Aaron M Walsh
- Teagasc Food Research Centre, Moorepark, Fermoy, Cork and APC Microbiome Ireland, P61 C996, Ireland
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| | - John Leech
- Teagasc Food Research Centre, Moorepark, Fermoy, Cork and APC Microbiome Ireland, P61 C996, Ireland
| | - Curtis Huttenhower
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| | | | - Fiona Crispie
- Teagasc Food Research Centre, Moorepark, Fermoy, Cork and APC Microbiome Ireland, P61 C996, Ireland
| | - Christian Chervaux
- Danone Nutricia Research, Centre Daniel Carasso, Palaiseau 91120, France
| | - Paul D Cotter
- Teagasc Food Research Centre, Moorepark, Fermoy, Cork and APC Microbiome Ireland, P61 C996, Ireland
| |
Collapse
|
18
|
BADASS: BActeriocin-Diversity ASsessment Software. BMC Bioinformatics 2023; 24:24. [PMID: 36670373 PMCID: PMC9854158 DOI: 10.1186/s12859-022-05106-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Accepted: 12/07/2022] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND Bacteriocins are defined as thermolabile peptides produced by bacteria with biological activity against taxonomically related species. These antimicrobial peptides have a wide application including disease treatment, food conservation, and probiotics. However, even with a large industrial and biotechnological application potential, these peptides are still poorly studied and explored. BADASS is software with a user-friendly graphical interface applied to the search and analysis of bacteriocin diversity in whole-metagenome shotgun sequencing data. RESULTS The search for bacteriocin sequences is performed with tools such as BLAST or DIAMOND using the BAGEL4 database as a reference. The putative bacteriocin sequences identified are used to determine the abundance and richness of the three classes of bacteriocins. Abundance is calculated by comparing the reads identified as bacteriocins to the reads identified as 16S rRNA gene using SILVA database as a reference. BADASS has a complete pipeline that starts with the quality assessment of the raw data. At the end of the analysis, BADASS generates several plots of richness and abundance automatically as well as tabular files containing information about the main bacteriocins detected. The user is able to change the main parameters of the analysis in the graphical interface. To demonstrate how the software works, we used four datasets from WMS studies using default parameters. Lantibiotics were the most abundant bacteriocins in the four datasets. This class of bacteriocin is commonly produced by Streptomyces sp. CONCLUSIONS With a user-friendly graphical interface and a complete pipeline, BADASS proved to be a powerful tool for prospecting bacteriocin sequences in Whole-Metagenome Shotgun Sequencing (WMS) data. This tool is publicly available at https://sourceforge.net/projects/badass/ .
Collapse
|
19
|
Hou Z, Yang Y, Ma Z, Wong KC, Li X. Learning the protein language of proteome-wide protein-protein binding sites via explainable ensemble deep learning. Commun Biol 2023; 6:73. [PMID: 36653447 PMCID: PMC9849350 DOI: 10.1038/s42003-023-04462-5] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Accepted: 01/11/2023] [Indexed: 01/20/2023] Open
Abstract
Protein-protein interactions (PPIs) govern cellular pathways and processes, by significantly influencing the functional expression of proteins. Therefore, accurate identification of protein-protein interaction binding sites has become a key step in the functional analysis of proteins. However, since most computational methods are designed based on biological features, there are no available protein language models to directly encode amino acid sequences into distributed vector representations to model their characteristics for protein-protein binding events. Moreover, the number of experimentally detected protein interaction sites is much smaller than that of protein-protein interactions or protein sites in protein complexes, resulting in unbalanced data sets that leave room for improvement in their performance. To address these problems, we develop an ensemble deep learning model (EDLM)-based protein-protein interaction (PPI) site identification method (EDLMPPI). Evaluation results show that EDLMPPI outperforms state-of-the-art techniques including several PPI site prediction models on three widely-used benchmark datasets including Dset_448, Dset_72, and Dset_164, which demonstrated that EDLMPPI is superior to those PPI site prediction models by nearly 10% in terms of average precision. In addition, the biological and interpretable analyses provide new insights into protein binding site identification and characterization mechanisms from different perspectives. The EDLMPPI webserver is available at http://www.edlmppi.top:5002/ .
Collapse
Affiliation(s)
- Zilong Hou
- School of Artificial Intelligence, Jilin University, Jilin, China
| | - Yuning Yang
- Information Science and Technology, Northeast Normal University, Jilin, China
| | - Zhiqiang Ma
- Information Science and Technology, Northeast Normal University, Jilin, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Jilin, China.
| |
Collapse
|
20
|
Ortiz-Vilchis P, De-la-Cruz-García JS, Ramirez-Arellano A. Identification of Relevant Protein Interactions with Partial Knowledge: A Complex Network and Deep Learning Approach. BIOLOGY 2023; 12:140. [PMID: 36671832 PMCID: PMC9856098 DOI: 10.3390/biology12010140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 01/11/2023] [Accepted: 01/12/2023] [Indexed: 01/18/2023]
Abstract
Protein-protein interactions (PPIs) are the basis for understanding most cellular events in biological systems. Several experimental methods, e.g., biochemical, molecular, and genetic methods, have been used to identify protein-protein associations. However, some of them, such as mass spectrometry, are time-consuming and expensive. Machine learning (ML) techniques have been widely used to characterize PPIs, increasing the number of proteins analyzed simultaneously and optimizing time and resources for identifying and predicting protein-protein functional linkages. Previous ML approaches have focused on well-known networks or specific targets but not on identifying relevant proteins with partial or null knowledge of the interaction networks. The proposed approach aims to generate a relevant protein sequence based on bidirectional Long-Short Term Memory (LSTM) with partial knowledge of interactions. The general framework comprises conducting a scale-free and fractal complex network analysis. The outcome of these analyses is then used to fine-tune the fractal method for the vital protein extraction of PPI networks. The results show that several PPI networks are self-similar or fractal, but that both features cannot coexist. The generated protein sequences (by the bidirectional LSTM) also contain an average of 39.5% of proteins in the original sequence. The average length of the generated sequences was 17% of the original one. Finally, 95% of the generated sequences were true.
Collapse
Affiliation(s)
- Pilar Ortiz-Vilchis
- Sección de Estudios de Posgrado e Investigación, Escuela Superior de Medicina, Instituto Politécnico Nacional, Mexico City 11340, Mexico
| | - Jazmin-Susana De-la-Cruz-García
- Sección de Estudios de Posgrado e Investigación, Unidad Profesional Interdisciplinaria de Ingeniería y Ciencias Sociales y Administrativas, Instituto Politécnico Nacional, Mexico City 08400, Mexico
| | - Aldo Ramirez-Arellano
- Sección de Estudios de Posgrado e Investigación, Unidad Profesional Interdisciplinaria de Ingeniería y Ciencias Sociales y Administrativas, Instituto Politécnico Nacional, Mexico City 08400, Mexico
| |
Collapse
|
21
|
Talat A, Khan AU. Artificial intelligence as a smart approach to develop antimicrobial drug molecules: A paradigm to combat drug-resistant infections. Drug Discov Today 2023; 28:103491. [PMID: 36646245 DOI: 10.1016/j.drudis.2023.103491] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Revised: 01/01/2023] [Accepted: 01/05/2023] [Indexed: 01/15/2023]
Abstract
Antimicrobial resistance (AMR) is a silent pandemic with the third highest global mortality. The antibiotic development pipeline is scarce even though AMR has escalated uncontrollably. Artificial intelligence (AI) is a revolutionary approach, accelerating drug discovery because of its fast pace, cost efficiency, lower labor requirements, and fewer chances of failure. AI has been used to discover several beta-lactamase inhibitors and antibiotic alternatives from antimicrobial peptides (AMPs), nonribosomal peptides, bacteriocins, and marine natural products. The significant recent increase in the use of AI platforms by pharmaceutical companies could result in the discovery of efficient antibiotic alternatives with lower chances of resistance generation.
Collapse
Affiliation(s)
- Absar Talat
- Medical Microbiology and Molecular Biology Laboratory, Interdisciplinary Biotechnology Unit, Aligarh Muslim University, Aligarh, India
| | - Asad U Khan
- Medical Microbiology and Molecular Biology Laboratory, Interdisciplinary Biotechnology Unit, Aligarh Muslim University, Aligarh, India.
| |
Collapse
|
22
|
Accurate Prediction of Anti-hypertensive Peptides Based on Convolutional Neural Network and Gated Recurrent unit. Interdiscip Sci 2022; 14:879-894. [PMID: 35474167 DOI: 10.1007/s12539-022-00521-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2021] [Revised: 03/30/2022] [Accepted: 04/06/2022] [Indexed: 12/30/2022]
Abstract
Hypertension (HT) is a general disease, and also one of the most ordinary and major causes of cardiovascular disease. Some diseases are caused by high blood pressure, including impairment of heart and kidney function, cerebral hemorrhage and myocardial infarction. Due to the limitations of laboratory methods, bioactive peptides for the treatment of HT need a long time to be identified. Therefore, it is of great immediate significance for the identification of anti-hypertensive peptides (AHTPs). With the prevalence of machine learning, it is suggested to use it as a supplementary method for AHTPs classification. Therefore, we develop a new model to identify AHTPs based on multiple features and deep learning. And the deep model is constructed by combining a convolutional neural network (CNN) and a gated recurrent unit (GRU). The unique convolution structure is used to reduce the feature dimension and running time. The data processed by CNN is input into the recurrent structure GRU, and important information is filtered out through the reset gate and update gate. Finally, the output layer adopts Sigmoid activation function. Firstly, we use Kmer, the deviation between the dipeptide frequency and the expected mean (DDE), encoding based on grouped weight (EBGW), enhanced grouped amino acid composition (EGAAC) and dipeptide binary profile and frequency (DBPF) to extract features. For Kmer, DDE, EBGW and EGAAC, it is widely used in the field of protein research. DBPF is a new feature representation method designed by us. It corresponds dipeptides to binary numbers, and finally obtains a binary coding file and a frequency file. Then these features are spliced together and input into our proposed model for prediction and analysis. After a tenfold cross-validation test, this model has a better competitive advantage than the previous methods, and the accuracy is 96.23% and 99.10%, respectively. From the results, compared with the previous methods, it has been greatly improved. It shows that the combination of convolution calculation and recurrent structure has a positive impact on the classification of AHTPs. The results show that this method is a feasible, efficient and competitive sequence analysis tool for AHTPs. Meanwhile, we design a friendly online prediction tool and it is freely accessible at http://ahtps.zhanglab.site/ .
Collapse
|
23
|
Zhang B, Zhao M, Tian J, Lei L, Huang R. Novel antimicrobial agents targeting the Streptococcus mutans biofilms discovery through computer technology. Front Cell Infect Microbiol 2022; 12:1065235. [PMID: 36530419 PMCID: PMC9751416 DOI: 10.3389/fcimb.2022.1065235] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2022] [Accepted: 11/16/2022] [Indexed: 12/02/2022] Open
Abstract
Dental caries is one of the most prevalent and costly biofilm-associated infectious diseases worldwide. Streptococcus mutans (S. mutans) is well recognized as the major causative factor of dental caries due to its acidogenicity, aciduricity and extracellular polymeric substances (EPSs) synthesis ability. The EPSs have been considered as a virulent factor of cariogenic biofilm, which enhance biofilms resistance to antimicrobial agents and virulence compared with planktonic bacterial cells. The traditional anti-caries therapies, such as chlorhexidine and antibiotics are characterized by side-effects and drug resistance. With the development of computer technology, several novel approaches are being used to synthesize or discover antimicrobial agents. In this mini review, we summarized the novel antimicrobial agents targeting the S. mutans biofilms discovery through computer technology. Drug repurposing of small molecules expands the original medical indications and lowers drug development costs and risks. The computer-aided drug design (CADD) has been used for identifying compounds with optimal interactions with the target via silico screening and computational methods. The synthetic antimicrobial peptides (AMPs) based on the rational design, computational design or high-throughput screening have shown increased selectivity for both single- and multi-species biofilms. These methods provide potential therapeutic agents to promote targeted control of the oral microbial biofilms in the near future.
Collapse
Affiliation(s)
- Bin Zhang
- Key Laboratory of Shaanxi Province for Craniofacial Precision Medicine Research, College of Stomatology, Xi’an Jiaotong University, Xi’an, China,Clinical Research Center of Shaanxi Province for Dental and Maxillofacial Diseases, Center of Oral Public Health, College of Stomatology, Xi’an Jiaotong University, Xi’an, China
| | - Min Zhao
- Key Laboratory of Shaanxi Province for Craniofacial Precision Medicine Research, College of Stomatology, Xi’an Jiaotong University, Xi’an, China,Clinical Research Center of Shaanxi Province for Dental and Maxillofacial Diseases, Center of Oral Public Health, College of Stomatology, Xi’an Jiaotong University, Xi’an, China
| | - Jiangang Tian
- Key Laboratory of Shaanxi Province for Craniofacial Precision Medicine Research, College of Stomatology, Xi’an Jiaotong University, Xi’an, China,Clinical Research Center of Shaanxi Province for Dental and Maxillofacial Diseases, Center of Oral Public Health, College of Stomatology, Xi’an Jiaotong University, Xi’an, China
| | - Lei Lei
- State Key Laboratory of Oral Diseases, Department of Preventive Dentistry, West China Hospital of Stomatology, Sichuan University, Chengdu, China,*Correspondence: Lei Lei, ; Ruizhe Huang,
| | - Ruizhe Huang
- Key Laboratory of Shaanxi Province for Craniofacial Precision Medicine Research, College of Stomatology, Xi’an Jiaotong University, Xi’an, China,Clinical Research Center of Shaanxi Province for Dental and Maxillofacial Diseases, Center of Oral Public Health, College of Stomatology, Xi’an Jiaotong University, Xi’an, China,*Correspondence: Lei Lei, ; Ruizhe Huang,
| |
Collapse
|
24
|
García-Jacas CR, García-González LA, Martinez-Rios F, Tapia-Contreras IP, Brizuela CA. Handcrafted versus non-handcrafted (self-supervised) features for the classification of antimicrobial peptides: complementary or redundant? Brief Bioinform 2022; 23:6754757. [PMID: 36215083 DOI: 10.1093/bib/bbac428] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Revised: 08/28/2022] [Accepted: 09/02/2022] [Indexed: 12/14/2022] Open
Abstract
Antimicrobial peptides (AMPs) have received a great deal of attention given their potential to become a plausible option to fight multi-drug resistant bacteria as well as other pathogens. Quantitative sequence-activity models (QSAMs) have been helpful to discover new AMPs because they allow to explore a large universe of peptide sequences and help reduce the number of wet lab experiments. A main aspect in the building of QSAMs based on shallow learning is to determine an optimal set of protein descriptors (features) required to discriminate between sequences with different antimicrobial activities. These features are generally handcrafted from peptide sequence datasets that are labeled with specific antimicrobial activities. However, recent developments have shown that unsupervised approaches can be used to determine features that outperform human-engineered (handcrafted) features. Thus, knowing which of these two approaches contribute to a better classification of AMPs, it is a fundamental question in order to design more accurate models. Here, we present a systematic and rigorous study to compare both types of features. Experimental outcomes show that non-handcrafted features lead to achieve better performances than handcrafted features. However, the experiments also prove that an improvement in performance is achieved when both types of features are merged. A relevance analysis reveals that non-handcrafted features have higher information content than handcrafted features, while an interaction-based importance analysis reveals that handcrafted features are more important. These findings suggest that there is complementarity between both types of features. Comparisons regarding state-of-the-art deep models show that shallow models yield better performances both when fed with non-handcrafted features alone and when fed with non-handcrafted and handcrafted features together.
Collapse
Affiliation(s)
- César R García-Jacas
- Cátedras CONACYT - Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| | - Luis A García-González
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| | | | - Issac P Tapia-Contreras
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| | - Carlos A Brizuela
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| |
Collapse
|
25
|
Liao M, Zhao JP, Tian J, Zheng CH. iEnhancer-DCLA: using the original sequence to identify enhancers and their strength based on a deep learning framework. BMC Bioinformatics 2022; 23:480. [PMCID: PMC9664816 DOI: 10.1186/s12859-022-05033-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2022] [Accepted: 11/02/2022] [Indexed: 11/16/2022] Open
Abstract
AbstractEnhancers are small regions of DNA that bind to proteins, which enhance the transcription of genes. The enhancer may be located upstream or downstream of the gene. It is not necessarily close to the gene to be acted on, because the entanglement structure of chromatin allows the positions far apart in the sequence to have the opportunity to contact each other. Therefore, identifying enhancers and their strength is a complex and challenging task. In this article, a new prediction method based on deep learning is proposed to identify enhancers and enhancer strength, called iEnhancer-DCLA. Firstly, we use word2vec to convert k-mers into number vectors to construct an input matrix. Secondly, we use convolutional neural network and bidirectional long short-term memory network to extract sequence features, and finally use the attention mechanism to extract relatively important features. In the task of predicting enhancers and their strengths, this method has improved to a certain extent in most evaluation indexes. In summary, we believe that this method provides new ideas in the analysis of enhancers.
Collapse
|
26
|
Sun TJ, Bu HL, Yan X, Sun ZH, Zha MS, Dong GF. LABAMPsGCN: A framework for identifying lactic acid bacteria antimicrobial peptides based on graph convolutional neural network. Front Genet 2022; 13:1062576. [PMID: 36406112 PMCID: PMC9669054 DOI: 10.3389/fgene.2022.1062576] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2022] [Accepted: 10/24/2022] [Indexed: 08/01/2023] Open
Abstract
Lactic acid bacteria antimicrobial peptides (LABAMPs) are a class of active polypeptide produced during the metabolic process of lactic acid bacteria, which can inhibit or kill pathogenic bacteria or spoilage bacteria in food. LABAMPs have broad application in important practical fields closely related to human beings, such as food production, efficient agricultural planting, and so on. However, screening for antimicrobial peptides by biological experiment researchers is time-consuming and laborious. Therefore, it is urgent to develop a model to predict LABAMPs. In this work, we design a graph convolutional neural network framework for identifying of LABAMPs. We build heterogeneous graph based on amino acids, tripeptide and their relationships and learn weights of a graph convolutional network (GCN). Our GCN iteratively completes the learning of embedded words and sequence weights in the graph under the supervision of inputting sequence labels. We applied 10-fold cross-validation experiment to two training datasets and acquired accuracy of 0.9163 and 0.9379 respectively. They are higher that of other machine learning and GNN algorithms. In an independent test dataset, accuracy of two datasets is 0.9130 and 0.9291, which are 1.08% and 1.57% higher than the best methods of other online webservers.
Collapse
Affiliation(s)
- Tong-Jie Sun
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China
| | - He-Long Bu
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China
| | - Xin Yan
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China
| | - Zhi-Hong Sun
- College of Food Science and Engineering, Inner Mongolia Agricultural University, Hohhot, China
| | - Mu-Su Zha
- College of Food Science and Engineering, Inner Mongolia Agricultural University, Hohhot, China
| | - Gai-Fang Dong
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China
| |
Collapse
|
27
|
Cui Z, Chen ZH, Zhang QH, Gribova V, Filaretov VF, Huang DS. RMSCNN: A Random Multi-Scale Convolutional Neural Network for Marine Microbial Bacteriocins Identification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3663-3672. [PMID: 34699364 DOI: 10.1109/tcbb.2021.3122183] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The abuse of traditional antibiotics has led to an increase in the resistance of bacteria and viruses. Similar to the function of antibacterial peptides, bacteriocins are more common as a kind of peptides produced by bacteria that have bactericidal or bacterial effects. More importantly, the marine environment is one of the most abundant resources for extracting marine microbial bacteriocins (MMBs). Identifying bacteriocins from marine microorganisms is a common goal for the development of new drugs. Effective use of MMBs will greatly alleviate the current antibiotic abuse problem. In this work, deep learning is used to identify meaningful MMBs. We propose a random multi-scale convolutional neural network method. In the scale setting, we set a random model to update the scale value randomly. The scale selection method can reduce the contingency caused by artificial setting under certain conditions, thereby making the method more extensive. The results show that the classification performance of the proposed method is better than the state-of-the-art classification methods. In addition, some potential MMBs are predicted, and some different sequence analyses are performed on these candidates. It is worth mentioning that after sequence analysis, the HNH endonucleases of different marine bacteria are considered as potential bacteriocins.
Collapse
|
28
|
Yan J, Cai J, Zhang B, Wang Y, Wong DF, Siu SWI. Recent Progress in the Discovery and Design of Antimicrobial Peptides Using Traditional Machine Learning and Deep Learning. Antibiotics (Basel) 2022; 11:1451. [PMID: 36290108 PMCID: PMC9598685 DOI: 10.3390/antibiotics11101451] [Citation(s) in RCA: 41] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Revised: 10/11/2022] [Accepted: 10/13/2022] [Indexed: 11/16/2022] Open
Abstract
Antimicrobial resistance has become a critical global health problem due to the abuse of conventional antibiotics and the rise of multi-drug-resistant microbes. Antimicrobial peptides (AMPs) are a group of natural peptides that show promise as next-generation antibiotics due to their low toxicity to the host, broad spectrum of biological activity, including antibacterial, antifungal, antiviral, and anti-parasitic activities, and great therapeutic potential, such as anticancer, anti-inflammatory, etc. Most importantly, AMPs kill bacteria by damaging cell membranes using multiple mechanisms of action rather than targeting a single molecule or pathway, making it difficult for bacterial drug resistance to develop. However, experimental approaches used to discover and design new AMPs are very expensive and time-consuming. In recent years, there has been considerable interest in using in silico methods, including traditional machine learning (ML) and deep learning (DL) approaches, to drug discovery. While there are a few papers summarizing computational AMP prediction methods, none of them focused on DL methods. In this review, we aim to survey the latest AMP prediction methods achieved by DL approaches. First, the biology background of AMP is introduced, then various feature encoding methods used to represent the features of peptide sequences are presented. We explain the most popular DL techniques and highlight the recent works based on them to classify AMPs and design novel peptide sequences. Finally, we discuss the limitations and challenges of AMP prediction.
Collapse
Affiliation(s)
- Jielu Yan
- PAMI Research Group, Department of Computer and Information Science, University of Macau, Taipa, Macau, China
| | - Jianxiu Cai
- Faculty of Applied Sciences, Macao Polytechnic University, Macau, China
- Institute of Science and Environment, University of Saint Joseph, Estr. Marginal da Ilha Verde, Macau, China
| | - Bob Zhang
- PAMI Research Group, Department of Computer and Information Science, University of Macau, Taipa, Macau, China
| | - Yapeng Wang
- Faculty of Applied Sciences, Macao Polytechnic University, Macau, China
| | - Derek F. Wong
- NLP2CT Lab, Department of Computer and Information Science, University of Macau, Taipa, Macau, China
| | - Shirley W. I. Siu
- Institute of Science and Environment, University of Saint Joseph, Estr. Marginal da Ilha Verde, Macau, China
- School of Pharmaceutical Sciences, Universiti Sains Malaysia, Pulau Pinang 11800, Malaysia
| |
Collapse
|
29
|
Li X, Zhang S, Shi H. An improved residual network using deep fusion for identifying RNA 5-methylcytosine sites. Bioinformatics 2022; 38:4271-4277. [PMID: 35866985 DOI: 10.1093/bioinformatics/btac532] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 06/30/2022] [Accepted: 07/21/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION 5-Methylcytosine (m5C) is a crucial post-transcriptional modification. With the development of technology, it is widely found in various RNAs. Numerous studies have indicated that m5C plays an essential role in various activities of organisms, such as tRNA recognition, stabilization of RNA structure, RNA metabolism and so on. Traditional identification is costly and time-consuming by wet biological experiments. Therefore, computational models are commonly used to identify the m5C sites. Due to the vast computing advantages of deep learning, it is feasible to construct the predictive model through deep learning algorithms. RESULTS In this study, we construct a model to identify m5C based on a deep fusion approach with an improved residual network. First, sequence features are extracted from the RNA sequences using Kmer, K-tuple nucleotide frequency component (KNFC), Pseudo dinucleotide composition (PseDNC) and Physical and chemical property (PCP). Kmer and KNFC extract information from a statistical point of view. PseDNC and PCP extract information from the physicochemical properties of RNA sequences. Then, two parts of information are fused with new features using bidirectional long- and short-term memory and attention mechanisms, respectively. Immediately after, the fused features are fed into the improved residual network for classification. Finally, 10-fold cross-validation and independent set testing are used to verify the credibility of the model. The results show that the accuracy reaches 91.87%, 95.55%, 92.27% and 95.60% on the training sets and independent test sets of Arabidopsis thaliana and M.musculus, respectively. This is a considerable improvement compared to previous studies and demonstrates the robust performance of our model. AVAILABILITY AND IMPLEMENTATION The data and code related to the study are available at https://github.com/alivelxj/m5c-DFRESG.
Collapse
Affiliation(s)
- Xinjie Li
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China
| | - Hongyan Shi
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China
| |
Collapse
|
30
|
Antimicrobial peptides with cell-penetrating activity as prophylactic and treatment drugs. Biosci Rep 2022; 42:231731. [PMID: 36052730 PMCID: PMC9508529 DOI: 10.1042/bsr20221789] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2022] [Revised: 08/31/2022] [Accepted: 09/01/2022] [Indexed: 01/18/2023] Open
Abstract
Health is fundamental for the development of individuals and evolution of species. In that sense, for human societies is relevant to understand how the human body has developed molecular strategies to maintain health. In the present review, we summarize diverse evidence that support the role of peptides in this endeavor. Of particular interest to the present review are antimicrobial peptides (AMP) and cell-penetrating peptides (CPP). Different experimental evidence indicates that AMP/CPP are able to regulate autophagy, which in turn regulates the immune system response. AMP also assists in the establishment of the microbiota, which in turn is critical for different behavioral and health aspects of humans. Thus, AMP and CPP are multifunctional peptides that regulate two aspects of our bodies that are fundamental to our health: autophagy and microbiota. While it is now clear the multifunctional nature of these peptides, we are still in the early stages of the development of computational strategies aimed to assist experimentalists in identifying selective multifunctional AMP/CPP to control nonhealthy conditions. For instance, both AMP and CPP are computationally characterized as amphipatic and cationic, yet none of these features are relevant to differentiate these peptides from non-AMP or non-CPP. The present review aims to highlight current knowledge that may facilitate the development of AMP’s design tools for preventing or treating illness.
Collapse
|
31
|
Chen S, Li Q, Zhao J, Bin Y, Zheng C. NeuroPred-CLQ: incorporating deep temporal convolutional networks and multi-head attention mechanism to predict neuropeptides. Brief Bioinform 2022; 23:6672901. [DOI: 10.1093/bib/bbac319] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2022] [Revised: 06/27/2022] [Accepted: 07/14/2022] [Indexed: 11/13/2022] Open
Abstract
Abstract
Neuropeptides (NPs) are a particular class of informative substances in the immune system and physiological regulation. They play a crucial role in regulating physiological functions in various biological growth and developmental stages. In addition, NPs are crucial for developing new drugs for the treatment of neurological diseases. With the development of molecular biology techniques, some data-driven tools have emerged to predict NPs. However, it is necessary to improve the predictive performance of these tools for NPs. In this study, we developed a deep learning model (NeuroPred-CLQ) based on the temporal convolutional network (TCN) and multi-head attention mechanism to identify NPs effectively and translate the internal relationships of peptide sequences into numerical features by the Word2vec algorithm. The experimental results show that NeuroPred-CLQ learns data information effectively, achieving 93.6% accuracy and 98.8% AUC on the independent test set. The model has better performance in identifying NPs than the state-of-the-art predictors. Visualization of features using t-distribution random neighbor embedding shows that the NeuroPred-CLQ can clearly distinguish the positive NPs from the negative ones. We believe the NeuroPred-CLQ can facilitate drug development and clinical trial studies to treat neurological disorders.
Collapse
Affiliation(s)
- Shouzhi Chen
- School of Mathematics and System Science, Xinjiang University , Urumqi, China
| | - Qing Li
- School of Mathematics and System Science, Xinjiang University , Urumqi, China
| | - Jianping Zhao
- School of Mathematics and System Science, Xinjiang University , Urumqi, China
| | - Yannan Bin
- School of Computer Science and Technology, Anhui University , Hefei, China
| | - Chunhou Zheng
- School of Mathematics and System Science, Xinjiang University , Urumqi, China
- School of Computer Science and Technology, Anhui University , Hefei, China
| |
Collapse
|
32
|
Rational Discovery of Antimicrobial Peptides by Means of Artificial Intelligence. MEMBRANES 2022; 12:membranes12070708. [PMID: 35877911 PMCID: PMC9320227 DOI: 10.3390/membranes12070708] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/14/2022] [Revised: 07/05/2022] [Accepted: 07/06/2022] [Indexed: 11/16/2022]
Abstract
Antibiotic resistance is a worldwide public health problem due to the costs and mortality rates it generates. However, the large pharmaceutical industries have stopped searching for new antibiotics because of their low profitability, given the rapid replacement rates imposed by the increasingly observed resistance acquired by microorganisms. Alternatively, antimicrobial peptides (AMPs) have emerged as potent molecules with a much lower rate of resistance generation. The discovery of these peptides is carried out through extensive in vitro screenings of either rational or non-rational libraries. These processes are tedious and expensive and generate only a few AMP candidates, most of which fail to show the required activity and physicochemical properties for practical applications. This work proposes implementing an artificial intelligence algorithm to reduce the required experimentation and increase the efficiency of high-activity AMP discovery. Our deep learning (DL) model, called AMPs-Net, outperforms the state-of-the-art method by 8.8% in average precision. Furthermore, it is highly accurate to predict the antibacterial and antiviral capacity of a large number of AMPs. Our search led to identifying two unreported antimicrobial motifs and two novel antimicrobial peptides related to them. Moreover, by coupling DL with molecular dynamics (MD) simulations, we were able to find a multifunctional peptide with promising therapeutic effects. Our work validates our previously proposed pipeline for a more efficient rational discovery of novel AMPs.
Collapse
|
33
|
Heinzinger M, Littmann M, Sillitoe I, Bordin N, Orengo C, Rost B. Contrastive learning on protein embeddings enlightens midnight zone. NAR Genom Bioinform 2022; 4:lqac043. [PMID: 35702380 PMCID: PMC9188115 DOI: 10.1093/nargab/lqac043] [Citation(s) in RCA: 45] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 03/25/2022] [Accepted: 05/17/2022] [Indexed: 12/23/2022] Open
Abstract
Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed ProtTucker, has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the 'midnight zone' of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT.
Collapse
Affiliation(s)
- Michael Heinzinger
- TUM (Technical University of Munich) Dept Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany
| | - Maria Littmann
- TUM (Technical University of Munich) Dept Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Burkhard Rost
- TUM (Technical University of Munich) Dept Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748 Garching, Germany & TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany
| |
Collapse
|
34
|
PDAUG: a Galaxy based toolset for peptide library analysis, visualization, and machine learning modeling. BMC Bioinformatics 2022; 23:197. [PMID: 35643441 PMCID: PMC9148462 DOI: 10.1186/s12859-022-04727-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Accepted: 05/11/2022] [Indexed: 11/28/2022] Open
Abstract
Background Computational methods based on initial screening and prediction of peptides for desired functions have proven to be effective alternatives to lengthy and expensive biochemical experimental methods traditionally utilized in peptide research, thus saving time and effort. However, for many researchers, the lack of expertise in utilizing programming libraries, access to computational resources, and flexible pipelines are big hurdles to adopting these advanced methods.
Results To address the above mentioned barriers, we have implemented the peptide design and analysis under Galaxy (PDAUG) package, a Galaxy-based Python powered collection of tools, workflows, and datasets for rapid in-silico peptide library analysis. In contrast to existing methods like standard programming libraries or rigid single-function web-based tools, PDAUG offers an integrated GUI-based toolset, providing flexibility to build and distribute reproducible pipelines and workflows without programming expertise. Finally, we demonstrate the usability of PDAUG in predicting anticancer properties of peptides using four different feature sets and assess the suitability of various ML algorithms. Conclusion PDAUG offers tools for peptide library generation, data visualization, built-in and public database peptide sequence retrieval, peptide feature calculation, and machine learning (ML) modeling. Additionally, this toolset facilitates researchers to combine PDAUG with hundreds of compatible existing Galaxy tools for limitless analytic strategies. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04727-6.
Collapse
|
35
|
Boosting biomedical document classification through the use of domain entity recognizers and semantic ontologies for document representation: The case of gluten bibliome. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2021.10.100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
36
|
Xie W, Zheng Z, Zhang W, Huang L, Lin Q, Wong KC. SRG-vote: Predicting miRNA-gene relationships via embedding and LSTM ensemble. IEEE J Biomed Health Inform 2022; 26:4335-4344. [PMID: 35471879 DOI: 10.1109/jbhi.2022.3169542] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
AbstractTargeted therapy for one for a set of genes has made it possible to apply precision medicine for different patients due to the existence of tumor heterogeneity. However, how to regulate those genes are still problematic. One of the natural regulators of genes is microRNAs. Thus, a better understanding of the miRNA-gene interaction mechanism might contribute to future diagnosis, prevention, and cancer therapy. The interactions between microRNA and genes play an essential role in molecular genetics. The in-vivo experiments validating the relationships between them are time-consuming, money-costly, and labor-intensive. With the development of high-throughput technology, we dealt with tons of biological data. However, extracting features from tremendous raw data and making a mathematical model is still a challenging topic. Machine learning and deep learning algorithms have become powerful tools in dealing with biological data. Inspired by this, in this paper, we propose a model that combines features/embedding extraction methods, deep learning algorithms, and a voting system. We leverage doc2vec to generate sequential embedding from molecular sequences. The role2vec, GCN, and GMM for geometrical embedding were generated from the complex network from similarity and pair-wise datasets. For the deep learning algorithms, we leveraged LSTM and Bi-LSTM according to different embedding and features. Finally, we adopted a voting system to balance results from different data sources. The results have shown that our voting system could achieve a higher AUC than the existing benchmark. The case studies demonstrate that our model could reveal potential relationships between miRNAs and genes. The source code, features, and predictive results can be downloaded at https://github.com/Xshelton/SRG-vote.
Collapse
|
37
|
García-Jacas CR, Pinacho-Castellanos SA, García-González LA, Brizuela CA. Do deep learning models make a difference in the identification of antimicrobial peptides? Brief Bioinform 2022; 23:6563422. [PMID: 35380616 DOI: 10.1093/bib/bbac094] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2021] [Revised: 02/16/2022] [Accepted: 02/23/2022] [Indexed: 12/21/2022] Open
Abstract
In the last few decades, antimicrobial peptides (AMPs) have been explored as an alternative to classical antibiotics, which in turn motivated the development of machine learning models to predict antimicrobial activities in peptides. The first generation of these predictors was filled with what is now known as shallow learning-based models. These models require the computation and selection of molecular descriptors to characterize each peptide sequence and train the models. The second generation, known as deep learning-based models, which no longer requires the explicit computation and selection of those descriptors, started to be used in the prediction task of AMPs just four years ago. The superior performance claimed by deep models regarding shallow models has created a prevalent inertia to using deep learning to identify AMPs. However, methodological flaws and/or modeling biases in the building of deep models do not support such superiority. Here, we analyze the main pitfalls that led to establish biased conclusions on the leading performance of deep models. Also, we analyze whether deep models truly contribute to achieve better predictions than shallow models by performing fair studies on different state-of-the-art benchmarking datasets. The experiments reveal that deep models do not outperform shallow models in the classification of AMPs, and that both types of models codify similar chemical information since their predictions are highly similar. Thus, according to the currently available datasets, we conclude that the use of deep learning could not be the most suitable approach to develop models to identify AMPs, mainly because shallow models achieve comparable-to-superior performances and are simpler (Ockham's razor principle). Even so, we suggest the use of deep learning only when its capabilities lead to obtaining significantly better performance gains worth the additional computational cost.
Collapse
Affiliation(s)
- César R García-Jacas
- Cátedras CONACYT - Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| | - Sergio A Pinacho-Castellanos
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México.,Centro de Investigación y Desarrollo de Tecnología Digital (CITEDI), Instituto Politécnico Nacional (IPN), 22435 Tijuana, Baja California, México
| | - Luis A García-González
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| | - Carlos A Brizuela
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| |
Collapse
|
38
|
Prediction of Linear Cationic Antimicrobial Peptides Active against Gram-Negative and Gram-Positive Bacteria Based on Machine Learning Models. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12073631] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Antimicrobial peptides (AMPs) are considered as promising alternatives to conventional antibiotics in order to overcome the growing problems of antibiotic resistance. Computational prediction approaches receive an increasing interest to identify and design the best candidate AMPs prior to the in vitro tests. In this study, we focused on the linear cationic peptides with non-hemolytic activity, which are downloaded from the Database of Antimicrobial Activity and Structure of Peptides (DBAASP). Referring to the MIC (Minimum inhibition concentration) values, we have assigned a positive label to a peptide if it shows antimicrobial activity; otherwise, the peptide is labeled as negative. Here, we focused on the peptides showing antimicrobial activity against Gram-negative and against Gram-positive bacteria separately, and we created two datasets accordingly. Ten different physico-chemical properties of the peptides are calculated and used as features in our study. Following data exploration and data preprocessing steps, a variety of classification algorithms are used with 100-fold Monte Carlo Cross-Validation to build models and to predict the antimicrobial activity of the peptides. Among the generated models, Random Forest has resulted in the best performance metrics for both Gram-negative dataset (Accuracy: 0.98, Recall: 0.99, Specificity: 0.97, Precision: 0.97, AUC: 0.99, F1: 0.98) and Gram-positive dataset (Accuracy: 0.95, Recall: 0.95, Specificity: 0.95, Precision: 0.90, AUC: 0.97, F1: 0.92) after outlier elimination is applied. This prediction approach might be useful to evaluate the antibacterial potential of a candidate peptide sequence before moving to the experimental studies.
Collapse
|
39
|
Nguyen TTD, Ho QT, Le NQK, Phan VD, Ou YY. Use Chou's 5-Steps Rule With Different Word Embedding Types to Boost Performance of Electron Transport Protein Prediction Model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1235-1244. [PMID: 32750894 DOI: 10.1109/tcbb.2020.3010975] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Living organisms receive necessary energy substances directly from cellular respiration. The completion of electron storage and transportation requires the process of cellular respiration with the aid of electron transport chains. Therefore, the work of deciphering electron transport proteins is inevitably needed. The identification of these proteins with high performance has a prompt dependence on the choice of methods for feature extraction and machine learning algorithm. In this study, protein sequences served as natural language sentences comprising words. The nominated word embedding-based feature sets, hinged on the word embedding modulation and protein motif frequencies, were useful for feature choosing. Five word embedding types and a variety of conjoint features were examined for such feature selection. The support vector machine algorithm consequentially was employed to perform classification. The performance statistics within the 5-fold cross-validation including average accuracy, specificity, sensitivity, as well as MCC rates surpass 0.95. Such metrics in the independent test are 96.82, 97.16, 95.76 percent, and 0.9, respectively. Compared to state-of-the-art predictors, the proposed method can generate more preferable performance above all metrics indicating the effectiveness of the proposed method in determining electron transport proteins. Furthermore, this study reveals insights about the applicability of various word embeddings for understanding surveyed sequences.
Collapse
|
40
|
Tsukiyama S, Hasan MM, Deng HW, Kurata H. BERT6mA: prediction of DNA N6-methyladenine site using deep learning-based approaches. Brief Bioinform 2022; 23:6539171. [PMID: 35225328 PMCID: PMC8921755 DOI: 10.1093/bib/bbac053] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Revised: 01/28/2022] [Accepted: 01/31/2022] [Indexed: 01/29/2023] Open
Abstract
N6-methyladenine (6mA) is associated with important roles in DNA replication, DNA repair, transcription, regulation of gene expression. Several experimental methods were used to identify DNA modifications. However, these experimental methods are costly and time-consuming. To detect the 6mA and complement these shortcomings of experimental methods, we proposed a novel, deep leaning approach called BERT6mA. To compare the BERT6mA with other deep learning approaches, we used the benchmark datasets including 11 species. The BERT6mA presented the highest AUCs in eight species in independent tests. Furthermore, BERT6mA showed higher and comparable performance with the state-of-the-art models while the BERT6mA showed poor performances in a few species with a small sample size. To overcome this issue, pretraining and fine-tuning between two species were applied to the BERT6mA. The pretrained and fine-tuned models on specific species presented higher performances than other models even for the species with a small sample size. In addition to the prediction, we analyzed the attention weights generated by BERT6mA to reveal how the BERT6mA model extracts critical features responsible for the 6mA prediction. To facilitate biological sciences, the BERT6mA online web server and its source codes are freely accessible at https://github.com/kuratahiroyuki/BERT6mA.git, respectively.
Collapse
Affiliation(s)
- Sho Tsukiyama
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Md Mehedi Hasan
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Hong-Wen Deng
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Hiroyuki Kurata
- Corresponding author: Hiroyuki Kurata, Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan. Tel: 81-948-29-7828; E-mail:
| |
Collapse
|
41
|
PRIP: A Protein-RNA Interface Predictor Based on Semantics of Sequences. Life (Basel) 2022; 12:life12020307. [PMID: 35207594 PMCID: PMC8879494 DOI: 10.3390/life12020307] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2022] [Revised: 01/28/2022] [Accepted: 02/04/2022] [Indexed: 01/08/2023] Open
Abstract
RNA–protein interactions play an indispensable role in many biological processes. Growing evidence has indicated that aberration of the RNA–protein interaction is associated with many serious human diseases. The precise and quick detection of RNA–protein interactions is crucial to finding new functions and to uncovering the mechanism of interactions. Although many methods have been presented to recognize RNA-binding sites, there is much room left for the improvement of predictive accuracy. We present a sequence semantics-based method (called PRIP) for predicting RNA-binding interfaces. The PRIP extracted semantic embedding by pre-training the Word2vec with the corpus. Extreme gradient boosting was employed to train a classifier. The PRIP obtained a SN of 0.73 over the five-fold cross validation and a SN of 0.67 over the independent test, outperforming the state-of-the-art methods. Compared with other methods, this PRIP learned the hidden relations between words in the context. The analysis of the semantics relationship implied that the semantics of some words were specific to RNA-binding interfaces. This method is helpful to explore the mechanism of RNA–protein interactions from a semantics point of view.
Collapse
|
42
|
Chu X, Sun T, Li Q, Xu Y, Zhang Z, Lai L, Pei J. Prediction of liquid-liquid phase separating proteins using machine learning. BMC Bioinformatics 2022; 23:72. [PMID: 35168563 PMCID: PMC8845408 DOI: 10.1186/s12859-022-04599-w] [Citation(s) in RCA: 96] [Impact Index Per Article: 32.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2021] [Accepted: 02/02/2022] [Indexed: 12/20/2022] Open
Abstract
Background The liquid–liquid phase separation (LLPS) of biomolecules in cell underpins the formation of membraneless organelles, which are the condensates of protein, nucleic acid, or both, and play critical roles in cellular function. Dysregulation of LLPS is implicated in a number of diseases. Although the LLPS of biomolecules has been investigated intensively in recent years, the knowledge of the prevalence and distribution of phase separation proteins (PSPs) is still lag behind. Development of computational methods to predict PSPs is therefore of great importance for comprehensive understanding of the biological function of LLPS.
Results Based on the PSPs collected in LLPSDB, we developed a sequence-based prediction tool for LLPS proteins (PSPredictor), which is an attempt at general purpose of PSP prediction that does not depend on specific protein types. Our method combines the componential and sequential information during the protein embedding stage, and, adopts the machine learning algorithm for final predicting. The proposed method achieves a tenfold cross-validation accuracy of 94.71%, and outperforms previously reported PSPs prediction tools. For further applications, we built a user-friendly PSPredictor web server (http://www.pkumdl.cn/PSPredictor), which is accessible for prediction of potential PSPs.
Conclusions PSPredictor could identifie novel scaffold proteins for stress granules and predict PSPs candidates in the human genome for further study. For further applications, we built a user-friendly PSPredictor web server (http://www.pkumdl.cn/PSPredictor), which provides valuable information for potential PSPs recognition. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04599-w.
Collapse
Affiliation(s)
- Xiaoquan Chu
- College of Information and Electrical Engineering, China Agricultural University, Beijing, 100083, China
| | - Tanlin Sun
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
| | - Qian Li
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Youjun Xu
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
| | - Zhuqing Zhang
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Luhua Lai
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China. .,Beijing National Laboratory for Molecular Science, State Key Laboratory for Structural Chemistry of Unstable and Stable Species, College of Chemistry and Molecular Engineering, Peking University, Beijing, 100871, China. .,Peking-Tsinghua Center for Life Sciences, Peking University, Beijing, 100871, China.
| | - Jianfeng Pei
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China.
| |
Collapse
|
43
|
Jin Y, Yang Y. ProtPlat: an efficient pre-training platform for protein classification based on FastText. BMC Bioinformatics 2022; 23:66. [PMID: 35148686 PMCID: PMC8832758 DOI: 10.1186/s12859-022-04604-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 02/02/2022] [Indexed: 11/24/2022] Open
Abstract
Background For the past decades, benefitting from the rapid growth of protein sequence data in public databases, a lot of machine learning methods have been developed to predict physicochemical properties or functions of proteins using amino acid sequence features. However, the prediction performance often suffers from the lack of labeled data. In recent years, pre-training methods have been widely studied to address the small-sample issue in computer vision and natural language processing fields, while specific pre-training techniques for protein sequences are few. Results In this paper, we propose a pre-training platform for representing protein sequences, called ProtPlat, which uses the Pfam database to train a three-layer neural network, and then uses specific training data from downstream tasks to fine-tune the model. ProtPlat can learn good representations for amino acids, and at the same time achieve efficient classification. We conduct experiments on three protein classification tasks, including the identification of type III secreted effectors, the prediction of subcellular localization, and the recognition of signal peptides. The experimental results show that the pre-training can enhance model performance effectively and ProtPlat is competitive to the state-of-the-art predictors, especially for small datasets. We implement the ProtPlat platform as a web service (https://compbio.sjtu.edu.cn/protplat) that is accessible to the public. Conclusions To enhance the feature representation of protein amino acid sequences and improve the performance of sequence-based classification tasks, we develop ProtPlat, a general platform for the pre-training of protein sequences, which is featured by a large-scale supervised training based on Pfam database and an efficient learning model, FastText. The experimental results of three downstream classification tasks demonstrate the efficacy of ProtPlat. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04604-2.
Collapse
Affiliation(s)
- Yuan Jin
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, and Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai, 200240, China
| | - Yang Yang
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, and Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai, 200240, China.
| |
Collapse
|
44
|
Li H, Shi L, Gao W, Zhang Z, Zhang L, Wang G. dPromoter-XGBoost: Detecting promoters and strength by combining multiple descriptors and feature selection using XGBoost. Methods 2022; 204:215-222. [PMID: 34998983 DOI: 10.1016/j.ymeth.2022.01.001] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Revised: 12/13/2021] [Accepted: 01/02/2022] [Indexed: 12/12/2022] Open
Abstract
Promoters play an irreplaceable role in biological processes and genetics, which are responsible for stimulating the transcription and expression of specific genes. Promoter abnormalities have been found in some diseases, and the level of promoter-binding transcription factors can be used as a marker before a disease occurs. Hence, detecting promoters from DNA sequences has important biological significance, particular, distinguishing strong promoters can help to elucidate differences in gene expression and the mechanisms of specific diseases. With the introduction of third-generation sequencing, it is difficult to match the speed of sequencing to the speed of labeling promoters experimentally. Many computing models have been designed to fill this gap and identify unlabeled DNA. However, their feature representation methods are very singular, which cannot reflect the information contained in the original samples. With the aim of avoiding information loss, we propose a computational model based on multiple descriptors and feature selection to jointly express samples. It is worth mentioning that a new feature descriptor called K-mer word vector is defined. The promoter model of multiple feature descriptors dominated by K-mer word vector achieves similar performance to existing methods, the sensitivity of 85.72% can distinguish the promoter more effectively than other methods. Furthermore, the performance of the promoter strength has surpassed published methods, and accuracy of 77.00% greatly improves the ability to distinguish between strong and weak promoters.
Collapse
Affiliation(s)
- Hongfei Li
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China; Yangtze Delta Region Institute, University of Electronic Science and Technology, Quzhou,China
| | - Lei Shi
- Department of Spine Surgery, Changzheng Hospital, Naval Medical University, Shanghai, China
| | - Wentao Gao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Zixiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, China
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China.
| |
Collapse
|
45
|
Word2vec neural model-based technique to generate protein vectors for combating COVID-19: a machine learning approach. INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY : AN OFFICIAL JOURNAL OF BHARATI VIDYAPEETH'S INSTITUTE OF COMPUTER APPLICATIONS AND MANAGEMENT 2022; 14:3291-3299. [PMID: 35611155 PMCID: PMC9119569 DOI: 10.1007/s41870-022-00949-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/11/2022] [Accepted: 04/13/2022] [Indexed: 12/15/2022]
Abstract
The world was ambushed in 2019 by the COVID-19 virus which affected the health, economy, and lifestyle of individuals worldwide. One way of combating such a public health concern is by using appropriate, rapid, and unbiased diagnostic tools for quick detection of infected people. However, a current dearth of bioinformatics tools necessitates modeling studies to help diagnose COVID-19 cases. Molecular-based methods such as the real-time reverse transcription polymerase chain reaction (rRT-PCR) for detecting COVID-19 is time consuming and prone to contamination. Modern bioinformatics tools have made it possible to create large databases of protein sequences of various diseases, apply data mining techniques, and accurately diagnose diseases. However, the current sequence alignment tools that use these databases are not able to detect novel COVID-19 viral sequences due to high sequence dissimilarity. The objective of this study, therefore, was to develop models that can accurately classify COVID-19 viral sequences rapidly using protein vectors generated by neural word embedding technique. Five machine learning models; K nearest neighbor regression (KNN), support vector machine (SVM), random forest (RF), Linear discriminant analysis (LDA), and Logistic regression were developed using datasets from the National Center for Biotechnology. Our results suggest, the RF model performed better than all other models on the training dataset with 99% accuracy score and 99.5% accuracy on the testing dataset. The implication of this study is that, rapid detection of the COVID-19 virus in suspected cases could potentially save lives as less time will be needed to ascertain the status of a patient.
Collapse
|
46
|
Attention-Based Deep Multiple-Instance Learning for Classifying Circular RNA and Other Long Non-Coding RNA. Genes (Basel) 2021; 12:genes12122018. [PMID: 34946967 PMCID: PMC8701965 DOI: 10.3390/genes12122018] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Revised: 12/14/2021] [Accepted: 12/17/2021] [Indexed: 12/23/2022] Open
Abstract
Circular RNA (circRNA) is a distinguishable circular formed long non-coding RNA (lncRNA), which has specific roles in transcriptional regulation, multiple biological processes. The identification of circRNA from other lncRNA is necessary for relevant research. In this study, we designed attention-based multi-instance learning (MIL) network architecture fed with a raw sequence, to learn the sparse features of RNA sequences and to accomplish the circRNAs identification task. The model outperformed the state-of-art models. Moreover, following the validation of the attention mechanism effectiveness by the handwritten digit dataset, the key sequence loci underlying circRNA’s recognition were obtained based on the corresponding attention score. Then, motif enrichment analysis identified some of the key motifs for circRNA formation. In conclusion, we designed deep learning network architecture suitable for learning gene sequences with sparse features and implemented it for the circRNA identification task, and the model has strong representation capability in the indication of some key loci.
Collapse
|
47
|
David L, Brata AM, Mogosan C, Pop C, Czako Z, Muresan L, Ismaiel A, Dumitrascu DI, Leucuta DC, Stanculete MF, Iaru I, Popa SL. Artificial Intelligence and Antibiotic Discovery. Antibiotics (Basel) 2021; 10:antibiotics10111376. [PMID: 34827314 PMCID: PMC8614913 DOI: 10.3390/antibiotics10111376] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Revised: 11/01/2021] [Accepted: 11/08/2021] [Indexed: 01/13/2023] Open
Abstract
Over recent decades, a new antibiotic crisis has been unfolding due to a decreased research in this domain, a low return of investment for the companies that developed the drug, a lengthy and difficult research process, a low success rate for candidate molecules, an increased use of antibiotics in farms and an overall inappropriate use of antibiotics. This has led to a series of pathogens developing antibiotic resistance, which poses severe threats to public health systems while also driving up the costs of hospitalization and treatment. Moreover, without proper action and collaboration between academic and health institutions, a catastrophic trend might develop, with the possibility of returning to a pre-antibiotic era. Nevertheless, new emerging AI-based technologies have started to enter the field of antibiotic and drug development, offering a new perspective to an ever-growing problem. Cheaper and faster research can be achieved through algorithms that identify hit compounds, thereby further accelerating the development of new antibiotics, which represents a vital step in solving the current antibiotic crisis. The aim of this review is to provide an extended overview of the current artificial intelligence-based technologies that are used for antibiotic discovery, together with their technological and economic impact on the industrial sector.
Collapse
Affiliation(s)
- Liliana David
- 2nd Medical Department, “Iuliu Hatieganu” University of Medicine and Pharmacy, 400000 Cluj-Napoca, Romania; (L.D.); (A.I.); (S.L.P.)
| | - Anca Monica Brata
- Faculty of Environmental Protection, University of Oradea, 410048 Oradea, Romania
- Correspondence:
| | - Cristina Mogosan
- Department of Pharmacology, Physiology and Pathophysiology, Faculty of Pharmacy, “Iuliu Hațieganu” University of Medicine and Pharmacy, 400000 Cluj-Napoca, Romania; (C.M.); (C.P.); (I.I.)
| | - Cristina Pop
- Department of Pharmacology, Physiology and Pathophysiology, Faculty of Pharmacy, “Iuliu Hațieganu” University of Medicine and Pharmacy, 400000 Cluj-Napoca, Romania; (C.M.); (C.P.); (I.I.)
| | - Zoltan Czako
- Department of Computer Science, Technical University of Cluj-Napoca, 400027 Cluj-Napoca, Romania;
| | - Lucian Muresan
- Department of Cardiology, “Emile Muller” Hospital, 68200 Mulhouse, France;
| | - Abdulrahman Ismaiel
- 2nd Medical Department, “Iuliu Hatieganu” University of Medicine and Pharmacy, 400000 Cluj-Napoca, Romania; (L.D.); (A.I.); (S.L.P.)
| | - Dinu Iuliu Dumitrascu
- Department of Anatomy, “Iuliu Hatieganu” University of Medicine and Pharmacy, 400000 Cluj-Napoca, Romania;
| | - Daniel Corneliu Leucuta
- Department of Medical Informatics and Biostatistics, “Iuliu Hatieganu” University of Medicine and Pharmacy, 400349 Cluj-Napoca, Romania;
| | - Mihaela Fadygas Stanculete
- Department of Neurosciences, Discipline of Psychiatry and Pediatric Psychiatry, “Iuliu Hatieganu“ University of Medicine and Pharmacy, 400000 Cluj-Napoca, Romania;
| | - Irina Iaru
- Department of Pharmacology, Physiology and Pathophysiology, Faculty of Pharmacy, “Iuliu Hațieganu” University of Medicine and Pharmacy, 400000 Cluj-Napoca, Romania; (C.M.); (C.P.); (I.I.)
| | - Stefan Lucian Popa
- 2nd Medical Department, “Iuliu Hatieganu” University of Medicine and Pharmacy, 400000 Cluj-Napoca, Romania; (L.D.); (A.I.); (S.L.P.)
| |
Collapse
|
48
|
Tsukiyama S, Hasan MM, Fujii S, Kurata H. LSTM-PHV: prediction of human-virus protein-protein interactions by LSTM with word2vec. Brief Bioinform 2021; 22:bbab228. [PMID: 34160596 PMCID: PMC8574953 DOI: 10.1093/bib/bbab228] [Citation(s) in RCA: 47] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Revised: 04/27/2021] [Accepted: 05/25/2021] [Indexed: 12/30/2022] Open
Abstract
Viral infection involves a large number of protein-protein interactions (PPIs) between human and virus. The PPIs range from the initial binding of viral coat proteins to host membrane receptors to the hijacking of host transcription machinery. However, few interspecies PPIs have been identified, because experimental methods including mass spectrometry are time-consuming and expensive, and molecular dynamic simulation is limited only to the proteins whose 3D structures are solved. Sequence-based machine learning methods are expected to overcome these problems. We have first developed the LSTM model with word2vec to predict PPIs between human and virus, named LSTM-PHV, by using amino acid sequences alone. The LSTM-PHV effectively learnt the training data with a highly imbalanced ratio of positive to negative samples and achieved AUCs of 0.976 and 0.973 and accuracies of 0.984 and 0.985 on the training and independent datasets, respectively. In predicting PPIs between human and unknown or new virus, the LSTM-PHV learned greatly outperformed the existing state-of-the-art PPI predictors. Interestingly, learning of only sequence contexts as words is sufficient for PPI prediction. Use of uniform manifold approximation and projection demonstrated that the LSTM-PHV clearly distinguished the positive PPI samples from the negative ones. We presented the LSTM-PHV online web server and support data that are freely available at http://kurata35.bio.kyutech.ac.jp/LSTM-PHV.
Collapse
Affiliation(s)
- Sho Tsukiyama
- Department of Interdisciplinary Informatics in the Kyushu Institute of Technology, Japan
| | | | - Satoshi Fujii
- Department of Bioscience and Bioinformatics in the Kyushu Institute of Technology, Japan
| | - Hiroyuki Kurata
- Department of Bioscience and Bioinformatics in the Kyushu Institute of Technology, Japan
| |
Collapse
|
49
|
Ostrovsky-Berman M, Frankel B, Polak P, Yaari G. Immune2vec: Embedding B/T Cell Receptor Sequences in ℝ N Using Natural Language Processing. Front Immunol 2021; 12:680687. [PMID: 34367141 PMCID: PMC8340020 DOI: 10.3389/fimmu.2021.680687] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Accepted: 06/22/2021] [Indexed: 11/13/2022] Open
Abstract
The adaptive branch of the immune system learns pathogenic patterns and remembers them for future encounters. It does so through dynamic and diverse repertoires of T- and B- cell receptors (TCR and BCRs, respectively). These huge immune repertoires in each individual present investigators with the challenge of extracting meaningful biological information from multi-dimensional data. The ability to embed these DNA and amino acid textual sequences in a vector-space is an important step towards developing effective analysis methods. Here we present Immune2vec, an adaptation of a natural language processing (NLP)-based embedding technique for BCR repertoire sequencing data. We validate Immune2vec on amino acid 3-gram sequences, continuing to longer BCR sequences, and finally to entire repertoires. Our work demonstrates Immune2vec to be a reliable low-dimensional representation that preserves relevant information of immune sequencing data, such as n-gram properties and IGHV gene family classification. Applying Immune2vec along with machine learning approaches to patient data exemplifies how distinct clinical conditions can be effectively stratified, indicating that the embedding space can be used for feature extraction and exploratory data analysis.
Collapse
Affiliation(s)
- Miri Ostrovsky-Berman
- Bioengineering, Faculty of Engineering, Bar Ilan University, Ramat Gan, Israel.,Bar Ilan Institute of Nanotechnologies and Advanced Materials, Bar Ilan University, Ramat Gan, Israel
| | - Boaz Frankel
- Bioengineering, Faculty of Engineering, Bar Ilan University, Ramat Gan, Israel.,Bar Ilan Institute of Nanotechnologies and Advanced Materials, Bar Ilan University, Ramat Gan, Israel
| | - Pazit Polak
- Bioengineering, Faculty of Engineering, Bar Ilan University, Ramat Gan, Israel.,Bar Ilan Institute of Nanotechnologies and Advanced Materials, Bar Ilan University, Ramat Gan, Israel
| | - Gur Yaari
- Bioengineering, Faculty of Engineering, Bar Ilan University, Ramat Gan, Israel.,Bar Ilan Institute of Nanotechnologies and Advanced Materials, Bar Ilan University, Ramat Gan, Israel
| |
Collapse
|
50
|
Sharma R, Shrivastava S, Kumar Singh S, Kumar A, Saxena S, Kumar Singh R. AniAMPpred: artificial intelligence guided discovery of novel antimicrobial peptides in animal kingdom. Brief Bioinform 2021; 22:6320952. [PMID: 34259329 DOI: 10.1093/bib/bbab242] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2021] [Revised: 06/02/2021] [Accepted: 06/21/2021] [Indexed: 12/12/2022] Open
Abstract
With advancements in genomics, there has been substantial reduction in the cost and time of genome sequencing and has resulted in lot of data in genome databases. Antimicrobial host defense proteins provide protection against invading microbes. But confirming the antimicrobial function of host proteins by wet-lab experiments is expensive and time consuming. Therefore, there is a need to develop an in silico tool to identify the antimicrobial function of proteins. In the current study, we developed a model AniAMPpred by considering all the available antimicrobial peptides (AMPs) of length $\in $[10 200] from the animal kingdom. The model utilizes a support vector machine algorithm with deep learning-based features and identifies probable antimicrobial proteins (PAPs) in the genome of animals. The results show that our proposed model outperforms other state-of-the-art classifiers, has very high confidence in its predictions, is not biased and can classify both AMPs and non-AMPs for a diverse peptide length with high accuracy. By utilizing AniAMPpred, we identified 436 PAPs in the genome of Helobdella robusta. To further confirm the functional activity of PAPs, we performed BLAST analysis against known AMPs. On detailed analysis of five selected PAPs, we could observe their similarity with antimicrobial proteins of several animal species. Thus, our proposed model can help the researchers identify PAPs in the genome of animals and provide insight into the functional identity of different proteins. An online prediction server is also developed based on the proposed approach, which is freely accessible at https://aniamppred.anvil.app/.
Collapse
Affiliation(s)
- Ritesh Sharma
- Department of Computer Science and Engineering, Indian Institute of Technology (BHU), Varanasi, 221005, Uttar Pradesh, India
| | - Sameer Shrivastava
- Division of Veterinary Biotechnology, ICAR-Indian Veterinary Research Institute, Izatnagar, 243122, Uttar Pradesh, India
| | - Sanjay Kumar Singh
- Department of Computer Science and Engineering, Indian Institute of Technology (BHU), Varanasi, 221005, Uttar Pradesh, India
| | - Abhinav Kumar
- Department of Computer Science and Engineering, Indian Institute of Technology (BHU), Varanasi, 221005, Uttar Pradesh, India
| | - Sonal Saxena
- Division of Veterinary Biotechnology, ICAR-Indian Veterinary Research Institute, Izatnagar, 243122, Uttar Pradesh, India
| | - Raj Kumar Singh
- Former Director & Vice Chancellor, ICAR-Indian Veterinary Research Institute, Izatnagar, 243122, Uttar Pradesh, India
| |
Collapse
|