1
|
González M, Durán RE, Seeger M, Araya M, Jara N. Negative dataset selection impacts machine learning-based predictors for multiple bacterial species promoters. Bioinformatics 2025; 41:btaf135. [PMID: 40152247 PMCID: PMC11993300 DOI: 10.1093/bioinformatics/btaf135] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2024] [Revised: 03/13/2025] [Accepted: 03/25/2025] [Indexed: 03/29/2025] Open
Abstract
MOTIVATION Advances in bacterial promoter predictors based on machine learning have greatly improved identification metrics. However, existing models overlooked the impact of negative datasets, previously identified in GC-content discrepancies between positive and negative datasets in single-species models. This study aims to investigate whether multiple-species models for promoter classification are inherently biased due to the selection criteria of negative datasets. We further explore whether the generation of synthetic random sequences (SRS) that mimic GC-content distribution of promoters can partly reduce this bias. RESULTS Multiple-species predictors exhibited GC-content bias when using CDS as a negative dataset, suggested by specificity and sensibility metrics in a species-specific manner, and investigated by dimensionality reduction. We demonstrated a reduction in this bias by using the SRS dataset, with less detection of background noise in real genomic data. In both scenarios DNABERT showed the best metrics. These findings suggest that GC-balanced datasets can enhance the generalizability of promoter predictors across Bacteria. AVAILABILITY AND IMPLEMENTATION The source code of the experiments is freely available at https://github.com/maigonzalezh/MultispeciesPromoterClassifier.
Collapse
Affiliation(s)
- Marcelo González
- Departamento de Electrónica, Universidad Técnica Federico Santa María, Avenida España 1680, Valparaíso 2390123, Chile
| | - Roberto E Durán
- Laboratorio de Microbiología Molecular y Biotecnología Ambiental, Department of Chemistry & Center of Biotechnology Daniel Alkalay Lowitt, Universidad Técnica Federico Santa María, Avenida España 1680, Valparaíso 2390123, Chile
- Millennium Nucleus Bioproducts, Genomics and Environmental Microbiology (BioGEM), Avenida España 1680, Valparaíso 2390123, Chile
| | - Michael Seeger
- Laboratorio de Microbiología Molecular y Biotecnología Ambiental, Department of Chemistry & Center of Biotechnology Daniel Alkalay Lowitt, Universidad Técnica Federico Santa María, Avenida España 1680, Valparaíso 2390123, Chile
- Millennium Nucleus Bioproducts, Genomics and Environmental Microbiology (BioGEM), Avenida España 1680, Valparaíso 2390123, Chile
| | - Mauricio Araya
- Departamento de Electrónica, Universidad Técnica Federico Santa María, Avenida España 1680, Valparaíso 2390123, Chile
| | - Nicolás Jara
- Departamento de Electrónica, Universidad Técnica Federico Santa María, Avenida España 1680, Valparaíso 2390123, Chile
| |
Collapse
|
2
|
Du Q, Guo Y, Zhang J, Lu F, Peng C, Zhou C. Predicting Promoters in Multiple Prokaryotes with Prompt. Interdiscip Sci 2024; 16:814-828. [PMID: 39110340 DOI: 10.1007/s12539-024-00637-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Revised: 05/17/2024] [Accepted: 05/21/2024] [Indexed: 10/27/2024]
Abstract
Promoters are important cis-regulatory elements for the regulation of gene expression, and their accurate predictions are crucial for elucidating the biological functions and potential mechanisms of genes. Many previous prokaryotic promoter prediction methods are encouraging in terms of the prediction performance, but most of them focus on the recognition of promoters in only one or a few bacterial species. Moreover, due to ignoring the promoter sequence motifs, the interpretability of predictions with existing methods is limited. In this work, we present a generalized method Prompt (Promoters in multiple prokaryotes) to predict promoters in 16 prokaryotes and improve the interpretability of prediction results. Prompt integrates three methods including RSK (Regression based on Selected k-mer), CL (Contrastive Learning) and MLP (Multilayer Perception), and employs a voting strategy to divide the datasets into high-confidence and low-confidence categories. Results on the promoter prediction tasks in 16 prokaryotes show that the accuracy (Accuracy, Matthews correlation coefficient) of Prompt is greater than 80% in highly credible datasets of 16 prokaryotes, and is greater than 90% in 12 prokaryotes, and Prompt performs the best compared with other existing methods. Moreover, by identifying promoter sequence motifs, Prompt can improve the interpretability of the predictions. Prompt is freely available at https://github.com/duqimeng/PromptPrompt , and will contribute to the research of promoters in prokaryote.
Collapse
Affiliation(s)
- Qimeng Du
- School of Engineering, Air-Space-Ground Integrated Intelligence and Big Data Application Engineering Research Center of Yunnan Provincial Department of Education, Dali University, Dali, 671003, China
| | - Yixue Guo
- College of Biotechnology, Tianjin University of Science & Technology, Tianjin, 300457, China
| | - Junpeng Zhang
- School of Engineering, Air-Space-Ground Integrated Intelligence and Big Data Application Engineering Research Center of Yunnan Provincial Department of Education, Dali University, Dali, 671003, China
| | - Fuping Lu
- College of Biotechnology, Tianjin University of Science & Technology, Tianjin, 300457, China
| | - Chong Peng
- College of Biotechnology, Tianjin University of Science & Technology, Tianjin, 300457, China.
| | - Chichun Zhou
- School of Engineering, Air-Space-Ground Integrated Intelligence and Big Data Application Engineering Research Center of Yunnan Provincial Department of Education, Dali University, Dali, 671003, China.
| |
Collapse
|
3
|
Paul S, Olymon K, Martinez GS, Sarkar S, Yella VR, Kumar A. MLDSPP: Bacterial Promoter Prediction Tool Using DNA Structural Properties with Machine Learning and Explainable AI. J Chem Inf Model 2024; 64:2705-2719. [PMID: 38258978 DOI: 10.1021/acs.jcim.3c02017] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Bacterial promoters play a crucial role in gene expression by serving as docking sites for the transcription initiation machinery. However, accurately identifying promoter regions in bacterial genomes remains a challenge due to their diverse architecture and variations. In this study, we propose MLDSPP (Machine Learning and Duplex Stability based Promoter prediction in Prokaryotes), a machine learning-based promoter prediction tool, to comprehensively screen bacterial promoter regions in 12 diverse genomes. We leveraged biologically relevant and informative DNA structural properties, such as DNA duplex stability and base stacking, and state-of-the-art machine learning (ML) strategies to gain insights into promoter characteristics. We evaluated several machine learning models, including Support Vector Machines, Random Forests, and XGBoost, and assessed their performance using accuracy, precision, recall, specificity, F1 score, and MCC metrics. Our findings reveal that XGBoost outperformed other models and current state-of-the-art promoter prediction tools, namely Sigma70pred and iPromoter2L, achieving F1-scores >95% in most systems. Significantly, the use of one-hot encoding for representing nucleotide sequences complements these structural features, enhancing our XGBoost model's predictive capabilities. To address the challenge of model interpretability, we incorporated explainable AI techniques using Shapley values. This enhancement allows for a better understanding and interpretation of the predictions of our model. In conclusion, our study presents MLDSPP as a novel, generic tool for predicting promoter regions in bacteria, utilizing original downstream sequences as nonpromoter controls. This tool has the potential to significantly advance the field of bacterial genomics and contribute to our understanding of gene regulation in diverse bacterial systems.
Collapse
Affiliation(s)
- Subhojit Paul
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur 784028, Assam, India
| | - Kaushika Olymon
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur 784028, Assam, India
| | - Gustavo Sganzerla Martinez
- Microbiology and Immunology, Dalhousie University, Halifax, Nova Scotia B3H 4H7, Canada
- Pediatrics, Izaak Walton Killam (IWK) Health Center, Canadian Center for Vaccinology (CCfV), Halifax, Nova Scotia B3H 4H7, Canada
| | - Sharmilee Sarkar
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur 784028, Assam, India
| | - Venkata Rajesh Yella
- Department of Biotechnology, Koneru Lakshmaiah Education Foundation, Guntur 522302, Andhra Pradesh, India
| | - Aditya Kumar
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur 784028, Assam, India
| |
Collapse
|
4
|
Acero-Pimentel D, Romero-Sánchez DI, Fuentes-Curiel SN, Quirasco M. Study of an Enterococcus faecium strain isolated from an artisanal Mexican cheese, whole-genome sequencing, comparative genomics, and bacteriocin expression. Antonie Van Leeuwenhoek 2024; 117:40. [PMID: 38393447 PMCID: PMC10891205 DOI: 10.1007/s10482-024-01938-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Accepted: 01/28/2024] [Indexed: 02/25/2024]
Abstract
Enterococci are ubiquitous microorganisms in almost all environments, from the soil we step on to the food we eat. They are frequently found in naturally fermented foods, contributing to ripening through protein, lipid, and sugar metabolism. On the other hand, these organisms are also leading the current antibiotic resistance crisis. In this study, we performed whole-genome sequencing and comparative genomics of an Enterococcus faecium strain isolated from an artisanal Mexican Cotija cheese, namely QD-2. We found clear genomic differences between commensal and pathogenic strains, particularly in their carbohydrate metabolic pathways, resistance to vancomycin and other antibiotics, bacteriocin production, and bacteriophage and CRISPR content. Furthermore, a bacteriocin transcription analysis performed by RT-qPCR revealed that, at the end of the log phase, besides enterocins A and X, two putative bacteriocins not reported previously are also transcribed as a bicistronic operon in E. faecium QD-2, and are expressed 1.5 times higher than enterocin A when cultured in MRS broth.
Collapse
Affiliation(s)
- Daniel Acero-Pimentel
- Departamento de Alimentos y Biotecnología, Facultad de Química, Universidad Nacional Autónoma de México, Ciudad Universitaria, 04510, Mexico City, Mexico
| | - Diana I Romero-Sánchez
- Departamento de Alimentos y Biotecnología, Facultad de Química, Universidad Nacional Autónoma de México, Ciudad Universitaria, 04510, Mexico City, Mexico
| | - Sac Nicté Fuentes-Curiel
- Departamento de Alimentos y Biotecnología, Facultad de Química, Universidad Nacional Autónoma de México, Ciudad Universitaria, 04510, Mexico City, Mexico
| | - Maricarmen Quirasco
- Departamento de Alimentos y Biotecnología, Facultad de Química, Universidad Nacional Autónoma de México, Ciudad Universitaria, 04510, Mexico City, Mexico.
| |
Collapse
|
5
|
Ligeti B, Szepesi-Nagy I, Bodnár B, Ligeti-Nagy N, Juhász J. ProkBERT family: genomic language models for microbiome applications. Front Microbiol 2024; 14:1331233. [PMID: 38282738 PMCID: PMC10810988 DOI: 10.3389/fmicb.2023.1331233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 12/11/2023] [Indexed: 01/30/2024] Open
Abstract
Background In the evolving landscape of microbiology and microbiome analysis, the integration of machine learning is crucial for understanding complex microbial interactions, and predicting and recognizing novel functionalities within extensive datasets. However, the effectiveness of these methods in microbiology faces challenges due to the complex and heterogeneous nature of microbial data, further complicated by low signal-to-noise ratios, context-dependency, and a significant shortage of appropriately labeled datasets. This study introduces the ProkBERT model family, a collection of large language models, designed for genomic tasks. It provides a generalizable sequence representation for nucleotide sequences, learned from unlabeled genome data. This approach helps overcome the above-mentioned limitations in the field, thereby improving our understanding of microbial ecosystems and their impact on health and disease. Methods ProkBERT models are based on transfer learning and self-supervised methodologies, enabling them to use the abundant yet complex microbial data effectively. The introduction of the novel Local Context-Aware (LCA) tokenization technique marks a significant advancement, allowing ProkBERT to overcome the contextual limitations of traditional transformer models. This methodology not only retains rich local context but also demonstrates remarkable adaptability across various bioinformatics tasks. Results In practical applications such as promoter prediction and phage identification, the ProkBERT models show superior performance. For promoter prediction tasks, the top-performing model achieved a Matthews Correlation Coefficient (MCC) of 0.74 for E. coli and 0.62 in mixed-species contexts. In phage identification, ProkBERT models consistently outperformed established tools like VirSorter2 and DeepVirFinder, achieving an MCC of 0.85. These results underscore the models' exceptional accuracy and generalizability in both supervised and unsupervised tasks. Conclusions The ProkBERT model family is a compact yet powerful tool in the field of microbiology and bioinformatics. Its capacity for rapid, accurate analyses and its adaptability across a spectrum of tasks marks a significant advancement in machine learning applications in microbiology. The models are available on GitHub (https://github.com/nbrg-ppcu/prokbert) and HuggingFace (https://huggingface.co/nerualbioinfo) providing an accessible tool for the community.
Collapse
Affiliation(s)
- Balázs Ligeti
- Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary
| | - István Szepesi-Nagy
- Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary
| | - Babett Bodnár
- Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary
| | - Noémi Ligeti-Nagy
- Language Technology Research Group, HUN-REN Hungarian Research Centre for Linguistics, Budapest, Hungary
| | - János Juhász
- Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary
- Institute of Medical Microbiology, Semmelweis University, Budapest, Hungary
| |
Collapse
|
6
|
Wang Y, Tai S, Zhang S, Sheng N, Xie X. PromGER: Promoter Prediction Based on Graph Embedding and Ensemble Learning for Eukaryotic Sequence. Genes (Basel) 2023; 14:1441. [PMID: 37510345 PMCID: PMC10379012 DOI: 10.3390/genes14071441] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Revised: 07/04/2023] [Accepted: 07/10/2023] [Indexed: 07/30/2023] Open
Abstract
Promoters are DNA non-coding regions around the transcription start site and are responsible for regulating the gene transcription process. Due to their key role in gene function and transcriptional activity, the prediction of promoter sequences and their core elements accurately is a crucial research area in bioinformatics. At present, models based on machine learning and deep learning have been developed for promoter prediction. However, these models cannot mine the deeper biological information of promoter sequences and consider the complex relationship among promoter sequences. In this work, we propose a novel prediction model called PromGER to predict eukaryotic promoter sequences. For a promoter sequence, firstly, PromGER utilizes four types of feature-encoding methods to extract local information within promoter sequences. Secondly, according to the potential relationships among promoter sequences, the whole promoter sequences are constructed as a graph. Furthermore, three different scales of graph-embedding methods are applied for obtaining the global feature information more comprehensively in the graph. Finally, combining local features with global features of sequences, PromGER analyzes and predicts promoter sequences through a tree-based ensemble-learning framework. Compared with seven existing methods, PromGER improved the average specificity of 13%, accuracy of 10%, Matthew's correlation coefficient of 16%, precision of 4%, F1 score of 6%, and AUC of 9%. Specifically, this study interpreted the PromGER by the t-distributed stochastic neighbor embedding (t-SNE) method and SHAPley Additive exPlanations (SHAP) value analysis, which demonstrates the interpretability of the model.
Collapse
Affiliation(s)
- Yan Wang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
- School of Artificial Intelligence, Jilin University, Changchun 130012, China
| | - Shiwen Tai
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Shuangquan Zhang
- School of Cyber Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Nan Sheng
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Xuping Xie
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| |
Collapse
|
7
|
Ni CE, Doan DP, Chiu YJ, Huang YH. TSSUNet-MB - ab initio identification of σ 70 promoter transcription start sites in Escherichia coli using deep multitask learning. Comput Biol Chem 2023; 105:107904. [PMID: 37327560 DOI: 10.1016/j.compbiolchem.2023.107904] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2022] [Revised: 03/22/2023] [Accepted: 06/09/2023] [Indexed: 06/18/2023]
Abstract
MOTIVATION Computational promoter prediction (CPP) tools designed to classify prokaryotic promoter regions usually assume that a transcription start site (TSS) is located at a predefined position within each promoter region. Such CPP tools are sensitive to any positional shifting of the TSS in a windowed region, and they are unsuitable for determining the boundaries of prokaryotic promoters. RESULTS TSSUNet-MB is a deep learning model developed to identify the TSSs of σ70 promoters. Mononucleotide and bendability were used to encode input sequences. TSSUNet-MB outperforms other CPP tools when assessed using the sequences obtained from the neighborhood of real promoters. TSSUNet-MB achieved a sensitivity of 0.839 and specificity of 0.768 on sliding sequences, while other CPP tool cannot maintain both sensitivities and specificities in a compatible range. Furthermore, TSSUNet-MB can precisely predict the TSS position of σ70 promoter-containing regions with a 10-base accuracy of 77.6%. By leveraging the sliding window scanning approach, we further computed the confidence score of each predicted TSS, which allows for more accurately identifying TSS locations. Our results suggest that TSSUNet-MB is a robust tool for finding σ70 promoters and identifying TSSs.
Collapse
Affiliation(s)
- Chung-En Ni
- Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei, Taiwan
| | - Duy-Phuong Doan
- Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei, Taiwan
| | - Yen-Jung Chiu
- Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei, Taiwan
| | - Yen-Hua Huang
- Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei, Taiwan; Center for Systems and Synthetic Biology, National Yang Ming Chiao Tung University, Taipei, Taiwan.
| |
Collapse
|
8
|
Moller E, Britt M, Schams A, Cetuk H, Anishkin A, Sukharev S. Mechanosensitive channel MscS is critical for termination of the bacterial hypoosmotic permeability response. J Gen Physiol 2023; 155:e202213168. [PMID: 37022337 PMCID: PMC10082366 DOI: 10.1085/jgp.202213168] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2022] [Revised: 02/06/2023] [Accepted: 03/20/2023] [Indexed: 04/07/2023] Open
Abstract
Free-living microorganisms are subjected to drastic changes in osmolarity. To avoid lysis under sudden osmotic down-shock, bacteria quickly expel small metabolites through the tension-activated channels MscL, MscS, and MscK. We examined five chromosomal knockout strains, ∆mscL, ∆mscS, a double knockout ∆mscS ∆mscK, and a triple knockout ∆mscL ∆mscS ∆mscK, in comparison to the wild-type parental strain. Stopped-flow experiments confirmed that both MscS and MscL mediate fast osmolyte release and curb cell swelling, but osmotic viability assays indicated that they are not equivalent. MscS alone was capable of rescuing the cell population, but in some strains, MscL did not rescue and additionally became toxic in the absence of both MscS and MscK. Furthermore, MscS was upregulated in the ∆mscL strain, suggesting either a crosstalk between the two genes/proteins or the influence of cell mechanics on mscS expression. The data shows that for the proper termination of the permeability response, the high-threshold (MscL) and the low-threshold (MscS/MscK) channels must act sequentially. In the absence of low-threshold channels, at the end of the release phase, MscL should stabilize membrane tension at around 10 mN/m. Patch-clamp protocols emulating the tension changes during the release phase indicated that the non-inactivating MscL, residing at its own tension threshold, flickers and produces a protracted leakage. The MscS/MscK population, when present, stays open at this stage to reduce tension below the MscL threshold and silence the large channel. When MscS reaches its own threshold, it inactivates and thus ensures proper termination of the hypoosmotic permeability response. This functional interplay between the high- and low-threshold channels is further supported by the compromised osmotic survival of bacteria expressing non-inactivating MscS mutants.
Collapse
Affiliation(s)
- Elissa Moller
- Department of Biology, University of Maryland, College Park, College Park, MD, USA
- Biophysics Graduate Program, University of Maryland, College Park, College Park, MD, USA
| | - Madolyn Britt
- Department of Biology, University of Maryland, College Park, College Park, MD, USA
- Biophysics Graduate Program, University of Maryland, College Park, College Park, MD, USA
| | - Anthony Schams
- Department of Biology, University of Maryland, College Park, College Park, MD, USA
| | - Hannah Cetuk
- Department of Biology, University of Maryland, College Park, College Park, MD, USA
| | - Andriy Anishkin
- Department of Biology, University of Maryland, College Park, College Park, MD, USA
| | - Sergei Sukharev
- Department of Biology, University of Maryland, College Park, College Park, MD, USA
- Institute for Physical Science and Technology, University of Maryland, College Park, College Park, MD, USA
| |
Collapse
|
9
|
Moller E, Britt M, Schams A, Cetuk H, Anishkin A, Sukharev S. Mechanosensitive channel MscS is critical for termination of the bacterial hypoosmotic permeability response. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.27.530336. [PMID: 36909569 PMCID: PMC10002685 DOI: 10.1101/2023.02.27.530336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2023]
Abstract
Free-living microorganisms are subjected to drastic changes in osmolarity. To avoid lysis under sudden osmotic down-shock, bacteria quickly expel small metabolites through the tension-activated channels MscL, MscS, and MscK. We examined five chromosomal knockout strains, Δ mscL , Δ mscS , a double knockout Δ mscS Δ mscK , and a triple knockout Δ mscL Δ mscS Δ mscK in comparison to the wild-type parental strain. Stopped-flow experiments confirmed that both MscS and MscL mediate fast osmolyte release and curb cell swelling, but osmotic viability assays indicated that they are not equivalent. MscS alone was capable of rescuing the cell population, but in some strains MscL did not rescue and additionally became toxic in the absence of both MscS and MscK. Furthermore, MscS was upregulated in the Δ mscL strain, suggesting either a cross-talk between the two genes/proteins or the influence of cell mechanics on mscS expression. The data shows that for the proper termination of the permeability response, the high-threshold (MscL) and the low-threshold (MscS/MscK) channels must act sequentially. In the absence of low-threshold channels, at the end of the release phase, MscL should stabilize membrane tension at around 10 mN/m. Patch-clamp protocols emulating the tension changes during the release phase indicated that the non-inactivating MscL, residing at its own tension threshold, flickers and produces a protracted leakage. The MscS/MscK population, when present, stays open at this stage to reduce tension below the MscL threshold and silence the large channel. When MscS reaches its own threshold, it inactivates and thus ensures proper termination of the hypoosmotic permeability response. This functional interplay between the high- and low-threshold channels is further supported by the compromised osmotic survival of bacteria expressing non-inactivating MscS mutants. Summary for the table of contents The kinetics of hypotonic osmolyte release from E. coli is analyzed in conjunction with bacterial survival. It is shown that MscL, the high-threshold 'emergency release valve', rescues bacteria from down-shocks only in the presence of MscS, MscK or other low-threshold channels that are necessary to pacify MscL at the end of the release phase.
Collapse
|
10
|
Kim NY, Kim OB. The ybcF Gene of Escherichia coli Encodes a Local Orphan Enzyme, Catabolic Carbamate Kinase. J Microbiol Biotechnol 2022; 32:1527-1536. [PMID: 36384810 PMCID: PMC9843812 DOI: 10.4014/jmb.2210.10037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Revised: 10/31/2022] [Accepted: 11/01/2022] [Indexed: 11/18/2022]
Abstract
Escherichia coli can use allantoin as its sole nitrogen source under anaerobic conditions. The ureidoglycolate produced by double release of ammonia from allantoin can flow into either the glyoxylate shunt or further catabolic transcarbamoylation. Although the former pathway is well studied, the genes of the latter (catabolic) pathway are not known. In the catabolic pathway, ureidoglycolate is finally converted to carbamoyl phosphate (CP) and oxamate, and then CP is dephosphorylated to carbamate by a catabolic carbamate kinase (CK), whereby ATP is formed. We identified the ybcF gene in a gene cluster containing fdrA-ylbE-ylbF-ybcF that is located downstream of the allDCE-operon. Reverse transcription PCR of total mRNA confirmed that the genes fdrA, ylbE, ylbF, and ybcF are co-transcribed. Deletion of ybcF caused only a slight increase in metabolic flow into the glyoxylate pathway, probably because CP was used to de novo synthesize pyrimidine and arginine. The activity of the catabolic CK was analyzed using purified YbcF protein. The Vmax is 1.82 U/mg YbcF for CP and 1.94 U/mg YbcF for ADP, and the KM value is 0.47 mM for CP and 0.43 mM for ADP. With these results, it was experimentally revealed that the ybcF gene of E. coli encodes catabolic CK, which completes anaerobic allantoin degradation through substrate-level phosphorylation. Therefore, we suggest renaming the ybcF gene as allK.
Collapse
Affiliation(s)
- Nam Yeun Kim
- Department of Life Science, Division of EcoScience, Ewha Womans University, Seoul 03760, Republic of Korea
| | - Ok Bin Kim
- Department of Life Science, Division of EcoScience, Ewha Womans University, Seoul 03760, Republic of Korea
| |
Collapse
|
11
|
Mai DHA, Nguyen LT, Lee EY. TSSNote-CyaPromBERT: Development of an integrated platform for highly accurate promoter prediction and visualization of Synechococcus sp. and Synechocystis sp. through a state-of-the-art natural language processing model BERT. Front Genet 2022; 13:1067562. [PMID: 36523764 PMCID: PMC9745317 DOI: 10.3389/fgene.2022.1067562] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 11/17/2022] [Indexed: 07/30/2023] Open
Abstract
Since the introduction of the first transformer model with a unique self-attention mechanism, natural language processing (NLP) models have attained state-of-the-art (SOTA) performance on various tasks. As DNA is the blueprint of life, it can be viewed as an unusual language, with its characteristic lexicon and grammar. Therefore, NLP models may provide insights into the meaning of the sequential structure of DNA. In the current study, we employed and compared the performance of popular SOTA NLP models (i.e., XLNET, BERT, and a variant DNABERT trained on the human genome) to predict and analyze the promoters in freshwater cyanobacterium Synechocystis sp. PCC 6803 and the fastest growing cyanobacterium Synechococcus elongatus sp. UTEX 2973. These freshwater cyanobacteria are promising hosts for phototrophically producing value-added compounds from CO2. Through a custom pipeline, promoters and non-promoters from Synechococcus elongatus sp. UTEX 2973 were used to train the model. The trained model achieved an AUROC score of 0.97 and F1 score of 0.92. During cross-validation with promoters from Synechocystis sp. PCC 6803, the model achieved an AUROC score of 0.96 and F1 score of 0.91. To increase accessibility, we developed an integrated platform (TSSNote-CyaPromBERT) to facilitate large dataset extraction, model training, and promoter prediction from public dRNA-seq datasets. Furthermore, various visualization tools have been incorporated to address the "black box" issue of deep learning and feature analysis. The learning transfer ability of large language models may help identify and analyze promoter regions for newly isolated strains with similar lineages.
Collapse
|
12
|
Patiyal S, Singh N, Ali MZ, Pundir DS, Raghava GPS. Sigma70Pred: A highly accurate method for predicting sigma70 promoter in Escherichia coli K-12 strains. Front Microbiol 2022; 13:1042127. [PMID: 36452927 PMCID: PMC9701712 DOI: 10.3389/fmicb.2022.1042127] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Accepted: 10/27/2022] [Indexed: 12/01/2023] Open
Abstract
Sigma70 factor plays a crucial role in prokaryotes and regulates the transcription of most of the housekeeping genes. One of the major challenges is to predict the sigma70 promoter or sigma70 factor binding site with high precision. In this study, we trained and evaluate our models on a dataset consists of 741 sigma70 promoters and 1,400 non-promoters. We have generated a wide range of features around 8,000, which includes Dinucleotide Auto-Correlation, Dinucleotide Cross-Correlation, Dinucleotide Auto Cross-Correlation, Moran Auto-Correlation, Normalized Moreau-Broto Auto-Correlation, Parallel Correlation Pseudo Tri-Nucleotide Composition, etc. Our SVM based model achieved maximum accuracy 97.38% with AUROC 0.99 on training dataset, using 200 most relevant features. In order to check the robustness of the model, we have tested our model on the independent dataset made by using RegulonDB10.8, which included 1,134 sigma70 and 638 non-promoters, and able to achieve accuracy of 90.41% with AUROC of 0.95. Our model successfully predicted constitutive promoters with accuracy of 81.46% on an independent dataset. We have developed a method, Sigma70Pred, which is available as webserver and standalone packages at https://webs.iiitd.edu.in/raghava/sigma70pred/. The services are freely accessible.
Collapse
Affiliation(s)
- Sumeet Patiyal
- Department of Computational Biology, Indraprastha Institute of Information Technology Delhi, New Delhi, India
| | - Nitindeep Singh
- Department of Computer Science and Engineering, Indraprastha Institute of Information Technology Delhi, New Delhi, India
| | - Mohd Zartab Ali
- Department of Computer Science and Engineering, Indraprastha Institute of Information Technology Delhi, New Delhi, India
| | - Dhawal Singh Pundir
- Department of Computer Science and Engineering, Indraprastha Institute of Information Technology Delhi, New Delhi, India
| | - Gajendra P. S. Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology Delhi, New Delhi, India
| |
Collapse
|
13
|
Shujaat M, Jin JS, Tayara H, Chong KT. iProm-phage: A two-layer model to identify phage promoters and their types using a convolutional neural network. Front Microbiol 2022; 13:1061122. [PMID: 36406389 PMCID: PMC9672459 DOI: 10.3389/fmicb.2022.1061122] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2022] [Accepted: 10/18/2022] [Indexed: 04/26/2024] Open
Abstract
The increased interest in phages as antibacterial agents has resulted in a rise in the number of sequenced phage genomes, necessitating the development of user-friendly bioinformatics tools for genome annotation. A promoter is a DNA sequence that is used in the annotation of phage genomes. In this study we proposed a two layer model called "iProm-phage" for the prediction and classification of phage promoters. Model first layer identify query sequence as promoter or non-promoter and if the query sequence is predicted as promoter then model second layer classify it as phage or host promoter. Furthermore, rather than using non-coding regions of the genome as a negative set, we created a more challenging negative dataset using promoter sequences. The presented approach improves discrimination while decreasing the frequency of erroneous positive predictions. For feature selection, we investigated 10 distinct feature encoding approaches and utilized them with several machine-learning algorithms and a 1-D convolutional neural network model. We discovered that the one-hot encoding approach and the CNN model outperformed based on performance metrics. Based on the results of the 5-fold cross validation, the proposed predictor has a high potential. Furthermore, to make it easier for other experimental scientists to obtain the results they require, we set up a freely accessible and user-friendly web server at http://nsclbio.jbnu.ac.kr/tools/iProm-phage/.
Collapse
Affiliation(s)
- Muhammad Shujaat
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju, South Korea
| | - Joe Sung Jin
- Graduate School of Integrated Energy AI, Jeonbuk National University, Jeonju, South Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju, South Korea
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju, South Korea
- Advances Electronics and Information Research Center, Jeonbuk National University, Jeonju, South Korea
| |
Collapse
|
14
|
Bernardino M, Beiko R. Genome-scale prediction of bacterial promoters. Biosystems 2022; 221:104771. [PMID: 36099980 DOI: 10.1016/j.biosystems.2022.104771] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Revised: 08/18/2022] [Accepted: 08/27/2022] [Indexed: 11/02/2022]
Abstract
A key step in the transcription of RNA is the binding of the RNA polymerase protein complex to a short promoter sequence that is typically upstream of the gene to be expressed. Automated identification of promoters would serve as a valuable complement to experimental validation in determining which genes are likely to be expressed and when; however, promoter sequences are short and highly variable, which makes them very difficult to accurately classify. The many tools developed to identify promoters in DNA have generally been tested on small and balanced subsets of genomic sequence, and the results may not reflect their expected performance on genomes with millions of DNA base pairs where promoters are likely to comprise less than ∼1% of the sequence. Here we introduce Expositor, a neural-network-based method that uses different types of DNA encodings and tunable sensitivity and specificity parameters. Expositor showed higher sensitivity and precision on the E. coli K-12 MG1655 chromosome than other tested approaches. Expositor predictions were more consistent in the homologous subset of sequence from a strain of Salmonella than they were with another strain of E. coli. We also examined the accuracy of Expositor in distinguishing different classes of promoters and found that misclassification between classes was consistent with the biological similarity between promoters.
Collapse
Affiliation(s)
- Miria Bernardino
- Faculty of Computer Science, Dalhousie University, Halifax, Canada.
| | - Robert Beiko
- Faculty of Computer Science, Dalhousie University, Halifax, Canada.
| |
Collapse
|
15
|
PromoterLCNN: A Light CNN-Based Promoter Prediction and Classification Model. Genes (Basel) 2022; 13:genes13071126. [PMID: 35885909 PMCID: PMC9325283 DOI: 10.3390/genes13071126] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 06/15/2022] [Accepted: 06/20/2022] [Indexed: 01/01/2023] Open
Abstract
Promoter identification is a fundamental step in understanding bacterial gene regulation mechanisms. However, accurate and fast classification of bacterial promoters continues to be challenging. New methods based on deep convolutional networks have been applied to identify and classify bacterial promoters recognized by sigma (σ) factors and RNA polymerase subunits which increase affinity to specific DNA sequences to modulate transcription and respond to nutritional or environmental changes. This work presents a new multiclass promoter prediction model by using convolutional neural networks (CNNs), denoted as PromoterLCNN, which classifies Escherichia coli promoters into subclasses σ70, σ24, σ32, σ38, σ28, and σ54. We present a light, fast, and simple two-stage multiclass CNN architecture for promoter identification and classification. Training and testing were performed on a benchmark dataset, part of RegulonDB. Comparative performance of PromoterLCNN against other CNN-based classifiers using four parameters (Acc, Sn, Sp, MCC) resulted in similar or better performance than those that commonly use cascade architecture, reducing time by approximately 30–90% for training, prediction, and hyperparameter optimization without compromising classification quality.
Collapse
|
16
|
iProm-Zea: A two-layer model to identify plant promoters and their types using convolutional neural network. Genomics 2022; 114:110384. [PMID: 35533969 DOI: 10.1016/j.ygeno.2022.110384] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Revised: 04/18/2022] [Accepted: 05/02/2022] [Indexed: 01/14/2023]
Abstract
A promoter is a short DNA sequence near the start codon, responsible for initiating the transcription of a specific gene in the genome. The accurate recognition of promoters is important for achieving a better understanding of transcriptional regulation. Because of their importance in the process of biological transcriptional regulation, there is an urgent need to develop in silico tools to identify promoters and their types in a timely and accurate manner. A number of prediction methods have been developed in this regard; however, almost all of them are merely used for identifying promoters and their strength or sigma types. The TATA box region in TATA promoter influences the post-transcriptional processes; therefore, in the current study, we developed a two-layer predictor called "iProm-Zea" using the convolutional neural network (CNN) for identify TATA and TATA less promoters. The first layer can be used to identify a given DNA sequence as a promoter or non-promoter. The second layer can be used to identify whether the recognized promoter is the TATA promoter. To find an optimal feature encoding scheme and model, we employed four feature encoding schemes on different machine learning and CNN algorithms, and based on the evaluation results, we selected a one-hot encoding scheme and a CNN model for iProm-Zea. The 5-fold cross validation testing results demonstrated that the constructed predictor showed great potential for identifying promoters and classifying them as TATA and TATA less promoters. Furthermore, we performed cross-species analysis of iProm-Zea to evaluate its performance in other species. Moreover, to make it easier for other experimental scientists to obtain the results they need, we established a freely accessible and user-friendly web server at http://nsclbio.jbnu.ac.kr/tools/iProm-Zea/.
Collapse
|
17
|
Adaptation Potential of Three Psychrotolerant Aquatic Bacteria in the Pan-Okhotsk Region. WATER 2022. [DOI: 10.3390/w14071107] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
The Pan-Okhotsk region, which is part of the western North Pacific Ocean, is famous for its active volcanoes, which are part of the Pacific Ring of Fire and that enrich the surrounding waters with essential chemicals. Therefore, this region, including the Sea of Okhotsk and the Sea of Japan, is characterized by rich biota. Bacterioplankton plays a significant part in biological communities and is an indicator of ecosystem function. Analyzing the adaptability of three representatives of the microbiota of the Pan-Okhotsk region was the goal of our investigation. Marinomonas primoryensis KMM3633T (MP), Yersinia ruckeri KMM821 (YR), and Yersinia pseudotuberculosis 598 (YP) from the G.B. Elyakov Pacific Institute of Bioorganic Chemistry were studied by means of genomic and bioinformatic methods. The list of membrane translocator proteins, metabolism pathways, and cold shock and antifreeze proteins that were revealed in the genome of MP characterized this bacterium as being adaptable to free living in marine conditions, even at winter temperatures. The genomic potential of YR and YP makes not only survival in the environment of the Pan-Okhotsk region but also pathogenesis in eukaryotic organisms possible. The data obtained will serve as a basis for further ecosystem monitoring with the help of microbiota research.
Collapse
|
18
|
Zhang M, Jia C, Li F, Li C, Zhu Y, Akutsu T, Webb GI, Zou Q, Coin LJM, Song J. Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction. Brief Bioinform 2022; 23:6502561. [PMID: 35021193 PMCID: PMC8921625 DOI: 10.1093/bib/bbab551] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 11/12/2021] [Accepted: 11/30/2021] [Indexed: 01/13/2023] Open
Abstract
Promoters are crucial regulatory DNA regions for gene transcriptional activation. Rapid advances in next-generation sequencing technologies have accelerated the accumulation of genome sequences, providing increased training data to inform computational approaches for both prokaryotic and eukaryotic promoter prediction. However, it remains a significant challenge to accurately identify species-specific promoter sequences using computational approaches. To advance computational support for promoter prediction, in this study, we curated 58 comprehensive, up-to-date, benchmark datasets for 7 different species (i.e. Escherichia coli, Bacillus subtilis, Homo sapiens, Mus musculus, Arabidopsis thaliana, Zea mays and Drosophila melanogaster) to assist the research community to assess the relative functionality of alternative approaches and support future research on both prokaryotic and eukaryotic promoters. We revisited 106 predictors published since 2000 for promoter identification (40 for prokaryotic promoter, 61 for eukaryotic promoter, and 5 for both). We systematically evaluated their training datasets, computational methodologies, calculated features, performance and software usability. On the basis of these benchmark datasets, we benchmarked 19 predictors with functioning webservers/local tools and assessed their prediction performance. We found that deep learning and traditional machine learning-based approaches generally outperformed scoring function-based approaches. Taken together, the curated benchmark dataset repository and the benchmarking analysis in this study serve to inform the design and implementation of computational approaches for promoter prediction and facilitate more rigorous comparison of new techniques in the future.
Collapse
Affiliation(s)
| | - Cangzhi Jia
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | | | | | | | | | - Geoffrey I Webb
- Department of Data Science and Artificial Intelligence, Monash University, Melbourne, VIC 3800, Australia,Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Quan Zou
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | - Lachlan J M Coin
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | - Jiangning Song
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| |
Collapse
|
19
|
Clauwaert J, Waegeman W. Novel Transformer Networks for Improved Sequence Labeling in genomics. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:97-106. [PMID: 33125335 DOI: 10.1109/tcbb.2020.3035021] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In genomics, a wide range of machine learning methodologies have been investigated to annotate biological sequences for positions of interest such as transcription start sites, translation initiation sites, methylation sites, splice sites and promoter start sites. In recent years, this area has been dominated by convolutional neural networks, which typically outperform previously-designed methods as a result of automated scanning for influential sequence motifs. However, those architectures do not allow for the efficient processing of the full genomic sequence. As an improvement, we introduce transformer architectures for whole genome sequence labeling tasks. We show that these architectures, recently introduced for natural language processing, are better suited for processing and annotating long DNA sequences. We apply existing networks and introduce an optimized method for the calculation of attention from input nucleotides. To demonstrate this, we evaluate our architecture on several sequence labeling tasks, and find it to achieve state-of-the-art performances when comparing it to specialized models for the annotation of transcription start sites, translation initiation sites and 4mC methylation in E. coli.
Collapse
|
20
|
Chevez-Guardado R, Peña-Castillo L. Promotech: a general tool for bacterial promoter recognition. Genome Biol 2021; 22:318. [PMID: 34789306 PMCID: PMC8597233 DOI: 10.1186/s13059-021-02514-9] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Accepted: 10/11/2021] [Indexed: 12/14/2022] Open
Abstract
Promoters are genomic regions where the transcription machinery binds to initiate the transcription of specific genes. Computational tools for identifying bacterial promoters have been around for decades. However, most of these tools were designed to recognize promoters in one or few bacterial species. Here, we present Promotech, a machine-learning-based method for promoter recognition in a wide range of bacterial species. We compare Promotech's performance with the performance of five other promoter prediction methods. Promotech outperforms these other programs in terms of area under the precision-recall curve (AUPRC) or precision at the same level of recall. Promotech is available at https://github.com/BioinformaticsLabAtMUN/PromoTech .
Collapse
Affiliation(s)
- Ruben Chevez-Guardado
- Department of Computer Science, Memorial University of Newfoundland, 230 Elizabeth Ave, St. John's, Newfoundland, A1C 5S7, Canada
| | - Lourdes Peña-Castillo
- Department of Computer Science, Memorial University of Newfoundland, 230 Elizabeth Ave, St. John's, Newfoundland, A1C 5S7, Canada. .,Department of Biology, Memorial University of Newfoundland, 230 Elizabeth Ave, St. John's, Newfoundland, A1C 5S7, Canada.
| |
Collapse
|
21
|
Wilson EH, Groom JD, Sarfatis MC, Ford SM, Lidstrom ME, Beck DAC. A Computational Framework for Identifying Promoter Sequences in Nonmodel Organisms Using RNA-seq Data Sets. ACS Synth Biol 2021; 10:1394-1405. [PMID: 33988977 DOI: 10.1021/acssynbio.1c00017] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
Engineering microorganisms into biological factories that convert renewable feedstocks into valuable materials is a major goal of synthetic biology; however, for many nonmodel organisms, we do not yet have the genetic tools, such as suites of strong promoters, necessary to effectively engineer them. In this work, we developed a computational framework that can leverage standard RNA-seq data sets to identify sets of constitutive, strongly expressed genes and predict strong promoter signals within their upstream regions. The framework was applied to a diverse collection of RNA-seq data measured for the methanotroph Methylotuvimicrobium buryatense 5GB1 and identified 25 genes that were constitutively, strongly expressed across 12 experimental conditions. For each gene, the framework predicted short (27-30 nucleotide) sequences as candidate promoters and derived -35 and -10 consensus promoter motifs (TTGACA and TATAAT, respectively) for strong expression in M. buryatense. This consensus closely matches the canonical E. coli sigma-70 motif and was found to be enriched in promoter regions of the genome. A subset of promoter predictions was experimentally validated in a XylE reporter assay, including the consensus promoter, which showed high expression. The pmoC, pqqA, and ssrA promoter predictions were additionally screened in an experiment that scrambled the -35 and -10 signal sequences, confirming that transcription initiation was disrupted when these specific regions of the predicted sequence were altered. These results indicate that the computational framework can make biologically meaningful promoter predictions and identify key pieces of regulatory systems that can serve as foundational tools for engineering diverse microorganisms for biomolecule production.
Collapse
Affiliation(s)
- Erin H. Wilson
- The Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, Washington 98195, United States
| | - Joseph D. Groom
- Department of Chemical Engineering, University of Washington, Seattle, Washington 98195, United States
| | - M. Claire Sarfatis
- Department of Microbiology, University of Washington, Seattle, Washington 98195, United States
| | - Stephanie M. Ford
- Department of Chemical Engineering, University of Washington, Seattle, Washington 98195, United States
| | - Mary E. Lidstrom
- Department of Chemical Engineering, University of Washington, Seattle, Washington 98195, United States
- Department of Microbiology, University of Washington, Seattle, Washington 98195, United States
| | - David A. C. Beck
- Department of Chemical Engineering, University of Washington, Seattle, Washington 98195, United States
- eScience Institute, University of Washington, Seattle, Washington 98195, United States
| |
Collapse
|
22
|
Haque HMF, Rafsanjani M, Arifin F, Adilina S, Shatabda S. SubFeat: Feature subspacing ensemble classifier for function prediction of DNA, RNA and protein sequences. Comput Biol Chem 2021; 92:107489. [PMID: 33932779 DOI: 10.1016/j.compbiolchem.2021.107489] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2020] [Revised: 03/07/2021] [Accepted: 04/19/2021] [Indexed: 11/16/2022]
Abstract
The information of a cell is primarily contained in deoxyribonucleic acid (DNA). There is a flow of DNA information to protein sequences via ribonucleic acids (RNA) through transcription and translation. These entities are vital for the genetic process. Recent epigenetics developments also show the importance of the genetic material and knowledge of their attributes and functions. However, the growth in these entities' available features or functionalities is still slow due to the time-consuming and expensive in vitro experimental methods. In this paper, we have proposed an ensemble classification algorithm called SubFeat to predict biological entities' functionalities from different types of datasets. Our model uses a feature subspace-based novel ensemble method. It divides the feature space into sub-spaces, which are then passed to learn individual classifier models. The ensemble is built on these base classifiers that use a weighted majority voting mechanism. SubFeat tested on four datasets comprising two DNA, one RNA, and one protein dataset, and it outperformed all the existing single classifiers and the ensemble classifiers. SubFeat is made available as a Python-based tool. We have made the package SubFeat available online along with a user manual. It is freely accessible from here: https://github.com/fazlulhaquejony/SubFeat.
Collapse
Affiliation(s)
- H M Fazlul Haque
- Department of Computer Science and Engineering, United International University, United City, Madani Avenue, Badda, Dhaka 1212, Bangladesh
| | - Muhammod Rafsanjani
- Department of Computer Science and Engineering, United International University, United City, Madani Avenue, Badda, Dhaka 1212, Bangladesh
| | - Fariha Arifin
- Department of Computer Science and Engineering, United International University, United City, Madani Avenue, Badda, Dhaka 1212, Bangladesh
| | - Sheikh Adilina
- Department of Computer Science and Engineering, United International University, United City, Madani Avenue, Badda, Dhaka 1212, Bangladesh
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University, United City, Madani Avenue, Badda, Dhaka 1212, Bangladesh.
| |
Collapse
|
23
|
Clauwaert J, Menschaert G, Waegeman W. Explainability in transformer models for functional genomics. Brief Bioinform 2021; 22:6214646. [PMID: 33834200 PMCID: PMC8425421 DOI: 10.1093/bib/bbab060] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2020] [Revised: 01/28/2021] [Accepted: 02/05/2021] [Indexed: 11/16/2022] Open
Abstract
The effectiveness of deep learning methods can be largely attributed to the automated extraction of relevant features from raw data. In the field of functional genomics, this generally concerns the automatic selection of relevant nucleotide motifs from DNA sequences. To benefit from automated learning methods, new strategies are required that unveil the decision-making process of trained models. In this paper, we present a new approach that has been successful in gathering insights on the transcription process in Escherichia coli. This work builds upon a transformer-based neural network framework designed for prokaryotic genome annotation purposes. We find that the majority of subunits (attention heads) of the model are specialized towards identifying transcription factors and are able to successfully characterize both their binding sites and consensus sequences, uncovering both well-known and potentially novel elements involved in the initiation of the transcription process. With the specialization of the attention heads occurring automatically, we believe transformer models to be of high interest towards the creation of explainable neural networks in this field.
Collapse
Affiliation(s)
- Jim Clauwaert
- Department of Data Analysis and Mathematical Modelling, Ghent University, Coupure Links 653, 9000 Gent, Belgium
| | - Gerben Menschaert
- Department of Data Analysis and Mathematical Modelling, Ghent University, Coupure Links 653, 9000 Gent, Belgium
| | - Willem Waegeman
- Department of Data Analysis and Mathematical Modelling, Ghent University, Coupure Links 653, 9000 Gent, Belgium
| |
Collapse
|
24
|
Zhu Y, Li F, Xiang D, Akutsu T, Song J, Jia C. Computational identification of eukaryotic promoters based on cascaded deep capsule neural networks. Brief Bioinform 2020; 22:5998831. [PMID: 33227813 DOI: 10.1093/bib/bbaa299] [Citation(s) in RCA: 40] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Revised: 10/01/2020] [Accepted: 10/07/2020] [Indexed: 12/26/2022] Open
Abstract
A promoter is a region in the DNA sequence that defines where the transcription of a gene by RNA polymerase initiates, which is typically located proximal to the transcription start site (TSS). How to correctly identify the gene TSS and the core promoter is essential for our understanding of the transcriptional regulation of genes. As a complement to conventional experimental methods, computational techniques with easy-to-use platforms as essential bioinformatics tools can be effectively applied to annotate the functions and physiological roles of promoters. In this work, we propose a deep learning-based method termed Depicter (Deep learning for predicting promoter), for identifying three specific types of promoters, i.e. promoter sequences with the TATA-box (TATA model), promoter sequences without the TATA-box (non-TATA model), and indistinguishable promoters (TATA and non-TATA model). Depicter is developed based on an up-to-date, species-specific dataset which includes Homo sapiens, Mus musculus, Drosophila melanogaster and Arabidopsis thaliana promoters. A convolutional neural network coupled with capsule layers is proposed to train and optimize the prediction model of Depicter. Extensive benchmarking and independent tests demonstrate that Depicter achieves an improved predictive performance compared with several state-of-the-art methods. The webserver of Depicter is implemented and freely accessible at https://depicter.erc.monash.edu/.
Collapse
Affiliation(s)
- Yan Zhu
- School of Science, Dalian Maritime University, China
| | - Fuyi Li
- Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Australia
| | | | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Cangzhi Jia
- College of Science, Dalian Maritime University
| |
Collapse
|
25
|
Ahmed S, Hossain Z, Uddin M, Taherzadeh G, Sharma A, Shatabda S, Dehzangi A. Accurate prediction of RNA 5-hydroxymethylcytosine modification by utilizing novel position-specific gapped k-mer descriptors. Comput Struct Biotechnol J 2020; 18:3528-3538. [PMID: 33304452 PMCID: PMC7701324 DOI: 10.1016/j.csbj.2020.10.032] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Revised: 10/30/2020] [Accepted: 10/30/2020] [Indexed: 12/13/2022] Open
Abstract
RNA modification is an essential step towards generation of new RNA structures. Such modification is potentially able to modify RNA function or its stability. Among different modifications, 5-Hydroxymethylcytosine (5hmC) modification of RNA exhibit significant potential for a series of biological processes. Understanding the distribution of 5hmC in RNA is essential to determine its biological functionality. Although conventional sequencing techniques allow broad identification of 5hmC, they are both time-consuming and resource-intensive. In this study, we propose a new computational tool called iRNA5hmC-PS to tackle this problem. To build iRNA5hmC-PS we extract a set of novel sequence-based features called Position-Specific Gapped k-mer (PSG k-mer) to obtain maximum sequential information. Our feature analysis shows that our proposed PSG k-mer features contain vital information for the identification of 5hmC sites. We also use a group-wise feature importance calculation strategy to select a small subset of features containing maximum discriminative information. Our experimental results demonstrate that iRNA5hmC-PS is able to enhance the prediction performance, dramatically. iRNA5hmC-PS achieves 78.3% prediction performance, which is 12.8% better than those reported in the previous studies. iRNA5hmC-PS is publicly available as an online tool at http://103.109.52.8:81/iRNA5hmC-PS. Its benchmark dataset, source codes, and documentation are available at https://github.com/zahid6454/iRNA5hmC-PS.
Collapse
Affiliation(s)
- Sajid Ahmed
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| | - Zahid Hossain
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| | - Mahtab Uddin
- Department of Natural Science, United International University, Dhaka, Bangladesh
| | - Ghazaleh Taherzadeh
- Institute for Bioscience and Biotechnology Research, University of Maryland, College Park, MD 20742, USA
| | - Alok Sharma
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, QLD 4111, Australia.,Department of Medical Science Mathematics, Tokyo Medical and Dental University (TMDU), Tokyo, Japan.,Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.,School of Engineering and Physics, University of the South Pacific, Suva, Fiji
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| | - Abdollah Dehzangi
- Department of Computer Science, Rutgers University, Camden, NJ 08102, USA.,Center for Computational and Integrative Biology, Rutgers University, Camden, NJ 08102, USA
| |
Collapse
|
26
|
Abstract
The correct mapping of promoter elements is a crucial step in microbial genomics. Also, when combining new DNA elements into synthetic sequences, predicting the potential generation of new promoter sequences is critical. Over the last years, many bioinformatics tools have been created to allow users to predict promoter elements in a sequence or genome of interest. Here, we assess the predictive power of some of the main prediction tools available using well-defined promoter data sets. Using Escherichia coli as a model organism, we demonstrated that while some tools are biased toward AT-rich sequences, others are very efficient in identifying real promoters with low false-negative rates. We hope the potentials and limitations presented here will help the microbiology community to choose promoter prediction tools among many available alternatives. The promoter region is a key element required for the production of RNA in bacteria. While new high-throughput technology allows massively parallel mapping of promoter elements, we still mainly rely on bioinformatics tools to predict such elements in bacterial genomes. Additionally, despite many different prediction tools having become popular to identify bacterial promoters, no systematic comparison of such tools has been performed. Here, we performed a systematic comparison between several widely used promoter prediction tools (BPROM, bTSSfinder, BacPP, CNNProm, IBBP, Virtual Footprint, iPro70-FMWin, 70ProPred, iPromoter-2L, and MULTiPly) using well-defined sequence data sets and standardized metrics to determine how well those tools performed related to each other. For this, we used data sets of experimentally validated promoters from Escherichia coli and a control data set composed of randomly generated sequences with similar nucleotide distributions. We compared the performance of the tools using metrics such as specificity, sensitivity, accuracy, and Matthews correlation coefficient (MCC). We show that the widely used BPROM presented the worse performance among the compared tools, while four tools (CNNProm, iPro70-FMWin, 70ProPred, and iPromoter-2L) offered high predictive power. Of these tools, iPro70-FMWin exhibited the best results for most of the metrics used. We present here some potentials and limitations of available tools, and we hope that future work can build upon our effort to systematically characterize this useful class of bioinformatics tools. IMPORTANCE The correct mapping of promoter elements is a crucial step in microbial genomics. Also, when combining new DNA elements into synthetic sequences, predicting the potential generation of new promoter sequences is critical. Over the last years, many bioinformatics tools have been created to allow users to predict promoter elements in a sequence or genome of interest. Here, we assess the predictive power of some of the main prediction tools available using well-defined promoter data sets. Using Escherichia coli as a model organism, we demonstrated that while some tools are biased toward AT-rich sequences, others are very efficient in identifying real promoters with low false-negative rates. We hope the potentials and limitations presented here will help the microbiology community to choose promoter prediction tools among many available alternatives.
Collapse
|
27
|
Amin R, Rahman CR, Ahmed S, Sifat MHR, Liton MNK, Rahman MM, Khan MZH, Shatabda S. iPromoter-BnCNN: a novel branched CNN-based predictor for identifying and classifying sigma promoters. Bioinformatics 2020; 36:4869-4875. [DOI: 10.1093/bioinformatics/btaa609] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2019] [Revised: 05/19/2020] [Accepted: 06/24/2020] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
Promoter is a short region of DNA which is responsible for initiating transcription of specific genes. Development of computational tools for automatic identification of promoters is in high demand. According to the difference of functions, promoters can be of different types. Promoters may have both intra- and interclass variation and similarity in terms of consensus sequences. Accurate classification of various types of sigma promoters still remains a challenge.
Results
We present iPromoter-BnCNN for identification and accurate classification of six types of promoters—σ24,σ28,σ32,σ38,σ54,σ70. It is a CNN-based classifier which combines local features related to monomer nucleotide sequence, trimer nucleotide sequence, dimer structural properties and trimer structural properties through the use of parallel branching. We conducted experiments on a benchmark dataset and compared with six state-of-the-art tools to show our supremacy on 5-fold cross-validation. Moreover, we tested our classifier on an independent test dataset.
Availability and implementation
Our proposed tool iPromoter-BnCNN web server is freely available at http://103.109.52.8/iPromoter-BnCNN. The runnable source code can be found https://colab.research.google.com/drive/1yWWh7BXhsm8U4PODgPqlQRy23QGjF2DZ.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ruhul Amin
- Department of Computer Science and Engineering, United International University, Dhaka 1207, Bangladesh
| | - Chowdhury Rafeed Rahman
- Department of Computer Science and Engineering, United International University, Dhaka 1207, Bangladesh
| | - Sajid Ahmed
- Department of Computer Science and Engineering, United International University, Dhaka 1207, Bangladesh
| | - Md Habibur Rahman Sifat
- Department of Computer Science and Engineering, United International University, Dhaka 1207, Bangladesh
| | - Md Nazmul Khan Liton
- Department of Computer Science and Engineering, United International University, Dhaka 1207, Bangladesh
| | - Md Moshiur Rahman
- Department of Computer Science and Engineering, United International University, Dhaka 1207, Bangladesh
| | - Md Zahid Hossain Khan
- Department of Computer Science and Engineering, United International University, Dhaka 1207, Bangladesh
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University, Dhaka 1207, Bangladesh
| |
Collapse
|
28
|
Chen YL, Guo DH, Li QZ. An energy model for recognizing the prokaryotic promoters based on molecular structure. Genomics 2019; 112:2072-2079. [PMID: 31809797 DOI: 10.1016/j.ygeno.2019.12.001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2019] [Revised: 11/06/2019] [Accepted: 12/01/2019] [Indexed: 11/19/2022]
Abstract
Promoter is an important functional elements of DNA sequences, which is in charge of gene transcription initiation. Recognizing promoter have important help for understanding the relative life phenomena. Based on the concept that promoter is mainly determined by its sequence and structure, a novel statistical physics model for predicting promoter in Escherichia coli K-12 is proposed. The total energies of DNA local structure of sequence segments in the three benchmark promoter sequence datasets, the sole prediction parameter, are calculated by using principles from statistical physics and information theory. The better results are obtained. And a web-server PhysMPrePro for predicting promoter is established at http://202.207.14.87:8032/bioinformation/PhysMPrePro/index.asp, so that other scientists can easily get their desired results by our web-server.
Collapse
Affiliation(s)
- Ying-Li Chen
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China; The State key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Inner Mongolia University, Hohhot 010070, China.
| | - Dong-Hua Guo
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Qian-Zhong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China; The State key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Inner Mongolia University, Hohhot 010070, China.
| |
Collapse
|
29
|
iN6-methylat (5-step): identifying DNA N6-methyladenine sites in rice genome using continuous bag of nucleobases via Chou’s 5-step rule. Mol Genet Genomics 2019; 294:1173-1182. [DOI: 10.1007/s00438-019-01570-y] [Citation(s) in RCA: 51] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2019] [Accepted: 04/25/2019] [Indexed: 12/21/2022]
|