1
|
Zhang M, Jia C, Li F, Li C, Zhu Y, Akutsu T, Webb GI, Zou Q, Coin LJM, Song J. Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction. Brief Bioinform 2022; 23:6502561. [PMID: 35021193 PMCID: PMC8921625 DOI: 10.1093/bib/bbab551] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 11/12/2021] [Accepted: 11/30/2021] [Indexed: 01/13/2023] Open
Abstract
Promoters are crucial regulatory DNA regions for gene transcriptional activation. Rapid advances in next-generation sequencing technologies have accelerated the accumulation of genome sequences, providing increased training data to inform computational approaches for both prokaryotic and eukaryotic promoter prediction. However, it remains a significant challenge to accurately identify species-specific promoter sequences using computational approaches. To advance computational support for promoter prediction, in this study, we curated 58 comprehensive, up-to-date, benchmark datasets for 7 different species (i.e. Escherichia coli, Bacillus subtilis, Homo sapiens, Mus musculus, Arabidopsis thaliana, Zea mays and Drosophila melanogaster) to assist the research community to assess the relative functionality of alternative approaches and support future research on both prokaryotic and eukaryotic promoters. We revisited 106 predictors published since 2000 for promoter identification (40 for prokaryotic promoter, 61 for eukaryotic promoter, and 5 for both). We systematically evaluated their training datasets, computational methodologies, calculated features, performance and software usability. On the basis of these benchmark datasets, we benchmarked 19 predictors with functioning webservers/local tools and assessed their prediction performance. We found that deep learning and traditional machine learning-based approaches generally outperformed scoring function-based approaches. Taken together, the curated benchmark dataset repository and the benchmarking analysis in this study serve to inform the design and implementation of computational approaches for promoter prediction and facilitate more rigorous comparison of new techniques in the future.
Collapse
Affiliation(s)
| | - Cangzhi Jia
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | | | | | | | | | - Geoffrey I Webb
- Department of Data Science and Artificial Intelligence, Monash University, Melbourne, VIC 3800, Australia,Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Quan Zou
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | - Lachlan J M Coin
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | - Jiangning Song
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| |
Collapse
|
2
|
Rahman MS, Aktar U, Jani MR, Shatabda S. iPro70-FMWin: identifying Sigma70 promoters using multiple windowing and minimal features. Mol Genet Genomics 2018; 294:69-84. [PMID: 30187132 DOI: 10.1007/s00438-018-1487-5] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2018] [Accepted: 08/29/2018] [Indexed: 01/16/2023]
Abstract
In bacterial DNA, there are specific sequences of nucleotides called promoters that can bind to the RNA polymerase. Sigma70 ([Formula: see text]) is one of the most important promoter sequences due to its presence in most of the DNA regulatory functions. In this paper, we identify the most effective and optimal sequence-based features for prediction of [Formula: see text] promoter sequences in a bacterial genome. We used both short-range and long-range DNA sequences in our proposed method. A very small number of effective features are selected from a large number of the extracted features using multi-window of different sizes within the DNA sequences. We call our prediction method iPro70-FMWin and made it freely accessible online via a web application established at http://ipro70.pythonanywhere.com/server for the sake of convenience of the researchers. We have tested our method using a standard benchmark dataset. In the experiments, iPro70-FMWin has achieved an area under the curve of the receiver operating characteristic and accuracy of 0.959 and 90.57%, respectively, which significantly outperforms the state-of-the-art predictors.
Collapse
Affiliation(s)
- Md Siddiqur Rahman
- Department of Computer Science and Engineering, United International University, Madani Avenue, Satarkul, Badda, Dhaka, 1212, Bangladesh
| | - Usma Aktar
- Department of Computer Science and Engineering, United International University, Madani Avenue, Satarkul, Badda, Dhaka, 1212, Bangladesh
| | - Md Rafsan Jani
- Department of Computer Science and Engineering, United International University, Madani Avenue, Satarkul, Badda, Dhaka, 1212, Bangladesh
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University, Madani Avenue, Satarkul, Badda, Dhaka, 1212, Bangladesh.
| |
Collapse
|
3
|
Rahman MS, Aktar U, Jani MR, Shatabda S. iPromoter-FSEn: Identification of bacterial σ 70 promoter sequences using feature subspace based ensemble classifier. Genomics 2018; 111:1160-1166. [PMID: 30059731 DOI: 10.1016/j.ygeno.2018.07.011] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2018] [Revised: 07/07/2018] [Accepted: 07/12/2018] [Indexed: 10/28/2022]
Abstract
Sigma promoter sequences in bacterial genomes are important due to their role in transcription initiation. Sigma 70 is one of the most important and crucial sigma factors. In this paper, we address the problem of identification of σ70 promoter sequences in bacterial genome. We propose iPromoter-FSEn, a novel predictor for identification of σ70 promoter sequences. Our proposed method is based on a feature subspace based ensemble classifier. A large set of of features extracted from the sequence of nucleotides are divided into subsets and each subset is given to individual single classifiers to learn. Based on the decisions of the ensemble an aggregate decision is made by the ensemble voting classifier. We tested our method on a standard benchmark dataset extracted from experimentally validated results. Experimental results shows that iPromoter-FSEn significantly improves over the state-of-the art σ70 promoter sequence predictors. The accuracy and area under receiver operating characteristic curve of iPromoter-FSEn are 86.32% and 0.9319 respectively. We have also made our method readily available for use as an web application from: http://ipromoterfsen.pythonanywhere.com/server.
Collapse
Affiliation(s)
- Md Siddiqur Rahman
- Department of Computer Science and Engineering, United International University Madani Avenue, Satarkul, Badda, Dhaka 1212, Bangladesh
| | - Usma Aktar
- Department of Computer Science and Engineering, United International University Madani Avenue, Satarkul, Badda, Dhaka 1212, Bangladesh
| | - Md Rafsan Jani
- Department of Computer Science and Engineering, United International University Madani Avenue, Satarkul, Badda, Dhaka 1212, Bangladesh
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University Madani Avenue, Satarkul, Badda, Dhaka 1212, Bangladesh.
| |
Collapse
|
4
|
Abstract
Codon usage depends on mutation bias, tRNA-mediated selection, and the need for high efficiency and accuracy in translation. One codon in a synonymous codon family is often strongly over-used, especially in highly expressed genes, which often leads to a high dN/dS ratio because dS is very small. Many different codon usage indices have been proposed to measure codon usage and codon adaptation. Sense codon could be misread by release factors and stop codons misread by tRNAs, which also contribute to codon usage in rare cases. This chapter outlines the conceptual framework on codon evolution, illustrates codon-specific and gene-specific codon usage indices, and presents their applications. A new index for codon adaptation that accounts for background mutation bias (Index of Translation Elongation) is presented and contrasted with codon adaptation index (CAI) which does not consider background mutation bias. They are used to re-analyze data from a recent paper claiming that translation elongation efficiency matters little in protein production. The reanalysis disproves the claim.
Collapse
|
5
|
Cong Y, Gao L, Zhang Y, Xian Y, Hua Z, Elaasar H, Shen L. Quantifying promoter activity during the developmental cycle of Chlamydia trachomatis. Sci Rep 2016; 6:27244. [PMID: 27263495 PMCID: PMC4893696 DOI: 10.1038/srep27244] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2016] [Accepted: 05/10/2016] [Indexed: 11/09/2022] Open
Abstract
Chlamydia trachomatis is an important human pathogen that undergoes a characteristic development cycle correlating with stage-specific gene expression profiles. Taking advantage of recent developments in the genetic transformation in C. trachomatis, we constructed a versatile green fluorescent protein (GFP) reporter system to study the development-dependent function of C. trachomatis promoters in an attempt to elucidate the mechanism that controls C. trachomatis adaptability. We validated the use of the GFP reporter system by visualizing the activity of an early euo gene promoter. Additionally, we uncovered a new ompA promoter, which we named P3, utilizing the GFP reporter system combined with 5' rapid amplification of cDNA ends (RACE), in vitro transcription assays, real-time quantitative RT-PCR (RT-qPCR), and flow cytometry. Mutagenesis of the P3 region verifies that P3 is a new class of C. trachomatis σ(66)-dependent promoter, which requires an extended -10 TGn motif for transcription. These results corroborate complex developmentally controlled ompA expression in C. trachomatis. The exploitation of genetically labeled C. trachomatis organisms with P3-driven GFP allows for the observation of changes in ompA expression in response to developmental signals. The results of this study could be used to complement previous findings and to advance understanding of C. trachomatis genetic expression.
Collapse
Affiliation(s)
- Yanguang Cong
- Department of Microbiology, Immunology, and Parasitology, Louisiana State University Health Sciences Center, New Orleans, LA 70112, USA.,Department of Microbiology, Third Military Medical University, Chongqing, China, 400038
| | - Leiqiong Gao
- Department of Neonatology, Children's Hospital of Chongqing Medical University, Ministry of Education Key Laboratory of Child Development and Disorders, Chongqing Key Laboratory of Pediatrics, Chongqing, China, 400014
| | - Yan Zhang
- Department of Neonatology, Children's Hospital of Chongqing Medical University, Ministry of Education Key Laboratory of Child Development and Disorders, Chongqing Key Laboratory of Pediatrics, Chongqing, China, 400014
| | - Yuqi Xian
- Department of Neonatology, Children's Hospital of Chongqing Medical University, Ministry of Education Key Laboratory of Child Development and Disorders, Chongqing Key Laboratory of Pediatrics, Chongqing, China, 400014
| | - Ziyu Hua
- Department of Neonatology, Children's Hospital of Chongqing Medical University, Ministry of Education Key Laboratory of Child Development and Disorders, Chongqing Key Laboratory of Pediatrics, Chongqing, China, 400014
| | - Hiba Elaasar
- Department of Microbiology, Immunology, and Parasitology, Louisiana State University Health Sciences Center, New Orleans, LA 70112, USA
| | - Li Shen
- Department of Microbiology, Immunology, and Parasitology, Louisiana State University Health Sciences Center, New Orleans, LA 70112, USA
| |
Collapse
|
6
|
Xia X. Position weight matrix, gibbs sampler, and the associated significance tests in motif characterization and prediction. SCIENTIFICA 2012; 2012:917540. [PMID: 24278755 PMCID: PMC3820676 DOI: 10.6064/2012/917540] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/22/2012] [Accepted: 10/11/2012] [Indexed: 05/31/2023]
Abstract
Position weight matrix (PWM) is not only one of the most widely used bioinformatic methods, but also a key component in more advanced computational algorithms (e.g., Gibbs sampler) for characterizing and discovering motifs in nucleotide or amino acid sequences. However, few generally applicable statistical tests are available for evaluating the significance of site patterns, PWM, and PWM scores (PWMS) of putative motifs. Statistical significance tests of the PWM output, that is, site-specific frequencies, PWM itself, and PWMS, are in disparate sources and have never been collected in a single paper, with the consequence that many implementations of PWM do not include any significance test. Here I review PWM-based methods used in motif characterization and prediction (including a detailed illustration of the Gibbs sampler for de novo motif discovery), present statistical and probabilistic rationales behind statistical significance tests relevant to PWM, and illustrate their application with real data. The multiple comparison problem associated with the test of site-specific frequencies is best handled by false discovery rate methods. The test of PWM, due to the use of pseudocounts, is best done by resampling methods. The test of individual PWMS for each sequence segment should be based on the extreme value distribution.
Collapse
Affiliation(s)
- Xuhua Xia
- Department of Biology, University of Ottawa, 30 Marie Curie, Ottawa, ON, Canada K1N 6N5
| |
Collapse
|
7
|
Weber SDS, Sant'Anna FH, Schrank IS. Unveiling Mycoplasma hyopneumoniae promoters: sequence definition and genomic distribution. DNA Res 2012; 19:103-15. [PMID: 22334569 PMCID: PMC3325076 DOI: 10.1093/dnares/dsr045] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Several Mycoplasma species have had their genome completely sequenced, including four strains of the swine pathogen Mycoplasma hyopneumoniae. Nevertheless, little is known about the nucleotide sequences that control transcriptional initiation in these microorganisms. Therefore, with the objective of investigating the promoter sequences of M. hyopneumoniae, 23 transcriptional start sites (TSSs) of distinct genes were mapped. A pattern that resembles the σ70 promoter −10 element was found upstream of the TSSs. However, no −35 element was distinguished. Instead, an AT-rich periodic signal was identified. About half of the experimentally defined promoters contained the motif 5′-TRTGn-3′, which was identical to the −16 element usually found in Gram-positive bacteria. The defined promoters were utilized to build position-specific scoring matrices in order to scan putative promoters upstream of all coding sequences (CDSs) in the M. hyopneumoniae genome. Two hundred and one signals were found associated with 169 CDSs. Most of these sequences were located within 100 nucleotides of the start codons. This study has shown that the number of promoter-like sequences in the M. hyopneumoniae genome is more frequent than expected by chance, indicating that most of the sequences detected are probably biologically functional.
Collapse
Affiliation(s)
- Shana de Souto Weber
- Centro de Biotecnologia, Programa de Pós-graduação em Biologia Celular e Molecular, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre, RS, Brazil
| | | | | |
Collapse
|
8
|
Eukaryotic and prokaryotic promoter prediction using hybrid approach. Theory Biosci 2010; 130:91-100. [DOI: 10.1007/s12064-010-0114-8] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2010] [Accepted: 10/23/2010] [Indexed: 12/27/2022]
|
9
|
Mallios RR, Ojcius DM, Ardell DH. An iterative strategy combining biophysical criteria and duration hidden Markov models for structural predictions of Chlamydia trachomatis sigma66 promoters. BMC Bioinformatics 2009; 10:271. [PMID: 19715597 PMCID: PMC2743672 DOI: 10.1186/1471-2105-10-271] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2009] [Accepted: 08/28/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Promoter identification is a first step in the quest to explain gene regulation in bacteria. It has been demonstrated that the initiation of bacterial transcription depends upon the stability and topology of DNA in the promoter region as well as the binding affinity between the RNA polymerase sigma-factor and promoter. However, promoter prediction algorithms to date have not explicitly used an ensemble of these factors as predictors. In addition, most promoter models have been trained on data from Escherichia coli. Although it has been shown that transcriptional mechanisms are similar among various bacteria, it is quite possible that the differences between Escherichia coli and Chlamydia trachomatis are large enough to recommend an organism-specific modeling effort. RESULTS Here we present an iterative stochastic model building procedure that combines such biophysical metrics as DNA stability, curvature, twist and stress-induced DNA duplex destabilization along with duration hidden Markov model parameters to model Chlamydia trachomatis sigma66 promoters from 29 experimentally verified sequences. Initially, iterative duration hidden Markov modeling of the training set sequences provides a scoring algorithm for Chlamydia trachomatis RNA polymerase sigma66/DNA binding. Subsequently, an iterative application of Stepwise Binary Logistic Regression selects multiple promoter predictors and deletes/replaces training set sequences to determine an optimal training set. The resulting model predicts the final training set with a high degree of accuracy and provides insights into the structure of the promoter region. Model based genome-wide predictions are provided so that optimal promoter candidates can be experimentally evaluated, and refined models developed. Co-predictions with three other algorithms are also supplied to enhance reliability. CONCLUSION This strategy and resulting model support the conjecture that DNA biophysical properties, along with RNA polymerase sigma-factor/DNA binding collaboratively, contribute to a sequence's ability to promote transcription. This work provides a baseline model that can evolve as new Chlamydia trachomatis sigma66 promoters are identified with assistance from the provided genome-wide predictions. The proposed methodology is ideal for organisms with few identified promoters and relatively small genomes.
Collapse
Affiliation(s)
- Ronna R Mallios
- School of Natural Sciences, University of California, Merced, CA 95344, USA.
| | | | | |
Collapse
|
10
|
Towsey M, Timms P, Hogan J, Mathews SA. The cross-species prediction of bacterial promoters using a support vector machine. Comput Biol Chem 2008; 32:359-66. [DOI: 10.1016/j.compbiolchem.2008.07.009] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2007] [Revised: 05/01/2008] [Accepted: 07/06/2008] [Indexed: 10/21/2022]
|
11
|
Phylogenetic comparison of the known Chlamydia trachomatis sigma(66) promoters across to Chlamydia pneumoniae and Chlamydia caviae identifies seven poorly conserved promoters. Res Microbiol 2008; 159:550-6. [PMID: 18708139 DOI: 10.1016/j.resmic.2008.07.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2008] [Revised: 07/08/2008] [Accepted: 07/10/2008] [Indexed: 11/20/2022]
Abstract
We used four different phylogenetic footprinting programs and the six chlamydial species with publicly available whole genome sequences to analyze the 12 known sigma(66) promoters of Chlamydia trachomatis that phylogenetically footprinted negative in our previous paper. The analysis showed that 7 of the 12 promoters were poorly conserved across C. trachomatis, Chlamydia pneumoniae and Chlamydia caviae. Interestingly, the associated gene sets for these seven promoters were homologs and the gene orders were well conserved across these three species. Additional phylogenetic footprinting, across different subsets from that used above, of the six publicly available whole chlamydial genome sequences and transcription initiation site mapping of chlamydial promoters was also performed. This analysis showed that two of the seven poorly conserved promoters, the promoters in the upstream regions of C. caviae ltuA and ltuB, were like Escherichia coli sigma(70) promoters. Therefore, these promoters are similar to the promoters of C. trachomatis ltuA and ltuB, as they are sigma(70)-like. Given the fact that 7 out of the 22 known sigma(66) promoters in C. trachomatis are poorly conserved across C. trachomatis, C. pneumoniae and C. caviae, we would like to suggest that many other chlamydial promoters are poorly conserved across these species.
Collapse
|