1
|
Karp PD, Paley S, Caspi R, Kothari A, Krummenacker M, Midford PE, Moore LR, Subhraveti P, Gama-Castro S, Tierrafria VH, Lara P, Muñiz-Rascado L, Bonavides-Martinez C, Santos-Zavaleta A, Mackie A, Sun G, Ahn-Horst TA, Choi H, Juenemann R, Knudsen CNM, Covert MW, Collado-Vides J, Paulsen I. The EcoCyc database (2025). EcoSal Plus 2025:eesp00192024. [PMID: 40304522 DOI: 10.1128/ecosalplus.esp-0019-2024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2024] [Accepted: 03/18/2025] [Indexed: 05/02/2025]
Abstract
EcoCyc is a bioinformatics database (DB) available at EcoCyc.org that describes the genome and the biochemical machinery of Escherichia coli K-12 MG1655. The long-term goal of the project was to describe the complete molecular catalog of the E. coli cell, as well as the functions of each of its molecular parts, to facilitate a system-level understanding of E. coli. EcoCyc is an electronic reference source for E. coli biologists and for biologists who work with related microorganisms. The database includes information pages on each E. coli gene product, metabolite, reaction, operon, and metabolic pathway. The database also includes information on the regulation of gene expression, E. coli gene essentiality, and nutrient conditions that do or do not support the growth of E. coli. The website and downloadable software contain tools for the analysis of high-throughput data sets. In addition, a steady-state metabolic flux model is generated from each new version of EcoCyc and can be executed via EcoCyc.org. The model can predict metabolic flux rates, nutrient uptake rates, and growth rates for different gene knockouts and nutrient conditions. Data generated from a whole-cell model that is parameterized from the latest data on EcoCyc is also available. This review outlines the data content of EcoCyc and the procedures by which this content is generated.
Collapse
Affiliation(s)
- Peter D Karp
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Suzanne Paley
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Ron Caspi
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Anamika Kothari
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Markus Krummenacker
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Peter E Midford
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Lisa R Moore
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Pallavi Subhraveti
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Socorro Gama-Castro
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico
| | - Víctor H Tierrafria
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico
| | - Paloma Lara
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico
| | - Luis Muñiz-Rascado
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico
| | - César Bonavides-Martinez
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico
| | - Alberto Santos-Zavaleta
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico
| | - Amanda Mackie
- School of Natural Sciences, Macquarie University, Sydney, New South Wales, Australia
| | - Gwanggyu Sun
- Department of Bioengineering, Stanford University, Stanford, California, USA
| | - Travis A Ahn-Horst
- Department of Bioengineering, Stanford University, Stanford, California, USA
| | - Heejo Choi
- Department of Bioengineering, Stanford University, Stanford, California, USA
| | - Riley Juenemann
- Department of Bioengineering, Stanford University, Stanford, California, USA
| | - Cyrus N M Knudsen
- Department of Bioengineering, Stanford University, Stanford, California, USA
| | - Markus W Covert
- Department of Bioengineering, Stanford University, Stanford, California, USA
| | - Julio Collado-Vides
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico
| | - Ian Paulsen
- School of Natural Sciences, Macquarie University, Sydney, New South Wales, Australia
| |
Collapse
|
2
|
Lei R, Jia J, Qin L, Wei X. iPro2L-DG: Hybrid network based on improved densenet and global attention mechanism for identifying promoter sequences. Heliyon 2024; 10:e27364. [PMID: 38510021 PMCID: PMC10950492 DOI: 10.1016/j.heliyon.2024.e27364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 02/24/2024] [Accepted: 02/28/2024] [Indexed: 03/22/2024] Open
Abstract
The promoter is a key DNA sequence whose primary function is to control the initiation time and the degree of expression of gene transcription. Accurate identification of promoters is essential for understanding gene expression studies. Traditional sequencing techniques for identifying promoters are costly and time-consuming. Therefore, the development of computational methods to identify promoters has become critical. Since deep learning methods show great potential in identifying promoters, this study proposes a new promoter prediction model, called iPro2L-DG. The iPro2L-DG predictor, based on an improved Densely Connected Convolutional Network (DenseNet) and a Global Attention Mechanism (GAM), is constructed to achieve the prediction of promoters. The promoter sequences are combined feature encoding using C2 encoding and nucleotide chemical property (NCP) encoding. An improved DenseNet extracts advanced feature information from the combined feature encoding. GAM evaluates the importance of advanced feature information in terms of channel and spatial dimensions, and finally uses a Full Connect Neural Network (FNN) to derive prediction probabilities. The experimental results showed that the accuracy of iPro2L-DG in the first layer (promoter identification) was 94.10% with Matthews correlation coefficient value of 0.8833. In the second layer (promoter strength prediction), the accuracy was 89.42% with Matthews correlation coefficient value of 0.7915. The iPro2L-DG predictor significantly outperforms other existing predictors in promoter identification and promoter strength prediction. Therefore, our proposed model iPro2L-DG is the most advanced promoter prediction tool. The source code of the iPro2L-DG model can be found in https://github.com/leirufeng/iPro2L-DG.
Collapse
Affiliation(s)
- Rufeng Lei
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| | - Jianhua Jia
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| | - Lulu Qin
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| | - Xin Wei
- Business School, Jiangxi Institute of Fashion Technology, Nanchang, 330044, China
| |
Collapse
|
3
|
Karp PD, Paley S, Caspi R, Kothari A, Krummenacker M, Midford PE, Moore LR, Subhraveti P, Gama-Castro S, Tierrafria VH, Lara P, Muñiz-Rascado L, Bonavides-Martinez C, Santos-Zavaleta A, Mackie A, Sun G, Ahn-Horst TA, Choi H, Covert MW, Collado-Vides J, Paulsen I. The EcoCyc Database (2023). EcoSal Plus 2023; 11:eesp00022023. [PMID: 37220074 PMCID: PMC10729931 DOI: 10.1128/ecosalplus.esp-0002-2023] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Accepted: 04/04/2023] [Indexed: 01/28/2024]
Abstract
EcoCyc is a bioinformatics database available online at EcoCyc.org that describes the genome and the biochemical machinery of Escherichia coli K-12 MG1655. The long-term goal of the project is to describe the complete molecular catalog of the E. coli cell, as well as the functions of each of its molecular parts, to facilitate a system-level understanding of E. coli. EcoCyc is an electronic reference source for E. coli biologists and for biologists who work with related microorganisms. The database includes information pages on each E. coli gene product, metabolite, reaction, operon, and metabolic pathway. The database also includes information on the regulation of gene expression, E. coli gene essentiality, and nutrient conditions that do or do not support the growth of E. coli. The website and downloadable software contain tools for the analysis of high-throughput data sets. In addition, a steady-state metabolic flux model is generated from each new version of EcoCyc and can be executed online. The model can predict metabolic flux rates, nutrient uptake rates, and growth rates for different gene knockouts and nutrient conditions. Data generated from a whole-cell model that is parameterized from the latest data on EcoCyc are also available. This review outlines the data content of EcoCyc and of the procedures by which this content is generated.
Collapse
Affiliation(s)
- Peter D. Karp
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Suzanne Paley
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Ron Caspi
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Anamika Kothari
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Markus Krummenacker
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Peter E. Midford
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Lisa R. Moore
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Pallavi Subhraveti
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Socorro Gama-Castro
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - Victor H. Tierrafria
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - Paloma Lara
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - Luis Muñiz-Rascado
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - César Bonavides-Martinez
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - Alberto Santos-Zavaleta
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - Amanda Mackie
- Department of Chemistry and Biomolecular Sciences, Macquarie University, Sydney, New South Wales, Australia
| | - Gwanggyu Sun
- Department of Bioengineering, Stanford University, Stanford, California, USA
| | - Travis A. Ahn-Horst
- Department of Bioengineering, Stanford University, Stanford, California, USA
| | - Heejo Choi
- Department of Bioengineering, Stanford University, Stanford, California, USA
| | - Markus W. Covert
- Department of Bioengineering, Stanford University, Stanford, California, USA
| | - Julio Collado-Vides
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - Ian Paulsen
- School of Natural Sciences, Macquarie University, Sydney, New South Wales, Australia
| |
Collapse
|
4
|
Mai DHA, Nguyen LT, Lee EY. TSSNote-CyaPromBERT: Development of an integrated platform for highly accurate promoter prediction and visualization of Synechococcus sp. and Synechocystis sp. through a state-of-the-art natural language processing model BERT. Front Genet 2022; 13:1067562. [PMID: 36523764 PMCID: PMC9745317 DOI: 10.3389/fgene.2022.1067562] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 11/17/2022] [Indexed: 07/30/2023] Open
Abstract
Since the introduction of the first transformer model with a unique self-attention mechanism, natural language processing (NLP) models have attained state-of-the-art (SOTA) performance on various tasks. As DNA is the blueprint of life, it can be viewed as an unusual language, with its characteristic lexicon and grammar. Therefore, NLP models may provide insights into the meaning of the sequential structure of DNA. In the current study, we employed and compared the performance of popular SOTA NLP models (i.e., XLNET, BERT, and a variant DNABERT trained on the human genome) to predict and analyze the promoters in freshwater cyanobacterium Synechocystis sp. PCC 6803 and the fastest growing cyanobacterium Synechococcus elongatus sp. UTEX 2973. These freshwater cyanobacteria are promising hosts for phototrophically producing value-added compounds from CO2. Through a custom pipeline, promoters and non-promoters from Synechococcus elongatus sp. UTEX 2973 were used to train the model. The trained model achieved an AUROC score of 0.97 and F1 score of 0.92. During cross-validation with promoters from Synechocystis sp. PCC 6803, the model achieved an AUROC score of 0.96 and F1 score of 0.91. To increase accessibility, we developed an integrated platform (TSSNote-CyaPromBERT) to facilitate large dataset extraction, model training, and promoter prediction from public dRNA-seq datasets. Furthermore, various visualization tools have been incorporated to address the "black box" issue of deep learning and feature analysis. The learning transfer ability of large language models may help identify and analyze promoter regions for newly isolated strains with similar lineages.
Collapse
|
5
|
DeeProPre: A promoter predictor based on deep learning. Comput Biol Chem 2022; 101:107770. [PMID: 36116322 DOI: 10.1016/j.compbiolchem.2022.107770] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Revised: 08/06/2022] [Accepted: 09/11/2022] [Indexed: 11/21/2022]
Abstract
The promoter is a DNA sequence recognized, bound and transcribed by RNA polymerase. It is usually located at the upstream or 5'end of the transcription start site (TSS). Studies have shown that the structure of the promoter affects its affinity for RNA polymerase, thus affecting the level of gene expression. Therefore, the correct identification of core promoter and common structural gene is of great significance in the field of biomedicine. At present, many methods have been proposed to improve the accuracy of promoter recognition, but the performances still need to be further improved. In this study, a deep learning algorithm (DeeProPre) based on bidirectional long short-term memory (BiLSTM) and convolutional neural network (CNN) was proposed. Firstly, the supervised embedding layer was applied to map the sequence to a high-dimensional space. Secondly, two 1D convolutional layers, BiLSTM and attentional mechanism layer were used for extracting features. Finally, the full connection layer activated by Sigmoid function was used to obtain the probability of classification into target categories. This model can identify the promoter region of eukaryotes with high accuracy, providing an analytical basis for further understanding of promoter physiological functions and studies of gene transcription mechanisms. The source code of DeeProPre is freely available at https://github.com/zzwwmmm/DeeProPre/tree/master.
Collapse
|
6
|
Qiao H, Zhang S, Xue T, Wang J, Wang B. iPro-GAN: A novel model based on generative adversarial learning for identifying promoters and their strength. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2022; 215:106625. [PMID: 35038653 DOI: 10.1016/j.cmpb.2022.106625] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Revised: 12/13/2021] [Accepted: 01/06/2022] [Indexed: 06/14/2023]
Abstract
BACKGROUND AND OBJECTIVE Promoter is a component of the gene, which can specifically bind with RNA polymerase and determine where transcription starts, and also determine the transcription efficiency of the gene. Promoters can be divided into strong promoters and weak promoters because their structures and the interaction time interval are quite different. The functional variation of the promoter can lead to a variety of diseases. Therefore, identifying promoters and their strength is necessary and has important biological significance. A novel and promising model based on deep learning is proposed to achieve it. METHODS In this work, we build a power model named iPro-GAN for identification of promoters and their strength. First, we collect benchmark datasets and independent datasets for training and testing. Then, Moran-based spatial auto-cross correlation method is used as feature extraction method. Finally, deep convolution generative adversarial network with 10-fold cross validation is applied for classifying. The first layer of the model is used to identify the promoter and the second layer is used to determine its type. RESULTS On the benchmark data set, the accuracy of the first layer predictor is 93.15%, and the accuracy of the second layer predictor is 92.30%. On the independent data set, the accuracy of the first layer predictor is 86.77%, and the accuracy of the second layer predictor is 91.66%. In particular, breakthrough progress has been made in the identification of promoters' strength. CONCLUSIONS These results are far higher than the existing best predictor, which indicate that our model is serviceable and practicable to identify promoters and their strength. Furthermore, the datasets and source codes are available from this link: https://github.com/Bovbene/iPro-GAN.
Collapse
Affiliation(s)
- Huijuan Qiao
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China.
| | - Tian Xue
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Jinyue Wang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Bowei Wang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| |
Collapse
|
7
|
Zhang M, Jia C, Li F, Li C, Zhu Y, Akutsu T, Webb GI, Zou Q, Coin LJM, Song J. Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction. Brief Bioinform 2022; 23:6502561. [PMID: 35021193 PMCID: PMC8921625 DOI: 10.1093/bib/bbab551] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 11/12/2021] [Accepted: 11/30/2021] [Indexed: 01/13/2023] Open
Abstract
Promoters are crucial regulatory DNA regions for gene transcriptional activation. Rapid advances in next-generation sequencing technologies have accelerated the accumulation of genome sequences, providing increased training data to inform computational approaches for both prokaryotic and eukaryotic promoter prediction. However, it remains a significant challenge to accurately identify species-specific promoter sequences using computational approaches. To advance computational support for promoter prediction, in this study, we curated 58 comprehensive, up-to-date, benchmark datasets for 7 different species (i.e. Escherichia coli, Bacillus subtilis, Homo sapiens, Mus musculus, Arabidopsis thaliana, Zea mays and Drosophila melanogaster) to assist the research community to assess the relative functionality of alternative approaches and support future research on both prokaryotic and eukaryotic promoters. We revisited 106 predictors published since 2000 for promoter identification (40 for prokaryotic promoter, 61 for eukaryotic promoter, and 5 for both). We systematically evaluated their training datasets, computational methodologies, calculated features, performance and software usability. On the basis of these benchmark datasets, we benchmarked 19 predictors with functioning webservers/local tools and assessed their prediction performance. We found that deep learning and traditional machine learning-based approaches generally outperformed scoring function-based approaches. Taken together, the curated benchmark dataset repository and the benchmarking analysis in this study serve to inform the design and implementation of computational approaches for promoter prediction and facilitate more rigorous comparison of new techniques in the future.
Collapse
Affiliation(s)
| | - Cangzhi Jia
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | | | | | | | | | - Geoffrey I Webb
- Department of Data Science and Artificial Intelligence, Monash University, Melbourne, VIC 3800, Australia,Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Quan Zou
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | - Lachlan J M Coin
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | - Jiangning Song
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| |
Collapse
|
8
|
Mishra A, Dhanda S, Siwach P, Aggarwal S, Jayaram B. A novel method SEProm for prokaryotic promoter prediction based on DNA structure and energetics. Bioinformatics 2020; 36:2375-2384. [PMID: 31909789 DOI: 10.1093/bioinformatics/btz941] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2019] [Revised: 11/08/2019] [Accepted: 01/02/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Despite conservation in general architecture of promoters and protein-DNA interaction interface of RNA polymerases among various prokaryotes, identification of promoter regions in the whole genome sequences remains a daunting challenge. The available tools for promoter prediction do not seem to address the problem satisfactorily, apparently because the biochemical nature of promoter signals is yet to be understood fully. Using 28 structural and 3 energetic parameters, we found that prokaryotic promoter regions have a unique structural and energy state, quite distinct from that of coding regions and the information for this signature state is in-built in their sequences. We developed a novel promoter prediction tool from these 31 parameters using various statistical techniques. RESULTS Here, we introduce SEProm, a novel tool that is developed by studying and utilizing the in-built structural and energy information of DNA sequences, which is applicable to all prokaryotes including archaea. Compared to five most recent, diverged and current best available tools, SEProm performs much better, predicting promoters with an 'F-value' of 82.04 and 'Precision' of 81.08. The next best 'F-value' was obtained with PromPredict (72.14) followed by BProm (68.37). On the basis of 'Precision' value, the next best 'Precision' was observed for Pepper (75.39) followed by PromPredict (72.01). SEProm maintained the lead even when comparison was done on two test organisms (not involved in training for SEProm). AVAILABILITY AND IMPLEMENTATION The software is freely available with easy to follow instructions (www.scfbio-iitd.res.in/software/TSS_Predict.jsp). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Akhilesh Mishra
- Supercomputing Facility for Bioinformatics & Computational Biology.,Kusuma School of Biological Sciences, Indian Institute of Technology, New Delhi 110016, India
| | - Sahil Dhanda
- Supercomputing Facility for Bioinformatics & Computational Biology
| | - Priyanka Siwach
- Supercomputing Facility for Bioinformatics & Computational Biology.,Department of Biotechnology, Chaudhary Devi Lal University, Sirsa 125055, India
| | - Shruti Aggarwal
- Supercomputing Facility for Bioinformatics & Computational Biology
| | - B Jayaram
- Supercomputing Facility for Bioinformatics & Computational Biology.,Kusuma School of Biological Sciences, Indian Institute of Technology, New Delhi 110016, India.,Department of Chemistry, Indian Institute of Technology, New Delhi 110016, India
| |
Collapse
|
9
|
Liu X, Guo Z, He T, Ren M. Prediction and analysis of prokaryotic promoters based on sequence features. Biosystems 2020; 197:104218. [PMID: 32755610 DOI: 10.1016/j.biosystems.2020.104218] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2020] [Revised: 07/03/2020] [Accepted: 07/21/2020] [Indexed: 10/23/2022]
Abstract
Promoter recognition is an important part of functional genomic annotation but a difficult problem. Many studies have been carried out to address this issue. However, they still cannot meet application needs. Most of the methods exhibit specificity, and the objects analyzed are relatively simple, especially for prokaryotes. Hence, more research on prokaryotic promoters is lacking. In this study, the similarity between gene expression and the transmission of information inspired us to analyze promoter sequences by calculating the information content of the sequences and the correlation between sequences in the subregion. We also calculated other sequence features as supplements, such as the Hurst exponent, GC content, and sequence bending property. Then, we employed an artificial neural network to build a classifier and applied it to identify promoters in three organisms, Escherichia coli, Bacillus subtilis, and Pseudomonas aeruginosa. The experiments on the benchmark test set indicate that our method has good capability to distinguish promoters from randomly selected nonpromoters. The maximal AUC for the classifier is 0.90, and the minimal AUC score is 0.80. Additionally, cross-species experiments were conducted. The AUC of the cross-experiment on three organisms yielded 0.8, suggesting that our approach has better generalization ability, which is conducive to revealing the more common characteristics of prokaryotic promoters.
Collapse
Affiliation(s)
- Xiao Liu
- School of Microelectronics and Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing, 400044, China.
| | - Zhirui Guo
- School of Microelectronics and Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing, 400044, China
| | - Ting He
- School of Microelectronics and Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing, 400044, China
| | - Meixiang Ren
- School of Microelectronics and Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing, 400044, China
| |
Collapse
|
10
|
Coelho RV, Dall'Alba G, de Avila E Silva S, Echeverrigaray S, Delamare APL. Toward Algorithms for Automation of Postgenomic Data Analyses: Bacillus subtilis Promoter Prediction with Artificial Neural Network. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2019; 24:300-309. [PMID: 31573385 DOI: 10.1089/omi.2019.0041] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
In the present postgenomic era, the capacity to generate big data has far exceeded the capacity to analyze, contextualize, and make sense of the data in clinical, biological, and ecological applications. There is a great unmet need for automation and algorithms to aid in analyses of big data, in biology in particular. In this context, it is noteworthy that computational methods used to analyze the regulation of bacterial gene expression have in the past focused mainly on Escherichia coli promoters due to the large amount of data available. The challenge and prospects of automation in prediction and recognition of bacteria sequences as promoters have not been properly addressed due to the promoter size and degenerate pattern. We report here an original neural network approach for recognition and prediction of Bacillus subtilis promoters. The artificial neural network used as input 767 B. subtilis promoter sequences, while also aiming at identifying the architecture, provides the most optimal prediction. Two multilayer perceptron neural network architectures offered the highest accuracy: one with five, and another with seven neurons in the hidden layer. Each architecture achieved an accuracy of 98.57% and 97.69%, respectively. The results collectively indicate the promise of the application of neural network approaches to the B. subtilis promoter recognition problem, while also suggesting the broader potential of algorithms for automation of data analyses in the postgenomic era.
Collapse
Affiliation(s)
- Rafael Vieira Coelho
- Farroupilha Campus, Rio Grande do Sul Federal Institute of Education, Science and Technology (IFRS), Farroupilha, Brazil
| | - Gabriel Dall'Alba
- Biotechnology Institute, Caxias do Sul University (UCS), Caxias do Sul, Brazil
| | | | | | | |
Collapse
|
11
|
Lai HY, Zhang ZY, Su ZD, Su W, Ding H, Chen W, Lin H. iProEP: A Computational Predictor for Predicting Promoter. MOLECULAR THERAPY. NUCLEIC ACIDS 2019; 17:337-346. [PMID: 31299595 PMCID: PMC6616480 DOI: 10.1016/j.omtn.2019.05.028] [Citation(s) in RCA: 110] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/23/2019] [Revised: 05/18/2019] [Accepted: 05/19/2019] [Indexed: 11/29/2022]
Abstract
Promoter is a fundamental DNA element located around the transcription start site (TSS) and could regulate gene transcription. Promoter recognition is of great significance in determining transcription units, studying gene structure, analyzing gene regulation mechanisms, and annotating gene functional information. Many models have already been proposed to predict promoters. However, the performances of these methods still need to be improved. In this work, we combined pseudo k-tuple nucleotide composition (PseKNC) with position-correlation scoring function (PCSF) to formulate promoter sequences of Homo sapiens (H. sapiens), Drosophila melanogaster (D. melanogaster), Caenorhabditis elegans (C. elegans), Bacillus subtilis (B. subtilis), and Escherichia coli (E. coli). Minimum Redundancy Maximum Relevance (mRMR) algorithm and increment feature selection strategy were then adopted to find out optimal feature subsets. Support vector machine (SVM) was used to distinguish between promoters and non-promoters. In the 10-fold cross-validation test, accuracies of 93.3%, 93.9%, 95.7%, 95.2%, and 93.1% were obtained for H. sapiens, D. melanogaster, C. elegans, B. subtilis, and E. coli, with the areas under receiver operating curves (AUCs) of 0.974, 0.975, 0.981, 0.988, and 0.976, respectively. Comparative results demonstrated that our method outperforms existing methods for identifying promoters. An online web server was established that can be freely accessed (http://lin-group.cn/server/iProEP/).
Collapse
Affiliation(s)
- Hong-Yan Lai
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zhao-Yue Zhang
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zhen-Dong Su
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Su
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hui Ding
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Chen
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China; Center for Genomics and Computational Biology, School of Life Sciences, North China University of Science and Technology, Tangshan 063000, China.
| | - Hao Lin
- Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
12
|
Shelyakin PV, Bochkareva OO, Karan AA, Gelfand MS. Micro-evolution of three Streptococcus species: selection, antigenic variation, and horizontal gene inflow. BMC Evol Biol 2019; 19:83. [PMID: 30917781 PMCID: PMC6437910 DOI: 10.1186/s12862-019-1403-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2017] [Accepted: 02/25/2019] [Indexed: 02/07/2023] Open
Abstract
Background The genus Streptococcus comprises pathogens that strongly influence the health of humans and animals. Genome sequencing of multiple Streptococcus strains demonstrated high variability in gene content and order even in closely related strains of the same species and created a newly emerged object for genomic analysis, the pan-genome. Here we analysed the genome evolution of 25 strains of Streptococcus suis, 50 strains of Streptococcus pyogenes and 28 strains of Streptococcus pneumoniae. Results Fractions of the pan-genome, unique, periphery, and universal genes differ in size, functional composition, the level of nucleotide substitutions, and predisposition to horizontal gene transfer and genomic rearrangements. The density of substitutions in intergenic regions appears to be correlated with selection acting on adjacent genes, implying that more conserved genes tend to have more conserved regulatory regions. The total pan-genome of the genus is open, but only due to strain-specific genes, whereas other pan-genome fractions reach saturation. We have identified the set of genes with phylogenies inconsistent with species and non-conserved location in the chromosome; these genes are rare in at least one species and have likely experienced recent horizontal transfer between species. The strain-specific fraction is enriched with mobile elements and hypothetical proteins, but also contains a number of candidate virulence-related genes, so it may have a strong impact on adaptability and pathogenicity. Mapping the rearrangements to the phylogenetic tree revealed large parallel inversions in all species. A parallel inversion of length 15 kB with breakpoints formed by genes encoding surface antigen proteins PhtD and PhtB in S. pneumoniae leads to replacement of gene fragments that likely indicates the action of an antigen variation mechanism. Conclusions Members of genus Streptococcus have a highly dynamic, open pan-genome, that potentially confers them with the ability to adapt to changing environmental conditions, i.e. antibiotic resistance or transmission between different hosts. Hence, integrated analysis of all aspects of genome evolution is important for the identification of potential pathogens and design of drugs and vaccines. Electronic supplementary material The online version of this article (10.1186/s12862-019-1403-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Pavel V Shelyakin
- Vavilov Institute of General Genetics Russian Academy of Sciences, Gubkina str. 3, Moscow, 119991, Russia. .,Kharkevich Institute for Information Transmission Problems, 19, Bolshoy Karetny per., Moscow, 127051, Russia. .,Center of Life Sciences, Skolkovo Institute of Science and Technology, Moscow, Russia.
| | - Olga O Bochkareva
- Kharkevich Institute for Information Transmission Problems, 19, Bolshoy Karetny per., Moscow, 127051, Russia.,Center of Life Sciences, Skolkovo Institute of Science and Technology, Moscow, Russia
| | - Anna A Karan
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia
| | - Mikhail S Gelfand
- Kharkevich Institute for Information Transmission Problems, 19, Bolshoy Karetny per., Moscow, 127051, Russia.,Center of Life Sciences, Skolkovo Institute of Science and Technology, Moscow, Russia.,Faculty of Computer Science, Higher School of Economics, Moscow, Russia
| |
Collapse
|
13
|
Image-based promoter prediction: a promoter prediction method based on evolutionarily generated patterns. Sci Rep 2018; 8:17695. [PMID: 30523308 PMCID: PMC6283834 DOI: 10.1038/s41598-018-36308-0] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2017] [Accepted: 11/12/2018] [Indexed: 11/18/2022] Open
Abstract
Prediction of promoter regions is crucial for studying gene function and regulation. The well-accepted position weight matrix method for this purpose relies on predefined motifs, which would hinder application across different species. Here, we introduce image-based promoter prediction (IBPP) as a method that creates an “image” from training promoter sequences using an evolutionary approach and predicts promoters by matching with the “image”. We used Escherichia coli σ70 promoter sequences to test the performance of IBPP and the combination of IBPP and a support vector machine algorithm (IBPP-SVM). The “images” generated with IBPP could effectively distinguish promoter from non-promoter sequences. Compared with IBPP, IBPP-SVM showed a substantial improvement in sensitivity. Furthermore, both methods showed good performance for sequences of up to 2,000 nt in length. The performances of IBPP and IBPP-SVM were largely affected by the threshold and dimension of vectors, respectively. The source code and documentation are freely available at https://github.com/hahatcdg/IBPP.
Collapse
|
14
|
Xiao X, Xu ZC, Qiu WR, Wang P, Ge HT, Chou KC. iPSW(2L)-PseKNC: A two-layer predictor for identifying promoters and their strength by hybrid features via pseudo K-tuple nucleotide composition. Genomics 2018; 111:1785-1793. [PMID: 30529532 DOI: 10.1016/j.ygeno.2018.12.001] [Citation(s) in RCA: 54] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2018] [Revised: 11/20/2018] [Accepted: 12/04/2018] [Indexed: 12/20/2022]
Abstract
The promoter is a regulatory DNA region about 81-1000 base pairs long, usually located near the transcription start site (TSS) along upstream of a given gene. By combining a certain protein called transcription factor, the promoter provides the starting point for regulated gene transcription, and hence plays a vitally important role in gene transcriptional regulation. With explosive growth of DNA sequences in the post-genomic age, it has become an urgent challenge to develop computational method for effectively identifying promoters because the information thus obtained is very useful for both basic research and drug development. Although some prediction methods were developed in this regard, most of them were limited at merely identifying whether a query DNA sequence being of a promoter or not. However, based on their strength-distinct levels for transcriptional activation and expression, promoter should be divided into two categories: strong and weak types. Here a new two-layer predictor, called "iPSW(2L)-PseKNC", was developed by fusing the physicochemical properties of nucleotides and their nucleotide density into PseKNC (pseudo K-tuple nucleotide composition). Its 1st-layer serves to predict whether a query DNA sequence sample is of promoter or not, while its 2nd-layer is able to predict the strength of promoters. It has been observed through rigorous cross-validations that the 1st-layer sub-predictor is remarkably superior to the existing state-of-the-art predictors in identifying the promoters and non-promoters, and that the 2nd-layer sub-predictor can do what is beyond the reach of the existing predictors. Moreover, the web-server for iPSW(2L)-PseKNC has been established at http://www.jci-bioinfo.cn/iPSW(2L)-PseKNC, by which the majority of experimental scientists can easily get the results they need.
Collapse
Affiliation(s)
- Xuan Xiao
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China; The Gordon Life Science Institute, Boston, MA 02478, USA.
| | - Zhao-Chun Xu
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China.
| | - Wang-Ren Qiu
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China; The Gordon Life Science Institute, Boston, MA 02478, USA
| | - Peng Wang
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Hui-Ting Ge
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Kuo-Chen Chou
- The Gordon Life Science Institute, Boston, MA 02478, USA; Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
15
|
Karp PD, Ong WK, Paley S, Billington R, Caspi R, Fulcher C, Kothari A, Krummenacker M, Latendresse M, Midford PE, Subhraveti P, Gama-Castro S, Muñiz-Rascado L, Bonavides-Martinez C, Santos-Zavaleta A, Mackie A, Collado-Vides J, Keseler IM, Paulsen I. The EcoCyc Database. EcoSal Plus 2018; 8:10.1128/ecosalplus.ESP-0006-2018. [PMID: 30406744 PMCID: PMC6504970 DOI: 10.1128/ecosalplus.esp-0006-2018] [Citation(s) in RCA: 63] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2018] [Indexed: 01/28/2023]
Abstract
EcoCyc is a bioinformatics database available at EcoCyc.org that describes the genome and the biochemical machinery of Escherichia coli K-12 MG1655. The long-term goal of the project is to describe the complete molecular catalog of the E. coli cell, as well as the functions of each of its molecular parts, to facilitate a system-level understanding of E. coli. EcoCyc is an electronic reference source for E. coli biologists and for biologists who work with related microorganisms. The database includes information pages on each E. coli gene product, metabolite, reaction, operon, and metabolic pathway. The database also includes information on E. coli gene essentiality and on nutrient conditions that do or do not support the growth of E. coli. The website and downloadable software contain tools for analysis of high-throughput data sets. In addition, a steady-state metabolic flux model is generated from each new version of EcoCyc and can be executed via EcoCyc.org. The model can predict metabolic flux rates, nutrient uptake rates, and growth rates for different gene knockouts and nutrient conditions. This review outlines the data content of EcoCyc and of the procedures by which this content is generated.
Collapse
Affiliation(s)
- Peter D Karp
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Wai Kit Ong
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Suzanne Paley
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | | | - Ron Caspi
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Carol Fulcher
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Anamika Kothari
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | | | - Mario Latendresse
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Peter E Midford
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | | | - Socorro Gama-Castro
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - Luis Muñiz-Rascado
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - César Bonavides-Martinez
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - Alberto Santos-Zavaleta
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - Amanda Mackie
- Department of Chemistry and Biomolecular Sciences, Macquarie University, Sydney, NSW 2109, Australia
| | - Julio Collado-Vides
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - Ingrid M Keseler
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Ian Paulsen
- Department of Chemistry and Biomolecular Sciences, Macquarie University, Sydney, NSW 2109, Australia
| |
Collapse
|
16
|
Chang SC, Lee CY. OpaR and RpoS are positive regulators of a virulence factor PrtA in Vibrio parahaemolyticus. Microbiology (Reading) 2018; 164:221-231. [DOI: 10.1099/mic.0.000591] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Affiliation(s)
- San-Chi Chang
- Department of Agricultural Chemistry, Microbiology Laboratory, National Taiwan University, Taipei, Taiwan, ROC
| | - Chia-Yin Lee
- Department of Agricultural Chemistry, Microbiology Laboratory, National Taiwan University, Taipei, Taiwan, ROC
| |
Collapse
|
17
|
Shahmuradov IA, Mohamad Razali R, Bougouffa S, Radovanovic A, Bajic VB. bTSSfinder: a novel tool for the prediction of promoters in cyanobacteria and Escherichia coli. Bioinformatics 2017; 33:334-340. [PMID: 27694198 PMCID: PMC5408793 DOI: 10.1093/bioinformatics/btw629] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2016] [Accepted: 09/27/2016] [Indexed: 12/01/2022] Open
Abstract
Motivation The computational search for promoters in prokaryotes remains an attractive problem in bioinformatics. Despite the attention it has received for many years, the problem has not been addressed satisfactorily. In any bacterial genome, the transcription start site is chosen mostly by the sigma (σ) factor proteins, which control the gene activation. The majority of published bacterial promoter prediction tools target σ70 promoters in Escherichia coli. Moreover, no σ-specific classification of promoters is available for prokaryotes other than for E. coli. Results Here, we introduce bTSSfinder, a novel tool that predicts putative promoters for five classes of σ factors in Cyanobacteria (σA, σC, σH, σG and σF) and for five classes of sigma factors in E. coli (σ70, σ38, σ32, σ28 and σ24). Comparing to currently available tools, bTSSfinder achieves higher accuracy (MCC = 0.86, F1-score = 0.93) compared to the next best tool with MCC = 0.59, F1-score = 0.79) and covers multiple classes of promoters. Availability and Implementation bTSSfinder is available standalone and online at http://www.cbrc.kaust.edu.sa/btssfinder. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ilham Ayub Shahmuradov
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), 4700 King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
| | - Rozaimi Mohamad Razali
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), 4700 King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
| | - Salim Bougouffa
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), 4700 King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
| | - Aleksandar Radovanovic
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), 4700 King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
| | - Vladimir B Bajic
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), 4700 King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
| |
Collapse
|
18
|
Abbas MM, Mohie-Eldin MM, EL-Manzalawy Y. Assessing the effects of data selection and representation on the development of reliable E. coli sigma 70 promoter region predictors. PLoS One 2015; 10:e0119721. [PMID: 25803493 PMCID: PMC4372424 DOI: 10.1371/journal.pone.0119721] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2014] [Accepted: 01/26/2015] [Indexed: 11/27/2022] Open
Abstract
As the number of sequenced bacterial genomes increases, the need for rapid and reliable tools for the annotation of functional elements (e.g., transcriptional regulatory elements) becomes more desirable. Promoters are the key regulatory elements, which recruit the transcriptional machinery through binding to a variety of regulatory proteins (known as sigma factors). The identification of the promoter regions is very challenging because these regions do not adhere to specific sequence patterns or motifs and are difficult to determine experimentally. Machine learning represents a promising and cost-effective approach for computational identification of prokaryotic promoter regions. However, the quality of the predictors depends on several factors including: i) training data; ii) data representation; iii) classification algorithms; iv) evaluation procedures. In this work, we create several variants of E. coli promoter data sets and utilize them to experimentally examine the effect of these factors on the predictive performance of E. coli σ70 promoter models. Our results suggest that under some combinations of the first three criteria, a prediction model might perform very well on cross-validation experiments while its performance on independent test data is drastically very poor. This emphasizes the importance of evaluating promoter region predictors using independent test data, which corrects for the over-optimistic performance that might be estimated using the cross-validation procedure. Our analysis of the tested models shows that good prediction models often perform well despite how the non-promoter data was obtained. On the other hand, poor prediction models seems to be more sensitive to the choice of non-promoter sequences. Interestingly, the best performing sequence-based classifiers outperform the best performing structure-based classifiers on both cross-validation and independent test performance evaluation experiments. Finally, we propose a meta-predictor method combining two top performing sequence-based and structure-based classifiers and compare its performance with some of the state-of-the-art E. coli σ70 promoter prediction methods.
Collapse
Affiliation(s)
- Mostafa M. Abbas
- KINDI Center for Computing Research, College of Engineering, Qatar University, Doha, Qatar
| | | | - Yasser EL-Manzalawy
- Systems and Computer Engineering, Al-Azhar University, Cairo, Egypt
- College of Information Sciences, Penn State University, University Park, United States of America
| |
Collapse
|
19
|
Lloréns-Rico V, Lluch-Senar M, Serrano L. Distinguishing between productive and abortive promoters using a random forest classifier in Mycoplasma pneumoniae. Nucleic Acids Res 2015; 43:3442-53. [PMID: 25779052 PMCID: PMC4402517 DOI: 10.1093/nar/gkv170] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2014] [Accepted: 02/22/2015] [Indexed: 12/01/2022] Open
Abstract
Distinguishing between promoter-like sequences in bacteria that belong to true or abortive promoters, or to those that do not initiate transcription at all, is one of the important challenges in transcriptomics. To address this problem, we have studied the genome-reduced bacterium Mycoplasma pneumoniae, for which the RNAs associated with transcriptional start sites have been recently experimentally identified. We determined the contribution to transcription events of different genomic features: the –10, extended –10 and –35 boxes, the UP element, the bases surrounding the –10 box and the nearest-neighbor free energy of the promoter region. Using a random forest classifier and the aforementioned features transformed into scores, we could distinguish between true, abortive promoters and non-promoters with good –10 box sequences. The methods used in this characterization of promoters can be extended to other bacteria and have important applications for promoter design in bacterial genome engineering.
Collapse
Affiliation(s)
- Verónica Lloréns-Rico
- EMBL/CRG Systems Biology Research Unit, Centre for Genomic Regulation (CRG), Dr Aiguader 88, 08003 Barcelona, Spain Universitat Pompeu Fabra (UPF), Dr Aiguader 88, 08003 Barcelona, Spain
| | - Maria Lluch-Senar
- EMBL/CRG Systems Biology Research Unit, Centre for Genomic Regulation (CRG), Dr Aiguader 88, 08003 Barcelona, Spain Universitat Pompeu Fabra (UPF), Dr Aiguader 88, 08003 Barcelona, Spain
| | - Luis Serrano
- EMBL/CRG Systems Biology Research Unit, Centre for Genomic Regulation (CRG), Dr Aiguader 88, 08003 Barcelona, Spain Universitat Pompeu Fabra (UPF), Dr Aiguader 88, 08003 Barcelona, Spain Institució Catalana de Recerca i Estudis Avançats (ICREA), Pg. Lluis Companys 23, 08010 Barcelona, Spain
| |
Collapse
|
20
|
Karp PD, Weaver D, Paley S, Fulcher C, Kubo A, Kothari A, Krummenacker M, Subhraveti P, Weerasinghe D, Gama-Castro S, Huerta AM, Muñiz-Rascado L, Bonavides-Martinez C, Weiss V, Peralta-Gil M, Santos-Zavaleta A, Schröder I, Mackie A, Gunsalus R, Collado-Vides J, Keseler IM, Paulsen I. The EcoCyc Database. EcoSal Plus 2014; 6:10.1128/ecosalplus.ESP-0009-2013. [PMID: 26442933 PMCID: PMC4243172 DOI: 10.1128/ecosalplus.esp-0009-2013] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2014] [Indexed: 11/20/2022]
Abstract
EcoCyc is a bioinformatics database available at EcoCyc.org that describes the genome and the biochemical machinery of Escherichia coli K-12 MG1655. The long-term goal of the project is to describe the complete molecular catalog of the E. coli cell, as well as the functions of each of its molecular parts, to facilitate a system-level understanding of E. coli. EcoCyc is an electronic reference source for E. coli biologists and for biologists who work with related microorganisms. The database includes information pages on each E. coli gene, metabolite, reaction, operon, and metabolic pathway. The database also includes information on E. coli gene essentiality and on nutrient conditions that do or do not support the growth of E. coli. The website and downloadable software contain tools for analysis of high-throughput data sets. In addition, a steady-state metabolic flux model is generated from each new version of EcoCyc. The model can predict metabolic flux rates, nutrient uptake rates, and growth rates for different gene knockouts and nutrient conditions. This review provides a detailed description of the data content of EcoCyc and of the procedures by which this content is generated.
Collapse
Affiliation(s)
- Peter D Karp
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Daniel Weaver
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Suzanne Paley
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Carol Fulcher
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Aya Kubo
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Anamika Kothari
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | | | | | | | - Socorro Gama-Castro
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - Araceli M Huerta
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - Luis Muñiz-Rascado
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - César Bonavides-Martinez
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - Verena Weiss
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - Martin Peralta-Gil
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - Alberto Santos-Zavaleta
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - Imke Schröder
- Department of Microbiology, Immunology, and Molecular Genetics, University of California, Los Angeles, CA 90095
- UCLA Institute of Genomics and Proteomics, University of California, Los Angeles, CA 90095
| | - Amanda Mackie
- Department of Chemistry and Biomolecular Sciences, Macquarie University, Sydney, NSW 2109, Australia
| | - Robert Gunsalus
- Department of Microbiology, Immunology, and Molecular Genetics, University of California, Los Angeles, CA 90095
| | - Julio Collado-Vides
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - Ingrid M Keseler
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Ian Paulsen
- Department of Chemistry and Biomolecular Sciences, Macquarie University, Sydney, NSW 2109, Australia
| |
Collapse
|
21
|
Bansal M, Kumar A, Yella VR. Role of DNA sequence based structural features of promoters in transcription initiation and gene expression. Curr Opin Struct Biol 2014; 25:77-85. [PMID: 24503515 DOI: 10.1016/j.sbi.2014.01.007] [Citation(s) in RCA: 76] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2013] [Accepted: 01/07/2014] [Indexed: 11/18/2022]
Abstract
Regulatory information for transcription initiation is present in a stretch of genomic DNA, called the promoter region that is located upstream of the transcription start site (TSS) of the gene. The promoter region interacts with different transcription factors and RNA polymerase to initiate transcription and contains short stretches of transcription factor binding sites (TFBSs), as well as structurally unique elements. Recent experimental and computational analyses of promoter sequences show that they often have non-B-DNA structural motifs, as well as some conserved structural properties, such as stability, bendability, nucleosome positioning preference and curvature, across a class of organisms. Here, we briefly describe these structural features, the differences observed in various organisms and their possible role in regulation of gene expression.
Collapse
Affiliation(s)
- Manju Bansal
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560012, India.
| | - Aditya Kumar
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560012, India
| | | |
Collapse
|
22
|
Djordjevic M. Efficient transcription initiation in bacteria: an interplay of protein-DNA interaction parameters. Integr Biol (Camb) 2013; 5:796-806. [PMID: 23511241 DOI: 10.1039/c3ib20221f] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
As the first, and usually rate-limiting, step of transcription initiation, bacterial RNA polymerase (RNAP) binds to double stranded DNA (dsDNA) and subsequently opens the two strands of DNA (the open complex formation). The rate determining step in the open complex formation is opening of a short (6 bp) DNA called the -10 region, which interacts with RNAP in both dsDNA and single stranded (ssDNA) forms. Accordingly, formation of the open complex depends on (physically independent) domains of RNAP that interact with ssDNA and dsDNA, as well as on parameters of DNA melting and sequences of -10 regions. We here aim to understand how these different interactions are mutually related to ensure efficient open complex formation. To achieve this, we use a recently developed biophysical model of transcription initiation, which allows the calculation of the kinetic parameters of transcription initiation on the scale of whole genome. We consequently investigate kinetic properties of sequences derived from all E. coli intergenic regions, and from more than 300 experimentally confirmed E. coli σ(70) promoters. We find that interaction specificities of σ(70) DNA binding domains reduce the number of sequences where RNAP binds strongly, but forms the open complex too slowly to achieve functional transcription (so-called poised promoters). However, we find that, despite this reduction, there is still a significant number of such poised promoters in the intergenic regions, which may provide a major source of false positives in genome-wide searches of transcription start sites. Furthermore, we surprisingly find that sequences of -10 regions of the functional promoters increase the extent of RNAP poising, which we interpret in terms of an extension of a recently proposed model of promoter recognition ('mix-and-match model') to kinetic parameters. Overall, our results allow better understanding of the design of σ(70) DNA binding domains and promoter sequences, and place a fundamental limit on accuracy of methods for promoter detection that are based on strong RNAP binding (e.g. ChIP-chip).
Collapse
Affiliation(s)
- Marko Djordjevic
- Institute of Physiology and Biochemistry, Faculty of Biology, University of Belgrade, Studentski trg 16, 11000 Belgrade, Serbia.
| |
Collapse
|
23
|
de Avila e Silva S, Forte F, T S Sartor I, Andrighetti T, J L Gerhardt G, Longaray Delamare AP, Echeverrigaray S. DNA duplex stability as discriminative characteristic for Escherichia coli σ(54)- and σ(28)- dependent promoter sequences. Biologicals 2013; 42:22-8. [PMID: 24172230 DOI: 10.1016/j.biologicals.2013.10.001] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2013] [Accepted: 10/01/2013] [Indexed: 11/17/2022] Open
Abstract
The advent of modern high-throughput sequencing has made it possible to generate vast quantities of genomic sequence data. However, the processing of this volume of information, including prediction of gene-coding and regulatory sequences remains an important bottleneck in bioinformatics research. In this work, we integrated DNA duplex stability into the repertoire of a Neural Network (NN) capable of predicting promoter regions with augmented accuracy, specificity and sensitivity. We took our method beyond a simplistic analysis based on a single sigma subunit of RNA polymerase, incorporating the six main sigma-subunits of Escherichia coli. This methodology employed successfully re-discovered known promoter sequences recognized by E. coli RNA polymerase subunits σ(24), σ(28), σ(32), σ(38), σ(54) and σ(70), with highlighted accuracies for σ(28)- and σ(54)- dependent promoter sequences (values obtained were 80% and 78.8%, respectively). Furthermore, the discrimination of promoters according to the σ factor made it possible to extract functional commonalities for the genes expressed by each type of promoter. The DNA duplex stability rises as a distinctive feature which improves the recognition and classification of σ(28)- and σ(54)- dependent promoter sequences. The findings presented in this report underscore the usefulness of including DNA biophysical parameters into NN learning algorithms to increase accuracy, specificity and sensitivity in promoter beyond what is accomplished based on sequence alone.
Collapse
Affiliation(s)
- Scheila de Avila e Silva
- Universidade de Caxias do Sul, Instituto de Biotecnologia, Rua Francisco Getúlio Vargas, 1130, CEP 95070-560 Caxias do Sul, RS, Brazil.
| | - Franciele Forte
- Universidade de Caxias do Sul, Instituto de Biotecnologia, Rua Francisco Getúlio Vargas, 1130, CEP 95070-560 Caxias do Sul, RS, Brazil.
| | - Ivaine T S Sartor
- Universidade de Caxias do Sul, Instituto de Biotecnologia, Rua Francisco Getúlio Vargas, 1130, CEP 95070-560 Caxias do Sul, RS, Brazil.
| | - Tahila Andrighetti
- Universidade de Caxias do Sul, Instituto de Biotecnologia, Rua Francisco Getúlio Vargas, 1130, CEP 95070-560 Caxias do Sul, RS, Brazil.
| | - Günther J L Gerhardt
- Universidade de Caxias do Sul, Instituto de Biotecnologia, Rua Francisco Getúlio Vargas, 1130, CEP 95070-560 Caxias do Sul, RS, Brazil.
| | - Ana Paula Longaray Delamare
- Universidade de Caxias do Sul, Instituto de Biotecnologia, Rua Francisco Getúlio Vargas, 1130, CEP 95070-560 Caxias do Sul, RS, Brazil.
| | - Sergio Echeverrigaray
- Universidade de Caxias do Sul, Instituto de Biotecnologia, Rua Francisco Getúlio Vargas, 1130, CEP 95070-560 Caxias do Sul, RS, Brazil.
| |
Collapse
|
24
|
The glyceraldehyde-3-phosphate dehydrogenase promoter of the food yeast Candida utilis strain NRRL Y-660 is functional in Agrobacterium tumefaciens. J Appl Genet 2013; 54:495-9. [PMID: 23873160 DOI: 10.1007/s13353-013-0162-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2013] [Revised: 06/18/2013] [Accepted: 07/02/2013] [Indexed: 10/26/2022]
Abstract
The glyceraldehyde-3-phosphate dehydrogenase promoter of the food yeast Candida utilis strain NRRL Y-660 was cloned to create a novel integrative vector for Agrobacterium tumefaciens-mediated transformation. The new binary vector harbors β-glucuronidase activity as reporter and kanamicin/geneticin resistance as selection marker. Recombinant clones of A. tumefaciens show kanamycin resistance and high β-glucuronidase activity under the control of the C. utilis promoter. This finding can be explained by the presence of a prokaryotic core in the yeast promoter, predicted by in silico analysis of the sequence. This is the first report about functionality of a yeast promoter in A. tumefaciens.
Collapse
|
25
|
Datta S, Mukhopadhyay S. A composite method based on formal grammar and DNA structural features in detecting human polymerase II promoter region. PLoS One 2013; 8:e54843. [PMID: 23437045 PMCID: PMC3577817 DOI: 10.1371/journal.pone.0054843] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2012] [Accepted: 12/17/2012] [Indexed: 11/25/2022] Open
Abstract
An important step in understanding gene regulation is to identify the promoter regions where the transcription factor binding takes place. Predicting a promoter region de novo has been a theoretical goal for many researchers for a long time. There exists a number of in silico methods to predict the promoter region de novo but most of these methods are still suffering from various shortcomings, a major one being the selection of appropriate features of promoter region distinguishing them from non-promoters. In this communication, we have proposed a new composite method that predicts promoter sequences based on the interrelationship between structural profiles of DNA and primary sequence elements of the promoter regions. We have shown that a Context Free Grammar (CFG) can formalize the relationships between different primary sequence features and by utilizing the CFG, we demonstrate that an efficient parser can be constructed for extracting these relationships from DNA sequences to distinguish the true promoter sequences from non-promoter sequences. Along with CFG, we have extracted the structural features of the promoter region to improve upon the efficiency of our prediction system. Extensive experiments performed on different datasets reveals that our method is effective in predicting promoter sequences on a genome-wide scale and performs satisfactorily as compared to other promoter prediction techniques.
Collapse
Affiliation(s)
- Sutapa Datta
- Department of Biophysics, Molecular Biology and Bioinformatics and Distributed Information Centre for Bioinformatics, University of Calcutta, Kolkata, West Bengal, India.
| | | |
Collapse
|
26
|
Zhou X, Li Z, Dai Z, Zou X. Predicting promoters by pseudo-trinucleotide compositions based on discrete wavelets transform. J Theor Biol 2013; 319:1-7. [DOI: 10.1016/j.jtbi.2012.11.024] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2012] [Revised: 11/20/2012] [Accepted: 11/21/2012] [Indexed: 10/27/2022]
|
27
|
Fine tuning the transcription of ldhA for d-lactate production. ACTA ACUST UNITED AC 2012; 39:1209-17. [DOI: 10.1007/s10295-012-1116-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2011] [Accepted: 02/27/2012] [Indexed: 11/25/2022]
Abstract
Abstract
Fine tuning of the key enzymes to moderate rather than high expression levels could overproduce the desired metabolic products without inhibiting cell growth. The aims of this investigation were to regulate rates of lactate production and cell growth in recombinant Escherichia coli through promoter engineering and to evaluate the transcriptional function of the upstream region of ldhA (encoding fermentative lactate dehydrogenase in E. coli). Twelve ldhA genes with sequentially shortened chromosomal upstream regions were cloned in an ldhA deletion, E. coli CICIM B0013-080C (ack-pta pps pflB dld poxB adhE frdA ldhA). The varied ldhA upstream regions were further analyzed using program NNPP2.2 (Neural Network Promoter Prediction 2.2) to predict the possible promoter regions. Two-phase fermentations (aerobic growth and oxygen-limited production) of these strains showed that shortening the ldhA upstream sequence from 291 to 106 bp successively reduced aerobic lactate synthesis and the inhibition effect on cell growth during the first phase. Simultaneously, oxygen-limited lactate productivity was increased during the second phase. The putative promoter downstream of the −96 site of ldhA could function as a transcriptional promoter or regulator. B0013-080C/pTH-rrnB-ldhA8, with the 72-bp upstream segment of ldhA, could be grown at a high rate and achieve a high oxygen-limited lactate productivity of 1.09 g g−1 h−1. No transcriptional promoting region was apparent downstream of the −61 site of ldhA. We identified the latent transcription regions in the ldhA upstream sequence, which will help to understand regulation of ldhA expression.
Collapse
|
28
|
Lee TY, Chang WC, Hsu JBK, Chang TH, Shien DM. GPMiner: an integrated system for mining combinatorial cis-regulatory elements in mammalian gene group. BMC Genomics 2012; 13 Suppl 1:S3. [PMID: 22369687 PMCID: PMC3587379 DOI: 10.1186/1471-2164-13-s1-s3] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Background Sequence features in promoter regions are involved in regulating gene transcription initiation. Although numerous computational methods have been developed for predicting transcriptional start sites (TSSs) or transcription factor (TF) binding sites (TFBSs), they lack annotations for do not consider some important regulatory features such as CpG islands, tandem repeats, the TATA box, CCAAT box, GC box, over-represented oligonucleotides, DNA stability, and GC content. Additionally, the combinatorial interaction of TFs regulates the gene group that is associated with same expression pattern. To investigate gene transcriptional regulation, an integrated system that annotates regulatory features in a promoter sequence and detects co-regulation of TFs in a group of genes is needed. Results This work identifies TSSs and regulatory features in a promoter sequence, and recognizes co-occurrence of cis-regulatory elements in co-expressed genes using a novel system. Three well-known TSS prediction tools are incorporated with orthologous conserved features, such as CpG islands, nucleotide composition, over-represented hexamer nucleotides, and DNA stability, to construct the novel Gene Promoter Miner (GPMiner) using a support vector machine (SVM). According to five-fold cross-validation results, the predictive sensitivity and specificity are both roughly 80%. The proposed system allows users to input a group of gene names/symbols, enabling the co-occurrence of TFBSs to be determined. Additionally, an input sequence can also be analyzed for homogeneity of experimental mammalian promoter sequences, and conserved regulatory features between homologous promoters can be observed through cross-species analysis. After identifying promoter regions, regulatory features are visualized graphically to facilitate gene promoter observations. Conclusions The GPMiner, which has a user-friendly input/output interface, has numerous benefits in analyzing human and mouse promoters. The proposed system is freely available at http://GPMiner.mbc.nctu.edu.tw/.
Collapse
Affiliation(s)
- Tzong-Yi Lee
- Department of Computer Science and Engineering, Yuan Ze University, Taoyuan 320, Taiwan.
| | | | | | | | | |
Collapse
|
29
|
de Avila e Silva S, Echeverrigaray S, Gerhardt GJ. BacPP: Bacterial promoter prediction—A tool for accurate sigma-factor specific assignment in enterobacteria. J Theor Biol 2011; 287:92-9. [DOI: 10.1016/j.jtbi.2011.07.017] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2010] [Revised: 05/20/2011] [Accepted: 07/21/2011] [Indexed: 10/17/2022]
|
30
|
Song K. Recognition of prokaryotic promoters based on a novel variable-window Z-curve method. Nucleic Acids Res 2011; 40:963-71. [PMID: 21954440 PMCID: PMC3273801 DOI: 10.1093/nar/gkr795] [Citation(s) in RCA: 52] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Transcription is the first step in gene expression, and it is the step at which most of the regulation of expression occurs. Although sequenced prokaryotic genomes provide a wealth of information, transcriptional regulatory networks are still poorly understood using the available genomic information, largely because accurate prediction of promoters is difficult. To improve promoter recognition performance, a novel variable-window Z-curve method is developed to extract general features of prokaryotic promoters. The features are used for further classification by the partial least squares technique. To verify the prediction performance, the proposed method is applied to predict promoter fragments of two representative prokaryotic model organisms (Escherichia coli and Bacillus subtilis). Depending on the feature extraction and selection power of the proposed method, the promoter prediction accuracies are improved markedly over most existing approaches: for E. coli, the accuracies are 96.05% (σ70 promoters, coding negative samples), 90.44% (σ70 promoters, non-coding negative samples), 92.13% (known sigma-factor promoters, coding negative samples), 92.50% (known sigma-factor promoters, non-coding negative samples), respectively; for B. subtilis, the accuracies are 95.83% (known sigma-factor promoters, coding negative samples) and 99.09% (known sigma-factor promoters, non-coding negative samples). Additionally, being a linear technique, the computational simplicity of the proposed method makes it easy to run in a matter of minutes on ordinary personal computers or even laptops. More importantly, there is no need to optimize parameters, so it is very practical for predicting other species promoters without any prior knowledge or prior information of the statistical properties of the samples.
Collapse
Affiliation(s)
- Kai Song
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, 300072, China.
| |
Collapse
|
31
|
de Avila E Silva S, Gerhardt GJL, Echeverrigaray S. Rules extraction from neural networks applied to the prediction and recognition of prokaryotic promoters. Genet Mol Biol 2011; 34:353-60. [PMID: 21734842 PMCID: PMC3115335 DOI: 10.1590/s1415-47572011000200031] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2010] [Accepted: 01/11/2011] [Indexed: 11/21/2022] Open
Abstract
Promoters are DNA sequences located upstream of the gene region and play a central role in gene expression. Computational techniques show good accuracy in gene prediction but are less successful in predicting promoters, primarily because of the high number of false positives that reflect characteristics of the promoter sequences. Many machine learning methods have been used to address this issue. Neural Networks (NN) have been successfully used in this field because of their ability to recognize imprecise and incomplete patterns characteristic of promoter sequences. In this paper, NN was used to predict and recognize promoter sequences in two data sets: (i) one based on nucleotide sequence information and (ii) another based on stability sequence information. The accuracy was approximately 80% for simulation (i) and 68% for simulation (ii). In the rules extracted, biological consensus motifs were important parts of the NN learning process in both simulations.
Collapse
Affiliation(s)
- Scheila de Avila E Silva
- Programa de Pós-Graduação em Biotecnologia, Universidade de Caxias do Sul, Caxias do Sul, RS, Brazil
| | | | | |
Collapse
|
32
|
Identification of TATA and TATA-less promoters in plant genomes by integrating diversity measure, GC-Skew and DNA geometric flexibility. Genomics 2010; 97:112-20. [PMID: 21112384 DOI: 10.1016/j.ygeno.2010.11.002] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2010] [Revised: 11/05/2010] [Accepted: 11/12/2010] [Indexed: 11/20/2022]
Abstract
Accurate identification of core promoters is important for gaining more insight about the understanding of the eukaryotic transcription regulation. In this study, the authors focused on the biologically realistic promoter prediction of plant genomes. By analyzing the correlative conservation, GC-compositional bias and specific structural patterns of TATA and TATA-less promoters in PlantPromDB, a hybrid multi-feature approach based on support vector machine (SVM) for predicting the two types of promoters were developed by integrating local word content, GC-Skew and DNA geometric flexibility. Compared with the TSSP-TCM program on the same test dataset, better prediction results were obtained. Especially for the TATA-less promoter, the accuracy is 10% higher than the result of TSSP-TCM program. The good performance of the hybrid promoters and the experimental data also indicate that our method has the ability to locate the promoter region of the plant genome.
Collapse
|
33
|
Eukaryotic and prokaryotic promoter prediction using hybrid approach. Theory Biosci 2010; 130:91-100. [DOI: 10.1007/s12064-010-0114-8] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2010] [Accepted: 10/23/2010] [Indexed: 12/27/2022]
|
34
|
Cheng CH, Yang CH, Chiu HT, Lu CL. Reconstructing genome trees of prokaryotes using overlapping genes. BMC Bioinformatics 2010; 11:102. [PMID: 20181237 PMCID: PMC2845580 DOI: 10.1186/1471-2105-11-102] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2009] [Accepted: 02/24/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Overlapping genes (OGs) are defined as adjacent genes whose coding sequences overlap partially or entirely. In fact, they are ubiquitous in microbial genomes and more conserved between species than non-overlapping genes. Based on this property, we have previously implemented a web server, named OGtree, that allows the user to reconstruct genome trees of some prokaryotes according to their pairwise OG distances. By analogy to the analyses of gene content and gene order, the OG distance between two genomes we defined was based on a measure of combining OG content (i.e., the normalized number of shared orthologous OG pairs) and OG order (i.e., the normalized OG breakpoint distance) in their whole genomes. A shortcoming of using the concept of breakpoints to define the OG distance is its inability to analyze the OG distance of multi-chromosomal genomes. In addition, the amount of overlapping coding sequences between some distantly related prokaryotic genomes may be limited so that it is hard to find enough OGs to properly evaluate their pairwise OG distances. RESULTS In this study, we therefore define a new OG order distance that is based on more biologically accurate rearrangements (e.g., reversals, transpositions and translocations) rather than breakpoints and that is applicable to both uni-chromosomal and multi-chromosomal genomes. In addition, we expand the term "gene" to include both its coding sequence and regulatory regions so that two adjacent genes whose coding sequences or regulatory regions overlap with each other are considered as a pair of overlapping genes. This is because overlapping of regulatory regions of distinct genes suggests that the regulation of expression for these genes should be more or less interrelated. Based on these modifications, we have reimplemented our OGtree as a new web server, named OGtree2, and have also evaluated its accuracy of genome tree reconstruction on a testing dataset consisting of 21 Proteobacteria genomes. Our experimental results have finally shown that our current OGtree2 indeed outperforms its previous version OGtree, as well as another similar server, called BPhyOG, significantly in the quality of genome tree reconstruction, because the phylogenetic tree obtained by OGtree2 is greatly congruent with the reference tree that coincides with the taxonomy accepted by biologists for these Proteobacteria. CONCLUSIONS In this study, we have introduced a new web server OGtree2 at http://bioalgorithm.life.nctu.edu.tw/OGtree2.0/ that can serve as a useful tool for reconstructing more precise and robust genome trees of prokaryotes according to their overlapping genes.
Collapse
Affiliation(s)
- Chih-Hsien Cheng
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu 300, Taiwan
| | - Chung-Han Yang
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu 300, Taiwan
| | - Hsien-Tai Chiu
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu 300, Taiwan
| | - Chin Lung Lu
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu 300, Taiwan
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu 300, Taiwan
| |
Collapse
|
35
|
Eng C, Asthana C, Aigle B, Hergalant S, Mari JF, Leblond P. A New Data Mining Approach for the Detection of Bacterial Promoters Combining Stochastic and Combinatorial Methods. J Comput Biol 2009; 16:1211-25. [DOI: 10.1089/cmb.2008.0122] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022] Open
Affiliation(s)
- Catherine Eng
- LORIA, UMR CNRS 7503 et INRIA Grand Est, Campus Scientifique, Vandœuvre-lès-Nancy, France
- Laboratoire de Génétique et Microbiologie, UMR UHP-INRA 1128, IFR 110, Nancy Université, Faculté des Sciences et Techniques, Vandœuvre-lès-Nancy, France
| | - Charu Asthana
- LORIA, UMR CNRS 7503 et INRIA Grand Est, Campus Scientifique, Vandœuvre-lès-Nancy, France
| | - Bertrand Aigle
- Laboratoire de Génétique et Microbiologie, UMR UHP-INRA 1128, IFR 110, Nancy Université, Faculté des Sciences et Techniques, Vandœuvre-lès-Nancy, France
| | - Sébastien Hergalant
- LORIA, UMR CNRS 7503 et INRIA Grand Est, Campus Scientifique, Vandœuvre-lès-Nancy, France
| | - Jean-François Mari
- LORIA, UMR CNRS 7503 et INRIA Grand Est, Campus Scientifique, Vandœuvre-lès-Nancy, France
| | - Pierre Leblond
- Laboratoire de Génétique et Microbiologie, UMR UHP-INRA 1128, IFR 110, Nancy Université, Faculté des Sciences et Techniques, Vandœuvre-lès-Nancy, France
| |
Collapse
|
36
|
Mallios RR, Ojcius DM, Ardell DH. An iterative strategy combining biophysical criteria and duration hidden Markov models for structural predictions of Chlamydia trachomatis sigma66 promoters. BMC Bioinformatics 2009; 10:271. [PMID: 19715597 PMCID: PMC2743672 DOI: 10.1186/1471-2105-10-271] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2009] [Accepted: 08/28/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Promoter identification is a first step in the quest to explain gene regulation in bacteria. It has been demonstrated that the initiation of bacterial transcription depends upon the stability and topology of DNA in the promoter region as well as the binding affinity between the RNA polymerase sigma-factor and promoter. However, promoter prediction algorithms to date have not explicitly used an ensemble of these factors as predictors. In addition, most promoter models have been trained on data from Escherichia coli. Although it has been shown that transcriptional mechanisms are similar among various bacteria, it is quite possible that the differences between Escherichia coli and Chlamydia trachomatis are large enough to recommend an organism-specific modeling effort. RESULTS Here we present an iterative stochastic model building procedure that combines such biophysical metrics as DNA stability, curvature, twist and stress-induced DNA duplex destabilization along with duration hidden Markov model parameters to model Chlamydia trachomatis sigma66 promoters from 29 experimentally verified sequences. Initially, iterative duration hidden Markov modeling of the training set sequences provides a scoring algorithm for Chlamydia trachomatis RNA polymerase sigma66/DNA binding. Subsequently, an iterative application of Stepwise Binary Logistic Regression selects multiple promoter predictors and deletes/replaces training set sequences to determine an optimal training set. The resulting model predicts the final training set with a high degree of accuracy and provides insights into the structure of the promoter region. Model based genome-wide predictions are provided so that optimal promoter candidates can be experimentally evaluated, and refined models developed. Co-predictions with three other algorithms are also supplied to enhance reliability. CONCLUSION This strategy and resulting model support the conjecture that DNA biophysical properties, along with RNA polymerase sigma-factor/DNA binding collaboratively, contribute to a sequence's ability to promote transcription. This work provides a baseline model that can evolve as new Chlamydia trachomatis sigma66 promoters are identified with assistance from the provided genome-wide predictions. The proposed methodology is ideal for organisms with few identified promoters and relatively small genomes.
Collapse
Affiliation(s)
- Ronna R Mallios
- School of Natural Sciences, University of California, Merced, CA 95344, USA.
| | | | | |
Collapse
|
37
|
Zeng J, Zhu S, Yan H. Towards accurate human promoter recognition: a review of currently used sequence features and classification methods. Brief Bioinform 2009; 10:498-508. [PMID: 19531545 DOI: 10.1093/bib/bbp027] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
This review describes important advances that have been made during the past decade for genome-wide human promoter recognition. Interest in promoter recognition algorithms on a genome-wide scale is worldwide and touches on a number of practical systems that are important in analysis of gene regulation and in genome annotation without experimental support of ESTs, cDNAs or mRNAs. The main focus of this review is on feature extraction and model selection for accurate human promoter recognition, with descriptions of what they are, what has been accomplished, and what remains to be done.
Collapse
Affiliation(s)
- Jia Zeng
- Department of Computer Science, Hong Kong Baptist University, Kowloon, Hong Kong.
| | | | | |
Collapse
|
38
|
Zhang J, Li E, Olsen GJ. Protein-coding gene promoters in Methanocaldococcus (Methanococcus) jannaschii. Nucleic Acids Res 2009; 37:3588-601. [PMID: 19359364 PMCID: PMC2699501 DOI: 10.1093/nar/gkp213] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Although Methanocaldococcus (Methanococcus) jannaschii was the first archaeon to have its genome sequenced, little is known about the promoters of its protein-coding genes. To expand our knowledge, we have experimentally identified 131 promoters for 107 protein-coding genes in this genome by mapping their transcription start sites. Compared to previously identified promoters, more than half of which are from genes for stable RNAs, the protein-coding gene promoters are qualitatively similar in overall sequence pattern, but statistically different at several positions due to greater variation among their sequences. Relative binding affinity for general transcription factors was measured for 12 of these promoters by competition electrophoretic mobility shift assays. These promoters bind the factors less tightly than do most tRNA gene promoters. When a position weight matrix (PWM) was constructed from the protein gene promoters, factor binding affinities correlated with corresponding promoter PWM scores. We show that the PWM based on our data more accurately predicts promoters in the genome and transcription start sites than could be done with the previously available data. We also introduce a PWM logo, which visually displays the implications of observing a given base at a position in a sequence.
Collapse
Affiliation(s)
- Jian Zhang
- Department of Microbiology, University of Illinois at Urbana-Champaign, 601 South Goodwin Avenue, Urbana, IL 61801, USA
| | | | | |
Collapse
|
39
|
Askary A, Masoudi-Nejad A, Sharafi R, Mizbani A, Parizi SN, Purmasjedi M. N4: A precise and highly sensitive promoter predictor using neural network fed by nearest neighbors. Genes Genet Syst 2009; 84:425-30. [DOI: 10.1266/ggs.84.425] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Affiliation(s)
- Amjad Askary
- Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics and COE in Biomathematics, University of Tehran
- Department of Biotechnology, College of Science, University of Tehran
| | - Ali Masoudi-Nejad
- Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics and COE in Biomathematics, University of Tehran
| | - Roozbeh Sharafi
- Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics and COE in Biomathematics, University of Tehran
| | - Amir Mizbani
- Department of Biotechnology, College of Science, University of Tehran
| | | | - Malihe Purmasjedi
- Department of Biotechnology, College of Science, University of Tehran
| |
Collapse
|
40
|
|
41
|
Rhodius VA, Wade JT. Technical considerations in using DNA microarrays to define regulons. Methods 2008; 47:63-72. [PMID: 18955146 DOI: 10.1016/j.ymeth.2008.10.017] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2008] [Revised: 10/15/2008] [Accepted: 10/17/2008] [Indexed: 11/20/2022] Open
Abstract
Transcription is the major regulatory target of gene expression in bacteria, and is controlled by many regulatory proteins and RNAs. Microarrays are a powerful tool to study the regulation of transcription on a genomic scale. Here we describe the use of transcription profiling and ChIP-chip to study transcriptional regulation in bacteria. Transcription profiling determines the outcome of regulatory events whereas ChIP-chip identifies the protein-DNA interactions that determine these events. Together they can provide detailed information on transcriptional regulatory systems.
Collapse
Affiliation(s)
- Virgil A Rhodius
- Department of Microbiology and Immunology, University of California at San Francisco, San Francisco, CA 94143, USA.
| | | |
Collapse
|
42
|
Abeel T, Saeys Y, Bonnet E, Rouzé P, Van de Peer Y. Generic eukaryotic core promoter prediction using structural features of DNA. Genes Dev 2008; 18:310-23. [PMID: 18096745 PMCID: PMC2203629 DOI: 10.1101/gr.6991408] [Citation(s) in RCA: 133] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2007] [Accepted: 11/14/2007] [Indexed: 11/24/2022]
Abstract
Despite many recent efforts, in silico identification of promoter regions is still in its infancy. However, the accurate identification and delineation of promoter regions is important for several reasons, such as improving genome annotation and devising experiments to study and understand transcriptional regulation. Current methods to identify the core region of promoters require large amounts of high-quality training data and often behave like black box models that output predictions that are difficult to interpret. Here, we present a novel approach for predicting promoters in whole-genome sequences by using large-scale structural properties of DNA. Our technique requires no training, is applicable to many eukaryotic genomes, and performs extremely well in comparison with the best available promoter prediction programs. Moreover, it is fast, simple in design, and has no size constraints, and the results are easily interpretable. We compared our approach with 14 current state-of-the-art implementations using human gene and transcription start site data and analyzed the ENCODE region in more detail. We also validated our method on 12 additional eukaryotic genomes, including vertebrates, invertebrates, plants, fungi, and protists.
Collapse
Affiliation(s)
- Thomas Abeel
- Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), 9052 Gent, Belgium
- Department of Molecular Genetics, Ghent University, 9052 Gent, Belgium
| | - Yvan Saeys
- Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), 9052 Gent, Belgium
- Department of Molecular Genetics, Ghent University, 9052 Gent, Belgium
| | - Eric Bonnet
- Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), 9052 Gent, Belgium
- Department of Molecular Genetics, Ghent University, 9052 Gent, Belgium
| | - Pierre Rouzé
- Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), 9052 Gent, Belgium
- Department of Molecular Genetics, Ghent University, 9052 Gent, Belgium
- Laboratoire Associé de l’INRA (France), Ghent University, 9052 Gent, Belgium
| | - Yves Van de Peer
- Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), 9052 Gent, Belgium
- Department of Molecular Genetics, Ghent University, 9052 Gent, Belgium
| |
Collapse
|
43
|
SIGffRid: a tool to search for sigma factor binding sites in bacterial genomes using comparative approach and biologically driven statistics. BMC Bioinformatics 2008; 9:73. [PMID: 18237374 PMCID: PMC2375139 DOI: 10.1186/1471-2105-9-73] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2007] [Accepted: 01/31/2008] [Indexed: 11/10/2022] Open
Abstract
Background Many programs have been developed to identify transcription factor binding sites. However, most of them are not able to infer two-word motifs with variable spacer lengths. This case is encountered for RNA polymerase Sigma (σ) Factor Binding Sites (SFBSs) usually composed of two boxes, called -35 and -10 in reference to the transcription initiation point. Our goal is to design an algorithm detecting SFBS by using combinational and statistical constraints deduced from biological observations. Results We describe a new approach to identify SFBSs by comparing two related bacterial genomes. The method, named SIGffRid (SIGma Factor binding sites Finder using R'MES to select Input Data), performs a simultaneous analysis of pairs of promoter regions of orthologous genes. SIGffRid uses a prior identification of over-represented patterns in whole genomes as selection criteria for potential -35 and -10 boxes. These patterns are then grouped using pairs of short seeds (of which one is possibly gapped), allowing a variable-length spacer between them. Next, the motifs are extended guided by statistical considerations, a feature that ensures a selection of motifs with statistically relevant properties. We applied our method to the pair of related bacterial genomes of Streptomyces coelicolor and Streptomyces avermitilis. Cross-check with the well-defined SFBSs of the SigR regulon in S. coelicolor is detailed, validating the algorithm. SFBSs for HrdB and BldN were also found; and the results suggested some new targets for these σ factors. In addition, consensus motifs for BldD and new SFBSs binding sites were defined, overlapping previously proposed consensuses. Relevant tests were carried out also on bacteria with moderate GC content (i.e. Escherichia coli/Salmonella typhimurium and Bacillus subtilis/Bacillus licheniformis pairs). Motifs of house-keeping σ factors were found as well as other SFBSs such as that of SigW in Bacillus strains. Conclusion We demonstrate that our approach combining statistical and biological criteria was successful to predict SFBSs. The method versatility autorizes the recognition of other kinds of two-box regulatory sites.
Collapse
|
44
|
Abnizova I, Subhankulova T, Gilks WR. Recent computational approaches to understand gene regulation: mining gene regulation in silico. Curr Genomics 2007; 8:79-91. [PMID: 18660846 PMCID: PMC2435357 DOI: 10.2174/138920207780368150] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2006] [Revised: 12/13/2006] [Accepted: 12/15/2006] [Indexed: 01/03/2023] Open
Abstract
This paper reviews recent computational approaches to the understanding of gene regulation in eukaryotes. Cis-regulation of gene expression by the binding of transcription factors is a critical component of cellular physiology. In eukaryotes, a number of transcription factors often work together in a combinatorial fashion to enable cells to respond to a wide spectrum of environmental and developmental signals. Integration of genome sequences and/or Chromatin Immunoprecipitation on chip data with gene-expression data has facilitated in silico discovery of how the combinatorics and positioning of transcription factors binding sites underlie gene activation in a variety of cellular processes.The process of gene regulation is extremely complex and intriguing, therefore all possible points of view and related links should be carefully considered. Here we attempt to collect an inventory, not claiming it to be comprehensive and complete, of related computational biological topics covering gene regulation, which may en-lighten the process, and briefly review what is currently occurring in these areas.We will consider the following computational areas:o gene regulatory network construction;o evolution of regulatory DNA;o studies of its structural and statistical informational properties;o and finally, regulatory RNA.
Collapse
Affiliation(s)
| | - T Subhankulova
- Wellcome Trust/Cancer Research UK Gurdon Institute of Cancer and Developmental Biology, Cambridge, UK
| | | |
Collapse
|
45
|
Hefty PS, Stephens RS. Chlamydial type III secretion system is encoded on ten operons preceded by sigma 70-like promoter elements. J Bacteriol 2006; 189:198-206. [PMID: 17056752 PMCID: PMC1797217 DOI: 10.1128/jb.01034-06] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Many gram-negative bacterial pathogens employ type III secretion systems for infectious processes. Chlamydiae are obligate intracellular bacteria that encode a conserved type III secretion system that is likely requisite for growth. Typically, genes encoding type III secretion systems are located in a single locus; however, for chlamydiae these genes are scattered throughout the genome. Little is known regarding the gene regulatory mechanisms for this essential virulence determinant. To facilitate identification of cis-acting transcriptional regulatory elements, the operon structure was determined. This analysis revealed 10 operons that contained 37 genes associated with the type III secretion system. Linkage within these operons suggests a role in type III secretion for each of these genes, including 13 genes encoding proteins with unknown function. The transcriptional start site for each operon was determined. In conjunction with promoter activity assays, this analysis revealed that the type III secretion system operons encode sigma(70)-like promoter elements. Transcriptional initiation by a sigma factor responsible for constitutive gene expression indicates that undefined activators or repressors regulate developmental stage-specific expression of chlamydial type III secretion system genes.
Collapse
Affiliation(s)
- P Scott Hefty
- Division of Infectious Diseases, School of Public Health, 140 Earl Warren Hall, University of California, Berkeley, Berkeley, CA 94720, USA
| | | |
Collapse
|
46
|
Jacques PÉ, Rodrigue S, Gaudreau L, Goulet J, Brzezinski R. Detection of prokaryotic promoters from the genomic distribution of hexanucleotide pairs. BMC Bioinformatics 2006; 7:423. [PMID: 17014715 PMCID: PMC1615881 DOI: 10.1186/1471-2105-7-423] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2006] [Accepted: 10/02/2006] [Indexed: 12/03/2022] Open
Abstract
Background In bacteria, sigma factors and other transcriptional regulatory proteins recognize DNA patterns upstream of their target genes and interact with RNA polymerase to control transcription. As a consequence of evolution, DNA sequences recognized by transcription factors are thought to be enriched in intergenic regions (IRs) and depleted from coding regions of prokaryotic genomes. Results In this work, we report that genomic distribution of transcription factors binding sites is biased towards IRs, and that this bias is conserved amongst bacterial species. We further take advantage of this observation to develop an algorithm that can efficiently identify promoter boxes by a distribution-dependent approach rather than a direct sequence comparison approach. This strategy, which can easily be combined with other methodologies, allowed the identification of promoter sequences in ten species and can be used with any annotated bacterial genome, with results that rival with current methodologies. Experimental validations of predicted promoters also support our approach. Conclusion Considering that complete genomic sequences of over 1000 bacteria will soon be available and that little transcriptional information is available for most of them, our algorithm constitutes a promising tool for the prediction of promoter sequences. Importantly, our methodology could also be adapted to identify DNA sequences recognized by other regulatory proteins.
Collapse
Affiliation(s)
- Pierre-Étienne Jacques
- Département de biologie, Université de Sherbrooke, Sherbrooke, Québec, Canada
- Département d'informatique, Université de Sherbrooke, Sherbrooke, Québec, Canada
| | - Sébastien Rodrigue
- Département de biologie, Université de Sherbrooke, Sherbrooke, Québec, Canada
| | - Luc Gaudreau
- Département de biologie, Université de Sherbrooke, Sherbrooke, Québec, Canada
| | - Jean Goulet
- Département d'informatique, Université de Sherbrooke, Sherbrooke, Québec, Canada
| | - Ryszard Brzezinski
- Département de biologie, Université de Sherbrooke, Sherbrooke, Québec, Canada
- Centre d'étude et de valorisation de la diversité microbienne, Université de Sherbrooke, Sherbrooke, Québec, Canada
| |
Collapse
|
47
|
Li QZ, Lin H. The recognition and prediction of σ70 promoters in Escherichia coli K-12. J Theor Biol 2006; 242:135-41. [PMID: 16603195 DOI: 10.1016/j.jtbi.2006.02.007] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2005] [Revised: 02/05/2006] [Accepted: 02/10/2006] [Indexed: 10/24/2022]
Abstract
Based on the conservation analysis of the 683 latest experimentally verified sigma(70)-promoter sequences of Escherichia coli K-12, it is found that the conservative hexamers segments in different sites play a key role of promoter regions, a novel position-correlation scoring matrix (PCSM) algorithm for predicting sigma(70) promoter is presented. The predictive capacity of the algorithm is tested by 10-cross validation test. The results show that the overall prediction accuracies (sensitivity) and specificity are 91% and 81%, respectively. By selecting the 683 experimentally verified sigma(70) promoters as training set and searching for the complete sequence in E. coli K-12 with 4639221bp. Results show that the 100% of the 683 experimentally verified sigma(70) promoters have been identified and some possible promoters are predicted.
Collapse
Affiliation(s)
- Qian-Zhong Li
- Department of Physics, Laboratory of Theoretical Biophysics, College of Sciences and Technology, Inner Mongolia University, Hohhot 010021, China.
| | | |
Collapse
|
48
|
Nonaka G, Blankschien M, Herman C, Gross CA, Rhodius VA. Regulon and promoter analysis of the E. coli heat-shock factor, sigma32, reveals a multifaceted cellular response to heat stress. Genes Dev 2006; 20:1776-89. [PMID: 16818608 PMCID: PMC1522074 DOI: 10.1101/gad.1428206] [Citation(s) in RCA: 246] [Impact Index Per Article: 12.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
The heat-shock response (HSR), a universal cellular response to heat, is crucial for cellular adaptation. In Escherichia coli, the HSR is mediated by the alternative sigma factor, sigma32. To determine its role, we used genome-wide expression analysis and promoter validation to identify genes directly regulated by sigma32 and screened ORF overexpression libraries to identify sigma32 inducers. We triple the number of genes validated to be transcribed by sigma32 and provide new insights into the cellular role of this response. Our work indicates that the response is propagated as the regulon encodes numerous global transcriptional regulators, reveals that sigma70 holoenzyme initiates from 12% of sigma32 promoters, which has important implications for global transcriptional wiring, and identifies a new role for the response in protein homeostasis, that of protecting complex proteins. Finally, this study suggests that the response protects the cell membrane and responds to its status: Fully 25% of sigma32 regulon members reside in the membrane and alter its functionality; moreover, a disproportionate fraction of overexpressed proteins that induce the response are membrane localized. The intimate connection of the response to the membrane rationalizes why a major regulator of the response resides in that cellular compartment.
Collapse
Affiliation(s)
- Gen Nonaka
- Department of Microbiology and Immunology, University of California at San Francisco, San Francisco, California 94143, USA
| | | | | | | | | |
Collapse
|
49
|
Rhodius VA, Suh WC, Nonaka G, West J, Gross CA. Conserved and variable functions of the sigmaE stress response in related genomes. PLoS Biol 2006; 4:e2. [PMID: 16336047 PMCID: PMC1312014 DOI: 10.1371/journal.pbio.0040002] [Citation(s) in RCA: 415] [Impact Index Per Article: 21.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2005] [Accepted: 10/13/2005] [Indexed: 11/19/2022] Open
Abstract
Bacteria often cope with environmental stress by inducing alternative sigma (σ) factors, which direct RNA polymerase to specific promoters, thereby inducing a set of genes called a regulon to combat the stress. To understand the conserved and organism-specific functions of each σ, it is necessary to be able to predict their promoters, so that their regulons can be followed across species. However, the variability of promoter sequences and motif spacing makes their prediction difficult. We developed and validated an accurate promoter prediction model for Escherichia coli σE, which enabled us to predict a total of 89 unique σE-controlled transcription units in E. coli K-12 and eight related genomes. σE controls the envelope stress response in E. coli K-12. The portion of the regulon conserved across genomes is functionally coherent, ensuring the synthesis, assembly, and homeostasis of lipopolysaccharide and outer membrane porins, the key constituents of the outer membrane of Gram-negative bacteria. The larger variable portion is predicted to perform pathogenesis-associated functions, suggesting that σE provides organism-specific functions necessary for optimal host interaction. The success of our promoter prediction model for σE suggests that it will be applicable for the prediction of promoter elements for many alternative σ factors. A model for predicting the variable promoter sequences associated with the bacterial stress response is developed and used to identify constituents of the transcriptional response to σE.
Collapse
Affiliation(s)
- Virgil A Rhodius
- 1 Department of Microbiology and Immunology, University of California, San Francisco, California, United States of America
| | - Won Chul Suh
- 1 Department of Microbiology and Immunology, University of California, San Francisco, California, United States of America
| | - Gen Nonaka
- 1 Department of Microbiology and Immunology, University of California, San Francisco, California, United States of America
| | - Joyce West
- 1 Department of Microbiology and Immunology, University of California, San Francisco, California, United States of America
| | - Carol A Gross
- 1 Department of Microbiology and Immunology, University of California, San Francisco, California, United States of America
- 2 Department of Cell and Tissue Biology, University of California, San Francisco, California, United States of America
| |
Collapse
|
50
|
Abstract
An unsupervised clustering of 4541 DNA sequences containing active promoter regions from vertebrate and arthropod classes (including their viral genes) was performed. All necessary information was solely gathered a priori from the DNA sequences by measuring frequencies of tri-nucleotides and tetra-nucleotides. We employed Super Paramagnetic Clustering, a novel clustering algorithm based on physical properties of an inhomogeneous granular ferromagnet. This method utilizes Swendsen-Wang cluster Monte Carlo simulations to distinguish clusters by measuring pairs of correlation functions from different resolutions. We identified two strongly separated clusters of human viral genes corresponding to the Epstein-Barr virus and the Herpes Simplex virus type 1. In addition, vertebrate and arthropod sequences were successfully separated into two different classes with merely 9.25% of arthropod sequences being misclassified. From a functional perspective, these sequences have high gene function correlations with sequences from the vertebrate cluster. By tuning a clustering parameter, Super Paramagnetic Clustering was able to classify vertebrate class further into two major clusters, from where a large number of housekeeping genes and tissue-specific genes were found respectively. The indications came from observation of gene expression function and consensus transcription factors which were found grouped together in specific positions of the DNA sequences.
Collapse
Affiliation(s)
- Sugiarto Radjiman
- Department of Computational Science, National University of Singapore, 117543 Singapore, Republic of Singapore.
| | | | | | | |
Collapse
|