1
|
Zhao M, Yuan Z, Wu L, Zhou S, Deng Y. Precise Prediction of Promoter Strength Based on a De Novo Synthetic Promoter Library Coupled with Machine Learning. ACS Synth Biol 2022; 11:92-102. [PMID: 34927418 DOI: 10.1021/acssynbio.1c00117] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
Promoters are one of the most critical regulatory elements controlling metabolic pathways. However, the fast and accurate prediction of promoter strength remains challenging, leading to time- and labor-consuming promoter construction and characterization processes. This dilemma is caused by the lack of a big promoter library that has gradient strengths, broad dynamic ranges, and clear sequence profiles that can be used to train an artificial intelligence model of promoter strength prediction. To overcome this challenge, we constructed and characterized a mutant library of Trc promoters (Ptrc) using 83 rounds of mutation-construction-screening-characterization engineering cycles. After excluding invalid mutation sites, we established a synthetic promoter library that consisted of 3665 different variants, displaying an intensity range of more than two orders of magnitude. The strongest variant was ∼69-fold stronger than the original Ptrc and 1.52-fold stronger than a 1 mM isopropyl-β-d-thiogalactoside-driven PT7 promoter, with an ∼454-fold difference between the strongest and weakest expression levels. Using this synthetic promoter library, different machine learning models were built and optimized to explore the relationships between promoter sequences and transcriptional strength. Finally, our XgBoost model exhibited optimal performance, and we utilized this approach to precisely predict the strength of artificially designed promoter sequences (R2 = 0.88, mean absolute error = 0.15, and Pearson correlation coefficient = 0.94). Our work provides a powerful platform that enables the predictable tuning of promoters to achieve optimal transcriptional strength.
Collapse
Affiliation(s)
- Mei Zhao
- National Engineering Laboratory for Cereal Fermentation Technology (NELCF), Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
- Jiangsu Provincial Research Center for Bioactive Product Processing Technology, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
- School of Food and Biological Engineering, Jiangsu University, 301 Xuefu Road, Zhenjiang, Jiangsu 212013, China
| | - Zhenqi Yuan
- School of Artificial Intelligence and Computer Science, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
| | - Longtao Wu
- College of Physics and Optoelectronics, Taiyuan University of Technology, Taiyuan 030024, China
| | - Shenghu Zhou
- National Engineering Laboratory for Cereal Fermentation Technology (NELCF), Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
- Jiangsu Provincial Research Center for Bioactive Product Processing Technology, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
| | - Yu Deng
- National Engineering Laboratory for Cereal Fermentation Technology (NELCF), Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
- Jiangsu Provincial Research Center for Bioactive Product Processing Technology, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
| |
Collapse
|
2
|
Mey F, Clauwaert J, Van Brempt M, Stock M, Maertens J, Waegeman W, De Mey M. ProD: A Tool for Predictive Design of Tailored Promoters in Escherichia coli. Methods Mol Biol 2022; 2516:51-59. [PMID: 35922621 DOI: 10.1007/978-1-0716-2413-5_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
A major goal in synthetic biology is the engineering of synthetic gene circuits with a predictable, controlled and designed outcome. This creates a need for building blocks that can modulate gene expression without interference with the native cell system. A tool allowing forward engineering of promoters with predictable transcription initiation frequency is still lacking. Promoter libraries specific for σ70 to ensure the orthogonality of gene expression were built in Escherichia coli and labeled using fluorescence-activated cell sorting to obtain high-throughput DNA sequencing data to train a convolutional neural network. We were able to confirm in vivo that the model is able to predict the promoter transcription initiation frequency (TIF) of new promoter sequences. Here, we provide an online tool for promoter design (ProD) in E. coli, which can be used to tailor output sequences of desired promoter TIF or predict the TIF of a custom sequence.
Collapse
Affiliation(s)
- Friederike Mey
- Centre for Synthetic Biology (CSB), Department of Biotechnology, Ghent University, Ghent, Belgium
| | - Jim Clauwaert
- KERMIT, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Maarten Van Brempt
- Centre for Synthetic Biology (CSB), Department of Biotechnology, Ghent University, Ghent, Belgium
| | - Michiel Stock
- KERMIT, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Jo Maertens
- Centre for Synthetic Biology (CSB), Department of Biotechnology, Ghent University, Ghent, Belgium
| | - Willem Waegeman
- KERMIT, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Marjan De Mey
- Centre for Synthetic Biology (CSB), Department of Biotechnology, Ghent University, Ghent, Belgium.
| |
Collapse
|
3
|
Nascent RNA sequencing identifies a widespread sigma70-dependent pausing regulated by Gre factors in bacteria. Nat Commun 2021; 12:906. [PMID: 33568644 PMCID: PMC7876045 DOI: 10.1038/s41467-021-21150-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2020] [Accepted: 01/14/2021] [Indexed: 01/29/2023] Open
Abstract
Promoter-proximal pausing regulates eukaryotic gene expression and serves as checkpoints to assemble elongation/splicing machinery. Little is known how broadly this type of pausing regulates transcription in bacteria. We apply nascent elongating transcript sequencing combined with RNase I footprinting for genome-wide analysis of σ70-dependent transcription pauses in Escherichia coli. Retention of σ70 induces strong backtracked pauses at a 10−20-bp distance from many promoters. The pauses in the 10−15-bp register of the promoter are dictated by the canonical −10 element, 6−7 nt spacer and “YR+1Y” motif centered at the transcription start site. The promoters for the pauses in the 16−20-bp register contain an additional −10-like sequence recognized by σ70. Our in vitro analysis reveals that DNA scrunching is involved in these pauses relieved by Gre cleavage factors. The genes coding for transcription factors are enriched in these pauses, suggesting that σ70 and Gre proteins regulate transcription in response to changing environmental cues. Transcription by bacterial RNA polymerase is interrupted by pausing events that play diverse regulatory roles. Here, the authors find that a large number of E. coli sigma70-dependent pauses, clustered at a 10−20-bp distance from promoters, are regulated by Gre cleavage factors constituting a mechanism for rapid response to changing environmental cues.
Collapse
|
4
|
Van Brempt M, Clauwaert J, Mey F, Stock M, Maertens J, Waegeman W, De Mey M. Predictive design of sigma factor-specific promoters. Nat Commun 2020; 11:5822. [PMID: 33199691 PMCID: PMC7670410 DOI: 10.1038/s41467-020-19446-w] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Accepted: 10/13/2020] [Indexed: 02/07/2023] Open
Abstract
To engineer synthetic gene circuits, molecular building blocks are developed which can modulate gene expression without interference, mutually or with the host's cell machinery. As the complexity of gene circuits increases, automated design tools and tailored building blocks to ensure perfect tuning of all components in the network are required. Despite the efforts to develop prediction tools that allow forward engineering of promoter transcription initiation frequency (TIF), such a tool is still lacking. Here, we use promoter libraries of E. coli sigma factor 70 (σ70)- and B. subtilis σB-, σF- and σW-dependent promoters to construct prediction models, capable of both predicting promoter TIF and orthogonality of the σ-specific promoters. This is achieved by training a convolutional neural network with high-throughput DNA sequencing data from fluorescence-activated cell sorted promoter libraries. This model functions as the base of the online promoter design tool (ProD), providing tailored promoters for tailored genetic systems.
Collapse
Affiliation(s)
- Maarten Van Brempt
- Centre for Synthetic Biology (CSB), Department of Biotechnology, Ghent University, 9000, Ghent, Belgium
| | - Jim Clauwaert
- KERMIT, Department of Data Analysis and Mathematical Modelling, Ghent University, 9000, Ghent, Belgium
| | - Friederike Mey
- Centre for Synthetic Biology (CSB), Department of Biotechnology, Ghent University, 9000, Ghent, Belgium
| | - Michiel Stock
- KERMIT, Department of Data Analysis and Mathematical Modelling, Ghent University, 9000, Ghent, Belgium
| | - Jo Maertens
- Centre for Synthetic Biology (CSB), Department of Biotechnology, Ghent University, 9000, Ghent, Belgium
| | - Willem Waegeman
- KERMIT, Department of Data Analysis and Mathematical Modelling, Ghent University, 9000, Ghent, Belgium
| | - Marjan De Mey
- Centre for Synthetic Biology (CSB), Department of Biotechnology, Ghent University, 9000, Ghent, Belgium.
| |
Collapse
|
5
|
Wang Y, Wang H, Wei L, Li S, Liu L, Wang X. Synthetic promoter design in Escherichia coli based on a deep generative network. Nucleic Acids Res 2020; 48:6403-6412. [PMID: 32424410 PMCID: PMC7337522 DOI: 10.1093/nar/gkaa325] [Citation(s) in RCA: 106] [Impact Index Per Article: 21.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2019] [Revised: 04/05/2020] [Accepted: 04/22/2020] [Indexed: 01/11/2023] Open
Abstract
Promoter design remains one of the most important considerations in metabolic engineering and synthetic biology applications. Theoretically, there are 450 possible sequences for a 50-nt promoter, of which naturally occurring promoters make up only a small subset. To explore the vast number of potential sequences, we report a novel AI-based framework for de novo promoter design in Escherichia coli. The model, which was guided by sequence features learned from natural promoters, could capture interactions between nucleotides at different positions and design novel synthetic promoters in silico. We combined a deep generative model that guides the search for artificial sequences with a predictive model to preselect the most promising promoters. The AI-designed promoters were optimized based on the promoter activity in E. coli and the predictive model. After two rounds of optimization, up to 70.8% of the AI-designed promoters were experimentally demonstrated to be functional, and few of them shared significant sequence similarity with the E. coli genome. Our work provided an end-to-end approach to the de novo design of novel promoter elements, indicating the potential to apply deep learning methods to de novo genetic element design.
Collapse
Affiliation(s)
- Ye Wang
- Ministry of Education Key Laboratory of Bioinformatics; Center for Synthetic and Systems Biology; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Department of Automation, Tsinghua University, Beijing 100084, China
| | - Haochen Wang
- Ministry of Education Key Laboratory of Bioinformatics; Center for Synthetic and Systems Biology; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Department of Automation, Tsinghua University, Beijing 100084, China
| | - Lei Wei
- Ministry of Education Key Laboratory of Bioinformatics; Center for Synthetic and Systems Biology; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Department of Automation, Tsinghua University, Beijing 100084, China
| | - Shuailin Li
- School of Life Sciences, Tsinghua University, Beijing 100084, China
| | - Liyang Liu
- Ministry of Education Key Laboratory of Bioinformatics; Center for Synthetic and Systems Biology; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xiaowo Wang
- Ministry of Education Key Laboratory of Bioinformatics; Center for Synthetic and Systems Biology; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Department of Automation, Tsinghua University, Beijing 100084, China
| |
Collapse
|
6
|
Xu N, Wei L, Liu J. Recent advances in the applications of promoter engineering for the optimization of metabolite biosynthesis. World J Microbiol Biotechnol 2019; 35:33. [DOI: 10.1007/s11274-019-2606-0] [Citation(s) in RCA: 43] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2018] [Accepted: 01/23/2019] [Indexed: 01/24/2023]
|
7
|
Liu Q, Liu M, Wu W. Strong/Weak Feature Recognition of Promoters Based on Position Weight Matrix and Ensemble Set-Valued Models. J Comput Biol 2018; 25:1152-1160. [PMID: 29993261 DOI: 10.1089/cmb.2018.0067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
In this article, we propose a method to recognize the strong/weak property of the promoters based on the nucleotide sequence. To the best of our knowledge, it is the first time to predict the strong/weak property of the promoters. First, position weight matrix (PWM) is used to evaluate the contributions of the nucleotides to the promoter strength. Then, the set-valued model is used to describe the relation between the nucleotide sequence and the strength. Considering the small-sample and imbalance features of the promoter data, we propose an ensemble approach to predict the strong/weak property of the promoters. The proposed method is used to recognize 60 [Formula: see text] promoters of Escherichia coli. The results show the effectiveness of the proposed method. This article provides a simple way for a biologist to evaluate the strong/weak feature of promoters from the nucleotide sequence.
Collapse
Affiliation(s)
- Qie Liu
- Department of Automation, Tsinghua University , Beijing, China
| | - Min Liu
- Department of Automation, Tsinghua University , Beijing, China
| | - Wenfa Wu
- Department of Automation, Tsinghua University , Beijing, China
| |
Collapse
|
8
|
Matos IMN, Coelho MM, Schartl M. Gene copy silencing and DNA methylation in natural and artificially produced allopolyploid fish. ACTA ACUST UNITED AC 2016; 219:3072-3081. [PMID: 27445349 DOI: 10.1242/jeb.140418] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2016] [Accepted: 07/19/2016] [Indexed: 12/28/2022]
Abstract
Allelic silencing is an important mechanism for coping with gene dosage changes in polyploid organisms that is well known in allopolyploid plants. Only recently, it was shown in the allotriploid fish Squalius alburnoides that this process also occurs in vertebrates. However, it is still unknown whether this silencing mechanism is common to other allopolyploid fish, and which mechanisms might be responsible for allelic silencing. We addressed these questions in a comparative study between Squalius alburnoides and another allopolyploid complex, the Amazon molly (Poecilia formosa). We examined the allelic expression patterns for three target genes in four somatic tissues of natural allo-anorthoploids and laboratory-produced tri-genomic hybrids of S. alburnoides and P. formosa. Also, for both complexes, we evaluated the correlation between total DNA methylation level and the ploidy status and genomic composition of the individuals. We found that allelic silencing also occurs in other allopolyploid organisms besides the single one that was previously known. We found and discuss disparities within and between the two considered complexes concerning the pattern of allele-specific expression and DNA methylation levels. Disparities might be due to intrinsic characteristics of each genome involved in the hybridization process. Our findings also support the idea that long-term evolutionary processes have an effect on the allele expression patterns and possibly also on DNA methylation levels.
Collapse
Affiliation(s)
- Isa M N Matos
- Centre for Ecology, Evolution and Environmental Changes, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal Department of Physiological Chemistry, Biocenter, University of Würzburg, Würzburg 97078, Germany
| | - Maria M Coelho
- Centre for Ecology, Evolution and Environmental Changes, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal
| | - Manfred Schartl
- Department of Physiological Chemistry, Biocenter, University of Würzburg, Würzburg 97078, Germany Comprehensive Cancer Center Mainfranken, University Clinic Würzburg, Würzburg 97078, Germany Texas Institute for Advanced Study and Department of Biology, Texas A&M University, College Station, TX 77843, USA
| |
Collapse
|
9
|
Beier R, Labudde D. Numeric promoter description - A comparative view on concepts and general application. J Mol Graph Model 2015; 63:65-77. [PMID: 26655334 DOI: 10.1016/j.jmgm.2015.11.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Revised: 11/12/2015] [Accepted: 11/17/2015] [Indexed: 11/25/2022]
Abstract
Nucleic acid molecules play a key role in a variety of biological processes. Starting from storage and transfer tasks, this also comprises the triggering of biological processes, regulatory effects and the active influence gained by target binding. Based on the experimental output (in this case promoter sequences), further in silico analyses aid in gaining new insights into these processes and interactions. The numerical description of nucleic acids thereby constitutes a bridge between the concrete biological issues and the analytical methods. Hence, this study compares 26 descriptor sets obtained by applying well-known numerical description concepts to an established dataset of 38 DNA promoter sequences. The suitability of the description sets was evaluated by computing partial least squares regression models and assessing the model accuracy. We conclude that the major importance regarding the descriptive power is attached to positional information rather than to explicitly incorporated physico-chemical information, since a sufficient amount of implicit physico-chemical information is already encoded in the nucleobase classification. The regression models especially benefited from employing the information that is encoded in the sequential and structural neighborhood of the nucleobases. Thus, the analyses of n-grams (short fragments of length n) suggested that they are valuable descriptors for DNA target interactions. A mixed n-gram descriptor set thereby yielded the best description of the promoter sequences. The corresponding regression model was checked and found to be plausible as it was able to reproduce the characteristic binding motifs of promoter sequences in a reasonable degree. As most functional nucleic acids are based on the principle of molecular recognition, the findings are not restricted to promoter sequences, but can rather be transferred to other kinds of functional nucleic acids. Thus, the concepts presented in this study could provide advantages for future nucleic acid-based technologies, like biosensoring, therapeutics and molecular imaging.
Collapse
Affiliation(s)
- Rico Beier
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany.
| | - Dirk Labudde
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany.
| |
Collapse
|
10
|
Li J, Zhang Y. Relationship between promoter sequence and its strength in gene expression. THE EUROPEAN PHYSICAL JOURNAL. E, SOFT MATTER 2014; 37:44. [PMID: 25260329 DOI: 10.1140/epje/i2014-14086-1] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/10/2014] [Accepted: 09/05/2014] [Indexed: 06/03/2023]
Abstract
Promoter strength, or activity, is important in genetic engineering and synthetic biology. A constitutive promoter with a certain strength for one given RNA can often be reused for other RNAs. Therefore, the strength of one promoter is mainly determined by its nucleotide sequence. One of the main difficulties in genetic engineering and synthetic biology is how to control the expression of a certain protein at a given level. One usually used way to achieve this goal is to choose one promoter with a suitable strength which can be employed to regulate the rate of transcription, which then leads to the required level of protein expression. For this purpose, so far, many promoter libraries have been established experimentally. However, theoretical methods to predict the strength of one promoter from its nucleotide sequence are desirable. Such methods are not only valuable in the design of promoter with specified strength, but also meaningful to understand the mechanism of promoter in gene transcription. In this study, through various tests, a theoretical model is presented to describe the relationship between promoter strength and nucleotide sequence. Our analysis shows that promoter strength is greatly influenced by nucleotide groups with three adjacent nucleotides in their sequences. Meanwhile, nucleotides in different regions of promoter sequence have different effects on promoter strength. Based on experimental data for E. coli promoters, our calculations indicate that nucleotides in the -10 region, the -35 region, and the discriminator region of a promoter sequence are more important for determining promoter strength than those in the spacing region. With model parameter values obtained by fitting to experimental data, four promoter libraries are theoretically built for the corresponding experimental environments under which data for promoter strength in gene expression has been measured previously.
Collapse
Affiliation(s)
- Jingwei Li
- Shanghai Key Laboratory for Contemporary Applied Mathematics, Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, 200433, Shanghai, China
| | | |
Collapse
|
11
|
Meng H, Wang J, Xiong Z, Xu F, Zhao G, Wang Y. Quantitative design of regulatory elements based on high-precision strength prediction using artificial neural network. PLoS One 2013; 8:e60288. [PMID: 23560087 PMCID: PMC3613377 DOI: 10.1371/journal.pone.0060288] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2013] [Accepted: 02/25/2013] [Indexed: 01/31/2023] Open
Abstract
Accurate and controllable regulatory elements such as promoters and ribosome binding sites (RBSs) are indispensable tools to quantitatively regulate gene expression for rational pathway engineering. Therefore, de novo designing regulatory elements is brought back to the forefront of synthetic biology research. Here we developed a quantitative design method for regulatory elements based on strength prediction using artificial neural network (ANN). One hundred mutated Trc promoter & RBS sequences, which were finely characterized with a strength distribution from 0 to 3.559 (relative to the strength of the original sequence which was defined as 1), were used for model training and test. A precise strength prediction model, NET90_19_576, was finally constructed with high regression correlation coefficients of 0.98 for both model training and test. Sixteen artificial elements were in silico designed using this model. All of them were proved to have good consistency between the measured strength and our desired strength. The functional reliability of the designed elements was validated in two different genetic contexts. The designed parts were successfully utilized to improve the expression of BmK1 peptide toxin and fine-tune deoxy-xylulose phosphate pathway in Escherichia coli. Our results demonstrate that the methodology based on ANN model can de novo and quantitatively design regulatory elements with desired strengths, which are of great importance for synthetic biology applications.
Collapse
Affiliation(s)
- Hailin Meng
- Key Laboratory of Synthetic Biology, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Jianfeng Wang
- Key Laboratory of Synthetic Biology, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, Shanghai, China
| | - Zhiqiang Xiong
- Key Laboratory of Synthetic Biology, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Feng Xu
- Key Laboratory of Synthetic Biology, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Guoping Zhao
- Key Laboratory of Synthetic Biology, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yong Wang
- Key Laboratory of Synthetic Biology, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- * E-mail:
| |
Collapse
|
12
|
Mishra H, Singh N, Misra K, Lahiri T. An ANN-GA model based promoter prediction in Arabidopsis thaliana using tilling microarray data. Bioinformation 2011; 6:240-3. [PMID: 21887014 PMCID: PMC3159145 DOI: 10.6026/97320630006240] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2011] [Accepted: 05/09/2011] [Indexed: 11/23/2022] Open
Abstract
Identification of promoter region is an important part of gene annotation. Identification of promoters in eukaryotes is important as promoters modulate various
metabolic functions and cellular stress responses. In this work, a novel approach utilizing intensity values of tilling microarray data for a model eukaryotic plant
Arabidopsis thaliana, was used to specify promoter region from non-promoter region. A feed-forward back propagation neural network model supported by
genetic algorithm was employed to predict the class of data with a window size of 41. A dataset comprising of 2992 data vectors representing both promoter and
non-promoter regions, chosen randomly from probe intensity vectors for whole genome of Arabidopsis thaliana generated through tilling microarray technique
was used. The classifier model shows prediction accuracy of 69.73% and 65.36% on training and validation sets, respectively. Further, a concept of distance based
class membership was used to validate reliability of classifier, which showed promising results. The study shows the usability of micro-array probe intensities to
predict the promoter regions in eukaryotic genomes.
Collapse
Affiliation(s)
- Hrishikesh Mishra
- Division of Applied Sciences and Indo-Russian Centre for Biotechnology, Indian Institute of Information Technology, Allahabad, India
| | | | | | | |
Collapse
|
13
|
van Hijum SAFT, Medema MH, Kuipers OP. Mechanisms and evolution of control logic in prokaryotic transcriptional regulation. Microbiol Mol Biol Rev 2009; 73:481-509, Table of Contents. [PMID: 19721087 PMCID: PMC2738135 DOI: 10.1128/mmbr.00037-08] [Citation(s) in RCA: 98] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
A major part of organismal complexity and versatility of prokaryotes resides in their ability to fine-tune gene expression to adequately respond to internal and external stimuli. Evolution has been very innovative in creating intricate mechanisms by which different regulatory signals operate and interact at promoters to drive gene expression. The regulation of target gene expression by transcription factors (TFs) is governed by control logic brought about by the interaction of regulators with TF binding sites (TFBSs) in cis-regulatory regions. A factor that in large part determines the strength of the response of a target to a given TF is motif stringency, the extent to which the TFBS fits the optimal TFBS sequence for a given TF. Advances in high-throughput technologies and computational genomics allow reconstruction of transcriptional regulatory networks in silico. To optimize the prediction of transcriptional regulatory networks, i.e., to separate direct regulation from indirect regulation, a thorough understanding of the control logic underlying the regulation of gene expression is required. This review summarizes the state of the art of the elements that determine the functionality of TFBSs by focusing on the molecular biological mechanisms and evolutionary origins of cis-regulatory regions.
Collapse
Affiliation(s)
- Sacha A F T van Hijum
- Molecular Genetics, Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Kerklaan 30, 9751 NN Haren, The Netherlands.
| | | | | |
Collapse
|
14
|
Nov Klaiman T, Hosid S, Bolshoy A. Upstream curved sequences in E. coli are related to the regulation of transcription initiation. Comput Biol Chem 2009; 33:275-82. [PMID: 19646927 DOI: 10.1016/j.compbiolchem.2009.06.007] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2009] [Accepted: 06/17/2009] [Indexed: 01/03/2023]
Abstract
The advancement in Escherichia coli genome research has made the information regarding transcription start sites of many genes available. A study relying on the availability of transcription start locations was performed. The first question addressed was what an average DNA curvature profile upstream of genes would look like when these genes are aligned by transcription start sites in comparison to alignment by translation start sites. Since it was hypothesized that curvature plays a role in transcription regulation, the expectation was that curvature measurements relative to transcription starts, rather than translation, should strengthen the signal. Our study justified this expectation. The second question aimed to clarify the relation between DNA curvature and promoter strength. Through clustering based on DNA curvature profiles along promoter regions, a strong positive correlation between the promoter strength and the curved DNA was found. The third question dealt with dinucleotide periodicity in E. coli to see whether a periodicity pattern specific to promoter regions exists. Such unknown pattern might shed new light on transcription regulation mechanisms in E. coli. A sequence periodicity of about 11 bp is characteristic to the whole E. coli genome, and is especially well-expressed in intergenic regions. Here it was shown that regions of the size of about 100-150 bp centered 70-100 bp upstream to transcription starts carry hidden periodicity with a period of about 10.3 bp.
Collapse
Affiliation(s)
- Tamar Nov Klaiman
- Department of Evolutionary and Environmental Biology, University of Haifa, Haifa 31905, Israel
| | | | | |
Collapse
|
15
|
Weindl J, Dawy Z, Hanus P, Zech J, Mueller JC. Modeling promoter search by E. coli RNA polymerase: one-dimensional diffusion in a sequence-dependent energy landscape. J Theor Biol 2009; 259:628-34. [PMID: 19463831 DOI: 10.1016/j.jtbi.2009.05.006] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2009] [Revised: 05/12/2009] [Accepted: 05/12/2009] [Indexed: 10/20/2022]
Abstract
We present a biophysical model of promoter search by Escherichia coli RNA polymerase. We use an unconventional weight matrix derived from promoter strength data to extract the energy landscape common to a large set of known promoters. This exhibits a continuous strengthening of the binding energy when approaching the transcription start site from either side. During promoter search, the RNA polymerase slides along the DNA double helix (one-dimensional diffusion) after randomly binding to it. We discuss the possibility that the sliding has a sequence-dependent component, which implies that the energy landscape influences the movement with respect to speed, direction and efficiency. Based on this assumption, we relate the obtained energy landscape around the promoters to the one-dimensional diffusion of the RNA polymerase. Our analytical results suggest that the sequence-dependent random walk slows down and gets directed upon entering a region of 500 bp around the transcription start site, which significantly increases the efficiency of promoter search. These results may explain how the RNA polymerase is able to find the promoter in biologically relevant times out of a vast excess of non-target sites. Moreover, they provide evidence for a sequence-dependent component of one-dimensional diffusion.
Collapse
Affiliation(s)
- Johanna Weindl
- Institute for Communications Engineering, Technische Universität München, Arcisstrasse 21, 80290 München, Germany.
| | | | | | | | | |
Collapse
|
16
|
Gaussian process: an alternative approach for QSAM modeling of peptides. Amino Acids 2009; 38:199-212. [DOI: 10.1007/s00726-008-0228-1] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2008] [Accepted: 12/18/2008] [Indexed: 10/21/2022]
|
17
|
Weindl J, Hanus P, Dawy Z, Zech J, Hagenauer J, Mueller JC. Modeling DNA-binding of Escherichia coli sigma70 exhibits a characteristic energy landscape around strong promoters. Nucleic Acids Res 2007; 35:7003-10. [PMID: 17940097 PMCID: PMC2175306 DOI: 10.1093/nar/gkm720] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
We present a computational model of DNA-binding by σ70 in Escherichia coli which allows us to extract the functional characteristics of the wider promoter environment. Our model is based on a measure for the binding energy of σ70 to the DNA, which is derived from promoter strength data and used to build up a non-standard weight matrix. Opposed to conventional approaches, we apply the matrix to the environment of 3765 known promoters and consider the average matrix scores to extract the common features. In addition to the expected minimum of the average binding energy at the exact promoter site, we detect two minima shortly upstream and downstream of the promoter. These are likely to occur due to correlation between the two binding sites of σ70. Moreover, we observe a characteristic energy landscape in the 500 bp surrounding the transcription start sites, which is more pronounced in groups of strong promoters than in groups of weak promoters. Our subsequent analysis suggests that the characteristic energy landscape is more likely an influence on target search by the RNA polymerase than a result of nucleotide biases in transcription factor binding sites.
Collapse
Affiliation(s)
- Johanna Weindl
- Institute for Communications Engineering, Technische Universität München, Arcisstrasse 21, 80290 München, Germany
| | | | | | | | | | | |
Collapse
|
18
|
Liang G, Li Z. Scores of generalized base properties for quantitative sequence-activity modelings for E. coli promoters based on support vector machine. J Mol Graph Model 2007; 26:269-81. [PMID: 17291800 DOI: 10.1016/j.jmgm.2006.12.004] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2006] [Revised: 11/18/2006] [Accepted: 12/10/2006] [Indexed: 10/23/2022]
Abstract
A novel base sequence representation technique, namely SGBP (scores of generalized base properties), was derived from principal component analysis of a matrix of 1209 property parameters including 0D, 1D, 2D and 3D information for five bases such as A, C, G, T and U. It was then employed to represent sequence structures of E. coli promoters. Variables which were used as inputs of partial least square (PLS) and support vector machine (SVM) were selected by genetic arithmetic-partial least square. All samples were divided into train set which was applied to develop quantitative sequence-activity modelings (QSAMs) and test set which was used to validate the predictive power of the resulting models according to D-optimal design. Investigation on QSAM by PLS showed properties of base of position -42, -34, -31, -33, -41, -46 and -29 may yield more influence on strengths, which has thus pointed us further into the direction of strong promoters. Parameters of SVM were determined by response surface methodology. Satisfactory results indicated that the simulative and the predictive abilities for the internal and external samples of QSAM by SVM were better than those of PLS. Those results showed that SGBP is a useful structural representation methodology in QSAMs due to its many advantages including plentiful structural information, easy manipulation, and high characterization competence. Moreover, SGBP-GA-SVM route for sequences design and activities prediction of DNA or RNA can further be applied.
Collapse
Affiliation(s)
- Guizhao Liang
- College of Bioengineering, Chongqing University, Chongqing 400030, PR China
| | | |
Collapse
|
19
|
Rani TS, Bhavani SD, Bapi RS. Analysis of E. coli promoter recognition problem in dinucleotide feature space. Bioinformatics 2007; 23:582-8. [PMID: 17237059 DOI: 10.1093/bioinformatics/btl670] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Patterns in the promoter sequences within a species are known to be conserved but there exist many exceptions to this rule which makes the promoter recognition a complex problem. Although many complex feature extraction schemes coupled with several classifiers have been proposed for promoter recognition in the current literature, the problem is still open. RESULTS A dinucleotide global feature extraction method is proposed for the recognition of sigma-70 promoters in Escherichia coli in this article. The positive data set consists of sigma-70 promoters with known transcription starting points which are part of regulonDB and promec databases. Four different kinds of negative data sets are considered, two of them biological sets (Gordon et al., 2003) and the other two synthetic data sets. Our results reveal that a single-layer perceptron using dinucleotide features is able to achieve an accuracy of 80% against a background of biological non-promoters and 96% for random data sets. A scheme for locating the promoter regions in a given genome sequence is proposed. A deeper analysis of the data set shows that there is a bifurcation of the data set into two distinct classes, a majority class and a minority class. Our results point out that majority class constituting the majority promoter and the majority non-promoter signal is linearly separable. Also the minority class is linearly separable. We further show that the feature extraction and classification methods proposed in the paper are generic enough to be applied to the more complex problem of eucaryotic promoter recognition. We present Drosophila promoter recognition as a case study. AVAILABILITY http://202.41.85.117/htmfiles/faculty/tsr/tsr.html.
Collapse
Affiliation(s)
- T Sobha Rani
- Computational Intelligence Lab, Department of Computer and Information Sciences, University of Hyderabad, Hyderabad 500046, India.
| | | | | |
Collapse
|
20
|
Fujita A, Sato JR, Rodrigues LDO, Ferreira CE, Sogayar MC. Evaluating different methods of microarray data normalization. BMC Bioinformatics 2006; 7:469. [PMID: 17059609 PMCID: PMC1636075 DOI: 10.1186/1471-2105-7-469] [Citation(s) in RCA: 185] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2006] [Accepted: 10/23/2006] [Indexed: 11/10/2022] Open
Abstract
Background With the development of DNA hybridization microarray technologies, nowadays it is possible to simultaneously assess the expression levels of thousands to tens of thousands of genes. Quantitative comparison of microarrays uncovers distinct patterns of gene expression, which define different cellular phenotypes or cellular responses to drugs. Due to technical biases, normalization of the intensity levels is a pre-requisite to performing further statistical analyses. Therefore, choosing a suitable approach for normalization can be critical, deserving judicious consideration. Results Here, we considered three commonly used normalization approaches, namely: Loess, Splines and Wavelets, and two non-parametric regression methods, which have yet to be used for normalization, namely, the Kernel smoothing and Support Vector Regression. The results obtained were compared using artificial microarray data and benchmark studies. The results indicate that the Support Vector Regression is the most robust to outliers and that Kernel is the worst normalization technique, while no practical differences were observed between Loess, Splines and Wavelets. Conclusion In face of our results, the Support Vector Regression is favored for microarray normalization due to its superiority when compared to the other methods for its robustness in estimating the normalization curve.
Collapse
Affiliation(s)
- André Fujita
- Institute of Mathematics and Statistics, University of São Paulo, Rua do Matão, 1010 – São Paulo, 05508-090 SP, Brazil
- Chemistry Institute, University of São Paulo, Av. Lineu Prestes, 748 – São Paulo, 05513-970 SP, Brazil
| | - João Ricardo Sato
- Institute of Mathematics and Statistics, University of São Paulo, Rua do Matão, 1010 – São Paulo, 05508-090 SP, Brazil
| | | | - Carlos Eduardo Ferreira
- Institute of Mathematics and Statistics, University of São Paulo, Rua do Matão, 1010 – São Paulo, 05508-090 SP, Brazil
| | - Mari Cleide Sogayar
- Chemistry Institute, University of São Paulo, Av. Lineu Prestes, 748 – São Paulo, 05513-970 SP, Brazil
| |
Collapse
|
21
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2005. [PMCID: PMC2447491 DOI: 10.1002/cfg.425] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
|