1
|
Barbero-Aparicio JA, Olivares-Gil A, Díez-Pastor JF, García-Osorio C. Deep learning and support vector machines for transcription start site identification. PeerJ Comput Sci 2023; 9:e1340. [PMID: 37346545 PMCID: PMC10280436 DOI: 10.7717/peerj-cs.1340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Accepted: 03/21/2023] [Indexed: 06/23/2023]
Abstract
Recognizing transcription start sites is key to gene identification. Several approaches have been employed in related problems such as detecting translation initiation sites or promoters, many of the most recent ones based on machine learning. Deep learning methods have been proven to be exceptionally effective for this task, but their use in transcription start site identification has not yet been explored in depth. Also, the very few existing works do not compare their methods to support vector machines (SVMs), the most established technique in this area of study, nor provide the curated dataset used in the study. The reduced amount of published papers in this specific problem could be explained by this lack of datasets. Given that both support vector machines and deep neural networks have been applied in related problems with remarkable results, we compared their performance in transcription start site predictions, concluding that SVMs are computationally much slower, and deep learning methods, specially long short-term memory neural networks (LSTMs), are best suited to work with sequences than SVMs. For such a purpose, we used the reference human genome GRCh38. Additionally, we studied two different aspects related to data processing: the proper way to generate training examples and the imbalanced nature of the data. Furthermore, the generalization performance of the models studied was also tested using the mouse genome, where the LSTM neural network stood out from the rest of the algorithms. To sum up, this article provides an analysis of the best architecture choices in transcription start site identification, as well as a method to generate transcription start site datasets including negative instances on any species available in Ensembl. We found that deep learning methods are better suited than SVMs to solve this problem, being more efficient and better adapted to long sequences and large amounts of data. We also create a transcription start site (TSS) dataset large enough to be used in deep learning experiments.
Collapse
Affiliation(s)
| | - Alicia Olivares-Gil
- Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain
| | - José F. Díez-Pastor
- Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain
| | - César García-Osorio
- Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain
| |
Collapse
|
2
|
Prokaryotic and eukaryotic promoters identification based on residual network transfer learning. Bioprocess Biosyst Eng 2022; 45:955-967. [DOI: 10.1007/s00449-022-02716-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2022] [Accepted: 02/27/2022] [Indexed: 11/26/2022]
|
3
|
Review of the Estimation Methods of Energy Consumption for Battery Electric Buses. ENERGIES 2021. [DOI: 10.3390/en14227578] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
In the transportation sector, electric battery bus (EBB) deployment is considered to be a potential solution to reduce global warming because no greenhouse gas (GHG) emissions are directly produced by EBBs. In addition to the required charging infrastructure, estimating the energy consumption of buses has become a crucial precondition for the deployment and planning of electric bus fleets. Policy and decision-makers may not have the specific tools needed to estimate the energy consumption of a particular bus network. Therefore, many state-of-the-art studies have proposed models to determine the energy demand of electric buses. However, these studies have not critically reviewed, classified and discussed the challenges of the approaches that are applied to estimate EBBs’ energy demands. Thus, this manuscript provides a detailed review of the forecasting models used to estimate the energy consumption of EBBs. Furthermore, this work fills the gap by classifying the models for estimating EBBs’ energy consumption into small-town depot and big-city depot networks. In brief, this review explains and discusses the models and formulations of networks associated with well-to-wheel (WTW) assessment, which can determine the total energy demand of a bus network. This work also reviews a survey of the most recent optimization methods that could be applied to achieve the optimal pattern parameters of EBB fleet systems, such as the bus battery capacity, charger rated power and the total number of installed chargers in the charging station. This paper highlights the issues and challenges, such as the impact of external factors, replicating real-world data, big data analytics, validity index, and bus routes’ topography, with recommendations on each issue. Also, the paper proposes a generic framework based on optimization algorithms, namely, artificial neural network (ANN) and particle swarm optimization (PSO), which will be significant for future development in implementing new energy consumption estimation approaches. Finally, the main findings of this manuscript further our understanding of the determinants that contribute to managing the energy demand of EBBs networks.
Collapse
|
4
|
Chong LC, Gandhi G, Lee JM, Yeo WWY, Choi SB. Drug Discovery of Spinal Muscular Atrophy (SMA) from the Computational Perspective: A Comprehensive Review. Int J Mol Sci 2021; 22:8962. [PMID: 34445667 PMCID: PMC8396480 DOI: 10.3390/ijms22168962] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2021] [Accepted: 01/27/2021] [Indexed: 01/02/2023] Open
Abstract
Spinal muscular atrophy (SMA), one of the leading inherited causes of child mortality, is a rare neuromuscular disease arising from loss-of-function mutations of the survival motor neuron 1 (SMN1) gene, which encodes the SMN protein. When lacking the SMN protein in neurons, patients suffer from muscle weakness and atrophy, and in the severe cases, respiratory failure and death. Several therapeutic approaches show promise with human testing and three medications have been approved by the U.S. Food and Drug Administration (FDA) to date. Despite the shown promise of these approved therapies, there are some crucial limitations, one of the most important being the cost. The FDA-approved drugs are high-priced and are shortlisted among the most expensive treatments in the world. The price is still far beyond affordable and may serve as a burden for patients. The blooming of the biomedical data and advancement of computational approaches have opened new possibilities for SMA therapeutic development. This article highlights the present status of computationally aided approaches, including in silico drug repurposing, network driven drug discovery as well as artificial intelligence (AI)-assisted drug discovery, and discusses the future prospects.
Collapse
Affiliation(s)
- Li Chuin Chong
- Centre for Bioinformatics, School of Data Sciences, Perdana University, Suite 9.2, 9th Floor, Wisma Chase Perdana, Changkat Semantan, Kuala Lumpur 50490, Malaysia; (L.C.C.); (J.M.L.)
| | - Gayatri Gandhi
- Perdana University Graduate School of Medicine, Perdana University, Suite 9.2, 9th Floor, Wisma Chase Perdana, Changkat Semantan, Kuala Lumpur 50490, Malaysia; (G.G.); (W.W.Y.Y.)
| | - Jian Ming Lee
- Centre for Bioinformatics, School of Data Sciences, Perdana University, Suite 9.2, 9th Floor, Wisma Chase Perdana, Changkat Semantan, Kuala Lumpur 50490, Malaysia; (L.C.C.); (J.M.L.)
| | - Wendy Wai Yeng Yeo
- Perdana University Graduate School of Medicine, Perdana University, Suite 9.2, 9th Floor, Wisma Chase Perdana, Changkat Semantan, Kuala Lumpur 50490, Malaysia; (G.G.); (W.W.Y.Y.)
| | - Sy-Bing Choi
- Centre for Bioinformatics, School of Data Sciences, Perdana University, Suite 9.2, 9th Floor, Wisma Chase Perdana, Changkat Semantan, Kuala Lumpur 50490, Malaysia; (L.C.C.); (J.M.L.)
| |
Collapse
|
5
|
Idris AB, Idris EB, Ataelmanan AE, Mohamed AEA, Osman Arbab BM, Ibrahim EAM, Hassan MA. First insights into the molecular basis association between promoter polymorphisms of the IL1B gene and Helicobacter pylori infection in the Sudanese population: computational approach. BMC Microbiol 2021; 21:16. [PMID: 33413117 PMCID: PMC7792167 DOI: 10.1186/s12866-020-02072-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2020] [Accepted: 12/15/2020] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Helicobacter pylori (H. pylori) infects nearly half of the world's population with a variation in incidence among different geographic regions. Genetic variants in the promoter regions of the IL1B gene can affect cytokine expression and creates a condition of hypoacidity which favors the survival and colonization of H. pylori. Therefore, the aim of this study was to characterize the polymorphic sites in the 5'- region [-687_ + 297] of IL1B in H. pylori infection using in silico tools. RESULTS A total of five nucleotide variations were detected in the 5'-regulatory region [-687_ + 297] of IL1B which led to the addition or alteration of transcription factor binding sites (TFBSs) or composite regulatory elements (CEs). Genotyping of IL1B - 31 C > T revealed a significant association between -31 T and susceptibility to H. pylori infection in the studied population (P = 0.0363). Comparative analysis showed conservation rates of IL1B upstream [-368_ + 10] region above 70% in chimpanzee, rhesus monkey, a domesticated dog, cow and rat. CONCLUSIONS In H. pylori-infected patients, three detected SNPs (- 338, - 155 and - 31) located in the IL1B promoter were predicted to alter TFBSs and CE, which might affect the gene expression. These in silico predictions provide insight for further experimental in vitro and in vivo studies of the regulation of IL1B expression and its relationship to H. pylori infection. However, the recognition of regulatory motifs by computer algorithms is fundamental for understanding gene expression patterns.
Collapse
Affiliation(s)
- Abeer Babiker Idris
- Department of Medical Microbiology, Faculty of Medical Laboratory Sciences, University of Khartoum, Khartoum, Sudan
| | - Einas Babiker Idris
- Medical Laboratory Specialist, Department of Medical Microbiology, Rashid Medical Complex, Riyadh, Saudi Arabia
| | - Amany Eltayib Ataelmanan
- Department of Medical Microbiology, Faculty of Medical Laboratory Sciences, University of Al-Gazirah, Wad Madani, Sudan
| | | | | | - El-Amin Mohamed Ibrahim
- Department of Medical Microbiology, Faculty of Medical Laboratory Sciences, University of Khartoum, Khartoum, Sudan
| | - Mohamed A. Hassan
- Department of Bioinformatics, Africa city of technology, Khartoum, Sudan
- Department of Bioinformatics, DETAGEN Genetic Diagnostics Center, Kayseri, Turkey
- Department of Translation Bioinformatics, Detavax Biotech, Kayseri, Turkey
| |
Collapse
|
6
|
Raman Kumar M, Vaegae NK. A new numerical approach for DNA representation using modified Gabor wavelet transform for the identification of protein coding regions. Biocybern Biomed Eng 2020. [DOI: 10.1016/j.bbe.2020.03.007] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
7
|
Chen YL, Guo DH, Li QZ. An energy model for recognizing the prokaryotic promoters based on molecular structure. Genomics 2019; 112:2072-2079. [PMID: 31809797 DOI: 10.1016/j.ygeno.2019.12.001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2019] [Revised: 11/06/2019] [Accepted: 12/01/2019] [Indexed: 11/19/2022]
Abstract
Promoter is an important functional elements of DNA sequences, which is in charge of gene transcription initiation. Recognizing promoter have important help for understanding the relative life phenomena. Based on the concept that promoter is mainly determined by its sequence and structure, a novel statistical physics model for predicting promoter in Escherichia coli K-12 is proposed. The total energies of DNA local structure of sequence segments in the three benchmark promoter sequence datasets, the sole prediction parameter, are calculated by using principles from statistical physics and information theory. The better results are obtained. And a web-server PhysMPrePro for predicting promoter is established at http://202.207.14.87:8032/bioinformation/PhysMPrePro/index.asp, so that other scientists can easily get their desired results by our web-server.
Collapse
Affiliation(s)
- Ying-Li Chen
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China; The State key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Inner Mongolia University, Hohhot 010070, China.
| | - Dong-Hua Guo
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Qian-Zhong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China; The State key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Inner Mongolia University, Hohhot 010070, China.
| |
Collapse
|
8
|
Lenzini L, Di Patti F, Livi R, Fondi M, Fani R, Mengoni A. A Method for the Structure-Based, Genome-Wide Analysis of Bacterial Intergenic Sequences Identifies Shared Compositional and Functional Features. Genes (Basel) 2019; 10:genes10100834. [PMID: 31652625 PMCID: PMC6826451 DOI: 10.3390/genes10100834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2019] [Revised: 10/07/2019] [Accepted: 10/16/2019] [Indexed: 11/16/2022] Open
Abstract
In this paper, we propose a computational strategy for performing genome-wide analyses of intergenic sequences in bacterial genomes. Following similar directions of a previous paper, where a method for genome-wide analysis of eucaryotic Intergenic sequences was proposed, here we developed a tool for implementing similar concepts in bacteria genomes. This allows us to (i) classify intergenic sequences into clusters, characterized by specific global structural features and (ii) draw possible relations with their functional features.
Collapse
Affiliation(s)
- Leonardo Lenzini
- Dipartimento di Fisica e Astronomia, Università degli Studi di Firenze, Sesto Fiorentino, 50019, Italy.
- Istituto Nazionale di Fisica Nucleare, Sesto Fiorentino, 50019, Italy.
| | - Francesca Di Patti
- Dipartimento di Fisica e Astronomia, Università degli Studi di Firenze, Sesto Fiorentino, 50019, Italy.
- Centro Interdipartimentale per lo Studio delle Dinamiche Complesse, Sesto Fiorentino, 50019, Italy.
| | - Roberto Livi
- Dipartimento di Fisica e Astronomia, Università degli Studi di Firenze, Sesto Fiorentino, 50019, Italy.
- Istituto Nazionale di Fisica Nucleare, Sesto Fiorentino, 50019, Italy.
- Centro Interdipartimentale per lo Studio delle Dinamiche Complesse, Sesto Fiorentino, 50019, Italy.
- Istituto dei Sistemi Complessi, Consiglio Nazionale delle Ricerche, Sesto Fiorentino, 50019, Italy.
| | - Marco Fondi
- Dipartimento di Biologia, Università degli Studi di Firenze, Sesto Fiorentino, 50019, Italy.
| | - Renato Fani
- Istituto dei Sistemi Complessi, Consiglio Nazionale delle Ricerche, Sesto Fiorentino, 50019, Italy.
- Dipartimento di Biologia, Università degli Studi di Firenze, Sesto Fiorentino, 50019, Italy.
| | - Alessio Mengoni
- Dipartimento di Biologia, Università degli Studi di Firenze, Sesto Fiorentino, 50019, Italy.
| |
Collapse
|
9
|
Coelho RV, Dall'Alba G, de Avila E Silva S, Echeverrigaray S, Delamare APL. Toward Algorithms for Automation of Postgenomic Data Analyses: Bacillus subtilis Promoter Prediction with Artificial Neural Network. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2019; 24:300-309. [PMID: 31573385 DOI: 10.1089/omi.2019.0041] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
In the present postgenomic era, the capacity to generate big data has far exceeded the capacity to analyze, contextualize, and make sense of the data in clinical, biological, and ecological applications. There is a great unmet need for automation and algorithms to aid in analyses of big data, in biology in particular. In this context, it is noteworthy that computational methods used to analyze the regulation of bacterial gene expression have in the past focused mainly on Escherichia coli promoters due to the large amount of data available. The challenge and prospects of automation in prediction and recognition of bacteria sequences as promoters have not been properly addressed due to the promoter size and degenerate pattern. We report here an original neural network approach for recognition and prediction of Bacillus subtilis promoters. The artificial neural network used as input 767 B. subtilis promoter sequences, while also aiming at identifying the architecture, provides the most optimal prediction. Two multilayer perceptron neural network architectures offered the highest accuracy: one with five, and another with seven neurons in the hidden layer. Each architecture achieved an accuracy of 98.57% and 97.69%, respectively. The results collectively indicate the promise of the application of neural network approaches to the B. subtilis promoter recognition problem, while also suggesting the broader potential of algorithms for automation of data analyses in the postgenomic era.
Collapse
Affiliation(s)
- Rafael Vieira Coelho
- Farroupilha Campus, Rio Grande do Sul Federal Institute of Education, Science and Technology (IFRS), Farroupilha, Brazil
| | - Gabriel Dall'Alba
- Biotechnology Institute, Caxias do Sul University (UCS), Caxias do Sul, Brazil
| | | | | | | |
Collapse
|
10
|
Yang X, Wang Y, Byrne R, Schneider G, Yang S. Concepts of Artificial Intelligence for Computer-Assisted Drug Discovery. Chem Rev 2019; 119:10520-10594. [PMID: 31294972 DOI: 10.1021/acs.chemrev.8b00728] [Citation(s) in RCA: 413] [Impact Index Per Article: 68.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Artificial intelligence (AI), and, in particular, deep learning as a subcategory of AI, provides opportunities for the discovery and development of innovative drugs. Various machine learning approaches have recently (re)emerged, some of which may be considered instances of domain-specific AI which have been successfully employed for drug discovery and design. This review provides a comprehensive portrayal of these machine learning techniques and of their applications in medicinal chemistry. After introducing the basic principles, alongside some application notes, of the various machine learning algorithms, the current state-of-the art of AI-assisted pharmaceutical discovery is discussed, including applications in structure- and ligand-based virtual screening, de novo drug design, physicochemical and pharmacokinetic property prediction, drug repurposing, and related aspects. Finally, several challenges and limitations of the current methods are summarized, with a view to potential future directions for AI-assisted drug discovery and design.
Collapse
Affiliation(s)
- Xin Yang
- State Key Laboratory of Biotherapy and Cancer Center, West China Hospital , Sichuan University , Chengdu , Sichuan 610041 , China
| | - Yifei Wang
- State Key Laboratory of Biotherapy and Cancer Center, West China Hospital , Sichuan University , Chengdu , Sichuan 610041 , China
| | - Ryan Byrne
- ETH Zurich , Department of Chemistry and Applied Biosciences , Vladimir-Prelog-Weg 4 , CH-8093 Zurich , Switzerland
| | - Gisbert Schneider
- ETH Zurich , Department of Chemistry and Applied Biosciences , Vladimir-Prelog-Weg 4 , CH-8093 Zurich , Switzerland
| | - Shengyong Yang
- State Key Laboratory of Biotherapy and Cancer Center, West China Hospital , Sichuan University , Chengdu , Sichuan 610041 , China
| |
Collapse
|
11
|
Lin H, Liang ZY, Tang H, Chen W. Identifying Sigma70 Promoters with Novel Pseudo Nucleotide Composition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1316-1321. [PMID: 28186907 DOI: 10.1109/tcbb.2017.2666141] [Citation(s) in RCA: 108] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Promoters are DNA regulatory elements located directly upstream or at the 5' end of the transcription initiation site (TSS), which are in charge of gene transcription initiation. With the completion of a large number of microorganism genomics, it is urgent to predict promoters accurately in bacteria by using the computational method. In this work, a sequence-based predictor named "iPro70-PseZNC" was designed for identifying sigma70 promoters in prokaryote. In the predictor, the samples of DNA sequences are formulated by a novel pseudo nucleotide composition, called PseZNC, into which the multi-window Z-curve composition and six local DNA structural properties are incorporated. In the 5-fold cross-validation, the area under the curve of receiver operating characteristic of 0.909 was obtained on our benchmark dataset, indicating that the proposed predictor is promising and will provide an important guide in this area. Further studies showed that the performance of PseZNC is better than it of multi-window Z-curve composition. For the sake of convenience for researchers, a user-friendly online service was established and can be freely accessible at http://lin.uestc.edu.cn/server/iPro70-PseZNC. The PseZNC approach can be also extended to other DNA-related problems.
Collapse
|
12
|
A signal processing method for alignment-free metagenomic binning: multi-resolution genomic binary patterns. Sci Rep 2019; 9:2159. [PMID: 30770850 PMCID: PMC6377666 DOI: 10.1038/s41598-018-38197-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2017] [Accepted: 12/21/2018] [Indexed: 11/08/2022] Open
Abstract
Algorithms in bioinformatics use textual representations of genetic information, sequences of the characters A, T, G and C represented computationally as strings or sub-strings. Signal and related image processing methods offer a rich source of alternative descriptors as they are designed to work in the presence of noisy data without the need for exact matching. Here we introduce a method, multi-resolution local binary patterns (MLBP) adapted from image processing to extract local ‘texture’ changes from nucleotide sequence data. We apply this feature space to the alignment-free binning of metagenomic data. The effectiveness of MLBP is demonstrated using both simulated and real human gut microbial communities. Sequence reads or contigs can be represented as vectors and their ‘texture’ compared efficiently using machine learning algorithms to perform dimensionality reduction to capture eigengenome information and perform clustering (here using randomized singular value decomposition and BH-tSNE). The intuition behind our method is the MLBP feature vectors permit sequence comparisons without the need for explicit pairwise matching. We demonstrate this approach outperforms existing methods based on k-mer frequencies. The signal processing method, MLBP, thus offers a viable alternative feature space to textual representations of sequence data. The source code for our Multi-resolution Genomic Binary Patterns method can be found at https://github.com/skouchaki/MrGBP.
Collapse
|
13
|
Rahman MS, Aktar U, Jani MR, Shatabda S. iPro70-FMWin: identifying Sigma70 promoters using multiple windowing and minimal features. Mol Genet Genomics 2018; 294:69-84. [PMID: 30187132 DOI: 10.1007/s00438-018-1487-5] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2018] [Accepted: 08/29/2018] [Indexed: 01/16/2023]
Abstract
In bacterial DNA, there are specific sequences of nucleotides called promoters that can bind to the RNA polymerase. Sigma70 ([Formula: see text]) is one of the most important promoter sequences due to its presence in most of the DNA regulatory functions. In this paper, we identify the most effective and optimal sequence-based features for prediction of [Formula: see text] promoter sequences in a bacterial genome. We used both short-range and long-range DNA sequences in our proposed method. A very small number of effective features are selected from a large number of the extracted features using multi-window of different sizes within the DNA sequences. We call our prediction method iPro70-FMWin and made it freely accessible online via a web application established at http://ipro70.pythonanywhere.com/server for the sake of convenience of the researchers. We have tested our method using a standard benchmark dataset. In the experiments, iPro70-FMWin has achieved an area under the curve of the receiver operating characteristic and accuracy of 0.959 and 90.57%, respectively, which significantly outperforms the state-of-the-art predictors.
Collapse
Affiliation(s)
- Md Siddiqur Rahman
- Department of Computer Science and Engineering, United International University, Madani Avenue, Satarkul, Badda, Dhaka, 1212, Bangladesh
| | - Usma Aktar
- Department of Computer Science and Engineering, United International University, Madani Avenue, Satarkul, Badda, Dhaka, 1212, Bangladesh
| | - Md Rafsan Jani
- Department of Computer Science and Engineering, United International University, Madani Avenue, Satarkul, Badda, Dhaka, 1212, Bangladesh
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University, Madani Avenue, Satarkul, Badda, Dhaka, 1212, Bangladesh.
| |
Collapse
|
14
|
Rahman MS, Aktar U, Jani MR, Shatabda S. iPromoter-FSEn: Identification of bacterial σ 70 promoter sequences using feature subspace based ensemble classifier. Genomics 2018; 111:1160-1166. [PMID: 30059731 DOI: 10.1016/j.ygeno.2018.07.011] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2018] [Revised: 07/07/2018] [Accepted: 07/12/2018] [Indexed: 10/28/2022]
Abstract
Sigma promoter sequences in bacterial genomes are important due to their role in transcription initiation. Sigma 70 is one of the most important and crucial sigma factors. In this paper, we address the problem of identification of σ70 promoter sequences in bacterial genome. We propose iPromoter-FSEn, a novel predictor for identification of σ70 promoter sequences. Our proposed method is based on a feature subspace based ensemble classifier. A large set of of features extracted from the sequence of nucleotides are divided into subsets and each subset is given to individual single classifiers to learn. Based on the decisions of the ensemble an aggregate decision is made by the ensemble voting classifier. We tested our method on a standard benchmark dataset extracted from experimentally validated results. Experimental results shows that iPromoter-FSEn significantly improves over the state-of-the art σ70 promoter sequence predictors. The accuracy and area under receiver operating characteristic curve of iPromoter-FSEn are 86.32% and 0.9319 respectively. We have also made our method readily available for use as an web application from: http://ipromoterfsen.pythonanywhere.com/server.
Collapse
Affiliation(s)
- Md Siddiqur Rahman
- Department of Computer Science and Engineering, United International University Madani Avenue, Satarkul, Badda, Dhaka 1212, Bangladesh
| | - Usma Aktar
- Department of Computer Science and Engineering, United International University Madani Avenue, Satarkul, Badda, Dhaka 1212, Bangladesh
| | - Md Rafsan Jani
- Department of Computer Science and Engineering, United International University Madani Avenue, Satarkul, Badda, Dhaka 1212, Bangladesh
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University Madani Avenue, Satarkul, Badda, Dhaka 1212, Bangladesh.
| |
Collapse
|
15
|
Yus E, Yang JS, Sogues A, Serrano L. A reporter system coupled with high-throughput sequencing unveils key bacterial transcription and translation determinants. Nat Commun 2017; 8:368. [PMID: 28848232 PMCID: PMC5573727 DOI: 10.1038/s41467-017-00239-7] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2017] [Accepted: 06/09/2017] [Indexed: 12/24/2022] Open
Abstract
Quantitative analysis of the sequence determinants of transcription and translation regulation is relevant for systems and synthetic biology. To identify these determinants, researchers have developed different methods of screening random libraries using fluorescent reporters or antibiotic resistance genes. Here, we have implemented a generic approach called ELM-seq (expression level monitoring by DNA methylation) that overcomes the technical limitations of such classic reporters. ELM-seq uses DamID (Escherichia coli DNA adenine methylase as a reporter coupled with methylation-sensitive restriction enzyme digestion and high-throughput sequencing) to enable in vivo quantitative analyses of upstream regulatory sequences. Using the genome-reduced bacterium Mycoplasma pneumoniae, we show that ELM-seq has a large dynamic range and causes minimal toxicity. We use ELM-seq to determine key sequences (known and putatively novel) of promoter and untranslated regions that influence transcription and translation efficiency. Applying ELM-seq to other organisms will help us to further understand gene expression and guide synthetic biology. Quantitative analysis of how DNA sequence determines transcription and translation regulation is of interest to systems and synthetic biologists. Here the authors present ELM-seq, which uses Dam activity as reporter for high-throughput analysis of promoter and 5’-UTR regions.
Collapse
Affiliation(s)
- Eva Yus
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Doctor Aiguader 88, Barcelona, 08003, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Jae-Seong Yang
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Doctor Aiguader 88, Barcelona, 08003, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Adrià Sogues
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Doctor Aiguader 88, Barcelona, 08003, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain.,Institut Pasteur, Unité de Microbiologie Structurale (CNRS) UMR 3528, Université Paris Diderot, 25 rue du Docteur Roux, Paris, 75724, France
| | - Luis Serrano
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Doctor Aiguader 88, Barcelona, 08003, Spain. .,Universitat Pompeu Fabra (UPF), Barcelona, Spain. .,Institució Catalana de Recerca i Estudis Avançats (ICREA), Pg. Lluis Companys 23, Barcelona, 08010, Spain.
| |
Collapse
|
16
|
Ahmad M, Jung LT, Bhuiyan AA. From DNA to protein: Why genetic code context of nucleotides for DNA signal processing? A review. Biomed Signal Process Control 2017. [DOI: 10.1016/j.bspc.2017.01.004] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
17
|
Ahmad M, Jung LT, Bhuiyan MAA. On fuzzy semantic similarity measure for DNA coding. Comput Biol Med 2015; 69:144-51. [PMID: 26773936 DOI: 10.1016/j.compbiomed.2015.12.017] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2015] [Revised: 12/22/2015] [Accepted: 12/23/2015] [Indexed: 11/28/2022]
Abstract
A coding measure scheme numerically translates the DNA sequence to a time domain signal for protein coding regions identification. A number of coding measure schemes based on numerology, geometry, fixed mapping, statistical characteristics and chemical attributes of nucleotides have been proposed in recent decades. Such coding measure schemes lack the biologically meaningful aspects of nucleotide data and hence do not significantly discriminate coding regions from non-coding regions. This paper presents a novel fuzzy semantic similarity measure (FSSM) coding scheme centering on FSSM codons׳ clustering and genetic code context of nucleotides. Certain natural characteristics of nucleotides i.e. appearance as a unique combination of triplets, preserving special structure and occurrence, and ability to own and share density distributions in codons have been exploited in FSSM. The nucleotides׳ fuzzy behaviors, semantic similarities and defuzzification based on the center of gravity of nucleotides revealed a strong correlation between nucleotides in codons. The proposed FSSM coding scheme attains a significant enhancement in coding regions identification i.e. 36-133% as compared to other existing coding measure schemes tested over more than 250 benchmarked and randomly taken DNA datasets of different organisms.
Collapse
Affiliation(s)
- Muneer Ahmad
- College of Computer Sciences, King Faisal University, Saudi Arabia.
| | - Low Tang Jung
- Department of Computer Sciences, University Technology PETRONAS, Malaysia.
| | | |
Collapse
|
18
|
Beier R, Labudde D. Numeric promoter description - A comparative view on concepts and general application. J Mol Graph Model 2015; 63:65-77. [PMID: 26655334 DOI: 10.1016/j.jmgm.2015.11.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Revised: 11/12/2015] [Accepted: 11/17/2015] [Indexed: 11/25/2022]
Abstract
Nucleic acid molecules play a key role in a variety of biological processes. Starting from storage and transfer tasks, this also comprises the triggering of biological processes, regulatory effects and the active influence gained by target binding. Based on the experimental output (in this case promoter sequences), further in silico analyses aid in gaining new insights into these processes and interactions. The numerical description of nucleic acids thereby constitutes a bridge between the concrete biological issues and the analytical methods. Hence, this study compares 26 descriptor sets obtained by applying well-known numerical description concepts to an established dataset of 38 DNA promoter sequences. The suitability of the description sets was evaluated by computing partial least squares regression models and assessing the model accuracy. We conclude that the major importance regarding the descriptive power is attached to positional information rather than to explicitly incorporated physico-chemical information, since a sufficient amount of implicit physico-chemical information is already encoded in the nucleobase classification. The regression models especially benefited from employing the information that is encoded in the sequential and structural neighborhood of the nucleobases. Thus, the analyses of n-grams (short fragments of length n) suggested that they are valuable descriptors for DNA target interactions. A mixed n-gram descriptor set thereby yielded the best description of the promoter sequences. The corresponding regression model was checked and found to be plausible as it was able to reproduce the characteristic binding motifs of promoter sequences in a reasonable degree. As most functional nucleic acids are based on the principle of molecular recognition, the findings are not restricted to promoter sequences, but can rather be transferred to other kinds of functional nucleic acids. Thus, the concepts presented in this study could provide advantages for future nucleic acid-based technologies, like biosensoring, therapeutics and molecular imaging.
Collapse
Affiliation(s)
- Rico Beier
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany.
| | - Dirk Labudde
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany.
| |
Collapse
|
19
|
Abbas MM, Mohie-Eldin MM, EL-Manzalawy Y. Assessing the effects of data selection and representation on the development of reliable E. coli sigma 70 promoter region predictors. PLoS One 2015; 10:e0119721. [PMID: 25803493 PMCID: PMC4372424 DOI: 10.1371/journal.pone.0119721] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2014] [Accepted: 01/26/2015] [Indexed: 11/27/2022] Open
Abstract
As the number of sequenced bacterial genomes increases, the need for rapid and reliable tools for the annotation of functional elements (e.g., transcriptional regulatory elements) becomes more desirable. Promoters are the key regulatory elements, which recruit the transcriptional machinery through binding to a variety of regulatory proteins (known as sigma factors). The identification of the promoter regions is very challenging because these regions do not adhere to specific sequence patterns or motifs and are difficult to determine experimentally. Machine learning represents a promising and cost-effective approach for computational identification of prokaryotic promoter regions. However, the quality of the predictors depends on several factors including: i) training data; ii) data representation; iii) classification algorithms; iv) evaluation procedures. In this work, we create several variants of E. coli promoter data sets and utilize them to experimentally examine the effect of these factors on the predictive performance of E. coli σ70 promoter models. Our results suggest that under some combinations of the first three criteria, a prediction model might perform very well on cross-validation experiments while its performance on independent test data is drastically very poor. This emphasizes the importance of evaluating promoter region predictors using independent test data, which corrects for the over-optimistic performance that might be estimated using the cross-validation procedure. Our analysis of the tested models shows that good prediction models often perform well despite how the non-promoter data was obtained. On the other hand, poor prediction models seems to be more sensitive to the choice of non-promoter sequences. Interestingly, the best performing sequence-based classifiers outperform the best performing structure-based classifiers on both cross-validation and independent test performance evaluation experiments. Finally, we propose a meta-predictor method combining two top performing sequence-based and structure-based classifiers and compare its performance with some of the state-of-the-art E. coli σ70 promoter prediction methods.
Collapse
Affiliation(s)
- Mostafa M. Abbas
- KINDI Center for Computing Research, College of Engineering, Qatar University, Doha, Qatar
| | | | - Yasser EL-Manzalawy
- Systems and Computer Engineering, Al-Azhar University, Cairo, Egypt
- College of Information Sciences, Penn State University, University Park, United States of America
| |
Collapse
|
20
|
Lloréns-Rico V, Lluch-Senar M, Serrano L. Distinguishing between productive and abortive promoters using a random forest classifier in Mycoplasma pneumoniae. Nucleic Acids Res 2015; 43:3442-53. [PMID: 25779052 PMCID: PMC4402517 DOI: 10.1093/nar/gkv170] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2014] [Accepted: 02/22/2015] [Indexed: 12/01/2022] Open
Abstract
Distinguishing between promoter-like sequences in bacteria that belong to true or abortive promoters, or to those that do not initiate transcription at all, is one of the important challenges in transcriptomics. To address this problem, we have studied the genome-reduced bacterium Mycoplasma pneumoniae, for which the RNAs associated with transcriptional start sites have been recently experimentally identified. We determined the contribution to transcription events of different genomic features: the –10, extended –10 and –35 boxes, the UP element, the bases surrounding the –10 box and the nearest-neighbor free energy of the promoter region. Using a random forest classifier and the aforementioned features transformed into scores, we could distinguish between true, abortive promoters and non-promoters with good –10 box sequences. The methods used in this characterization of promoters can be extended to other bacteria and have important applications for promoter design in bacterial genome engineering.
Collapse
Affiliation(s)
- Verónica Lloréns-Rico
- EMBL/CRG Systems Biology Research Unit, Centre for Genomic Regulation (CRG), Dr Aiguader 88, 08003 Barcelona, Spain Universitat Pompeu Fabra (UPF), Dr Aiguader 88, 08003 Barcelona, Spain
| | - Maria Lluch-Senar
- EMBL/CRG Systems Biology Research Unit, Centre for Genomic Regulation (CRG), Dr Aiguader 88, 08003 Barcelona, Spain Universitat Pompeu Fabra (UPF), Dr Aiguader 88, 08003 Barcelona, Spain
| | - Luis Serrano
- EMBL/CRG Systems Biology Research Unit, Centre for Genomic Regulation (CRG), Dr Aiguader 88, 08003 Barcelona, Spain Universitat Pompeu Fabra (UPF), Dr Aiguader 88, 08003 Barcelona, Spain Institució Catalana de Recerca i Estudis Avançats (ICREA), Pg. Lluis Companys 23, 08010 Barcelona, Spain
| |
Collapse
|
21
|
Borrayo E, Mendizabal-Ruiz EG, Vélez-Pérez H, Romo-Vázquez R, Mendizabal AP, Morales JA. Genomic signal processing methods for computation of alignment-free distances from DNA sequences. PLoS One 2014; 9:e110954. [PMID: 25393409 PMCID: PMC4230918 DOI: 10.1371/journal.pone.0110954] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2012] [Accepted: 09/26/2014] [Indexed: 11/19/2022] Open
Abstract
Genomic signal processing (GSP) refers to the use of digital signal processing (DSP) tools for analyzing genomic data such as DNA sequences. A possible application of GSP that has not been fully explored is the computation of the distance between a pair of sequences. In this work we present GAFD, a novel GSP alignment-free distance computation method. We introduce a DNA sequence-to-signal mapping function based on the employment of doublet values, which increases the number of possible amplitude values for the generated signal. Additionally, we explore the use of three DSP distance metrics as descriptors for categorizing DNA signal fragments. Our results indicate the feasibility of employing GAFD for computing sequence distances and the use of descriptors for characterizing DNA fragments.
Collapse
Affiliation(s)
- Ernesto Borrayo
- Computer Sciences Department, CUCEI - Universidad de Guadalajara, Guadalajara, México
| | | | - Hugo Vélez-Pérez
- Computer Sciences Department, CUCEI - Universidad de Guadalajara, Guadalajara, México
| | - Rebeca Romo-Vázquez
- Computer Sciences Department, CUCEI - Universidad de Guadalajara, Guadalajara, México
| | - Adriana P. Mendizabal
- Molecular Biology Laboratory, Farmacobiology Department, CUCEI - Universidad de Guadalajara, Guadalajara, México
| | - J. Alejandro Morales
- Computer Sciences Department, CUCEI - Universidad de Guadalajara, Guadalajara, México
- Center for Theoretical Research and High Performance Computing, CUCEI -Universidad de Guadalajara, Guadalajara, México
- * E-mail:
| |
Collapse
|
22
|
Lin H, Deng EZ, Ding H, Chen W, Chou KC. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res 2014; 42:12961-72. [PMID: 25361964 PMCID: PMC4245931 DOI: 10.1093/nar/gku1019] [Citation(s) in RCA: 413] [Impact Index Per Article: 37.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The σ54 promoters are unique in prokaryotic genome and responsible for transcripting carbon and nitrogen-related genes. With the avalanche of genome sequences generated in the postgenomic age, it is highly desired to develop automated methods for rapidly and effectively identifying the σ54 promoters. Here, a predictor called ‘iPro54-PseKNC’ was developed. In the predictor, the samples of DNA sequences were formulated by a novel feature vector called ‘pseudo k-tuple nucleotide composition’, which was further optimized by the incremental feature selection procedure. The performance of iPro54-PseKNC was examined by the rigorous jackknife cross-validation tests on a stringent benchmark data set. As a user-friendly web-server, iPro54-PseKNC is freely accessible at http://lin.uestc.edu.cn/server/iPro54-PseKNC. For the convenience of the vast majority of experimental scientists, a step-by-step protocol guide was provided on how to use the web-server to get the desired results without the need to follow the complicated mathematics that were presented in this paper just for its integrity. Meanwhile, we also discovered through an in-depth statistical analysis that the distribution of distances between the transcription start sites and the translation initiation sites were governed by the gamma distribution, which may provide a fundamental physical principle for studying the σ54 promoters.
Collapse
Affiliation(s)
- Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China Gordon Life Science Institute, Belmont, MA, USA
| | - En-Ze Deng
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Chen
- Department of Physics, School of Sciences, and Center for Genomics and Computational Biology, Hebei United University, Tangshan 063000, China Gordon Life Science Institute, Belmont, MA, USA
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Belmont, MA, USA Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi Arabia
| |
Collapse
|
23
|
Genome-wide analysis of promoters: clustering by alignment and analysis of regular patterns. PLoS One 2014; 9:e85260. [PMID: 24465517 PMCID: PMC3898993 DOI: 10.1371/journal.pone.0085260] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2013] [Accepted: 11/26/2013] [Indexed: 01/08/2023] Open
Abstract
In this paper we perform a genome-wide analysis of H. sapiens promoters. To this aim, we developed and combined two mathematical methods that allow us to (i) classify promoters into groups characterized by specific global structural features, and (ii) recover, in full generality, any regular sequence in the different classes of promoters. One of the main findings of this analysis is that H. sapiens promoters can be classified into three main groups. Two of them are distinguished by the prevalence of weak or strong nucleotides and are characterized by short compositionally biased sequences, while the most frequent regular sequences in the third group are strongly correlated with transposons. Taking advantage of the generality of these mathematical procedures, we have compared the promoter database of H. sapiens with those of other species. We have found that the above-mentioned features characterize also the evolutionary content appearing in mammalian promoters, at variance with ancestral species in the phylogenetic tree, that exhibit a definitely lower level of differentiation among promoters.
Collapse
|
24
|
Meng H, Wang J, Xiong Z, Xu F, Zhao G, Wang Y. Quantitative design of regulatory elements based on high-precision strength prediction using artificial neural network. PLoS One 2013; 8:e60288. [PMID: 23560087 PMCID: PMC3613377 DOI: 10.1371/journal.pone.0060288] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2013] [Accepted: 02/25/2013] [Indexed: 01/31/2023] Open
Abstract
Accurate and controllable regulatory elements such as promoters and ribosome binding sites (RBSs) are indispensable tools to quantitatively regulate gene expression for rational pathway engineering. Therefore, de novo designing regulatory elements is brought back to the forefront of synthetic biology research. Here we developed a quantitative design method for regulatory elements based on strength prediction using artificial neural network (ANN). One hundred mutated Trc promoter & RBS sequences, which were finely characterized with a strength distribution from 0 to 3.559 (relative to the strength of the original sequence which was defined as 1), were used for model training and test. A precise strength prediction model, NET90_19_576, was finally constructed with high regression correlation coefficients of 0.98 for both model training and test. Sixteen artificial elements were in silico designed using this model. All of them were proved to have good consistency between the measured strength and our desired strength. The functional reliability of the designed elements was validated in two different genetic contexts. The designed parts were successfully utilized to improve the expression of BmK1 peptide toxin and fine-tune deoxy-xylulose phosphate pathway in Escherichia coli. Our results demonstrate that the methodology based on ANN model can de novo and quantitatively design regulatory elements with desired strengths, which are of great importance for synthetic biology applications.
Collapse
Affiliation(s)
- Hailin Meng
- Key Laboratory of Synthetic Biology, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Jianfeng Wang
- Key Laboratory of Synthetic Biology, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, Shanghai, China
| | - Zhiqiang Xiong
- Key Laboratory of Synthetic Biology, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Feng Xu
- Key Laboratory of Synthetic Biology, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Guoping Zhao
- Key Laboratory of Synthetic Biology, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yong Wang
- Key Laboratory of Synthetic Biology, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- * E-mail:
| |
Collapse
|
25
|
Fine tuning the transcription of ldhA for d-lactate production. ACTA ACUST UNITED AC 2012; 39:1209-17. [DOI: 10.1007/s10295-012-1116-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2011] [Accepted: 02/27/2012] [Indexed: 11/25/2022]
Abstract
Abstract
Fine tuning of the key enzymes to moderate rather than high expression levels could overproduce the desired metabolic products without inhibiting cell growth. The aims of this investigation were to regulate rates of lactate production and cell growth in recombinant Escherichia coli through promoter engineering and to evaluate the transcriptional function of the upstream region of ldhA (encoding fermentative lactate dehydrogenase in E. coli). Twelve ldhA genes with sequentially shortened chromosomal upstream regions were cloned in an ldhA deletion, E. coli CICIM B0013-080C (ack-pta pps pflB dld poxB adhE frdA ldhA). The varied ldhA upstream regions were further analyzed using program NNPP2.2 (Neural Network Promoter Prediction 2.2) to predict the possible promoter regions. Two-phase fermentations (aerobic growth and oxygen-limited production) of these strains showed that shortening the ldhA upstream sequence from 291 to 106 bp successively reduced aerobic lactate synthesis and the inhibition effect on cell growth during the first phase. Simultaneously, oxygen-limited lactate productivity was increased during the second phase. The putative promoter downstream of the −96 site of ldhA could function as a transcriptional promoter or regulator. B0013-080C/pTH-rrnB-ldhA8, with the 72-bp upstream segment of ldhA, could be grown at a high rate and achieve a high oxygen-limited lactate productivity of 1.09 g g−1 h−1. No transcriptional promoting region was apparent downstream of the −61 site of ldhA. We identified the latent transcription regions in the ldhA upstream sequence, which will help to understand regulation of ldhA expression.
Collapse
|
26
|
de Avila e Silva S, Echeverrigaray S, Gerhardt GJ. BacPP: Bacterial promoter prediction—A tool for accurate sigma-factor specific assignment in enterobacteria. J Theor Biol 2011; 287:92-9. [DOI: 10.1016/j.jtbi.2011.07.017] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2010] [Revised: 05/20/2011] [Accepted: 07/21/2011] [Indexed: 10/17/2022]
|
27
|
de Avila E Silva S, Gerhardt GJL, Echeverrigaray S. Rules extraction from neural networks applied to the prediction and recognition of prokaryotic promoters. Genet Mol Biol 2011; 34:353-60. [PMID: 21734842 PMCID: PMC3115335 DOI: 10.1590/s1415-47572011000200031] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2010] [Accepted: 01/11/2011] [Indexed: 11/21/2022] Open
Abstract
Promoters are DNA sequences located upstream of the gene region and play a central role in gene expression. Computational techniques show good accuracy in gene prediction but are less successful in predicting promoters, primarily because of the high number of false positives that reflect characteristics of the promoter sequences. Many machine learning methods have been used to address this issue. Neural Networks (NN) have been successfully used in this field because of their ability to recognize imprecise and incomplete patterns characteristic of promoter sequences. In this paper, NN was used to predict and recognize promoter sequences in two data sets: (i) one based on nucleotide sequence information and (ii) another based on stability sequence information. The accuracy was approximately 80% for simulation (i) and 68% for simulation (ii). In the rules extracted, biological consensus motifs were important parts of the NN learning process in both simulations.
Collapse
Affiliation(s)
- Scheila de Avila E Silva
- Programa de Pós-Graduação em Biotecnologia, Universidade de Caxias do Sul, Caxias do Sul, RS, Brazil
| | | | | |
Collapse
|
28
|
Abstract
Both supervised and unsupervised neural networks have been applied to the prediction of protein structure and function. Here, we focus on feedforward neural networks and describe how these learning machines can be applied to protein prediction. We discuss how to select an appropriate data set, how to choose and encode protein features into the neural network input, and how to assess the predictor's performance.
Collapse
Affiliation(s)
- Marco Punta
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, USA
| | | |
Collapse
|
29
|
Askary A, Masoudi-Nejad A, Sharafi R, Mizbani A, Parizi SN, Purmasjedi M. N4: A precise and highly sensitive promoter predictor using neural network fed by nearest neighbors. Genes Genet Syst 2009; 84:425-30. [DOI: 10.1266/ggs.84.425] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Affiliation(s)
- Amjad Askary
- Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics and COE in Biomathematics, University of Tehran
- Department of Biotechnology, College of Science, University of Tehran
| | - Ali Masoudi-Nejad
- Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics and COE in Biomathematics, University of Tehran
| | - Roozbeh Sharafi
- Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics and COE in Biomathematics, University of Tehran
| | - Amir Mizbani
- Department of Biotechnology, College of Science, University of Tehran
| | | | - Malihe Purmasjedi
- Department of Biotechnology, College of Science, University of Tehran
| |
Collapse
|
30
|
Platt M, Rowe W, Knowles J, Day PJ, Kell DB. Analysis of aptamer sequence activity relationships. Integr Biol (Camb) 2008; 1:116-22. [PMID: 20023798 DOI: 10.1039/b814892a] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
DNA sequences that can bind selectively and specifically to target molecules are known as aptamers. Normally such binding analyses are performed using soluble aptamers. However, there is much to be gained by using an on-chip or microarray format, where a large number of aptameric DNA sequences can be interrogated simultaneously. To calibrate the system, known thrombin binding aptamers (TBAs) have been mutated systematically, producing large populations that allow exploration of key structural aspects of the overall binding motif. The ability to discriminate between background noise and low affinity binding aptamers can be problematic on arrays, and we use the mutated sequences to establish appropriate experimental conditions and their limitations for two commonly used fluorescence-based detection methods. Having optimized experimental conditions, high-density oligonucleotide microarrays were used to explore the entire loop-sequence-functionality relationship creating a detailed model based on over 40 000 analyses, describing key features for quadruplex-forming sequences.
Collapse
Affiliation(s)
- Mark Platt
- Manchester Interdisciplinary Biocentre, The University of Manchester, UK
| | | | | | | | | |
Collapse
|
31
|
Li Y, Zhu Y, Liu Y, Shu Y, Meng F, Lu Y, Bai X, Liu B, Guo D. Genome-wide identification of osmotic stress response gene in Arabidopsis thaliana. Genomics 2008; 92:488-93. [PMID: 18804526 DOI: 10.1016/j.ygeno.2008.08.011] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2008] [Revised: 08/14/2008] [Accepted: 08/18/2008] [Indexed: 11/18/2022]
Abstract
In this paper, we present a cis-regulatory element based computational approach to genome-wide identification of genes putatively responding to various osmotic stresses in Arabidopsis thaliana. The rationale of our method is that gene expression is largely controlled at the transcriptional level through the interactions between transcription factors and cis-regulatory elements. Using cis-regulatory motifs known to regulate osmotic stress response, we therefore built an artificial neural network model to identify other functionally relevant genes involved in the same process. We performed Gene Ontology enrichment analysis on the 500 top-scoring predictions and found that, except for un-annotated ORFs ( approximately 40%), 91.3% of the enriched GO classification was related to stress response and ABA response. Publicly available gene expression profiling data of Arabidopsis under various stresses were used for cross validation. We also conducted RT-PCR analysis to experimentally verify selected predictions. According to our results, transcript levels of 27 out of 41 top-ranked genes (65.8%) altered under various osmotic stress treatments. We believe that a similar approach could be extensively adopted elsewhere to infer gene function in various cellular processes from different species.
Collapse
Affiliation(s)
- Yong Li
- Plant Bioengineering Laboratory, Northeast Agricultural University, Harbin, China
| | | | | | | | | | | | | | | | | |
Collapse
|
32
|
Banerjee AK, Kiran K, Murty USN, Venkateswarlu C. Classification and identification of mosquito species using artificial neural networks. Comput Biol Chem 2008; 32:442-7. [PMID: 18838305 DOI: 10.1016/j.compbiolchem.2008.07.020] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2007] [Revised: 07/10/2008] [Accepted: 07/10/2008] [Indexed: 11/30/2022]
Abstract
An artificial neural network method is presented for classification and identification of Anopheles mosquito species based on the internal transcribed spacer2 (ITS2) data of ribosomal DNA string. The method is implemented in two different multi-layered feed-forward neural network model forms, namely, multi-input single-output neural network (MISONN) and multi-input multi-output neural network (MIMONN). A number of data sequences in varying sizes of different Anopheline malarial vectors and their corresponding species coding are employed to develop the neural network models. The classification efficiency of the network models for untrained data sequences is evaluated in terms of quantitative performance criteria. The results demonstrate the efficiency of the neural network models to extract the genetic information in ITS2 sequences and to adapt to new data. The method of MISONN is found to exhibit superior performance over MIMONN in distinguishing and identification of the mosquito vectors.
Collapse
Affiliation(s)
- Amit Kumar Banerjee
- Bioinformatics Group, Biology Division, Indian Institute of Chemical Technology, Andhra Pradesh, India
| | | | | | | |
Collapse
|
33
|
Dekhtyar M, Morin A, Sakanyan V. Triad pattern algorithm for predicting strong promoter candidates in bacterial genomes. BMC Bioinformatics 2008; 9:233. [PMID: 18471287 PMCID: PMC2412878 DOI: 10.1186/1471-2105-9-233] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2007] [Accepted: 05/09/2008] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Bacterial promoters, which increase the efficiency of gene expression, differ from other promoters by several characteristics. This difference, not yet widely exploited in bioinformatics, looks promising for the development of relevant computational tools to search for strong promoters in bacterial genomes. RESULTS We describe a new triad pattern algorithm that predicts strong promoter candidates in annotated bacterial genomes by matching specific patterns for the group I sigma70 factors of Escherichia coli RNA polymerase. It detects promoter-specific motifs by consecutively matching three patterns, consisting of an UP-element, required for interaction with the alpha subunit, and then optimally-separated patterns of -35 and -10 boxes, required for interaction with the sigma70 subunit of RNA polymerase. Analysis of 43 bacterial genomes revealed that the frequency of candidate sequences depends on the A+T content of the DNA under examination. The accuracy of in silico prediction was experimentally validated for the genome of a hyperthermophilic bacterium, Thermotoga maritima, by applying a cell-free expression assay using the predicted strong promoters. In this organism, the strong promoters govern genes for translation, energy metabolism, transport, cell movement, and other as-yet unidentified functions. CONCLUSION The triad pattern algorithm developed for predicting strong bacterial promoters is well suited for analyzing bacterial genomes with an A+T content of less than 62%. This computational tool opens new prospects for investigating global gene expression, and individual strong promoters in bacteria of medical and/or economic significance.
Collapse
Affiliation(s)
| | - Amelie Morin
- Laboratoire de Biotechnologie, UMR CNRS 6204, Université de Nantes, 2 rue de la Houssinière, 44322 Nantes, France
| | - Vehary Sakanyan
- Laboratoire de Biotechnologie, UMR CNRS 6204, Université de Nantes, 2 rue de la Houssinière, 44322 Nantes, France
- ProtNeteomix, 2 rue de la Houssinière, 44322 Nantes, France
| |
Collapse
|
34
|
Zhang AB, Sikes DS, Muster C, Li SQ. Inferring Species Membership Using DNA Sequences with Back-Propagation Neural Networks. Syst Biol 2008; 57:202-15. [DOI: 10.1080/10635150802032982] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022] Open
Affiliation(s)
- A. B. Zhang
- Institute of Zoology, Chinese Academy of Sciences Beijing 100080, P. R. China; E-mail: ;
- Current Address: Albanova University Center, Royal Institute of BiotechnologySE-106 91 Stockholm, Sweden; E-mail:
| | - D. S. Sikes
- University of Alaska Museum 907 Yukon Drive, Fairbanks, Alaska 99775-6960, USA
| | - C. Muster
- Molecular Evolution and Animal Systematics, University of Leipzig Talstrasse 33, D-04103 Leipzig, Germany
| | - S. Q. Li
- Institute of Zoology, Chinese Academy of Sciences Beijing 100080, P. R. China; E-mail: ;
| |
Collapse
|
35
|
Liang G, Li Z. Scores of generalized base properties for quantitative sequence-activity modelings for E. coli promoters based on support vector machine. J Mol Graph Model 2007; 26:269-81. [PMID: 17291800 DOI: 10.1016/j.jmgm.2006.12.004] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2006] [Revised: 11/18/2006] [Accepted: 12/10/2006] [Indexed: 10/23/2022]
Abstract
A novel base sequence representation technique, namely SGBP (scores of generalized base properties), was derived from principal component analysis of a matrix of 1209 property parameters including 0D, 1D, 2D and 3D information for five bases such as A, C, G, T and U. It was then employed to represent sequence structures of E. coli promoters. Variables which were used as inputs of partial least square (PLS) and support vector machine (SVM) were selected by genetic arithmetic-partial least square. All samples were divided into train set which was applied to develop quantitative sequence-activity modelings (QSAMs) and test set which was used to validate the predictive power of the resulting models according to D-optimal design. Investigation on QSAM by PLS showed properties of base of position -42, -34, -31, -33, -41, -46 and -29 may yield more influence on strengths, which has thus pointed us further into the direction of strong promoters. Parameters of SVM were determined by response surface methodology. Satisfactory results indicated that the simulative and the predictive abilities for the internal and external samples of QSAM by SVM were better than those of PLS. Those results showed that SGBP is a useful structural representation methodology in QSAMs due to its many advantages including plentiful structural information, easy manipulation, and high characterization competence. Moreover, SGBP-GA-SVM route for sequences design and activities prediction of DNA or RNA can further be applied.
Collapse
Affiliation(s)
- Guizhao Liang
- College of Bioengineering, Chongqing University, Chongqing 400030, PR China
| | | |
Collapse
|
36
|
González-Díaz H, Agüero-Chapin G, Varona J, Molina R, Delogu G, Santana L, Uriarte E, Podda G. 2D-RNA-coupling numbers: A new computational chemistry approach to link secondary structure topology with biological function. J Comput Chem 2007; 28:1049-56. [PMID: 17279496 DOI: 10.1002/jcc.20576] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Methods for prediction of proteins, DNA, or RNA function and mapping it onto sequence often rely on bioinformatics alignment approach instead of chemical structure. Consequently, it is interesting to develop computational chemistry approaches based on molecular descriptors. In this sense, many researchers used sequence-coupling numbers and our group extended them to 2D proteins representations. However, no coupling numbers have been reported for 2D-RNA topology graphs, which are highly branched and contain useful information. Here, we use a computational chemistry scheme: (a) transforming sequences into RNA secondary structures, (b) defining and calculating new 2D-RNA-coupling numbers, (c) seek a structure-function model, and (d) map biological function onto the folded RNA. We studied as example 1-aminocyclopropane-1-carboxylic acid (ACC) oxidases known as ACO, which control fruit ripening having importance for biotechnology industry. First, we calculated tau(k)(2D-RNA) values to a set of 90-folded RNAs, including 28 transcripts of ACO and control sequences. Afterwards, we compared the classification performance of 10 different classifiers implemented in the software WEKA. In particular, the logistic equation ACO = 23.8 . tau(1)(2D-RNA) + 41.4 predicts ACOs with 98.9%, 98.0%, and 97.8% of accuracy in training, leave-one-out and 10-fold cross-validation, respectively. Afterwards, with this equation we predict ACO function to a sequence isolated in this work from Coffea arabica (GenBank accession DQ218452). The tau(1)(2D-RNA) also favorably compare with other descriptors. This equation allows us to map the codification of ACO activity on different mRNA topology features. The present computational-chemistry approach is general and could be extended to connect RNA secondary structure topology to other functions.
Collapse
Affiliation(s)
- Humberto González-Díaz
- Department of Organic Chemistry, University of Santiago de Compostela, Santiago de Compostela 15782, Spain.
| | | | | | | | | | | | | | | |
Collapse
|
37
|
Mann S, Li J, Chen YPP. A pHMM-ANN based discriminative approach to promoter identification in prokaryote genomic contexts. Nucleic Acids Res 2006; 35:e12. [PMID: 17170007 PMCID: PMC1802591 DOI: 10.1093/nar/gkl1024] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2006] [Revised: 10/25/2006] [Accepted: 11/14/2006] [Indexed: 11/14/2022] Open
Abstract
The computational approach for identifying promoters on increasingly large genomic sequences has led to many false positives. The biological significance of promoter identification lies in the ability to locate true promoters with and without prior sequence contextual knowledge. Prior approaches to promoter modelling have involved artificial neural networks (ANNs) or hidden Markov models (HMMs), each producing adequate results on small scale identification tasks, i.e. narrow upstream regions. In this work, we present an architecture to support prokaryote promoter identification on large scale genomic sequences, i.e. not limited to narrow upstream regions. The significant contribution involved the hybrid formed via aggregation of the profile HMM with the ANN, via Viterbi scoring optimizations. The benefit obtained using this architecture includes the modelling ability of the profile HMM with the ability of the ANN to associate elements composing the promoter. We present the high effectiveness of the hybrid approach in comparison to profile HMMs and ANNs when used separately. The contribution of Viterbi optimizations is also highlighted for supporting the hybrid architecture in which gains in sensitivity (+0.3), specificity (+0.65) and precision (+0.54) are achieved over existing approaches.
Collapse
Affiliation(s)
- Scott Mann
- School of Engineering and Information Technology, Deakin UniversityVictoria, Australia
| | - Jinyan Li
- Institute for Infocomm ResearchSingapore 119613
| | - Yi-Ping Phoebe Chen
- School of Engineering and Information Technology, Deakin UniversityVictoria, Australia
- Australian Research Council Centre in BioinformaticsMelbourne, Australia
| |
Collapse
|
38
|
Xie X, Wu S, Lam KM, Yan H. PromoterExplorer: an effective promoter identification method based on the AdaBoost algorithm. Bioinformatics 2006; 22:2722-8. [PMID: 17000749 DOI: 10.1093/bioinformatics/btl482] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Promoter prediction is important for the analysis of gene regulations. Although a number of promoter prediction algorithms have been reported in literature, significant improvement in prediction accuracy remains a challenge. In this paper, an effective promoter identification algorithm, which is called PromoterExplorer, is proposed. In our approach, we analyze the different roles of various features, that is, local distribution of pentamers, positional CpG island features and digitized DNA sequence, and then combine them to build a high-dimensional input vector. A cascade AdaBoost-based learning procedure is adopted to select the most 'informative' or 'discriminating' features to build a sequence of weak classifiers, which are combined to form a strong classifier so as to achieve a better performance. The cascade structure used for identification can also reduce the false positive. RESULTS PromoterExplorer is tested based on large-scale DNA sequences from different databases, including the EPD, DBTSS, GenBank and human chromosome 22. Experimental results show that consistent and promising performance can be achieved.
Collapse
Affiliation(s)
- Xudong Xie
- Department of Electronic Engineering, City University of Hong Kong, Hong Kong
| | | | | | | |
Collapse
|
39
|
Li QZ, Lin H. The recognition and prediction of σ70 promoters in Escherichia coli K-12. J Theor Biol 2006; 242:135-41. [PMID: 16603195 DOI: 10.1016/j.jtbi.2006.02.007] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2005] [Revised: 02/05/2006] [Accepted: 02/10/2006] [Indexed: 10/24/2022]
Abstract
Based on the conservation analysis of the 683 latest experimentally verified sigma(70)-promoter sequences of Escherichia coli K-12, it is found that the conservative hexamers segments in different sites play a key role of promoter regions, a novel position-correlation scoring matrix (PCSM) algorithm for predicting sigma(70) promoter is presented. The predictive capacity of the algorithm is tested by 10-cross validation test. The results show that the overall prediction accuracies (sensitivity) and specificity are 91% and 81%, respectively. By selecting the 683 experimentally verified sigma(70) promoters as training set and searching for the complete sequence in E. coli K-12 with 4639221bp. Results show that the 100% of the 683 experimentally verified sigma(70) promoters have been identified and some possible promoters are predicted.
Collapse
Affiliation(s)
- Qian-Zhong Li
- Department of Physics, Laboratory of Theoretical Biophysics, College of Sciences and Technology, Inner Mongolia University, Hohhot 010021, China.
| | | |
Collapse
|
40
|
Gunewardena S, Jeavons P, Zhang Z. Enhancing the prediction of transcription factor binding sites by incorporating structural properties and nucleotide covariations. J Comput Biol 2006; 13:929-45. [PMID: 16761919 DOI: 10.1089/cmb.2006.13.929] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
A problem faced by many algorithms for finding transcription factor (TF) binding sites is the high number of false positive hits that result with the increased sensitivity of their prediction. A main contributing factor to this is the short and degenerate nature of these sites which results in a low signal-to-noise ratio. In order to counter this problem, one needs to look beyond the assumption that individual bases of a TF binding site act independently from each other when binding to a transcription factor. In this paper, we present a new method based on templates, designed to exploit the discriminatory features, nucleotide polymorphism, and structural homology present in TF binding sites for distinguishing them from nonbinding sites.
Collapse
Affiliation(s)
- Sumedha Gunewardena
- Banting and Best Department of Medical Research, Donnelly CCBR, University of Toronto, Ontario, Canada.
| | | | | |
Collapse
|
41
|
Rhodius VA, Suh WC, Nonaka G, West J, Gross CA. Conserved and variable functions of the sigmaE stress response in related genomes. PLoS Biol 2006; 4:e2. [PMID: 16336047 PMCID: PMC1312014 DOI: 10.1371/journal.pbio.0040002] [Citation(s) in RCA: 415] [Impact Index Per Article: 21.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2005] [Accepted: 10/13/2005] [Indexed: 11/19/2022] Open
Abstract
Bacteria often cope with environmental stress by inducing alternative sigma (σ) factors, which direct RNA polymerase to specific promoters, thereby inducing a set of genes called a regulon to combat the stress. To understand the conserved and organism-specific functions of each σ, it is necessary to be able to predict their promoters, so that their regulons can be followed across species. However, the variability of promoter sequences and motif spacing makes their prediction difficult. We developed and validated an accurate promoter prediction model for Escherichia coli σE, which enabled us to predict a total of 89 unique σE-controlled transcription units in E. coli K-12 and eight related genomes. σE controls the envelope stress response in E. coli K-12. The portion of the regulon conserved across genomes is functionally coherent, ensuring the synthesis, assembly, and homeostasis of lipopolysaccharide and outer membrane porins, the key constituents of the outer membrane of Gram-negative bacteria. The larger variable portion is predicted to perform pathogenesis-associated functions, suggesting that σE provides organism-specific functions necessary for optimal host interaction. The success of our promoter prediction model for σE suggests that it will be applicable for the prediction of promoter elements for many alternative σ factors. A model for predicting the variable promoter sequences associated with the bacterial stress response is developed and used to identify constituents of the transcriptional response to σE.
Collapse
Affiliation(s)
- Virgil A Rhodius
- 1 Department of Microbiology and Immunology, University of California, San Francisco, California, United States of America
| | - Won Chul Suh
- 1 Department of Microbiology and Immunology, University of California, San Francisco, California, United States of America
| | - Gen Nonaka
- 1 Department of Microbiology and Immunology, University of California, San Francisco, California, United States of America
| | - Joyce West
- 1 Department of Microbiology and Immunology, University of California, San Francisco, California, United States of America
| | - Carol A Gross
- 1 Department of Microbiology and Immunology, University of California, San Francisco, California, United States of America
- 2 Department of Cell and Tissue Biology, University of California, San Francisco, California, United States of America
| |
Collapse
|
42
|
Zhang F, Kuo MD, Brunkhors A. E. coli promoter prediction using feed-forward neural networks. CONFERENCE PROCEEDINGS : ... ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL CONFERENCE 2006; 2006:2025-2027. [PMID: 17946085 DOI: 10.1109/iembs.2006.260365] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
E. coli promoter recognition is an area of great interest in bioinformatics. In this paper, we describe the implementation of a feed forward neural network to predict the E. coli promoter. According to the sequence conservation, some sequences with 60 bases are selected as positive samples and some corresponding non-promoters from E. coli coding areas are selected as negative samples, and a classifier based on feed forward neural network is trained. Results show that feed forward neural networks can extract the statistical characteristics of promoters more effectively, and that coding with four dimensions for nucleic acid data is superior to two dimensions. Another result demonstrated here is that the number of hidden layers seems to have no significant effect on E. coli promoter prediction precision. The research results in this paper can provide reference for promoter recognition research.
Collapse
Affiliation(s)
- Fan Zhang
- Dept. of Radiol., California Univ., San Diego, CA, USA
| | | | | |
Collapse
|
43
|
Lewin A, Mayer M, Chusainow J, Jacob D, Appel B. Viral promoters can initiate expression of toxin genes introduced into Escherichia coli. BMC Biotechnol 2005; 5:19. [PMID: 15967027 PMCID: PMC1181807 DOI: 10.1186/1472-6750-5-19] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2005] [Accepted: 06/20/2005] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The expression of recombinant proteins in eukaryotic cells requires the fusion of the coding region to a promoter functional in the eukaryotic cell line. Viral promoters are very often used for this purpose. The preceding cloning procedures are usually performed in Escherichia coli and it is therefore of interest if the foreign promoter results in an expression of the gene in bacteria. In the case molecules toxic for humans are to be expressed, this knowledge is indispensable for the specification of safety measures. RESULTS We selected five frequently used viral promoters and quantified their activity in E. coli with a reporter system. Only the promoter from the thymidine kinase gene from HSV1 showed no activity, while the polyhedrin promoter from baculovirus, the early immediate CMV promoter, the early SV40 promoter and the 5' LTR promoter from HIV-1 directed gene expression in E. coli. The determination of transcription start sites in the immediate early CMV promoter and the polyhedrin promoter confirmed the existence of bacterial -10 and -35 consensus sequences. The importance of this heterologous gene expression for safety considerations was further supported by analysing fusions between the aforementioned promoters and a promoter-less cytotoxin gene. CONCLUSION According to our results a high percentage of viral promoters have the ability of initiating gene expression in E. coli. The degree of such heterologous gene expression can be sufficient for the expression of toxin genes and must therefore be considered when defining safety measures for the handling of corresponding genetically modified organisms.
Collapse
Affiliation(s)
- Astrid Lewin
- Robert Koch-Institut, Nordufer 20, 13353 Berlin, Germany
| | - Martin Mayer
- HU-Berlin, Abt. Bakterienphysiologie, Chausseestr.117, 10115 Berlin, Germany
| | - Janet Chusainow
- Bioprocessing Technology Institute (BTI) 20 Biopolis Way, Centros Singapore 138668
| | - Daniela Jacob
- Robert Koch-Institut, Nordufer 20, 13353 Berlin, Germany
| | - Bernd Appel
- Bundesintitut für Risikobewertung, Diedersdorfer Weg 112277 Berlin, Germany
| |
Collapse
|
44
|
A neural network based multi-classifier system for gene identification in DNA sequences. Neural Comput Appl 2004. [DOI: 10.1007/s00521-004-0447-7] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
45
|
Burden S, Lin YX, Zhang R. Improving promoter prediction for the NNPP2.2 algorithm: a case study using Escherichia coli DNA sequences. Bioinformatics 2004; 21:601-7. [PMID: 15454410 DOI: 10.1093/bioinformatics/bti047] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Although a great deal of research has been undertaken in the area of promoter prediction, prediction techniques are still not fully developed. Many algorithms tend to exhibit poor specificity, generating many false positives, or poor sensitivity. The neural network prediction program NNPP2.2 is one such example. RESULTS To improve the NNPP2.2 prediction technique, the distance between the transcription start site (TSS) associated with the promoter and the translation start site (TLS) of the subsequent gene coding region has been studied for Escherichia coli K12 bacteria. An empirical probability distribution that is consistent for all E.coli promoters has been established. This information is combined with the results from NNPP2.2 to create a new technique called TLS-NNPP, which improves the specificity of promoter prediction. The technique is shown to be effective using E.coli DNA sequences, however, it is applicable to any organism for which a set of promoters has been experimentally defined. AVAILABILITY The data used in this project and the prediction results for the tested sequences can be obtained from http://www.uow.edu.au/~yanxia/E_Coli_paper/SBurden_Results.xls CONTACT alh98@uow.edu.au.
Collapse
Affiliation(s)
- S Burden
- Department of Mathematics and Applied Statistics, University of Wollongong Wollongong, NSW 2522, Australia.
| | | | | |
Collapse
|
46
|
Lewin A, Tran TT, Jacob D, Mayer M, Freytag B, Appel B. Yeast DNA sequences initiating gene expression in Escherichia coli. Microbiol Res 2004; 159:19-28. [PMID: 15160603 DOI: 10.1016/j.micres.2004.01.006] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
DNA transfer between pro- and eukaryotes occurs either during natural horizontal gene transfer or as a result of the employment of gene technology. We analysed the capacity of DNA sequences from a eukaryotic donor organism (Saccharomyces cerevisiae) to serve as promoter region in a prokaryotic recipient (Escherichia coli) by creating fusions between promoterless luxAB genes from Vibrio harveyi and random DNA sequences from S. cerevisiae and measuring the luminescence of transformed E. coli. Fifty-four out of 100 randomly analysed S. cerevisiae DNA sequences caused considerable gene expression in E. coli. Determination of transcription start sites within six selected yeast sequences in E. coli confirmed the existence of bacterial -10 and -35 consensus sequences at appropriate distances upstream from transcription initiation sites. Our results demonstrate that the probability of transcription of transferred eukaryotic DNA in bacteria is extremely high and does not require the insertion of the transferred DNA behind a promoter of the recipient genome.
Collapse
Affiliation(s)
- Astrid Lewin
- Robert Koch-Institut, Nordufer 20, Berlin 13353, Germany.
| | | | | | | | | | | |
Collapse
|
47
|
Kalate RN, Tambe SS, Kulkarni BD. Artificial neural networks for prediction of mycobacterial promoter sequences. Comput Biol Chem 2004; 27:555-64. [PMID: 14667783 DOI: 10.1016/j.compbiolchem.2003.09.004] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
A multilayered feed-forward ANN architecture trained using the error-back-propagation (EBP) algorithm has been developed for predicting whether a given nucleotide sequence is a mycobacterial promoter sequence. Owing to the high prediction capability ( congruent with 97%) of the developed network model, it has been further used in conjunction with the caliper randomization (CR) approach for determining the structurally/functionally important regions in the promoter sequences. The results obtained thereby indicate that: (i) upstream region of -35 box, (ii) -35 region, (iii) spacer region and, (iv) -10 box, are important for mycobacterial promoters. The CR approach also suggests that the -38 to -29 region plays a significant role in determining whether a given sequence is a mycobacterial promoter. In essence, the present study establishes ANNs as a tool for predicting mycobacterial promoter sequences and determining structurally/functionally important sub-regions therein.
Collapse
Affiliation(s)
- Rupali N Kalate
- Department of Biosciences and Informatics, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan.
| | | | | |
Collapse
|
48
|
Miyazaki S, Kuroda Y, Yokoyama S. Characterization and prediction of linker sequences of multi-domain proteins by a neural network. JOURNAL OF STRUCTURAL AND FUNCTIONAL GENOMICS 2003; 2:37-51. [PMID: 12836673 DOI: 10.1023/a:1014418700858] [Citation(s) in RCA: 39] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
In this paper, we describe a neural network analysis of sequences connecting two protein domains (domain linkers). The neural network was trained to distinguish between domain linker sequences and non-linker sequences, using a SCOP-defined domain library. The analysis indicated that a significant difference existed between domain linkers and non-linker regions, including intra-domain loop regions. Moreover, the resulting Hinton diagram showed a position-dependent amino acid preference of the domain linker sequences, and implied their non-random nature. We then applied the neural network to predict domain linkers in multi-domain protein sequences. As the result of a Jack-knife test, 58% of the predicted regions matched actual linker regions (specificity), and 36% of the SCOP-derived domain linkers were predicted (sensitivity). This prediction efficiency is superior to simpler methods derived from secondary structure prediction that assume that long loop regions are putative domain linkers. Altogether, these results suggest that domain linkers possess local characteristics different from those of loop regions.
Collapse
Affiliation(s)
- Satoshi Miyazaki
- Department of Biophysics and Biochemistry, Graduate School of Science, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan
| | | | | |
Collapse
|
49
|
Maddouri M, Elloumi M. A data mining approach based on machine learning techniques to classify biological sequences. Knowl Based Syst 2002. [DOI: 10.1016/s0950-7051(01)00143-5] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
50
|
Parbhane RV, Tambe SS, Kulkarni BD. ANN modeling of DNA sequences: new strategies using DNA shape code. COMPUTERS & CHEMISTRY 2000; 24:699-711. [PMID: 10966128 DOI: 10.1016/s0097-8485(00)00072-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
Two new encoding strategies, namely, wedge and twist codes, which are based on the DNA helical parameters, are introduced to represent DNA sequences in artificial neural network (ANN)-based modeling of biological systems. The performance of the new coding strategies has been evaluated by conducting three case studies involving mapping (modeling) and classification applications of ANNs. The proposed coding schemes have been compared rigorously and shown to outperform the existing coding strategies especially in situations wherein limited data are available for building the ANN models.
Collapse
Affiliation(s)
- R V Parbhane
- Chemical Engineering Division, National Chemical Laboratory, Pune, India
| | | | | |
Collapse
|