1
|
Paul S, Olymon K, Martinez GS, Sarkar S, Yella VR, Kumar A. MLDSPP: Bacterial Promoter Prediction Tool Using DNA Structural Properties with Machine Learning and Explainable AI. J Chem Inf Model 2024; 64:2705-2719. [PMID: 38258978 DOI: 10.1021/acs.jcim.3c02017] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Bacterial promoters play a crucial role in gene expression by serving as docking sites for the transcription initiation machinery. However, accurately identifying promoter regions in bacterial genomes remains a challenge due to their diverse architecture and variations. In this study, we propose MLDSPP (Machine Learning and Duplex Stability based Promoter prediction in Prokaryotes), a machine learning-based promoter prediction tool, to comprehensively screen bacterial promoter regions in 12 diverse genomes. We leveraged biologically relevant and informative DNA structural properties, such as DNA duplex stability and base stacking, and state-of-the-art machine learning (ML) strategies to gain insights into promoter characteristics. We evaluated several machine learning models, including Support Vector Machines, Random Forests, and XGBoost, and assessed their performance using accuracy, precision, recall, specificity, F1 score, and MCC metrics. Our findings reveal that XGBoost outperformed other models and current state-of-the-art promoter prediction tools, namely Sigma70pred and iPromoter2L, achieving F1-scores >95% in most systems. Significantly, the use of one-hot encoding for representing nucleotide sequences complements these structural features, enhancing our XGBoost model's predictive capabilities. To address the challenge of model interpretability, we incorporated explainable AI techniques using Shapley values. This enhancement allows for a better understanding and interpretation of the predictions of our model. In conclusion, our study presents MLDSPP as a novel, generic tool for predicting promoter regions in bacteria, utilizing original downstream sequences as nonpromoter controls. This tool has the potential to significantly advance the field of bacterial genomics and contribute to our understanding of gene regulation in diverse bacterial systems.
Collapse
Affiliation(s)
- Subhojit Paul
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur 784028, Assam, India
| | - Kaushika Olymon
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur 784028, Assam, India
| | - Gustavo Sganzerla Martinez
- Microbiology and Immunology, Dalhousie University, Halifax, Nova Scotia B3H 4H7, Canada
- Pediatrics, Izaak Walton Killam (IWK) Health Center, Canadian Center for Vaccinology (CCfV), Halifax, Nova Scotia B3H 4H7, Canada
| | - Sharmilee Sarkar
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur 784028, Assam, India
| | - Venkata Rajesh Yella
- Department of Biotechnology, Koneru Lakshmaiah Education Foundation, Guntur 522302, Andhra Pradesh, India
| | - Aditya Kumar
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur 784028, Assam, India
| |
Collapse
|
2
|
Uemura K, Ohyama T. Physical Peculiarity of Two Sites in Human Promoters: Universality and Diverse Usage in Gene Function. Int J Mol Sci 2024; 25:1487. [PMID: 38338773 PMCID: PMC10855393 DOI: 10.3390/ijms25031487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Revised: 01/15/2024] [Accepted: 01/18/2024] [Indexed: 02/12/2024] Open
Abstract
Since the discovery of physical peculiarities around transcription start sites (TSSs) and a site corresponding to the TATA box, research has revealed only the average features of these sites. Unsettled enigmas include the individual genes with these features and whether they relate to gene function. Herein, using 10 physical properties of DNA, including duplex DNA free energy, base stacking energy, protein-induced deformability, and stabilizing energy of Z-DNA, we clarified for the first time that approximately 97% of the promoters of 21,056 human protein-coding genes have distinctive physical properties around the TSS and/or position -27; of these, nearly 65% exhibited such properties at both sites. Furthermore, about 55% of the 21,056 genes had a minimum value of regional duplex DNA free energy within TSS-centered ±300 bp regions. Notably, distinctive physical properties within the promoters and free energies of the surrounding regions separated human protein-coding genes into five groups; each contained specific gene ontology (GO) terms. The group represented by immune response genes differed distinctly from the other four regarding the parameter of the free energies of the surrounding regions. A vital suggestion from this study is that physical-feature-based analyses of genomes may reveal new aspects of the organization and regulation of genes.
Collapse
Affiliation(s)
- Kohei Uemura
- Major in Integrative Bioscience and Biomedical Engineering, Graduate School of Science and Engineering, Waseda University, 2-2 Wakamatsu-cho, Shinjuku-ku, Tokyo 162-8480, Japan;
| | - Takashi Ohyama
- Major in Integrative Bioscience and Biomedical Engineering, Graduate School of Science and Engineering, Waseda University, 2-2 Wakamatsu-cho, Shinjuku-ku, Tokyo 162-8480, Japan;
- Department of Biology, Faculty of Education and Integrated Arts and Sciences, Waseda University, 2-2 Wakamatsu-cho, Shinjuku-ku, Tokyo 162-8480, Japan
| |
Collapse
|
3
|
Jankovic B, Gojobori T. From shallow to deep: some lessons learned from application of machine learning for recognition of functional genomic elements in human genome. Hum Genomics 2022; 16:7. [PMID: 35180894 PMCID: PMC8855580 DOI: 10.1186/s40246-022-00376-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2021] [Accepted: 01/02/2022] [Indexed: 11/25/2022] Open
Abstract
Identification of genomic signals as indicators for functional genomic elements is one of the areas that received early and widespread application of machine learning methods. With time, the methods applied grew in variety and generally exhibited a tendency to improve their ability to identify some major genomic and transcriptomics signals. The evolution of machine learning in genomics followed a similar path to applications of machine learning in other fields. These were impacted in a major way by three dominant developments, namely an enormous increase in availability and quality of data, a significant increase in computational power available to machine learning applications, and finally, new machine learning paradigms, of which deep learning is the most well-known example. It is not easy in general to distinguish factors leading to improvements in results of applications of machine learning. This is even more so in the field of genomics, where the advent of next-generation sequencing and the increased ability to perform functional analysis of raw data have had a major effect on the applicability of machine learning in OMICS fields. In this paper, we survey the results from a subset of published work in application of machine learning in the recognition of genomic signals and regions in human genome and summarize some lessons learnt from this endeavor. There is no doubt that a significant progress has been made both in terms of accuracy and reliability of models. Questions remain however whether the progress has been sufficient and what these developments bring to the field of genomics in general and human genomics in particular. Improving usability, interpretability and accuracy of models remains an important open challenge for current and future research in application of machine learning and more generally of artificial intelligence methods in genomics.
Collapse
Affiliation(s)
- Boris Jankovic
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Takashi Gojobori
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia. .,Division of Biological and Environmental Sciences and Engineering, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.
| |
Collapse
|
4
|
de Medeiros Oliveira M, Bonadio I, Lie de Melo A, Mendes Souza G, Durham AM. TSSFinder-fast and accurate ab initio prediction of the core promoter in eukaryotic genomes. Brief Bioinform 2021; 22:bbab198. [PMID: 34050351 PMCID: PMC8574697 DOI: 10.1093/bib/bbab198] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Revised: 02/14/2021] [Accepted: 02/23/2021] [Indexed: 12/02/2022] Open
Abstract
Promoter annotation is an important task in the analysis of a genome. One of the main challenges for this task is locating the border between the promoter region and the transcribing region of the gene, the transcription start site (TSS). The TSS is the reference point to delimit the DNA sequence responsible for the assembly of the transcribing complex. As the same gene can have more than one TSS, so to delimit the promoter region, it is important to locate the closest TSS to the site of the beginning of the translation. This paper presents TSSFinder, a new software for the prediction of the TSS signal of eukaryotic genes that is significantly more accurate than other available software. We currently are the only application to offer pre-trained models for six different eukaryotic organisms: Arabidopsis thaliana, Drosophila melanogaster, Gallus gallus, Homo sapiens, Oryza sativa and Saccharomyces cerevisiae. Additionally, our software can be easily customized for specific organisms using only 125 DNA sequences with a validated TSS signal and corresponding genomic locations as a training set. TSSFinder is a valuable new tool for the annotation of genomes. TSSFinder source code and docker container can be downloaded from http://tssfinder.github.io. Alternatively, TSSFinder is also available as a web service at http://sucest-fun.org/wsapp/tssfinder/.
Collapse
Affiliation(s)
| | - Igor Bonadio
- Data Science, Elo7 Research Lab, São Paulo, Brazil
| | | | | | | |
Collapse
|
5
|
Aman Beshir J, Kebede M. In silico analysis of promoter regions and regulatory elements (motifs and CpG islands) of the genes encoding for alcohol production in Saccharomyces cerevisiaea S288C and Schizosaccharomyces pombe 972h. J Genet Eng Biotechnol 2021; 19:8. [PMID: 33428031 PMCID: PMC7801573 DOI: 10.1186/s43141-020-00097-9] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2020] [Accepted: 11/17/2020] [Indexed: 11/10/2022]
Abstract
BACKGROUND The crucial factor in the production of bio-fuels is the choice of potent microorganisms used in fermentation processes. Despite the evolving trend of using bacteria, yeast is still the primary choice for fermentation. Molecular characterization of many genes from baker's yeast (Saccharomyces cerevisiaea), and fission yeast (Schizosaccharomyces pombe), have improved our understanding in gene structure and the regulation of its expression. This in silico study was done with the aim of analyzing the promoter regions, transcription start site (TSS), and CpG islands of genes encoding for alcohol production in S. cerevisiaea S288C and S. pombe 972h-. RESULTS The analysis revealed the highest promoter prediction scores (1.0) were obtained in five sequences (AAD4, SFA1, GRE3, YKL071W, and YPR127W) for S. cerevisiaea S288C TSS while the lowest (0.8) were found in three sequences (AAD6, ADH5, and BDH2). Similarly, in S. pombe 972h-, the highest (0.99) and lowest (0.88) prediction scores were obtained in five (Adh1, SPBC8E4.04, SPBC215.11c, SPAP32A8.02, and SPAC19G12.09) and one (erg27) sequences, respectively. Determination of common motifs revealed that S. cerevisiaea S288C had 100% coverage at MSc1 with an E value of 3.7e-007 while S. pombe 972h- had 95.23% at MSp1 with an E value of 2.6e+002. Furthermore, comparison of identified transcription factor proteins indicated that 88.88% of MSp1 were exactly similar to MSc1. It also revealed that only 21.73% in S. cerevisiaea S288C and 28% in S. pombe 972h- of the gene body regions had CpG islands. A combined phylogenetic analysis indicated that all sequences from both S. cerevisiaea S288C and S. pombe 972h- were divided into four subgroups (I, II, III, and IV). The four clades are respectively colored in blue, red, green, and violet. CONCLUSION This in silico analysis of gene promoter regions and transcription factors through the actions of regulatory structure such as motifs and CpG islands of genes encoding alcohol production could be used to predict gene expression profiles in yeast species.
Collapse
Affiliation(s)
- Jemal Aman Beshir
- Department of Applied Biology, School of Applied Natural Science, Adama Science and Technology University, P.O. Box 1888, Adama, Ethiopia
- Ethiopian Sugar Corporation, Sugar Academy, Wonji, Ethiopia
| | - Mulugeta Kebede
- Department of Applied Biology, School of Applied Natural Science, Adama Science and Technology University, P.O. Box 1888, Adama, Ethiopia
| |
Collapse
|
6
|
Identification of Regulatory SNPs Associated with Vicine and Convicine Content of Vicia faba Based on Genotyping by Sequencing Data Using Deep Learning. Genes (Basel) 2020; 11:genes11060614. [PMID: 32516876 PMCID: PMC7349281 DOI: 10.3390/genes11060614] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Revised: 05/26/2020] [Accepted: 05/28/2020] [Indexed: 12/15/2022] Open
Abstract
Faba bean (Vicia faba) is a grain legume, which is globally grown for both human consumption as well as feed for livestock. Despite its agro-ecological importance the usage of Vicia faba is severely hampered by its anti-nutritive seed-compounds vicine and convicine (V+C). The genes responsible for a low V+C content have not yet been identified. In this study, we aim to computationally identify regulatory SNPs (rSNPs), i.e., SNPs in promoter regions of genes that are deemed to govern the V+C content of Vicia faba. For this purpose we first trained a deep learning model with the gene annotations of seven related species of the Leguminosae family. Applying our model, we predicted putative promoters in a partial genome of Vicia faba that we assembled from genotyping-by-sequencing (GBS) data. Exploiting the synteny between Medicago truncatula and Vicia faba, we identified two rSNPs which are statistically significantly associated with V+C content. In particular, the allele substitutions regarding these rSNPs result in dramatic changes of the binding sites of the transcription factors (TFs) MYB4, MYB61, and SQUA. The knowledge about TFs and their rSNPs may enhance our understanding of the regulatory programs controlling V+C content of Vicia faba and could provide new hypotheses for future breeding programs.
Collapse
|
7
|
Liu B, Han L, Liu X, Wu J, Ma Q. Computational Prediction of Sigma-54 Promoters in Bacterial Genomes by Integrating Motif Finding and Machine Learning Strategies. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1211-1218. [PMID: 29993815 DOI: 10.1109/tcbb.2018.2816032] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Sigma factor, as a unit of RNA polymerase holoenzyme, is a critical factor in the process of gene transcriptional regulation. It recognizes the specific DNA sites and brings the core enzyme of RNA polymerase to the upstream regions of target genes. Therefore, the prediction of the promoters for a particular sigma factor is essential for interpreting functional genomic data and observation. This paper develops a new method to predict sigma-54 promoters in bacterial genomes. The new method organically integrates motif finding and machine learning strategies to capture the intrinsic features of sigma-54 promoters. The experiments on E. coli benchmark test set show that our method has good capability to distinguish sigma-54 promoters from surrounding or randomly selected DNA sequences. The applications of the other three bacterial genomes indicate the potential robustness and applicative power of our method on a large number of bacterial genomes. The source code of our method can be freely downloaded at https://github.com/maqin2001/PromotePredictor.
Collapse
|
8
|
Karami K, Zerehdaran S, Javadmanesh A, Shariati MM, Fallahi H. Characterization of bovine (Bos taurus) imprinted genes from genomic to amino acid attributes by data mining approaches. PLoS One 2019; 14:e0217813. [PMID: 31170205 PMCID: PMC6553745 DOI: 10.1371/journal.pone.0217813] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2018] [Accepted: 05/21/2019] [Indexed: 01/05/2023] Open
Abstract
Genomic imprinting results in monoallelic expression of genes in mammals and flowering plants. Understanding the function of imprinted genes improves our knowledge of the regulatory processes in the genome. In this study, we have employed classification and clustering algorithms with attribute weighting to specify the unique attributes of both imprinted (monoallelic) and biallelic expressed genes. We have obtained characteristics of 22 known monoallelically expressed (imprinted) and 8 biallelic expressed genes that have been experimentally validated alongside 208 randomly selected genes in bovine (Bos taurus). Attribute weighting methods and various supervised and unsupervised algorithms in machine learning were applied. Unique characteristics were discovered and used to distinguish mono and biallelic expressed genes from each other in bovine. To obtain the accuracy of classification, 10-fold cross-validation with concerning each combination of attribute weighting (feature selection) and machine learning algorithms, was used. Our approach was able to accurately predict mono and biallelic genes using the genomics and proteomics attributes.
Collapse
Affiliation(s)
- Keyvan Karami
- Department of Animal Science, Faculty of Agriculture, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Saeed Zerehdaran
- Department of Animal Science, Faculty of Agriculture, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Ali Javadmanesh
- Department of Animal Science, Faculty of Agriculture, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Mohammad Mahdi Shariati
- Department of Animal Science, Faculty of Agriculture, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Hossein Fallahi
- Department of Biology, School of Sciences, Razi University, Kermanshah, Iran
| |
Collapse
|
9
|
Kalkatawi M, Magana-Mora A, Jankovic B, Bajic VB. DeepGSR: an optimized deep-learning structure for the recognition of genomic signals and regions. Bioinformatics 2019; 35:1125-1132. [PMID: 30184052 PMCID: PMC6449759 DOI: 10.1093/bioinformatics/bty752] [Citation(s) in RCA: 41] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2017] [Revised: 07/15/2018] [Accepted: 08/31/2018] [Indexed: 01/05/2023] Open
Abstract
MOTIVATION Recognition of different genomic signals and regions (GSRs) in DNA is crucial for understanding genome organization, gene regulation, and gene function, which in turn generate better genome and gene annotations. Although many methods have been developed to recognize GSRs, their pure computational identification remains challenging. Moreover, various GSRs usually require a specialized set of features for developing robust recognition models. Recently, deep-learning (DL) methods have been shown to generate more accurate prediction models than 'shallow' methods without the need to develop specialized features for the problems in question. Here, we explore the potential use of DL for the recognition of GSRs. RESULTS We developed DeepGSR, an optimized DL architecture for the prediction of different types of GSRs. The performance of the DeepGSR structure is evaluated on the recognition of polyadenylation signals (PAS) and translation initiation sites (TIS) of different organisms: human, mouse, bovine and fruit fly. The results show that DeepGSR outperformed the state-of-the-art methods, reducing the classification error rate of the PAS and TIS prediction in the human genome by up to 29% and 86%, respectively. Moreover, the cross-organisms and genome-wide analyses we performed, confirmed the robustness of DeepGSR and provided new insights into the conservation of examined GSRs across species. AVAILABILITY AND IMPLEMENTATION DeepGSR is implemented in Python using Keras API; it is available as open-source software and can be obtained at https://doi.org/10.5281/zenodo.1117159. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Manal Kalkatawi
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
- Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Arturo Magana-Mora
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
- Drilling Technology Team, EXPEC-ARC, Saudi Aramco, Dhahran, Saudi Arabia
| | - Boris Jankovic
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Vladimir B Bajic
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| |
Collapse
|
10
|
He W, Jia C, Duan Y, Zou Q. 70ProPred: a predictor for discovering sigma70 promoters based on combining multiple features. BMC SYSTEMS BIOLOGY 2018; 12:44. [PMID: 29745856 PMCID: PMC5998878 DOI: 10.1186/s12918-018-0570-1] [Citation(s) in RCA: 60] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
BACKGROUND Promoter is an important sequence regulation element, which is in charge of gene transcription initiation. In prokaryotes, σ70 promoters regulate the transcription of most genes. The promoter recognition has been a crucial part of gene structure recognition. It's also the core issue of constructing gene transcriptional regulation network. With the successfully completion of genome sequencing from an increasing number of microbe species, the accurate identification of σ70 promoter regions in DNA sequence is not easy. RESULTS In order to improve the prediction accuracy of sigma70 promoters in prokaryote, a promoter recognition model 70ProPred was established. In this work, two sequence-based features, including position-specific trinucleotide propensity based on single-stranded characteristic (PSTNPss) and electron-ion potential values for trinucleotides (PseEIIP), were assessed to build the best prediction model. It was found that 79 features of PSTNPSS combined with 64 features of PseEIIP obtained the best performance for sigma70 promoter identification, with a promising accuracy and the Matthews correlation coefficient (MCC) at 95.56% and 0.90, respectively. CONCLUSION The jackknife tests showed that 70ProPred outperforms the existing sigma70 promoter prediction approaches in terms of accuracy and stability. Additionally, this approach can also be extended to predict promoters of other species. In order to facilitate experimental biologists, an online web server for the proposed method was established, which is freely available at http://server.malab.cn/70ProPred/ .
Collapse
Affiliation(s)
- Wenying He
- School of Computer Science and Technology, Tianjin University, Tianjin, 300072 China
| | - Cangzhi Jia
- Department of Mathematics, Dalian Maritime University, Dalian, 116026 China
| | - Yucong Duan
- College of Information and Technology, Hainan University, Haikou, 570228 China
| | - Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin, 300072 China
| |
Collapse
|
11
|
Ryasik A, Orlov M, Zykova E, Ermak T, Sorokin A. Bacterial promoter prediction: Selection of dynamic and static physical properties of DNA for reliable sequence classification. J Bioinform Comput Biol 2018; 16:1840003. [DOI: 10.1142/s0219720018400036] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Predicting promoter activity of DNA fragment is an important task for computational biology. Approaches using physical properties of DNA to predict bacterial promoters have recently gained a lot of attention. To select an adequate set of physical properties for training a classifier, various characteristics of DNA molecule should be taken into consideration. Here, we present a systematic approach that allows us to select less correlated properties for classification by means of both correlation and cophenetic coefficients as well as concordance matrices. To prove this concept, we have developed the first classifier that uses not only sequence and static physical properties of DNA fragment, but also dynamic properties of DNA open states. Therefore, the best performing models with accuracy values up to 90% for all types of sequences were obtained. Furthermore, we have demonstrated that the classifier can serve as a reliable tool enabling promoter DNA fragments to be distinguished from promoter islands despite the similarity of their nucleotide sequences.
Collapse
Affiliation(s)
- Artem Ryasik
- Mechanism of Cell Genome Functioning Laboratory, Institute of Cell Biophysics, ul. Institutskaya 3, Pushchino 142290, Russia
| | - Mikhail Orlov
- Mechanism of Cell Genome Functioning Laboratory, Institute of Cell Biophysics, ul. Institutskaya 3, Pushchino 142290, Russia
| | - Evgenia Zykova
- Mechanism of Cell Genome Functioning Laboratory, Institute of Cell Biophysics, ul. Institutskaya 3, Pushchino 142290, Russia
- Department of Applied Research Informatization, State Institute of Information Technologies and Telecommunications (SIIT&T Informika), per. Brusov 21 st.2, Moscow, 125009, Russia
| | - Timofei Ermak
- Laboratory of Molecular Genetics Systems, Institute of Cytology and Genetics, pr. Akademika Lavrentyeva 10, Novosibirsk 630090, Russia
| | - Anatoly Sorokin
- Mechanism of Cell Genome Functioning Laboratory, Institute of Cell Biophysics, ul. Institutskaya 3, Pushchino 142290, Russia
| |
Collapse
|
12
|
Evolution of Brain Active Gene Promoters in Human Lineage Towards the Increased Plasticity of Gene Regulation. Mol Neurobiol 2017; 55:1871-1904. [PMID: 28233272 DOI: 10.1007/s12035-017-0427-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2016] [Accepted: 01/26/2017] [Indexed: 01/31/2023]
Abstract
Adaptability to a variety of environmental conditions is a prominent feature of Homo sapiens. We hypothesize that this feature can be explained by evolutionary changes in gene promoters active in the brain prefrontal cortex leading to a more flexible gene regulation network. The genotype-dependent range of gene expression can be broader in humans than in other higher primates. Thus, we searched for specific signatures of evolutionary changes in promoter architectures of multiple hominid genes, including the genes active in human cortical neurons that may indicate an increase of variability of gene expression rather than just changes in the level of expression, such as downregulation or upregulation of the genes. We performed a whole-genome search for genetic-based alterations that may impact gene regulation "flexibility" in a process of hominids evolution, such as (i) CpG dinucleotide content, (ii) predicted nucleosome-DNA dissociation constant, and (iii) predicted affinities for TATA-binding protein (TBP) in gene promoters. We tested all putative promoter regions across the human genome and especially gene promoters in active chromatin state in neurons of prefrontal cortex, the brain region critical for abstract thinking and social and behavioral adaptation. Our data imply that the origin of modern man has been associated with an increase of flexibility of promoter-driven gene regulation in brain. In contrast, after splitting from the ancestral lineages of H. sapiens, the evolution of ape species is characterized by reduced flexibility of gene promoter functioning, underlying reduced variability of the gene expression.
Collapse
|
13
|
Sloutskin A, Danino YM, Orenstein Y, Zehavi Y, Doniger T, Shamir R, Juven-Gershon T. ElemeNT: a computational tool for detecting core promoter elements. Transcription 2016. [PMID: 26226151 PMCID: PMC4581360 DOI: 10.1080/21541264.2015.1067286] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Core promoter elements play a pivotal role in the transcriptional output, yet they are often detected manually within sequences of interest. Here, we present 2 contributions to the detection and curation of core promoter elements within given sequences. First, the Elements Navigation Tool (ElemeNT) is a user-friendly web-based, interactive tool for prediction and display of putative core promoter elements and their biologically-relevant combinations. Second, the CORE database summarizes ElemeNT-predicted core promoter elements near CAGE and RNA-seq-defined Drosophila melanogaster transcription start sites (TSSs). ElemeNT's predictions are based on biologically-functional core promoter elements, and can be used to infer core promoter compositions. ElemeNT does not assume prior knowledge of the actual TSS position, and can therefore assist in annotation of any given sequence. These resources, freely accessible at http://lifefaculty.biu.ac.il/gershon-tamar/index.php/resources, facilitate the identification of core promoter elements as active contributors to gene expression.
Collapse
Affiliation(s)
- Anna Sloutskin
- a The Mina and Everard Goodman Faculty of Life Sciences ; Bar-Ilan University ; Ramat Gan , Israel
| | | | | | | | | | | | | |
Collapse
|
14
|
Carvalho SG, Guerra-Sá R, de C Merschmann LH. The impact of sequence length and number of sequences on promoter prediction performance. BMC Bioinformatics 2015; 16 Suppl 19:S5. [PMID: 26695879 PMCID: PMC4686783 DOI: 10.1186/1471-2105-16-s19-s5] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
BACKGROUND The advent of rapid evolution on sequencing capacity of new genomes has evidenced the need for data analysis automation aiming at speeding up the genomic annotation process and reducing its cost. Given that one important step for functional genomic annotation is the promoter identification, several studies have been taken in order to propose computational approaches to predict promoters. Different classifiers and characteristics of the promoter sequences have been used to deal with this prediction problem. However, several works in literature have addressed the promoter prediction problem using datasets containing sequences of 250 nucleotides or more. As the sequence length defines the amount of dataset attributes, even considering a limited number of properties to characterize the sequences, datasets with a high number of attributes are generated for training classifiers. Once high-dimensional datasets can degrade the classifiers predictive performance or even require an infeasible processing time, predicting promoters by training classifiers from datasets with a reduced number of attributes, it is essential to obtain good predictive performance with low computational cost. To the best of our knowledge, there is no work in literature that verified in a systematic way the relation between the sequences length and the predictive performance of classifiers. Thus, in this work, we have evaluated the impact of sequence length variation and training dataset size (number of sequences) on the predictive performance of classifiers. RESULTS We have built sixteen datasets composed of different sized sequences (ranging in length from 12 to 301 nucleotides) and evaluated them using the SVM, Random Forest and k-NN classifiers. The best predictive performances reached by SVM and Random Forest remained relatively stable for datasets composed of sequences varying in length from 301 to 41 nucleotides, while k-NN achieved its best performance for the dataset composed of 101 nucleotides. We have also analyzed, using sequences composed of only 41 nucleotides, the impact of increasing the number of sequences in a dataset on the predictive performance of the same three classifiers. Datasets containing 14,000, 80,000, 100,000 and 120,000 sequences were built and evaluated. All classifiers achieved better predictive performance for datasets containing 80,000 sequences or more. CONCLUSION The experimental results show that several datasets composed of shorter sequences achieved better predictive performance when compared with datasets composed of longer sequences, and also consumed a significantly shorter processing time. Furthermore, increasing the number of sequences in a dataset proved to be beneficial to the predictive power of classifiers.
Collapse
Affiliation(s)
- Sávio G Carvalho
- Federal University of Ouro Preto, Morro do Cruzeiro, Ouro Preto-MG, Brazil
| | - Renata Guerra-Sá
- Federal University of Ouro Preto, Morro do Cruzeiro, Ouro Preto-MG, Brazil
| | | |
Collapse
|
15
|
Yella VR, Bansal M. In silico Identification of Eukaryotic Promoters. SYSTEMS AND SYNTHETIC BIOLOGY 2015. [DOI: 10.1007/978-94-017-9514-2_4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
16
|
Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, Cuomo CA, Zeng Q, Wortman J, Young SK, Earl AM. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 2014; 9:e112963. [PMID: 25409509 PMCID: PMC4237348 DOI: 10.1371/journal.pone.0112963] [Citation(s) in RCA: 6047] [Impact Index Per Article: 549.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2014] [Accepted: 10/16/2014] [Indexed: 02/06/2023] Open
Abstract
Advances in modern sequencing technologies allow us to generate sufficient data to analyze hundreds of bacterial genomes from a single machine in a single day. This potential for sequencing massive numbers of genomes calls for fully automated methods to produce high-quality assemblies and variant calls. We introduce Pilon, a fully automated, all-in-one tool for correcting draft assemblies and calling sequence variants of multiple sizes, including very large insertions and deletions. Pilon works with many types of sequence data, but is particularly strong when supplied with paired end data from two Illumina libraries with small e.g., 180 bp and large e.g., 3–5 Kb inserts. Pilon significantly improves draft genome assemblies by correcting bases, fixing mis-assemblies and filling gaps. For both haploid and diploid genomes, Pilon produces more contiguous genomes with fewer errors, enabling identification of more biologically relevant genes. Furthermore, Pilon identifies small variants with high accuracy as compared to state-of-the-art tools and is unique in its ability to accurately identify large sequence variants including duplications and resolve large insertions. Pilon is being used to improve the assemblies of thousands of new genomes and to identify variants from thousands of clinically relevant bacterial strains. Pilon is freely available as open source software.
Collapse
Affiliation(s)
- Bruce J. Walker
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- * E-mail: (BJW); (AME)
| | - Thomas Abeel
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- VIB Department of Plant Systems Biology, Ghent University, Ghent, Belgium
| | - Terrance Shea
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Margaret Priest
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Amr Abouelliel
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Sharadha Sakthikumar
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Christina A. Cuomo
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Qiandong Zeng
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Jennifer Wortman
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Sarah K. Young
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Ashlee M. Earl
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- * E-mail: (BJW); (AME)
| |
Collapse
|
17
|
Lin H, Deng EZ, Ding H, Chen W, Chou KC. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res 2014; 42:12961-72. [PMID: 25361964 PMCID: PMC4245931 DOI: 10.1093/nar/gku1019] [Citation(s) in RCA: 413] [Impact Index Per Article: 37.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The σ54 promoters are unique in prokaryotic genome and responsible for transcripting carbon and nitrogen-related genes. With the avalanche of genome sequences generated in the postgenomic age, it is highly desired to develop automated methods for rapidly and effectively identifying the σ54 promoters. Here, a predictor called ‘iPro54-PseKNC’ was developed. In the predictor, the samples of DNA sequences were formulated by a novel feature vector called ‘pseudo k-tuple nucleotide composition’, which was further optimized by the incremental feature selection procedure. The performance of iPro54-PseKNC was examined by the rigorous jackknife cross-validation tests on a stringent benchmark data set. As a user-friendly web-server, iPro54-PseKNC is freely accessible at http://lin.uestc.edu.cn/server/iPro54-PseKNC. For the convenience of the vast majority of experimental scientists, a step-by-step protocol guide was provided on how to use the web-server to get the desired results without the need to follow the complicated mathematics that were presented in this paper just for its integrity. Meanwhile, we also discovered through an in-depth statistical analysis that the distribution of distances between the transcription start sites and the translation initiation sites were governed by the gamma distribution, which may provide a fundamental physical principle for studying the σ54 promoters.
Collapse
Affiliation(s)
- Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China Gordon Life Science Institute, Belmont, MA, USA
| | - En-Ze Deng
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Chen
- Department of Physics, School of Sciences, and Center for Genomics and Computational Biology, Hebei United University, Tangshan 063000, China Gordon Life Science Institute, Belmont, MA, USA
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Belmont, MA, USA Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi Arabia
| |
Collapse
|
18
|
Carter JR, Keith JH, Fraser TS, Dawson JL, Kucharski CA, Horne KM, Higgs S, Fraser MJ. Effective suppression of dengue virus using a novel group-I intron that induces apoptotic cell death upon infection through conditional expression of the Bax C-terminal domain. Virol J 2014; 11:111. [PMID: 24927852 PMCID: PMC4104402 DOI: 10.1186/1743-422x-11-111] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2014] [Accepted: 05/20/2014] [Indexed: 11/10/2022] Open
Abstract
INTRODUCTION Approximately 100 million confirmed infections and 20,000 deaths are caused by Dengue virus (DENV) outbreaks annually. Global warming and rapid dispersal have resulted in DENV epidemics in formally non-endemic regions. Currently no consistently effective preventive measures for DENV exist, prompting development of transgenic and paratransgenic vector control approaches. Production of transgenic mosquitoes refractory for virus infection and/or transmission is contingent upon defining antiviral genes that have low probability for allowing escape mutations, and are equally effective against multiple serotypes. Previously we demonstrated the effectiveness of an anti-viral group I intron targeting U143 of the DENV genome in mediating trans-splicing and expression of a marker gene with the capsid coding domain. In this report we examine the effectiveness of coupling expression of ΔN Bax to trans-splicing U143 intron activity as a means of suppressing DENV infection of mosquito cells. RESULTS Targeting the conserved DENV circularization sequence (CS) by U143 intron trans-splicing activity appends a 3' exon RNA encoding ΔN Bax to the capsid coding region of the genomic RNA, resulting in a chimeric protein that induces premature cell death upon infection. TCID50-IFA analyses demonstrate an enhancement of DENV suppression for all DENV serotypes tested over the identical group I intron coupled with the non-apoptotic inducing firefly luciferase as the 3' exon. These cumulative results confirm the increased effectiveness of this αDENV-U143-ΔN Bax group I intron as a sequence specific antiviral that should be useful for suppression of DENV in transgenic mosquitoes. Annexin V staining, caspase 3 assays, and DNA ladder observations confirm DCA-ΔN Bax fusion protein expression induces apoptotic cell death. CONCLUSION This report confirms the relative effectiveness of an anti-DENV group I intron coupled to an apoptosis-inducing ΔN Bax 3' exon that trans-splices conserved sequences of the 5' CS region of all DENV serotypes and induces apoptotic cell death upon infection. Our results confirm coupling the targeted ribozyme capabilities of the group I intron with the generation of an apoptosis-inducing transcript increases the effectiveness of infection suppression, improving the prospects of this unique approach as a means of inducing transgenic refractoriness in mosquitoes for all serotypes of this important disease.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Malcolm J Fraser
- Department of Biological Sciences, Eck Institute of Global Health, University of Notre Dame, Notre Dame, Indiana 46556, USA.
| |
Collapse
|
19
|
Xiong D, Liu R, Xiao F, Gao X. ProMT: effective human promoter prediction using Markov chain model based on DNA structural properties. IEEE Trans Nanobioscience 2014; 13:374-83. [PMID: 24919203 DOI: 10.1109/tnb.2014.2327586] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
The core promoters play significant and extensive roles for the initiation and regulation of DNA transcription. The identification of core promoters is one of the most challenging problems yet. Due to the diverse nature of core promoters, the results obtained through existing computational approaches are not satisfactory. None of them considered the potential influence on performance of predictive approach resulted by the interference between neighboring TSSs in TSS clusters. In this paper, we sufficiently considered this main factor and proposed an approach to locate potential TSS clusters according to the correlation of regional profiles of DNA and TSS clusters. On this basis, we further presented a novel computational approach (ProMT) for promoter prediction using Markov chain model and predictive TSS clusters based on structural properties of DNA. Extensive experiments demonstrated that ProMT can significantly improve the predictive performance. Therefore, considering interference between neighboring TSSs is essential for a wider range of promoter prediction.
Collapse
|
20
|
Grid topologies for the self-organizing map. Neural Netw 2014; 56:35-48. [PMID: 24861385 DOI: 10.1016/j.neunet.2014.05.001] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2014] [Revised: 04/28/2014] [Accepted: 05/01/2014] [Indexed: 11/20/2022]
Abstract
The original Self-Organizing Feature Map (SOFM) has been extended in many ways to suit different goals and application domains. However, the topologies of the map lattice that we can found in literature are nearly always square or, more rarely, hexagonal. In this paper we study alternative grid topologies, which are derived from the geometrical theory of tessellations. Experimental results are presented for unsupervised clustering, color image segmentation and classification tasks, which show that the differences among the topologies are statistically significant in most cases, and that the optimal topology depends on the problem at hand. A theoretical interpretation of these results is also developed.
Collapse
|
21
|
Meysman P, Collado-Vides J, Morett E, Viola R, Engelen K, Laukens K. Structural properties of prokaryotic promoter regions correlate with functional features. PLoS One 2014; 9:e88717. [PMID: 24516674 PMCID: PMC3918002 DOI: 10.1371/journal.pone.0088717] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2013] [Accepted: 01/10/2014] [Indexed: 12/31/2022] Open
Abstract
The structural properties of the DNA molecule are known to play a critical role in transcription. In this paper, the structural profiles of promoter regions were studied within the context of their diversity and their function for eleven prokaryotic species; Escherichia coli, Klebsiella pneumoniae, Salmonella Typhimurium, Pseudomonas auroginosa, Geobacter sulfurreducens Helicobacter pylori, Chlamydophila pneumoniae, Synechocystis sp., Synechoccocus elongates, Bacillus anthracis, and the archaea Sulfolobus solfataricus. The main anchor point for these promoter regions were transcription start sites identified through high-throughput experiments or collected within large curated databases. Prokaryotic promoter regions were found to be less stable and less flexible than the genomic mean across all studied species. However, direct comparison between species revealed differences in their structural profiles that can not solely be explained by the difference in genomic GC content. In addition, comparison with functional data revealed that there are patterns in the promoter structural profiles that can be linked to specific functional loci, such as sigma factor regulation or transcription factor binding. Interestingly, a novel structural element clearly visible near the transcription start site was found in genes associated with essential cellular functions and growth in several species. Our analyses reveals the great diversity in promoter structural profiles both between and within prokaryotic species. We observed relationships between structural diversity and functional features that are interesting prospects for further research to yet uncharacterized functional loci defined by DNA structural properties.
Collapse
Affiliation(s)
- Pieter Meysman
- Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
- Biomedical Informatics Research Center Antwerp (biomina), University of Antwerp/Antwerp University Hospital, Edegem, Belgium
| | - Julio Collado-Vides
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico
| | - Enrique Morett
- Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico
- Instituto Nacional de Medicina Genómica, Mexico City, Mexico
| | - Roberto Viola
- Department of Computational Biology, Fondazione Edmund Mach, San Michele all’Adige, Trento, Italy
| | - Kristof Engelen
- Department of Computational Biology, Fondazione Edmund Mach, San Michele all’Adige, Trento, Italy
- * E-mail: (KE); (KL)
| | - Kris Laukens
- Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
- Biomedical Informatics Research Center Antwerp (biomina), University of Antwerp/Antwerp University Hospital, Edegem, Belgium
- * E-mail: (KE); (KL)
| |
Collapse
|
22
|
Huang WL, Tung CW, Liaw C, Huang HL, Ho SY. Rule-based knowledge acquisition method for promoter prediction in human and Drosophila species. ScientificWorldJournal 2014; 2014:327306. [PMID: 24955394 PMCID: PMC3927563 DOI: 10.1155/2014/327306] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2013] [Accepted: 10/10/2013] [Indexed: 01/08/2023] Open
Abstract
The rapid and reliable identification of promoter regions is important when the number of genomes to be sequenced is increasing very speedily. Various methods have been developed but few methods investigate the effectiveness of sequence-based features in promoter prediction. This study proposes a knowledge acquisition method (named PromHD) based on if-then rules for promoter prediction in human and Drosophila species. PromHD utilizes an effective feature-mining algorithm and a reference feature set of 167 DNA sequence descriptors (DNASDs), comprising three descriptors of physicochemical properties (absorption maxima, molecular weight, and molar absorption coefficient), 128 top-ranked descriptors of 4-mer motifs, and 36 global sequence descriptors. PromHD identifies two feature subsets with 99 and 74 DNASDs and yields test accuracies of 96.4% and 97.5% in human and Drosophila species, respectively. Based on the 99- and 74-dimensional feature vectors, PromHD generates several if-then rules by using the decision tree mechanism for promoter prediction. The top-ranked informative rules with high certainty grades reveal that the global sequence descriptor, the length of nucleotide A at the first position of the sequence, and two physicochemical properties, absorption maxima and molecular weight, are effective in distinguishing promoters from non-promoters in human and Drosophila species, respectively.
Collapse
Affiliation(s)
- Wen-Lin Huang
- Department of Management Information System, Asia Pacific Institute of Creativity, Miaoli 351, Taiwan
| | - Chun-Wei Tung
- School of Pharmacy, College of Pharmacy, Kaohsiung Medical University, Kaohsiung 807, Taiwan
| | - Chyn Liaw
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu 300, Taiwan
| | - Hui-Ling Huang
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu 300, Taiwan
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu 300, Taiwan
| | - Shinn-Ying Ho
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu 300, Taiwan
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu 300, Taiwan
| |
Collapse
|
23
|
Datta S, Mukhopadhyay S. A composite method based on formal grammar and DNA structural features in detecting human polymerase II promoter region. PLoS One 2013; 8:e54843. [PMID: 23437045 PMCID: PMC3577817 DOI: 10.1371/journal.pone.0054843] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2012] [Accepted: 12/17/2012] [Indexed: 11/25/2022] Open
Abstract
An important step in understanding gene regulation is to identify the promoter regions where the transcription factor binding takes place. Predicting a promoter region de novo has been a theoretical goal for many researchers for a long time. There exists a number of in silico methods to predict the promoter region de novo but most of these methods are still suffering from various shortcomings, a major one being the selection of appropriate features of promoter region distinguishing them from non-promoters. In this communication, we have proposed a new composite method that predicts promoter sequences based on the interrelationship between structural profiles of DNA and primary sequence elements of the promoter regions. We have shown that a Context Free Grammar (CFG) can formalize the relationships between different primary sequence features and by utilizing the CFG, we demonstrate that an efficient parser can be constructed for extracting these relationships from DNA sequences to distinguish the true promoter sequences from non-promoter sequences. Along with CFG, we have extracted the structural features of the promoter region to improve upon the efficiency of our prediction system. Extensive experiments performed on different datasets reveals that our method is effective in predicting promoter sequences on a genome-wide scale and performs satisfactorily as compared to other promoter prediction techniques.
Collapse
Affiliation(s)
- Sutapa Datta
- Department of Biophysics, Molecular Biology and Bioinformatics and Distributed Information Centre for Bioinformatics, University of Calcutta, Kolkata, West Bengal, India.
| | | |
Collapse
|
24
|
Zhou X, Li Z, Dai Z, Zou X. Predicting promoters by pseudo-trinucleotide compositions based on discrete wavelets transform. J Theor Biol 2013; 319:1-7. [DOI: 10.1016/j.jtbi.2012.11.024] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2012] [Revised: 11/20/2012] [Accepted: 11/21/2012] [Indexed: 10/27/2022]
|
25
|
Abstract
We present here a novel methodology for predicting new genes in prokaryotic genomes on the basis of inherent energetics of DNA. Regions of higher thermodynamic stability were identified, which were filtered based on already known annotations to yield a set of potentially new genes. These were then processed for their compatibility with the stereo-chemical properties of proteins and tripeptide frequencies of proteins in Swissprot data, which results in a reliable set of new genes in a genome. Quite surprisingly, the methodology identifies new genes even in well-annotated genomes. Also, the methodology can handle genomes of any GC-content, size and number of annotated genes.
Collapse
|
26
|
A new avenue for classification and prediction of olive cultivars using supervised and unsupervised algorithms. PLoS One 2012; 7:e44164. [PMID: 22957050 PMCID: PMC3434224 DOI: 10.1371/journal.pone.0044164] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2012] [Accepted: 07/30/2012] [Indexed: 11/19/2022] Open
Abstract
Various methods have been used to identify cultivares of olive trees; herein we used different bioinformatics algorithms to propose new tools to classify 10 cultivares of olive based on RAPD and ISSR genetic markers datasets generated from PCR reactions. Five RAPD markers (OPA0a21, OPD16a, OP01a1, OPD16a1 and OPA0a8) and five ISSR markers (UBC841a4, UBC868a7, UBC841a14, U12BC807a and UBC810a13) selected as the most important markers by all attribute weighting models. K-Medoids unsupervised clustering run on SVM dataset was fully able to cluster each olive cultivar to the right classes. All trees (176) induced by decision tree models generated meaningful trees and UBC841a4 attribute clearly distinguished between foreign and domestic olive cultivars with 100% accuracy. Predictive machine learning algorithms (SVM and Naïve Bayes) were also able to predict the right class of olive cultivares with 100% accuracy. For the first time, our results showed data mining techniques can be effectively used to distinguish between plant cultivares and proposed machine learning based systems in this study can predict new olive cultivars with the best possible accuracy.
Collapse
|
27
|
Osypov AA, Krutinin GG, Krutinina EA, Kamzolova SG. DEPPDB - DNA electrostatic potential properties database. Electrostatic properties of genome DNA elements. J Bioinform Comput Biol 2012; 10:1241004. [PMID: 22809340 DOI: 10.1142/s0219720012410041] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Electrostatic properties of genome DNA are important to its interactions with different proteins, in particular, related to transcription. DEPPDB - DNA Electrostatic Potential (and other Physical) Properties Database - provides information on the electrostatic and other physical properties of genome DNA combined with its sequence and annotation of biological and structural properties of genomes and their elements. Genomes are organized on taxonomical basis, supporting comparative and evolutionary studies. Currently, DEPPDB contains all completely sequenced bacterial, viral, mitochondrial, and plastids genomes according to the NCBI RefSeq, and some model eukaryotic genomes. Data for promoters, regulation sites, binding proteins, etc., are incorporated from established DBs and literature. The database is complemented by analytical tools. User sequences calculations are available. Case studies discovered electrostatics complementing DNA bending in E.coli plasmid BNT2 promoter functioning, possibly affecting host-environment metabolic switch. Transcription factors binding sites gravitate to high potential regions, confirming the electrostatics universal importance in protein-DNA interactions beyond the classical promoter-RNA polymerase recognition and regulation. Other genome elements, such as terminators, also show electrostatic peculiarities. Most intriguing are gene starts, exhibiting taxonomic correlations. The necessity of the genome electrostatic properties studies is discussed.
Collapse
Affiliation(s)
- Alexander A Osypov
- Laboratory of Mechanisms of the Cell Genome Functioning, Institute of Cell Biophysics RAS, Pushchino, 142290, Russia.
| | | | | | | |
Collapse
|
28
|
Hosseinzadeh F, Ebrahimi M, Goliaei B, Shamabadi N. Classification of lung cancer tumors based on structural and physicochemical properties of proteins by bioinformatics models. PLoS One 2012; 7:e40017. [PMID: 22829872 PMCID: PMC3400626 DOI: 10.1371/journal.pone.0040017] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2012] [Accepted: 05/30/2012] [Indexed: 12/03/2022] Open
Abstract
Rapid distinction between small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC) tumors is very important in diagnosis of this disease. Furthermore sequence-derived structural and physicochemical descriptors are very useful for machine learning prediction of protein structural and functional classes, classifying proteins and the prediction performance. Herein, in this study is the classification of lung tumors based on 1497 attributes derived from structural and physicochemical properties of protein sequences (based on genes defined by microarray analysis) investigated through a combination of attribute weighting, supervised and unsupervised clustering algorithms. Eighty percent of the weighting methods selected features such as autocorrelation, dipeptide composition and distribution of hydrophobicity as the most important protein attributes in classification of SCLC, NSCLC and COMMON classes of lung tumors. The same results were observed by most tree induction algorithms while descriptors of hydrophobicity distribution were high in protein sequences COMMON in both groups and distribution of charge in these proteins was very low; showing COMMON proteins were very hydrophobic. Furthermore, compositions of polar dipeptide in SCLC proteins were higher than NSCLC proteins. Some clustering models (alone or in combination with attribute weighting algorithms) were able to nearly classify SCLC and NSCLC proteins. Random Forest tree induction algorithm, calculated on leaves one-out and 10-fold cross validation) shows more than 86% accuracy in clustering and predicting three different lung cancer tumors. Here for the first time the application of data mining tools to effectively classify three classes of lung cancer tumors regarding the importance of dipeptide composition, autocorrelation and distribution descriptor has been reported.
Collapse
Affiliation(s)
- Faezeh Hosseinzadeh
- Student at Laboratory of Biophysics and Molecular Biology, Institute of Biophysics and Biochemistry, University of Tehran, Tehran, Iran
| | - Mansour Ebrahimi
- Department of Biology at Basic science School & Bioinformatics Research Group, Green Research Center, University of Qom, Qom, Iran
| | - Bahram Goliaei
- Department of Medical Physics, Iran University of Medical Science, Tehran, Iran
| | - Narges Shamabadi
- Bioinformatics Research Group, Green Research Center, University of Qom, Qom, Iran
| |
Collapse
|
29
|
Abstract
Transcription factors and the short, often degenerate DNA sequences they recognize are central regulators of gene expression, but their regulatory code is challenging to dissect experimentally. Thus, computational approaches have long been used to identify putative regulatory elements from the patterns in promoter sequences. Here we present a new algorithm “POWRS” (POsition-sensitive WoRd Set) for identifying regulatory sequence motifs, specifically developed to address two common shortcomings of existing algorithms. First, POWRS uses the position-specific enrichment of regulatory elements near transcription start sites to significantly increase sensitivity, while providing new information about the preferred localization of those elements. Second, POWRS forgoes position weight matrices for a discrete motif representation that appears more resistant to over-generalization. We apply this algorithm to discover sequences related to constitutive, high-level gene expression in the model plant Arabidopsis thaliana, and then experimentally validate the importance of those elements by systematically mutating two endogenous promoters and measuring the effect on gene expression levels. This provides a foundation for future efforts to rationally engineer gene expression in plants, a problem of great importance in developing biotech crop varieties. Availability: BSD-licensed Python code at http://grassrootsbio.com/papers/powrs/.
Collapse
|
30
|
Meysman P, Marchal K, Engelen K. DNA structural properties in the classification of genomic transcription regulation elements. Bioinform Biol Insights 2012; 6:155-68. [PMID: 22837642 PMCID: PMC3399529 DOI: 10.4137/bbi.s9426] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
It has been long known that DNA molecules encode information at various levels. The most basic level comprises the base sequence itself and is primarily important for the encoding of proteins and direct base recognition by DNA-binding proteins. A more elusive level consists of the local structural properties of the DNA molecule wherein the DNA sequence only plays an indirect supportive role. These properties are nevertheless an important factor in a large number of biomolecular processes and can be considered as informative signals for the presence of a variety of genomic features. Several recent studies have unequivocally shown the benefit of relying on such DNA properties for modeling and predicting genomic features as diverse as transcription start sites, transcription factor binding sites, or nucleosome occupancy. This review is meant to provide an overview of the key aspects of these DNA conformational and physicochemical properties. To illustrate their potential added value compared to relying solely on the nucleotide sequence in genomics studies, we discuss their application in research on transcription regulation mechanisms as representative cases.
Collapse
Affiliation(s)
- Pieter Meysman
- Department of Molecular and Microbial Systems, KULeuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium
| | | | | |
Collapse
|
31
|
Gan Y, Guan J, Zhou S. A comparison study on feature selection of DNA structural properties for promoter prediction. BMC Bioinformatics 2012; 13:4. [PMID: 22226192 PMCID: PMC3280155 DOI: 10.1186/1471-2105-13-4] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2011] [Accepted: 01/07/2012] [Indexed: 01/27/2023] Open
Abstract
Background Promoter prediction is an integrant step for understanding gene regulation and annotating genomes. Traditional promoter analysis is mainly based on sequence compositional features. Recently, many kinds of structural features have been employed in promoter prediction. However, considering the high-dimensionality and overfitting problems, it is unfeasible to utilize all available features for promoter prediction. Thus it is necessary to choose some appropriate features for the prediction task. Results This paper conducts an extensive comparison study on feature selection of DNA structural properties for promoter prediction. Firstly, to examine whether promoters possess some special structures, we carry out a systematical comparison among the profiles of thirteen structural features on promoter and non-promoter sequences. Secondly, we investigate the correlations between these structural features and promoter sequences. Thirdly, both filter and wrapper methods are utilized to select appropriate feature subsets from thirteen different kinds of structural features for promoter prediction, and the predictive power of the selected feature subsets is evaluated. Finally, we compare the prediction performance of the feature subsets selected in this paper with nine existing promoter prediction approaches. Conclusions Experimental results show that the structural features are differentially correlated to promoters. Specifically, DNA-bending stiffness, DNA denaturation and energy-related features are highly correlated with promoters. The predictive power for promoter sequences differentiates greatly among different structural features. Selecting the relevant features can significantly improve the accuracy of promoter prediction.
Collapse
Affiliation(s)
- Yanglan Gan
- Department of Computer Science and Technology, Tongji University, Shanghai, China
| | | | | |
Collapse
|
32
|
Prediction of thermostability from amino acid attributes by combination of clustering with attribute weighting: a new vista in engineering enzymes. PLoS One 2011; 6:e23146. [PMID: 21853079 PMCID: PMC3154288 DOI: 10.1371/journal.pone.0023146] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2011] [Accepted: 07/06/2011] [Indexed: 11/19/2022] Open
Abstract
The engineering of thermostable enzymes is receiving increased attention. The paper, detergent, and biofuel industries, in particular, seek to use environmentally friendly enzymes instead of toxic chlorine chemicals. Enzymes typically function at temperatures below 60°C and denature if exposed to higher temperatures. In contrast, a small portion of enzymes can withstand higher temperatures as a result of various structural adaptations. Understanding the protein attributes that are involved in this adaptation is the first step toward engineering thermostable enzymes. We employed various supervised and unsupervised machine learning algorithms as well as attribute weighting approaches to find amino acid composition attributes that contribute to enzyme thermostability. Specifically, we compared two groups of enzymes: mesostable and thermostable enzymes. Furthermore, a combination of attribute weighting with supervised and unsupervised clustering algorithms was used for prediction and modelling of protein thermostability from amino acid composition properties. Mining a large number of protein sequences (2090) through a variety of machine learning algorithms, which were based on the analysis of more than 800 amino acid attributes, increased the accuracy of this study. Moreover, these models were successful in predicting thermostability from the primary structure of proteins. The results showed that expectation maximization clustering in combination with uncertainly and correlation attribute weighting algorithms can effectively (100%) classify thermostable and mesostable proteins. Seventy per cent of the weighting methods selected Gln content and frequency of hydrophilic residues as the most important protein attributes. On the dipeptide level, the frequency of Asn-Glu was the key factor in distinguishing mesostable from thermostable enzymes. This study demonstrates the feasibility of predicting thermostability irrespective of sequence similarity and will serve as a basis for engineering thermostable enzymes in the laboratory.
Collapse
|
33
|
Morey C, Mookherjee S, Rajasekaran G, Bansal M. DNA free energy-based promoter prediction and comparative analysis of Arabidopsis and rice genomes. PLANT PHYSIOLOGY 2011; 156:1300-15. [PMID: 21531900 PMCID: PMC3135951 DOI: 10.1104/pp.110.167809] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/19/2010] [Accepted: 04/21/2011] [Indexed: 05/06/2023]
Abstract
The cis-regulatory regions on DNA serve as binding sites for proteins such as transcription factors and RNA polymerase. The combinatorial interaction of these proteins plays a crucial role in transcription initiation, which is an important point of control in the regulation of gene expression. We present here an analysis of the performance of an in silico method for predicting cis-regulatory regions in the plant genomes of Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa) on the basis of free energy of DNA melting. For protein-coding genes, we achieve recall and precision of 96% and 42% for Arabidopsis and 97% and 31% for rice, respectively. For noncoding RNA genes, the program gives recall and precision of 94% and 75% for Arabidopsis and 95% and 90% for rice, respectively. Moreover, 96% of the false-positive predictions were located in noncoding regions of primary transcripts, out of which 20% were found in the first intron alone, indicating possible regulatory roles. The predictions for orthologous genes from the two genomes showed a good correlation with respect to prediction scores and promoter organization. Comparison of our results with an existing program for promoter prediction in plant genomes indicates that our method shows improved prediction capability.
Collapse
Affiliation(s)
| | | | | | - Manju Bansal
- Indian Institute of Science, Bangalore 560 012, India
| |
Collapse
|
34
|
Bedo J, Kowalczyk A. Genome annotation test with validation on transcription start site and ChIP-Seq for Pol-II binding data. Bioinformatics 2011; 27:1610-7. [DOI: 10.1093/bioinformatics/btr263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
35
|
Irie T, Park SJ, Yamashita R, Seki M, Yada T, Sugano S, Nakai K, Suzuki Y. Predicting promoter activities of primary human DNA sequences. Nucleic Acids Res 2011; 39:e75. [PMID: 21486745 PMCID: PMC3113590 DOI: 10.1093/nar/gkr173] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
We developed a computer program that can predict the intrinsic promoter activities of primary human DNA sequences. We observed promoter activity using a quantitative luciferase assay and generated a prediction model using multiple linear regression. Our program achieved a prediction accuracy correlation coefficient of 0.87 between the predicted and observed promoter activities. We evaluated the prediction accuracy of the program using massive sequencing analysis of transcriptional start sites in vivo. We found that it is still difficult to predict transcript levels in a strictly quantitative manner in vivo; however, it was possible to select active promoters in a given cell from the other silent promoters. Using this program, we analyzed the transcriptional landscape of the entire human genome. We demonstrate that many human genomic regions have potential promoter activity, and the expression of some previously uncharacterized putatively non-protein-coding transcripts can be explained by our prediction model. Furthermore, we found that nucleosomes occasionally formed open chromatin structures with RNA polymerase II recruitment where the program predicted significant promoter activities, although no transcripts were observed.
Collapse
Affiliation(s)
- Takuma Irie
- Department of Medical Genome Sciences, Graduate School of Frontier Sciences, the University of Tokyo, 5-1-5 Kashiwanoha, Kashiwashi, Chiba 277-8562, Japan
| | | | | | | | | | | | | | | |
Collapse
|
36
|
Kantorovitz MR, Rapti Z, Gelev V, Usheva A. Computing DNA duplex instability profiles efficiently with a two-state model: trends of promoters and binding sites. BMC Bioinformatics 2010; 11:604. [PMID: 21172036 PMCID: PMC3018474 DOI: 10.1186/1471-2105-11-604] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2010] [Accepted: 12/21/2010] [Indexed: 11/30/2022] Open
Abstract
Background DNA instability profiles have been used recently for predicting the transcriptional start site and the location of core promoters, and to gain insight into promoter action. It was also shown that the use of these profiles can significantly improve the performance of motif finding programs. Results In this work we introduce a new method for computing DNA instability profiles. The model that we use is a modified Ising-type model and it is implemented via statistical mechanics. Our linear time algorithm computes the profile of a 10,000 base-pair long sequence in less than one second. The method we use also allows the computation of the probability that several consecutive bases are unpaired simultaneously. This is a feature that is not available in other linear-time algorithms. We use the model to compare the thermodynamic trends of promoter sequences of several genomes. In addition, we report results that associate the location of local extrema in the instability profiles with the presence of core promoter elements at these locations and with the location of the transcription start sites (TSS). We also analyzed the instability scores of binding sites of several human core promoter elements. We show that the instability scores of functional binding sites of a given core promoter element are significantly different than the scores of sites with the same motif occurring outside the functional range (relative to the TSS). Conclusions The time efficiency of the algorithm and its genome-wide applications makes this work of broad interest to scientists interested in transcriptional regulation, motif discovery, and comparative genomics.
Collapse
Affiliation(s)
- Miriam R Kantorovitz
- Department of Mathematics, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
| | | | | | | |
Collapse
|
37
|
Dineen DG, Schröder M, Higgins DG, Cunningham P. Ensemble approach combining multiple methods improves human transcription start site prediction. BMC Genomics 2010; 11:677. [PMID: 21118509 PMCID: PMC3053590 DOI: 10.1186/1471-2164-11-677] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2010] [Accepted: 11/30/2010] [Indexed: 11/20/2022] Open
Abstract
Background The computational prediction of transcription start sites is an important unsolved problem. Some recent progress has been made, but many promoters, particularly those not associated with CpG islands, are still difficult to locate using current methods. These methods use different features and training sets, along with a variety of machine learning techniques and result in different prediction sets. Results We demonstrate the heterogeneity of current prediction sets, and take advantage of this heterogeneity to construct a two-level classifier ('Profisi Ensemble') using predictions from 7 programs, along with 2 other data sources. Support vector machines using 'full' and 'reduced' data sets are combined in an either/or approach. We achieve a 14% increase in performance over the current state-of-the-art, as benchmarked by a third-party tool. Conclusions Supervised learning methods are a useful way to combine predictions from diverse sources.
Collapse
Affiliation(s)
- David G Dineen
- Complex and Adaptive Systems Laboratory (CASL), University College Dublin, Belfield, Dublin 4, Ireland.
| | | | | | | |
Collapse
|
38
|
Identification of TATA and TATA-less promoters in plant genomes by integrating diversity measure, GC-Skew and DNA geometric flexibility. Genomics 2010; 97:112-20. [PMID: 21112384 DOI: 10.1016/j.ygeno.2010.11.002] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2010] [Revised: 11/05/2010] [Accepted: 11/12/2010] [Indexed: 11/20/2022]
Abstract
Accurate identification of core promoters is important for gaining more insight about the understanding of the eukaryotic transcription regulation. In this study, the authors focused on the biologically realistic promoter prediction of plant genomes. By analyzing the correlative conservation, GC-compositional bias and specific structural patterns of TATA and TATA-less promoters in PlantPromDB, a hybrid multi-feature approach based on support vector machine (SVM) for predicting the two types of promoters were developed by integrating local word content, GC-Skew and DNA geometric flexibility. Compared with the TSSP-TCM program on the same test dataset, better prediction results were obtained. Especially for the TATA-less promoter, the accuracy is 10% higher than the result of TSSP-TCM program. The good performance of the hybrid promoters and the experimental data also indicate that our method has the ability to locate the promoter region of the plant genome.
Collapse
|
39
|
Eukaryotic and prokaryotic promoter prediction using hybrid approach. Theory Biosci 2010; 130:91-100. [DOI: 10.1007/s12064-010-0114-8] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2010] [Accepted: 10/23/2010] [Indexed: 12/27/2022]
|
40
|
Rangannan V, Bansal M. High-quality annotation of promoter regions for 913 bacterial genomes. ACTA ACUST UNITED AC 2010; 26:3043-50. [PMID: 20956245 DOI: 10.1093/bioinformatics/btq577] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
MOTIVATION The number of bacterial genomes being sequenced is increasing very rapidly and hence, it is crucial to have procedures for rapid and reliable annotation of their functional elements such as promoter regions, which control the expression of each gene or each transcription unit of the genome. The present work addresses this requirement and presents a generic method applicable across organisms. RESULTS Relative stability of the DNA double helical sequences has been used to discriminate promoter regions from non-promoter regions. Based on the difference in stability between neighboring regions, an algorithm has been implemented to predict promoter regions on a large scale over 913 microbial genome sequences. The average free energy values for the promoter regions as well as their downstream regions are found to differ, depending on their GC content. Threshold values to identify promoter regions have been derived using sequences flanking a subset of translation start sites from all microbial genomes and then used to predict promoters over the complete genome sequences. An average recall value of 72% (which indicates the percentage of protein and RNA coding genes with predicted promoter regions assigned to them) and precision of 56% is achieved over the 913 microbial genome dataset. AVAILABILITY The binary executable for 'PromPredict' algorithm (implemented in PERL and supported on Linux and MS Windows) and the predicted promoter data for all 913 microbial genomes are available at http://nucleix.mbu.iisc.ernet.in/prombase/.
Collapse
|
41
|
Zeng J, Zhao XY, Cao XQ, Yan H. SCS: signal, context, and structure features for genome-wide human promoter recognition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2010; 7:550-562. [PMID: 20671324 DOI: 10.1109/tcbb.2008.95] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
This paper integrates the signal, context, and structure features for genome-wide human promoter recognition, which is important in improving genome annotation and analyzing transcriptional regulation without experimental supports of ESTs, cDNAs, or mRNAs. First, CpG islands are salient biological signals associated with approximately 50 percent of mammalian promoters. Second, the genomic context of promoters may have biological significance, which is based on n-mers (sequences of n bases long) and their statistics estimated from training samples. Third, sequence-dependent DNA flexibility originates from DNA 3D structures and plays an important role in guiding transcription factors to the target site in promoters. Employing decision trees, we combine above signal, context, and structure features to build a hierarchical promoter recognition system called SCS. Experimental results on controlled data sets and the entire human genome demonstrate that SCS is significantly superior in terms of sensitivity and specificity as compared to other state-of-the-art methods. The SCS promoter recognition system is available online as supplemental materials for academic use and can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TCBB.2008.95.
Collapse
Affiliation(s)
- Jia Zeng
- School of Computer Science and Technology, Soochow University, Suzhou, China.
| | | | | | | |
Collapse
|
42
|
Osypov AA, Krutinin GG, Kamzolova SG. Deppdb--DNA electrostatic potential properties database: electrostatic properties of genome DNA. J Bioinform Comput Biol 2010; 8:413-25. [PMID: 20556853 DOI: 10.1142/s0219720010004811] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2009] [Revised: 01/28/2010] [Accepted: 02/12/2010] [Indexed: 11/18/2022]
Abstract
The electrostatic properties of genome DNA influence its interactions with different proteins, in particular, the regulation of transcription by RNA-polymerases. DEPPDB--DNA Electrostatic Potential Properties Database--was developed to hold and provide all available information on the electrostatic properties of genome DNA combined with its sequence and annotation of biological and structural properties of genome elements and whole genomes. Genomes in DEPPDB are organized on a taxonomical basis. Currently, the database contains all the completely sequenced bacterial and viral genomes according to NCBI RefSeq. General properties of the genome DNA electrostatic potential profile and principles of its formation are revealed. This potential correlates with the GC content but does not correspond to it exactly and strongly depends on both the sequence arrangement and its context (flanking regions). Analysis of the promoter regions for bacterial and viral RNA polymerases revealed a correspondence between the scale of these proteins' physical properties and electrostatic profile patterns. We also discovered a direct correlation between the potential value and the binding frequency of RNA polymerase to DNA, supporting the idea of the role of electrostatics in these interactions. This matches a pronounced tendency of the promoter regions to possess higher values of the electrostatic potential.
Collapse
Affiliation(s)
- Alexander A Osypov
- Laboratory of Mechanisms of the Cell Genome Functioning, Institute of Cell Biophysics RAS, Pushchino 142290, Russia.
| | | | | |
Collapse
|
43
|
Kim TK, Hemberg M, Gray JM, Costa AM, Bear DM, Wu J, Harmin DA, Laptewicz M, Barbara-Haley K, Kuersten S, Markenscoff-Papadimitriou E, Kuhl D, Bito H, Worley PF, Kreiman G, Greenberg ME. Widespread transcription at neuronal activity-regulated enhancers. Nature 2010; 465:182-7. [PMID: 20393465 PMCID: PMC3020079 DOI: 10.1038/nature09033] [Citation(s) in RCA: 1838] [Impact Index Per Article: 122.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2009] [Accepted: 03/25/2010] [Indexed: 01/12/2023]
Abstract
We used genome-wide sequencing methods to study stimulus-dependent enhancer function in mouse cortical neurons. We identified approximately 12,000 neuronal activity-regulated enhancers that are bound by the general transcriptional co-activator CBP in an activity-dependent manner. A function of CBP at enhancers may be to recruit RNA polymerase II (RNAPII), as we also observed activity-regulated RNAPII binding to thousands of enhancers. Notably, RNAPII at enhancers transcribes bi-directionally a novel class of enhancer RNAs (eRNAs) within enhancer domains defined by the presence of histone H3 monomethylated at lysine 4. The level of eRNA expression at neuronal enhancers positively correlates with the level of messenger RNA synthesis at nearby genes, suggesting that eRNA synthesis occurs specifically at enhancers that are actively engaged in promoting mRNA synthesis. These findings reveal that a widespread mechanism of enhancer activation involves RNAPII binding and eRNA synthesis.
Collapse
Affiliation(s)
- Tae-Kyung Kim
- Department of Neurobiology, Harvard Medical School, 220 Longwood Avenue, Boston, MA 02115, USA
| | - Martin Hemberg
- Department of Ophthalmology, Children's Hospital Boston, Center for Brain Science and Swartz Center for Theoretical Neuroscience, Harvard University, 300 Longwood Avenue, Boston, MA 02115, USA
| | - Jesse M. Gray
- Department of Neurobiology, Harvard Medical School, 220 Longwood Avenue, Boston, MA 02115, USA
| | - Allen M. Costa
- Department of Neurobiology, Harvard Medical School, 220 Longwood Avenue, Boston, MA 02115, USA
| | - Daniel M. Bear
- Department of Neurobiology, Harvard Medical School, 220 Longwood Avenue, Boston, MA 02115, USA
| | - Jing Wu
- The Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, 725 North Wolfe St., Baltimore, MD 21205, USA
| | - David A. Harmin
- Department of Neurobiology, Harvard Medical School, 220 Longwood Avenue, Boston, MA 02115, USA
- Children's Hospital Informatics Program at the Harvard-MIT Division of Health Sciences and Technology, 300 Longwood Avenue, Boston, MA 02115, USA
| | - Mike Laptewicz
- Department of Neurobiology, Harvard Medical School, 220 Longwood Avenue, Boston, MA 02115, USA
| | - Kellie Barbara-Haley
- Molecular Genetics Core facility, Children's Hospital Boston, 300 Longwood Ave, Boston, MA 02115, USA
| | - Scott Kuersten
- Epicentre Biotechnologies, 726 Post Road, Madison, WI 53713, USA
| | | | - Dietmar Kuhl
- Institute for Molecular and Cellular Cognition (IMCC), Center for Molecular Neurobiology (ZMNH), University Medical Center Hamburg-Eppendorf (UKE), Falkenried 94, 20251 Hamburg, Germany
| | - Haruhiko Bito
- Department of Neurochemistry, Graduate School of Medicine, University of Tokyo, Bunkyo-ku, Tokyo 113-0033, Japan
| | - Paul F. Worley
- The Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, 725 North Wolfe St., Baltimore, MD 21205, USA
| | - Gabriel Kreiman
- Department of Ophthalmology, Children's Hospital Boston, Center for Brain Science and Swartz Center for Theoretical Neuroscience, Harvard University, 300 Longwood Avenue, Boston, MA 02115, USA
| | - Michael E. Greenberg
- Department of Neurobiology, Harvard Medical School, 220 Longwood Avenue, Boston, MA 02115, USA
| |
Collapse
|
44
|
Dineen DG, Wilm A, Cunningham P, Higgins DG. High DNA melting temperature predicts transcription start site location in human and mouse. Nucleic Acids Res 2010; 37:7360-7. [PMID: 19820114 PMCID: PMC2794178 DOI: 10.1093/nar/gkp821] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
The accurate computational prediction of transcription start sites (TSS) in vertebrate genomes is a difficult problem. The physicochemical properties of DNA can be computed in various ways and a many combinations of DNA features have been tested in the past for use as predictors of transcription. We looked in detail at melting temperature, which measures the temperature, at which two strands of DNA separate, considering the cooperative nature of this process. We find that peaks in melting temperature correspond closely to experimentally determined transcription start sites in human and mouse chromosomes. Using melting temperature alone, and with simple thresholding, we can predict TSS with accuracy that is competitive with the most accurate state-of-the-art TSS prediction methods. Accuracy is measured using both experimentally and manually determined TSS. The method works especially well with CpG island containing promoters, but also works when CpG islands are absent. This result is clear evidence of the important role of the physical properties of DNA in the process of transcription. It also points to the importance for TSS prediction methods to include melting temperature as prior information.
Collapse
Affiliation(s)
- David G Dineen
- Complex and Adaptive Systems Laboratory (CASL), University College Dublin, Belfield, Dublin 4, Ireland.
| | | | | | | |
Collapse
|
45
|
Gupta R, Wikramasinghe P, Bhattacharyya A, Perez FA, Pal S, Davuluri RV. Annotation of gene promoters by integrative data-mining of ChIP-seq Pol-II enrichment data. BMC Bioinformatics 2010; 11 Suppl 1:S65. [PMID: 20122241 PMCID: PMC3009539 DOI: 10.1186/1471-2105-11-s1-s65] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Use of alternative gene promoters that drive widespread cell-type, tissue-type or developmental gene regulation in mammalian genomes is a common phenomenon. Chromatin immunoprecipitation methods coupled with DNA microarray (ChIP-chip) or massive parallel sequencing (ChIP-seq) are enabling genome-wide identification of active promoters in different cellular conditions using antibodies against Pol-II. However, these methods produce enrichment not only near the gene promoters but also inside the genes and other genomic regions due to the non-specificity of the antibodies used in ChIP. Further, the use of these methods is limited by their high cost and strong dependence on cellular type and context. Methods We trained and tested different state-of-art ensemble and meta classification methods for identification of Pol-II enriched promoter and Pol-II enriched non-promoter sequences, each of length 500 bp. The classification models were trained and tested on a bench-mark dataset, using a set of 39 different feature variables that are based on chromatin modification signatures and various DNA sequence features. The best performing model was applied on seven published ChIP-seq Pol-II datasets to provide genome wide annotation of mouse gene promoters. Results We present a novel algorithm based on supervised learning methods to discriminate promoter associated Pol-II enrichment from enrichment elsewhere in the genome in ChIP-chip/seq profiles. We accumulated a dataset of 11,773 promoter and 46,167 non-promoter sequences, each of length 500 bp, generated from RNA Pol-II ChIP-seq data of five tissues (Brain, Kidney, Liver, Lung and Spleen). We evaluated the classification models in building the best predictor and found that Bagging and Random Forest based approaches give the best accuracy. We implemented the algorithm on seven different published ChIP-seq datasets to provide a comprehensive set of promoter annotations for both protein-coding and non-coding genes in the mouse genome. The resulting annotations contain 13,413 (4,747) protein-coding (non-coding) genes with single promoters and 9,929 (1,858) protein-coding (non-coding) genes with two or more alternative promoters, and a significant number of unassigned novel promoters. Conclusion Our new algorithm can successfully predict the promoters from the genome wide profile of Pol-II bound regions. In addition, our algorithm performs significantly better than existing promoter prediction methods and can be applied for genome-wide predictions of Pol-II promoters.
Collapse
Affiliation(s)
- Ravi Gupta
- Center for Systems and Computational Biology, Molecular and Cellular Oncogenesis Program, The Wistar Institute, Philadelphia, PA, USA.
| | | | | | | | | | | |
Collapse
|
46
|
Abstract
Motivation: Promoter prediction is an important task in genome annotation projects, and during the past years many new promoter prediction programs (PPPs) have emerged. However, many of these programs are compared inadequately to other programs. In most cases, only a small portion of the genome is used to evaluate the program, which is not a realistic setting for whole genome annotation projects. In addition, a common evaluation design to properly compare PPPs is still lacking. Results: We present a large-scale benchmarking study of 17 state-of-the-art PPPs. A multi-faceted evaluation strategy is proposed that can be used as a gold standard for promoter prediction evaluation, allowing authors of promoter prediction software to compare their method to existing methods in a proper way. This evaluation strategy is subsequently used to compare the chosen promoter predictors, and an in-depth analysis on predictive performance, promoter class specificity, overlap between predictors and positional bias of the predictions is conducted. Availability: We provide the implementations of the four protocols, as well as the datasets required to perform the benchmarks to the academic community free of charge on request. Contact:yves.vandepeer@psb.ugent.be Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Thomas Abeel
- Department of Plant Systems Biology, VIB, Ghent University, Gent, Belgium
| | | | | |
Collapse
|
47
|
Zeng J, Zhu S, Yan H. Towards accurate human promoter recognition: a review of currently used sequence features and classification methods. Brief Bioinform 2009; 10:498-508. [PMID: 19531545 DOI: 10.1093/bib/bbp027] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
This review describes important advances that have been made during the past decade for genome-wide human promoter recognition. Interest in promoter recognition algorithms on a genome-wide scale is worldwide and touches on a number of practical systems that are important in analysis of gene regulation and in genome annotation without experimental support of ESTs, cDNAs or mRNAs. The main focus of this review is on feature extraction and model selection for accurate human promoter recognition, with descriptions of what they are, what has been accomplished, and what remains to be done.
Collapse
Affiliation(s)
- Jia Zeng
- Department of Computer Science, Hong Kong Baptist University, Kowloon, Hong Kong.
| | | | | |
Collapse
|
48
|
Narlikar L, Ovcharenko I. Identifying regulatory elements in eukaryotic genomes. BRIEFINGS IN FUNCTIONAL GENOMICS AND PROTEOMICS 2009; 8:215-30. [PMID: 19498043 DOI: 10.1093/bfgp/elp014] [Citation(s) in RCA: 68] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Proper development and functioning of an organism depends on precise spatial and temporal expression of all its genes. These coordinated expression-patterns are maintained primarily through the process of transcriptional regulation. Transcriptional regulation is mediated by proteins binding to regulatory elements on the DNA in a combinatorial manner, where particular combinations of transcription factor binding sites establish specific regulatory codes. In this review, we survey experimental and computational approaches geared towards the identification of proximal and distal gene regulatory elements in the genomes of complex eukaryotes. Available approaches that decipher the genetic structure and function of regulatory elements by exploiting various sources of information like gene expression data, chromatin structure, DNA-binding specificities of transcription factors, cooperativity of transcription factors, etc. are highlighted. We also discuss the relevance of regulatory elements in the context of human health through examples of mutations in some of these regions having serious implications in misregulation of genes and being strongly associated with human disorders.
Collapse
Affiliation(s)
- Leelavati Narlikar
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | |
Collapse
|
49
|
Auger H, Lamy C, Haeussler M, Khoueiry P, Lemaire P, Joly JS. Similar regulatory logic in Ciona intestinalis for two Wnt pathway modulators, ROR and SFRP-1/5. Dev Biol 2009; 329:364-73. [PMID: 19248777 DOI: 10.1016/j.ydbio.2009.02.018] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2008] [Revised: 01/22/2009] [Accepted: 02/03/2009] [Indexed: 10/21/2022]
Abstract
Anteroposterior patterning of the ectoderm in the invertebrate chordate Ciona intestinalis first relies on key zygotic activators, such as FoxA, and later on the coordinated responses to inducing signals from the underlying mesendoderm. Here, we focus on a mechanism of coordination of these responses by looking at the cis-regulatory logics of Ci-Rora and Ci-Rorb, which are coding for putative non-canonical transmembrane Wnt receptors, and are present in tandem along the C. intestinalis chromosome 08q. We showed that during cleavage stages, both Ci-Rora and Ci-Rorb genes are initially expressed in all blastomeres of the anterior ectoderm (a-line), as sFRP1/5 (Lamy, C., Rothbächer, U., Caillol, D., Lemaire, P., 2006. Ci-FoxA-a is the earliest zygotic determinant of the ascidian anterior ectoderm and directly activates Ci-sFRP1/5. Development 133, 2835-2844.). We then carried out a functional analysis of cis-regulatory regions and showed that both genes have elements enriched in Ci-FoxA-a binding sites. We dissected one of these early enhancers, and showed that it is directly activated by Ci-FoxA-a, as one sFRP1/5 cis-regulatory element. We also showed that although FoxA binding sites are abundant in genomes, dense clusters of these sites are found upstream from very few genes, and have a high predictive value of a-line expression. These data indicate an important role for FoxA in anterior specification, via the transcriptional regulation of target genes belonging to various signalling pathways.
Collapse
Affiliation(s)
- Hélène Auger
- INRA "Morphogenèse du Système Nerveux des Chordés" Group, DEPSN, UPR2197, Institut Fessard, CNRS, 1 Avenue de la Terrasse, 91198 GIF SUR YVETTE, France
| | | | | | | | | | | |
Collapse
|
50
|
Megraw M, Pereira F, Jensen ST, Ohler U, Hatzigeorgiou AG. A transcription factor affinity-based code for mammalian transcription initiation. Genome Res 2009; 19:644-56. [PMID: 19141595 DOI: 10.1101/gr.085449.108] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
The recent arrival of large-scale cap analysis of gene expression (CAGE) data sets in mammals provides a wealth of quantitative information on coding and noncoding RNA polymerase II transcription start sites (TSS). Genome-wide CAGE studies reveal that a large fraction of TSS exhibit peaks where the vast majority of associated tags map to a particular location ( approximately 45%), whereas other active regions contain a broader distribution of initiation events. The presence of a strong single peak suggests that transcription at these locations may be mediated by position-specific sequence features. We therefore propose a new model for single-peaked TSS based solely on known transcription factors (TFs) and their respective regions of positional enrichment. This probabilistic model leads to near-perfect classification results in cross-validation (auROC = 0.98), and performance in genomic scans demonstrates that TSS prediction with both high accuracy and spatial resolution is achievable for a specific but large subgroup of mammalian promoters. The interpretable model structure suggests a DNA code in which canonical sequence features such as TATA-box, Initiator, and GC content do play a significant role, but many additional TFs show distinct spatial biases with respect to TSS location and are important contributors to the accurate prediction of single-peak transcription initiation sites. The model structure also reveals that CAGE tag clusters distal from annotated gene starts have distinct characteristics compared to those close to gene 5'-ends. Using this high-resolution single-peak model, we predict TSS for approximately 70% of mammalian microRNAs based on currently available data.
Collapse
Affiliation(s)
- Molly Megraw
- Institute for Genome Sciences and Policy, Duke University, Durham, North Carolina 27708, USA
| | | | | | | | | |
Collapse
|