Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Bajic VB, Brent MR, Brown RH, Frankish A, Harrow J, Ohler U, Solovyev VV, Tan SL. Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment. Genome Biol 2006;7 Suppl 1:S3.1-13. [PMID: 16925837 PMCID: PMC1810552 DOI: 10.1186/gb-2006-7-s1-s3] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

For:	Bajic VB, Brent MR, Brown RH, Frankish A, Harrow J, Ohler U, Solovyev VV, Tan SL. Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment. Genome Biol 2006;7 Suppl 1:S3.1-13. [PMID: 16925837 PMCID: PMC1810552 DOI: 10.1186/gb-2006-7-s1-s3] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Number

Cited by Other Article(s)

Barbero-Aparicio JA, Olivares-Gil A, Díez-Pastor JF, García-Osorio C. Deep learning and support vector machines for transcription start site identification. PeerJ Comput Sci 2023;9:e1340. [PMID: 37346545 PMCID: PMC10280436 DOI: 10.7717/peerj-cs.1340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Accepted: 03/21/2023] [Indexed: 06/23/2023]

Abstract

Recognizing transcription start sites is key to gene identification. Several approaches have been employed in related problems such as detecting translation initiation sites or promoters, many of the most recent ones based on machine learning. Deep learning methods have been proven to be exceptionally effective for this task, but their use in transcription start site identification has not yet been explored in depth. Also, the very few existing works do not compare their methods to support vector machines (SVMs), the most established technique in this area of study, nor provide the curated dataset used in the study. The reduced amount of published papers in this specific problem could be explained by this lack of datasets. Given that both support vector machines and deep neural networks have been applied in related problems with remarkable results, we compared their performance in transcription start site predictions, concluding that SVMs are computationally much slower, and deep learning methods, specially long short-term memory neural networks (LSTMs), are best suited to work with sequences than SVMs. For such a purpose, we used the reference human genome GRCh38. Additionally, we studied two different aspects related to data processing: the proper way to generate training examples and the imbalanced nature of the data. Furthermore, the generalization performance of the models studied was also tested using the mouse genome, where the LSTM neural network stood out from the rest of the algorithms. To sum up, this article provides an analysis of the best architecture choices in transcription start site identification, as well as a method to generate transcription start site datasets including negative instances on any species available in Ensembl. We found that deep learning methods are better suited than SVMs to solve this problem, being more efficient and better adapted to long sequences and large amounts of data. We also create a transcription start site (TSS) dataset large enough to be used in deep learning experiments.

Collapse

BERT-Promoter: An improved sequence-based predictor of DNA promoter using BERT pre-trained model and SHAP feature selection. Comput Biol Chem 2022;99:107732. [PMID: 35863177 DOI: 10.1016/j.compbiolchem.2022.107732] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Accepted: 07/12/2022] [Indexed: 02/01/2023]

Perez Martell RI, Ziesel A, Jabbari H, Stege U. Supervised promoter recognition: a benchmark framework. BMC Bioinformatics 2022;23:118. [PMID: 35366794 PMCID: PMC8976979 DOI: 10.1186/s12859-022-04647-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Accepted: 03/16/2022] [Indexed: 11/10/2022] Open

Senthilkumar S, Vinod KK, Parthiban S, Thirugnanasambandam P, Lakshmi Pathy T, Banerjee N, Sarath Padmanabhan TS, Govindaraj P. Identification of potential MTAs and candidate genes for juice quality- and yield-related traits in Saccharum clones: a genome-wide association and comparative genomic study. Mol Genet Genomics 2022;297:635-654. [PMID: 35257240 DOI: 10.1007/s00438-022-01870-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2021] [Accepted: 02/06/2022] [Indexed: 11/30/2022]

Abstract

Sugarcane is an economically important commercial crop which provides raw material for the production of sugar, jaggery, bioethanol, biomass and other by-products. Sugarcane breeding till today heavily relies on conventional breeding approaches which is time consuming, laborious and costly. Integration of marker-assisted selection (MAS) in sugarcane genetic improvement programs for difficult to select traits like sucrose content, resistance to pests and diseases and tolerance to abiotic stresses will accelerate varietal development. In the present study, association mapping approach was used to identify QTLs and genes associated with sucrose and other important yield-contributing traits. A mapping panel of 110 diverse sugarcane genotypes and 148 microsatellite primers were used for structured association mapping study. An optimal subpopulation number (ΔK) of 5 was identified by structure analysis. GWAS analysis using TASSEL identified a total of 110 MTAs which were localized into 27 QTLs by GLM and MLM (Q + K, PC + K) approaches. Among the 24 QTLs sequenced, 12 were able to identify potential candidate genes, viz., starch branching enzyme, starch synthase 4, sugar transporters and G3P-DH related to carbohydrate metabolism and hormone pathway-related genes ethylene insensitive 3-like 1, reversion to ethylene sensitive1-like, and auxin response factor associated to juice quality- and yield-related traits. Six markers, NKS 5_185, SCB 270_144, SCB 370_256, NKS 46_176 and UGSM 648_245, associated with juice quality traits and marker SMC31CUQ_304 associated with NMC were validated and identified as significantly associated to the traits by one-way ANOVA analysis. In conclusion, 24 potential QTLs identified in the present study could be used in sugarcane breeding programs after further validation in larger population. The candidate genes from carbohydrate and hormone response pathway presented in this study could be manipulated with genome editing approaches to further improve sugarcane crop.

Collapse

Zhang M, Jia C, Li F, Li C, Zhu Y, Akutsu T, Webb GI, Zou Q, Coin LJM, Song J. Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction. Brief Bioinform 2022;23:6502561. [PMID: 35021193 PMCID: PMC8921625 DOI: 10.1093/bib/bbab551] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 11/12/2021] [Accepted: 11/30/2021] [Indexed: 01/13/2023] Open

Affiliation(s)

Meng Zhang
Cangzhi Jia Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
Fuyi Li
Chen Li
Yan Zhu
Tatsuya Akutsu
Geoffrey I Webb Department of Data Science and Artificial Intelligence, Monash University, Melbourne, VIC 3800, Australia,Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
Quan Zou Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
Lachlan J M Coin Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
Jiangning Song Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:

Collapse

Chhabra R, Muthusamy V, Gain N, Katral A, Prakash NR, Zunjare RU, Hossain F. Allelic variation in sugary1 gene affecting kernel sweetness among diverse-mutant and -wild-type maize inbreds. Mol Genet Genomics 2021;296:1085-1102. [PMID: 34159441 DOI: 10.1007/s00438-021-01807-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Accepted: 06/16/2021] [Indexed: 12/01/2022]

Singh S, Singh A. A prescient evolutionary model for genesis, duplication and differentiation of MIR160 homologs in Brassicaceae. Mol Genet Genomics 2021;296:985-1003. [PMID: 34052911 DOI: 10.1007/s00438-021-01797-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2020] [Accepted: 05/21/2021] [Indexed: 12/18/2022]

Abstract

MicroRNA160 is a class of nitrogen-starvation responsive genes which governs establishment of root system architecture by down-regulating AUXIN RESPONSE FACTOR genes (ARF10, ARF16 and ARF17) in plants. The high copy number of MIR160 variants discovered by us from land plants, especially polyploid crop Brassicas, posed questions regarding genesis, duplication, evolution and function. Absence of studies on impact of whole genome and segmental duplication on retention and evolution of MIR160 homologs in descendent plant lineages prompted us to undertake the current study. Herein, we describe ancestry and fate of MIR160 homologs in Brassicaceae in context of polyploidy driven genome re-organization, copy number and differentiation. Paralogy amongst Brassicaceae MIR160a, MIR160b and MIR160c was inferred using phylogenetic analysis of 468 MIR160 homologs from land plants. The evolutionarily distinct MIR160a was found to represent ancestral form and progenitor of MIR160b and MIR160c. Chronology of evolutionary events resulting in origin and diversification of genomic loci containing MIR160 homologs was delineated using derivatives of comparative synteny. A prescient model for causality of segmental duplications in establishment of paralogy in Brassicaceae MIR160, with whole genome duplication accentuating the copy number increase, is being posited in which post-segmental duplication events viz. differential gene fractionation, gene duplications and inversions are shown to drive divergence of chromosome segments. While mutations caused the diversification of MIR160a, MIR160b and MIR160c, duplicated segments containing these diversified genes suffered gene rearrangements via gene loss, duplications and inversions. Yet the topology of phylogenetic and phenetic trees were found congruent suggesting similar evolutionary trajectory. Over 80% of Brassicaceae genomes and subgenomes showed a preferential retention of single copy each of MIR160a, MIR160b and MIR160c suggesting functional relevance. Thus, our study provides a blue-print for reconstructing ancestry and phylogeny of MIRNA gene families at genomics level and analyzing the impact of polyploidy on organismal complexity. Such studies are critical for understanding the molecular basis of agronomic traits and deploying appropriate candidates for crop improvement.

Collapse

Liu B, Han L, Liu X, Wu J, Ma Q. Computational Prediction of Sigma-54 Promoters in Bacterial Genomes by Integrating Motif Finding and Machine Learning Strategies. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019;16:1211-1218. [PMID: 29993815 DOI: 10.1109/tcbb.2018.2816032] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]

Promoter analysis and prediction in the human genome using sequence-based deep learning models. Bioinformatics 2019;35:2730-2737. [DOI: 10.1093/bioinformatics/bty1068] [Citation(s) in RCA: 60] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2018] [Revised: 12/03/2018] [Accepted: 12/27/2018] [Indexed: 12/14/2022] Open

Triska M, Solovyev V, Baranova A, Kel A, Tatarinova TV. Nucleotide patterns aiding in prediction of eukaryotic promoters. PLoS One 2017;12:e0187243. [PMID: 29141011 PMCID: PMC5687710 DOI: 10.1371/journal.pone.0187243] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2017] [Accepted: 09/05/2017] [Indexed: 01/09/2023] Open

Umarov RK, Solovyev VV. Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. PLoS One 2017;12:e0171410. [PMID: 28158264 PMCID: PMC5291440 DOI: 10.1371/journal.pone.0171410] [Citation(s) in RCA: 142] [Impact Index Per Article: 17.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2016] [Accepted: 01/20/2017] [Indexed: 11/18/2022] Open

Abstract

Accurate computational identification of promoters remains a challenge as these key DNA regulatory regions have variable structures composed of functional motifs that provide gene-specific initiation of transcription. In this paper we utilize Convolutional Neural Networks (CNN) to analyze sequence characteristics of prokaryotic and eukaryotic promoters and build their predictive models. We trained a similar CNN architecture on promoters of five distant organisms: human, mouse, plant (Arabidopsis), and two bacteria (Escherichia coli and Bacillus subtilis). We found that CNN trained on sigma70 subclass of Escherichia coli promoter gives an excellent classification of promoters and non-promoter sequences (Sn = 0.90, Sp = 0.96, CC = 0.84). The Bacillus subtilis promoters identification CNN model achieves Sn = 0.91, Sp = 0.95, and CC = 0.86. For human, mouse and Arabidopsis promoters we employed CNNs for identification of two well-known promoter classes (TATA and non-TATA promoters). CNN models nicely recognize these complex functional regions. For human promoters Sn/Sp/CC accuracy of prediction reached 0.95/0.98/0,90 on TATA and 0.90/0.98/0.89 for non-TATA promoter sequences, respectively. For Arabidopsis we observed Sn/Sp/CC 0.95/0.97/0.91 (TATA) and 0.94/0.94/0.86 (non-TATA) promoters. Thus, the developed CNN models, implemented in CNNProm program, demonstrated the ability of deep learning approach to grasp complex promoter sequence characteristics and achieve significantly higher accuracy compared to the previously developed promoter prediction programs. We also propose random substitution procedure to discover positionally conserved promoter functional elements. As the suggested approach does not require knowledge of any specific promoter features, it can be easily extended to identify promoters and other complex functional regions in sequences of many other and especially newly sequenced genomes. The CNNProm program is available to run at web server http://www.softberry.com.

Collapse

Lacadie SA, Ibrahim MM, Gokhale SA, Ohler U. Divergent transcription and epigenetic directionality of human promoters. FEBS J 2016;283:4214-4222. [DOI: 10.1111/febs.13747] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2015] [Revised: 02/08/2016] [Accepted: 04/25/2016] [Indexed: 11/26/2022]

Barson G, Griffiths E. SeqTools: visual tools for manual analysis of sequence alignments. BMC Res Notes 2016;9:39. [PMID: 26801397 PMCID: PMC4724122 DOI: 10.1186/s13104-016-1847-3] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2015] [Accepted: 01/08/2016] [Indexed: 11/23/2022] Open

Abstract

Background

Manual annotation is essential to create high-quality reference alignments and annotation. Annotators need to be able to view sequence alignments in detail. The SeqTools package provides three tools for viewing different types of sequence alignment: Blixem is a many-to-one browser of pairwise alignments, displaying multiple match sequences aligned against a single reference sequence; Dotter provides a graphical dot-plot view of a single pairwise alignment; and Belvu is a multiple sequence alignment viewer, editor, and phylogenetic tool. These tools were originally part of the AceDB genome database system but have been completely rewritten to make them generally available as a standalone package of greatly improved function.

Findings

Blixem is used by annotators to give a detailed view of the evidence for particular gene models. Blixem displays the gene model positions and the match sequences aligned against the genomic reference sequence. Annotators use this for many reasons, including to check the quality of an alignment, to find missing/misaligned sequence and to identify splice sites and polyA sites and signals. Dotter is used to give a dot-plot representation of a particular pairwise alignment. This is used to identify sequence that is not represented (or is misrepresented) and to quickly compare annotated gene models with transcriptional and protein evidence that putatively supports them. Belvu is used to analyse conservation patterns in multiple sequence alignments and to perform a combination of manual and automatic processing of the alignment. High-quality reference alignments are essential if they are to be used as a starting point for further automatic alignment generation.

Conclusions

While there are many different alignment tools available, the SeqTools package provides unique functionality that annotators have found to be essential for analysing sequence alignments as part of the manual annotation process.

Electronic supplementary material

The online version of this article (doi:10.1186/s13104-016-1847-3) contains supplementary material, which is available to authorized users.

Collapse

Yella VR, Bansal M. In silico Identification of Eukaryotic Promoters. SYSTEMS AND SYNTHETIC BIOLOGY 2015. [DOI: 10.1007/978-94-017-9514-2_4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]

Mesoscopic model and free energy landscape for protein-DNA binding sites: analysis of cyanobacterial promoters. PLoS Comput Biol 2014;10:e1003835. [PMID: 25275384 PMCID: PMC4183373 DOI: 10.1371/journal.pcbi.1003835] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2014] [Accepted: 07/26/2014] [Indexed: 01/23/2023] Open

Abstract

The identification of protein binding sites in promoter sequences is a key problem to understand and control regulation in biochemistry and biotechnological processes. We use a computational method to analyze promoters from a given genome. Our approach is based on a physical model at the mesoscopic level of protein-DNA interaction based on the influence of DNA local conformation on the dynamics of a general particle along the chain. Following the proposed model, the joined dynamics of the protein particle and the DNA portion of interest, only characterized by its base pair sequence, is simulated. The simulation output is analyzed by generating and analyzing the Free Energy Landscape of the system. In order to prove the capacity of prediction of our computational method we have analyzed nine promoters of Anabaena PCC 7120. We are able to identify the transcription starting site of each of the promoters as the most populated macrostate in the dynamics. The developed procedure allows also to characterize promoter macrostates in terms of thermo-statistical magnitudes (free energy and entropy), with valuable biological implications. Our results agree with independent previous experimental results. Thus, our methods appear as a powerful complementary tool for identifying protein binding sites in promoter sequences.

Binding of specific proteins to particular sites in the DNA sequence is a fundamental issue for gene regulation in molecular biology and genetic engineering. A deep understanding of cell physiology requires the analysis of a plethora of genes involving characterization of their promoter architectures that determine their regulation and gene transcription. In order to locate the promoter elements of a given gene, experimental determination of its transcription start site (TSS) is required. This is an expensive, time-consuming task that, depending on our requirements, could be simplified using computational analysis as a first approach. Nevertheless, most computational methods lack a physical basis on the protein-DNA interaction mechanism. We adopt here this strategy, by using a simple model for protein-DNA interaction to find TSS in a bunch of cyanobacteria promoters. We make use of physical tools to characterize these TSS and to relate them with biological properties as the relative strength of the promoter. Our study shows how a model based on a coarse-grained description of a biomolecule can give valuable insight on its biological function.

Collapse

Kamath U, De Jong K, Shehu A. Effective automated feature construction and selection for classification of biological sequences. PLoS One 2014;9:e99982. [PMID: 25033270 PMCID: PMC4102475 DOI: 10.1371/journal.pone.0099982] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2013] [Accepted: 05/21/2014] [Indexed: 11/25/2022] Open

Abstract

BACKGROUND

Many open problems in bioinformatics involve elucidating underlying functional signals in biological sequences. DNA sequences, in particular, are characterized by rich architectures in which functional signals are increasingly found to combine local and distal interactions at the nucleotide level. Problems of interest include detection of regulatory regions, splice sites, exons, hypersensitive sites, and more. These problems naturally lend themselves to formulation as classification problems in machine learning. When classification is based on features extracted from the sequences under investigation, success is critically dependent on the chosen set of features.

METHODOLOGY

We present an algorithmic framework (EFFECT) for automated detection of functional signals in biological sequences. We focus here on classification problems involving DNA sequences which state-of-the-art work in machine learning shows to be challenging and involve complex combinations of local and distal features. EFFECT uses a two-stage process to first construct a set of candidate sequence-based features and then select a most effective subset for the classification task at hand. Both stages make heavy use of evolutionary algorithms to efficiently guide the search towards informative features capable of discriminating between sequences that contain a particular functional signal and those that do not.

RESULTS

To demonstrate its generality, EFFECT is applied to three separate problems of importance in DNA research: the recognition of hypersensitive sites, splice sites, and ALU sites. Comparisons with state-of-the-art algorithms show that the framework is both general and powerful. In addition, a detailed analysis of the constructed features shows that they contain valuable biological information about DNA architecture, allowing biologists and other researchers to directly inspect the features and potentially use the insights obtained to assist wet-laboratory studies on retainment or modification of a specific signal. Code, documentation, and all data for the applications presented here are provided for the community at http://www.cs.gmu.edu/~ashehu/?q=OurTools.

Collapse

Eisenhaber F. Unix interfaces, Kleisli, bucandin structure, etc. -- the heroic beginning of bioinformatics in Singapore. J Bioinform Comput Biol 2014;12:1471002. [PMID: 24969753 DOI: 10.1142/s0219720014710024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]

Grolmusz VK, Ács OD, Feldman-Kovács K, Szappanos Á, Stenczer B, Fekete T, Szendei G, Reismann P, Rácz K, Patócs A. Genetic variants of the HSD11B1 gene promoter may be protective against polycystic ovary syndrome. Mol Biol Rep 2014;41:5961-9. [DOI: 10.1007/s11033-014-3473-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2014] [Accepted: 06/14/2014] [Indexed: 01/08/2023]

Lin Z, Guo Z, Xu Y, Zhao X. Identification of a secondary promoter of CASP8 and its related transcription factor PURα. Int J Oncol 2014;45:57-66. [PMID: 24819879 PMCID: PMC4079158 DOI: 10.3892/ijo.2014.2436] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2014] [Accepted: 04/11/2014] [Indexed: 01/18/2023] Open

Bansal M, Kumar A, Yella VR. Role of DNA sequence based structural features of promoters in transcription initiation and gene expression. Curr Opin Struct Biol 2014;25:77-85. [PMID: 24503515 DOI: 10.1016/j.sbi.2014.01.007] [Citation(s) in RCA: 76] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2013] [Accepted: 01/07/2014] [Indexed: 11/18/2022]

Durán E, Djebali S, González S, Flores O, Mercader JM, Guigó R, Torrents D, Soler-López M, Orozco M. Unravelling the hidden DNA structural/physical code provides novel insights on promoter location. Nucleic Acids Res 2013;41:7220-30. [PMID: 23761436 PMCID: PMC3753636 DOI: 10.1093/nar/gkt511] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open

Datta S, Mukhopadhyay S. A composite method based on formal grammar and DNA structural features in detecting human polymerase II promoter region. PLoS One 2013;8:e54843. [PMID: 23437045 PMCID: PMC3577817 DOI: 10.1371/journal.pone.0054843] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2012] [Accepted: 12/17/2012] [Indexed: 11/25/2022] Open

Dineen DG, Schröder M, Higgins DG, Cunningham P. Ensemble approach combining multiple methods improves human transcription start site prediction. BMC Genomics 2010;11:677. [PMID: 21118509 PMCID: PMC3053590 DOI: 10.1186/1471-2164-11-677] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2010] [Accepted: 11/30/2010] [Indexed: 11/20/2022] Open

Schaefer U, Kodzius R, Kai C, Kawai J, Carninci P, Hayashizaki Y, Bajic VB. High sensitivity TSS prediction: estimates of locations where TSS cannot occur. PLoS One 2010;5:e13934. [PMID: 21085627 PMCID: PMC2981523 DOI: 10.1371/journal.pone.0013934] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2010] [Accepted: 10/19/2010] [Indexed: 11/26/2022] Open

Abstract

Background

Although transcription in mammalian genomes can initiate from various genomic positions (e.g., 3′UTR, coding exons, etc.), most locations on genomes are not prone to transcription initiation. It is of practical and theoretical interest to be able to estimate such collections of non-TSS locations (NTLs). The identification of large portions of NTLs can contribute to better focusing the search for TSS locations and thus contribute to promoter and gene finding. It can help in the assessment of 5′ completeness of expressed sequences, contribute to more successful experimental designs, as well as more accurate gene annotation.

Methodology

Using comprehensive collections of Cap Analysis of Gene Expression (CAGE) and other transcript data from mouse and human genomes, we developed a methodology that allows us, by performing computational TSS prediction with very high sensitivity, to annotate, with a high accuracy in a strand specific manner, locations of mammalian genomes that are highly unlikely to harbor transcription start sites (TSSs). The properties of the immediate genomic neighborhood of 98,682 accurately determined mouse and 113,814 human TSSs are used to determine features that distinguish genomic transcription initiation locations from those that are not likely to initiate transcription. In our algorithm we utilize various constraining properties of features identified in the upstream and downstream regions around TSSs, as well as statistical analyses of these surrounding regions.

Conclusions

Our analysis of human chromosomes 4, 21 and 22 estimates ∼46%, ∼41% and ∼27% of these chromosomes, respectively, as being NTLs. This suggests that on average more than 40% of the human genome can be expected to be highly unlikely to initiate transcription. Our method represents the first one that utilizes high-sensitivity TSS prediction to identify, with high accuracy, large portions of mammalian genomes as NTLs. The server with our algorithm implemented is available at http://cbrc.kaust.edu.sa/ddm/.

Collapse

Zeng J, Zhao XY, Cao XQ, Yan H. SCS: signal, context, and structure features for genome-wide human promoter recognition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2010;7:550-562. [PMID: 20671324 DOI: 10.1109/tcbb.2008.95] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]

Dineen DG, Wilm A, Cunningham P, Higgins DG. High DNA melting temperature predicts transcription start site location in human and mouse. Nucleic Acids Res 2010;37:7360-7. [PMID: 19820114 PMCID: PMC2794178 DOI: 10.1093/nar/gkp821] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open

Stanke M. Computational Gene Prediction in Eukaryotic Genomes. CELLULAR ORIGIN, LIFE IN EXTREME HABITATS AND ASTROBIOLOGY 2010:291-306. [DOI: 10.1007/978-90-481-3795-4_16] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/02/2023]

Rocha AA, Morais FV, Puccia R. Polymorphism in the flanking regions of the PbGP43 gene from the human pathogen Paracoccidioides brasiliensis: search for protein binding sequences and poly(A) cleavage sites. BMC Microbiol 2009;9:277. [PMID: 20042084 PMCID: PMC2809070 DOI: 10.1186/1471-2180-9-277] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2009] [Accepted: 12/30/2009] [Indexed: 11/24/2022] Open

Abstract

Background

Paracoccidioides brasiliensis is a thermo-dimorphic fungus that causes paracoccidiodomycosis (PCM). Glycoprotein gp43 is the fungal main diagnostic antigen, which can also protect against murine PCM and interact with extracellular matrix proteins. It is structurally related to glucanases, however not active, and whose expression varies considerably. We have presently studied polymorphisms in the PbGP43 flanking regions to help understand such variations.

Results

we tested the protein-binding capacity of oligonucleotides covering the PbGP43 proximal 5' flanking region, including overlap and mutated probes. We used electrophoretic mobility shift assays and found DNA binding regions between positions -134 to -103 and -255 to -215. Only mutation at -230, characteristic of P. brasiliensis phylogenetic species PS2, altered binding affinity. Next, we cloned and sequenced the 5' intergenic region up to position -2,047 from P. brasiliensis Pb339 and observed that it is composed of three tandem repetitive regions of about 500 bp preceded upstream by 442 bp. Correspondent PCR fragments of about 2,000 bp were found in eight out of fourteen isolates; in PS2 samples they were 1,500-bp long due to the absence of one repetitive region, as detected in Pb3. We also compared fifty-six PbGP43 3' UTR sequences from ten isolates and have not observed polymorphisms; however we detected two main poly(A) clusters (1,420 to 1,441 and 1,451 to 1,457) of multiple cleavage sites. In a single isolate we found one to seven sites.

Conclusions

We observed that the amount of PbGP43 transcripts accumulated in P. brasiliensis Pb339 grown in defined medium was about 1,000-fold higher than in Pb18 and 120-fold higher than in Pb3. We have described a series of features in the gene flanking regions and differences among isolates, including DNA-binding sequences, which might impact gene regulation. Little is known about regulatory sequences in thermo-dimorphic fungi. The peculiar structure of tandem repetitive fragments in the 5' intergenic region of PbGP43, their characteristic sequences, besides the presence of multiple poly(A) cleavage sites in the 3' UTR will certainly guide future studies.

Collapse

PromoterSweep: a tool for identification of transcription factor binding sites. Theor Chem Acc 2009. [DOI: 10.1007/s00214-009-0643-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]

Abeel T, Van de Peer Y, Saeys Y. Toward a gold standard for promoter prediction evaluation. ACTA ACUST UNITED AC 2009;25:i313-20. [PMID: 19478005 PMCID: PMC2687945 DOI: 10.1093/bioinformatics/btp191] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]

Zeng J, Zhu S, Yan H. Towards accurate human promoter recognition: a review of currently used sequence features and classification methods. Brief Bioinform 2009;10:498-508. [PMID: 19531545 DOI: 10.1093/bib/bbp027] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Narlikar L, Ovcharenko I. Identifying regulatory elements in eukaryotic genomes. BRIEFINGS IN FUNCTIONAL GENOMICS AND PROTEOMICS 2009;8:215-30. [PMID: 19498043 DOI: 10.1093/bfgp/elp014] [Citation(s) in RCA: 68] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]

Ladunga I(S. Finding Homologs in Amino Acid Sequences Using Network BLAST Searches. ACTA ACUST UNITED AC 2009;Chapter 3:3.4.1-3.4.34. [DOI: 10.1002/0471250953.bi0304s25] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]

Vingron M, Brazma A, Coulson R, van Helden J, Manke T, Palin K, Sand O, Ukkonen E. Integrating sequence, evolution and functional genomics in regulatory genomics. Genome Biol 2009;10:202. [PMID: 19226437 PMCID: PMC2687781 DOI: 10.1186/gb-2009-10-1-202] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Brick K, Watanabe J, Pizzi E. Core promoters are predicted by their distinct physicochemical properties in the genome of Plasmodium falciparum. Genome Biol 2008;9:R178. [PMID: 19094208 PMCID: PMC2646282 DOI: 10.1186/gb-2008-9-12-r178] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2008] [Revised: 11/03/2008] [Accepted: 12/18/2008] [Indexed: 11/23/2022] Open

Wang X, Xuan Z, Zhao X, Li Y, Zhang MQ. High-resolution human core-promoter prediction with CoreBoost_HM. Genome Res 2008;19:266-75. [PMID: 18997002 DOI: 10.1101/gr.081638.108] [Citation(s) in RCA: 78] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]

Gissot M, Kim K. How epigenomics contributes to the understanding of gene regulation in Toxoplasma gondii. J Eukaryot Microbiol 2008;55:476-80. [PMID: 19120792 PMCID: PMC2667958 DOI: 10.1111/j.1550-7408.2008.00366.x] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]

Abeel T, Saeys Y, Rouzé P, Van de Peer Y. ProSOM: core promoter prediction based on unsupervised clustering of DNA physical profiles. Bioinformatics 2008;24:i24-31. [PMID: 18586720 PMCID: PMC2718650 DOI: 10.1093/bioinformatics/btn172] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open

Profiling the thermodynamic softness of adenoviral promoters. Biophys J 2008;95:597-608. [PMID: 18390611 DOI: 10.1529/biophysj.107.123471] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open

Goñi JR, Pérez A, Torrents D, Orozco M. Determining promoter location based on DNA structure first-principles calculations. Genome Biol 2008;8:R263. [PMID: 18072969 PMCID: PMC2246265 DOI: 10.1186/gb-2007-8-12-r263] [Citation(s) in RCA: 105] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2007] [Revised: 11/24/2007] [Accepted: 12/11/2007] [Indexed: 11/25/2022] Open

Abeel T, Saeys Y, Bonnet E, Rouzé P, Van de Peer Y. Generic eukaryotic core promoter prediction using structural features of DNA. Genes Dev 2008;18:310-23. [PMID: 18096745 PMCID: PMC2203629 DOI: 10.1101/gr.6991408] [Citation(s) in RCA: 133] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2007] [Accepted: 11/14/2007] [Indexed: 11/24/2022]

Frith MC, Valen E, Krogh A, Hayashizaki Y, Carninci P, Sandelin A. A code for transcription initiation in mammalian genomes. Genes Dev 2008;18:1-12. [PMID: 18032727 PMCID: PMC2134772 DOI: 10.1101/gr.6831208] [Citation(s) in RCA: 196] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2007] [Accepted: 10/14/2007] [Indexed: 11/24/2022]

Sonnenburg S, Schweikert G, Philips P, Behr J, Rätsch G. Accurate splice site prediction using support vector machines. BMC Bioinformatics 2007;8 Suppl 10:S7. [PMID: 18269701 PMCID: PMC2230508 DOI: 10.1186/1471-2105-8-s10-s7] [Citation(s) in RCA: 118] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open

Wang J, Ungar LH, Tseng H, Hannenhalli S. MetaProm: a neural network based meta-predictor for alternative human promoter prediction. BMC Genomics 2007;8:374. [PMID: 17941982 PMCID: PMC2194789 DOI: 10.1186/1471-2164-8-374] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2007] [Accepted: 10/17/2007] [Indexed: 01/21/2023] Open

Zhao X, Xuan Z, Zhang MQ. Boosting with stumps for predicting transcription start sites. Genome Biol 2007;8:R17. [PMID: 17274821 PMCID: PMC1852414 DOI: 10.1186/gb-2007-8-2-r17] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2006] [Revised: 12/01/2006] [Accepted: 02/02/2007] [Indexed: 12/05/2022] Open

ENCODE Project Consortium, Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, Kuehn MS, Taylor CM, Neph S, Koch CM, Asthana S, Malhotra A, Adzhubei I, Greenbaum JA, Andrews RM, Flicek P, Boyle PJ, Cao H, Carter NP, Clelland GK, Davis S, Day N, Dhami P, Dillon SC, Dorschner MO, Fiegler H, Giresi PG, Goldy J, Hawrylycz M, Haydock A, Humbert R, James KD, Johnson BE, Johnson EM, Frum TT, Rosenzweig ER, Karnani N, Lee K, Lefebvre GC, Navas PA, Neri F, Parker SCJ, Sabo PJ, Sandstrom R, Shafer A, Vetrie D, Weaver M, Wilcox S, Yu M, Collins FS, Dekker J, Lieb JD, Tullius TD, Crawford GE, Sunyaev S, Noble WS, Dunham I, Denoeud F, Reymond A, Kapranov P, Rozowsky J, Zheng D, Castelo R, Frankish A, Harrow J, Ghosh S, Sandelin A, Hofacker IL, Baertsch R, Keefe D, Dike S, Cheng J, Hirsch HA, Sekinger EA, Lagarde J, Abril JF, Shahab A, Flamm C, Fried C, Hackermüller J, Hertel J, Lindemeyer M, Missal K, Tanzer A, Washietl S, Korbel J, Emanuelsson O, Pedersen JS, Holroyd N, Taylor R, Swarbreck D, Matthews N, Dickson MC, Thomas DJ, Weirauch MT, et alENCODE Project Consortium, Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, Kuehn MS, Taylor CM, Neph S, Koch CM, Asthana S, Malhotra A, Adzhubei I, Greenbaum JA, Andrews RM, Flicek P, Boyle PJ, Cao H, Carter NP, Clelland GK, Davis S, Day N, Dhami P, Dillon SC, Dorschner MO, Fiegler H, Giresi PG, Goldy J, Hawrylycz M, Haydock A, Humbert R, James KD, Johnson BE, Johnson EM, Frum TT, Rosenzweig ER, Karnani N, Lee K, Lefebvre GC, Navas PA, Neri F, Parker SCJ, Sabo PJ, Sandstrom R, Shafer A, Vetrie D, Weaver M, Wilcox S, Yu M, Collins FS, Dekker J, Lieb JD, Tullius TD, Crawford GE, Sunyaev S, Noble WS, Dunham I, Denoeud F, Reymond A, Kapranov P, Rozowsky J, Zheng D, Castelo R, Frankish A, Harrow J, Ghosh S, Sandelin A, Hofacker IL, Baertsch R, Keefe D, Dike S, Cheng J, Hirsch HA, Sekinger EA, Lagarde J, Abril JF, Shahab A, Flamm C, Fried C, Hackermüller J, Hertel J, Lindemeyer M, Missal K, Tanzer A, Washietl S, Korbel J, Emanuelsson O, Pedersen JS, Holroyd N, Taylor R, Swarbreck D, Matthews N, Dickson MC, Thomas DJ, Weirauch MT, Gilbert J, Drenkow J, Bell I, Zhao X, Srinivasan KG, Sung WK, Ooi HS, Chiu KP, Foissac S, Alioto T, Brent M, Pachter L, Tress ML, Valencia A, Choo SW, Choo CY, Ucla C, Manzano C, Wyss C, Cheung E, Clark TG, Brown JB, Ganesh M, Patel S, Tammana H, Chrast J, Henrichsen CN, Kai C, Kawai J, Nagalakshmi U, Wu J, Lian Z, Lian J, Newburger P, Zhang X, Bickel P, Mattick JS, Carninci P, Hayashizaki Y, Weissman S, Hubbard T, Myers RM, Rogers J, Stadler PF, Lowe TM, Wei CL, Ruan Y, Struhl K, Gerstein M, Antonarakis SE, Fu Y, Green ED, Karaöz U, Siepel A, Taylor J, Liefer LA, Wetterstrand KA, Good PJ, Feingold EA, Guyer MS, Cooper GM, Asimenos G, Dewey CN, Hou M, Nikolaev S, Montoya-Burgos JI, Löytynoja A, Whelan S, Pardi F, Massingham T, Huang H, Zhang NR, Holmes I, Mullikin JC, Ureta-Vidal A, Paten B, Seringhaus M, Church D, Rosenbloom K, Kent WJ, Stone EA, NISC Comparative Sequencing Program, Baylor College of Medicine Human Genome Sequencing Center, Washington University Genome Sequencing Center, Broad Institute, Children's Hospital Oakland Research Institute, Batzoglou S, Goldman N, Hardison RC, Haussler D, Miller W, Sidow A, Trinklein ND, Zhang ZD, Barrera L, Stuart R, King DC, Ameur A, Enroth S, Bieda MC, Kim J, Bhinge AA, Jiang N, Liu J, Yao F, Vega VB, Lee CWH, Ng P, Shahab A, Yang A, Moqtaderi Z, Zhu Z, Xu X, Squazzo S, Oberley MJ, Inman D, Singer MA, Richmond TA, Munn KJ, Rada-Iglesias A, Wallerman O, Komorowski J, Fowler JC, Couttet P, Bruce AW, Dovey OM, Ellis PD, Langford CF, Nix DA, Euskirchen G, Hartman S, Urban AE, Kraus P, Van Calcar S, Heintzman N, Kim TH, Wang K, Qu C, Hon G, Luna R, Glass CK, Rosenfeld MG, Aldred SF, Cooper SJ, Halees A, Lin JM, Shulha HP, Zhang X, Xu M, Haidar JNS, Yu Y, Ruan Y, Iyer VR, Green RD, Wadelius C, Farnham PJ, Ren B, Harte RA, Hinrichs AS, Trumbower H, Clawson H, Hillman-Jackson J, Zweig AS, Smith K, Thakkapallayil A, Barber G, Kuhn RM, Karolchik D, Armengol L, Bird CP, de Bakker PIW, Kern AD, Lopez-Bigas N, Martin JD, Stranger BE, Woodroffe A, Davydov E, Dimas A, Eyras E, Hallgrímsdóttir IB, Huppert J, Zody MC, Abecasis GR, Estivill X, Bouffard GG, Guan X, Hansen NF, Idol JR, Maduro VVB, Maskeri B, McDowell JC, Park M, Thomas PJ, Young AC, Blakesley RW, Muzny DM, Sodergren E, Wheeler DA, Worley KC, Jiang H, Weinstock GM, Gibbs RA, Graves T, Fulton R, Mardis ER, Wilson RK, Clamp M, Cuff J, Gnerre S, Jaffe DB, Chang JL, Lindblad-Toh K, Lander ES, Koriabine M, Nefedov M, Osoegawa K, Yoshinaga Y, Zhu B, de Jong PJ. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007;447:799-816. [PMID: 17571346 PMCID: PMC2212820 DOI: 10.1038/nature05874] [Show More Authors] [Citation(s) in RCA: 3870] [Impact Index Per Article: 215.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]

Liu F, Tøstesen E, Sundet JK, Jenssen TK, Bock C, Jerstad GI, Thilly WG, Hovig E. The human genomic melting map. PLoS Comput Biol 2007;3:e93. [PMID: 17511513 PMCID: PMC1868775 DOI: 10.1371/journal.pcbi.0030093] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2006] [Accepted: 04/11/2007] [Indexed: 11/19/2022] Open

Yamamoto YY, Ichida H, Matsui M, Obokata J, Sakurai T, Satou M, Seki M, Shinozaki K, Abe T. Identification of plant promoter constituents by analysis of local distribution of short sequences. BMC Genomics 2007;8:67. [PMID: 17346352 PMCID: PMC1832190 DOI: 10.1186/1471-2164-8-67] [Citation(s) in RCA: 115] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2006] [Accepted: 03/08/2007] [Indexed: 11/20/2022] Open

Abstract

Background

Plant promoter architecture is important for understanding regulation and evolution of the promoters, but our current knowledge about plant promoter structure, especially with respect to the core promoter, is insufficient. Several promoter elements including TATA box, and several types of transcriptional regulatory elements have been found to show local distribution within promoters, and this feature has been successfully utilized for extraction of promoter constituents from human genome.

Results

LDSS (Local Distribution of Short Sequences) profiles of short sequences along the plant promoter have been analyzed in silico, and hundreds of hexamer and octamer sequences have been identified as having localized distributions within promoters of Arabidopsis thaliana and rice. Based on their localization patterns, the identified sequences could be classified into three groups, pyrimidine patch (Y Patch), TATA box, and REG (Regulatory Element Group). Sequences of the TATA box group are consistent with the ones reported in previous studies. The REG group includes more than 200 sequences, and half of them correspond to known cis-elements. The other REG subgroups, together with about a hundred uncategorized sequences, are suggested to be novel cis-regulatory elements. Comparison of LDSS-positive sequences between Arabidopsis and rice has revealed moderate conservation of elements and common promoter architecture. In addition, a dimer motif named the YR Rule (C/T A/G) has been identified at the transcription start site (-1/+1). This rule also fits both Arabidopsis and rice promoters.

Conclusion

LDSS was successfully applied to plant genomes and hundreds of putative promoter elements have been extracted as LDSS-positive octamers. Identified promoter architecture of monocot and dicot are well conserved, but there are moderate variations in the utilized sequences.

Collapse

Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, Harrow J, Hubbard T, Lewis SE, Reese MG. EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 2006;7 Suppl 1:S2.1-31. [PMID: 16925836 PMCID: PMC1810551 DOI: 10.1186/gb-2006-7-s1-s2] [Citation(s) in RCA: 175] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open

Abstract

BACKGROUND

We present the results of EGASP, a community experiment to assess the state-of-the-art in genome annotation within the ENCODE regions, which span 1% of the human genome sequence. The experiment had two major goals: the assessment of the accuracy of computational methods to predict protein coding genes; and the overall assessment of the completeness of the current human genome annotations as represented in the ENCODE regions. For the computational prediction assessment, eighteen groups contributed gene predictions. We evaluated these submissions against each other based on a 'reference set' of annotations generated as part of the GENCODE project. These annotations were not available to the prediction groups prior to the submission deadline, so that their predictions were blind and an external advisory committee could perform a fair assessment.

RESULTS

The best methods had at least one gene transcript correctly predicted for close to 70% of the annotated genes. Nevertheless, the multiple transcript accuracy, taking into account alternative splicing, reached only approximately 40% to 50% accuracy. At the coding nucleotide level, the best programs reached an accuracy of 90% in both sensitivity and specificity. Programs relying on mRNA and protein sequences were the most accurate in reproducing the manually curated annotations. Experimental validation shows that only a very small percentage (3.2%) of the selected 221 computationally predicted exons outside of the existing annotation could be verified.

CONCLUSION

This is the first such experiment in human DNA, and we have followed the standards established in a similar experiment, GASP1, in Drosophila melanogaster. We believe the results presented here contribute to the value of ongoing large-scale annotation projects and should guide further experimental methods when being scaled up to the entire human genome sequence.

Collapse

Affiliation(s)

Roderic Guigó Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain Member of the EGASP Organizing Committee
Paul Flicek European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Josep F Abril Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain
Alexandre Reymond Center for Integrative Genomics, University of Lausanne, Switzerland
Julien Lagarde Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain
France Denoeud Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain
Stylianos Antonarakis University of Geneva Medical School and University Hospitals of Geneva, 1211 Geneva, Switzerland
Michael Ashburner Department of Genetics, University of Cambridge, Cambridge CB3 2EH, UK Member of the EGASP Advisory Board
Vladimir B Bajic South African National Bioinformatics Institute (SANBI), University of Western Cape, Bellville 7535, South Africa Member of the EGASP Advisory Board
Ewan Birney European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK Member of the EGASP Organizing Committee
Robert Castelo Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain
Eduardo Eyras Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain
Catherine Ucla University of Geneva Medical School and University Hospitals of Geneva, 1211 Geneva, Switzerland
Thomas R Gingeras Affymetrix Inc., Santa Clara, California 95051, USA Member of the EGASP Advisory Board
Jennifer Harrow Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK Member of the EGASP Organizing Committee
Tim Hubbard Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK Member of the EGASP Organizing Committee
Suzanna E Lewis Department of Molecular and Cellular Biology, University of California, Berkeley, California 94792, USA Member of the EGASP Advisory Board
Martin G Reese Omicia Inc., Christie Ave., Emeryville, California 94608, USA Member of the EGASP Advisory Board

Collapse