1
|
Wu J. Testing the coding potential of conserved short genomic sequences. Adv Bioinformatics 2010; 2010:287070. [PMID: 20224812 PMCID: PMC2834954 DOI: 10.1155/2010/287070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2009] [Accepted: 01/02/2010] [Indexed: 11/25/2022] Open
Abstract
Proposed is a procedure to test whether a genomic sequence contains coding DNA, called a coding potential region. The procedure tests the coding potential of conserved short genomic sequence, in which the assumptions on the probability models of gene structures are relaxed. Thus, it is expected to provide additional candidate regions that contain coding DNAs to the current genomic database. The procedure was applied to the set of highly conserved human-mouse sequences in the genome database at the University of California at Santa Cruz. For sequences containing RefSeq coding exons, the procedure detected 91.3% regions having coding potential in this set, which covers 83% of the human RefSeq coding exons, at a 2.6% false positive rate. The procedure detected 12,688 novel short regions with coding potential at the false discovery rate <0.05; 65.7% of the novel regions are between annotated genes.
Collapse
Affiliation(s)
- Jing Wu
- Department of Statistics, Carnegie Mellon University, PA 15213, USA
| |
Collapse
|
2
|
Madupu R, Brinkac LM, Harrow J, Wilming LG, Böhme U, Lamesch P, Hannick LI. Meeting report: a workshop on Best Practices in Genome Annotation. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2010; 2010:baq001. [PMID: 20428316 PMCID: PMC2860899 DOI: 10.1093/database/baq001] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/12/2009] [Revised: 01/08/2010] [Accepted: 01/11/2010] [Indexed: 01/28/2023]
Abstract
Efforts to annotate the genomes of a wide variety of model organisms are currently carried out by sequencing centers, model organism databases and academic/institutional laboratories around the world. Different annotation methods and tools have been developed over time to meet the needs of biologists faced with the task of annotating biological data. While standardized methods are essential for consistent curation within each annotation group, methods and tools can differ between groups, especially when the groups are curating different organisms. Biocurators from several institutes met at the Third International Biocuration Conference in Berlin, Germany, April 2009 and hosted the ‘Best Practices in Genome Annotation: Inference from Evidence’ workshop to share their strategies, pipelines, standards and tools. This article documents the material presented in the workshop.
Collapse
Affiliation(s)
- Ramana Madupu
- Informatics, J. Craig Venter Institute, Rockville, MD 20850 USA, Wellcome Trust Sanger Institute, Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK and The Arabidopsis Information Resource, Carnegie Institution of Washington, Stanford, CA 94305 USA
| | | | | | | | | | | | | |
Collapse
|
3
|
Abstract
BACKGROUND Computational gene prediction tools routinely generate large volumes of predicted coding exons (putative exons). One common limitation of these tools is the relatively low specificity due to the large amount of non-coding regions. METHODS A statistical approach is developed that largely improves the gene prediction specificity. The key idea is to utilize the evolutionary conservation principle relative to the coding exons. By first exploiting the homology between genomes of two related species, a probability model for the evolutionary conservation pattern of codons across different genomes is developed. A probability model for the dependency between adjacent codons/triplets is added to differentiate coding exons and random sequences. Finally, the log odds ratio is developed to classify putative exons into the group of coding exons and the group of non-coding regions. RESULTS The method was tested on pre-aligned human-mouse sequences where the putative exons are predicted by GENSCAN and TWINSCAN. The proposed method is able to improve the exon specificity by 73% and 32% respectively, while the loss of the sensitivity < or = 1%. The method also keeps 98% of RefSeq gene structures that are correctly predicted by TWINSCAN when removing 26% of predicted genes that are in non-coding regions. The estimated number of true exons in TWINSCAN's predictions is 157,070. The results and the executable codes can be downloaded from http://www.stat.purdue.edu/~jingwu/codon/ CONCLUSION The proposed method demonstrates an application of the evolutionary conservation principle to coding exons. It is a complementary method which can be used as an additional criteria to refine many existing gene predictions.
Collapse
Affiliation(s)
- Jing Wu
- Department of Statistics, Purdue University, 150 N, University Street, West Lafayette, IN 47906, USA.
| |
Collapse
|
4
|
Rozowsky J, Wu J, Lian Z, Nagalakshmi U, Korbel JO, Kapranov P, Zheng D, Dyke S, Newburger P, Miller P, Gingeras TR, Weissman S, Gerstein M, Snyder M. Novel transcribed regions in the human genome. COLD SPRING HARBOR SYMPOSIA ON QUANTITATIVE BIOLOGY 2007; 71:111-6. [PMID: 17381286 DOI: 10.1101/sqb.2006.71.054] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
We have used genomic tiling arrays to identify transcribed regions throughout the human genome. Analysis of the mapping results of RNA isolated from five cell/tissue types, NB4 cells, NB4 cells treated with retinoic acid (RA), NB4 cells treated with 12-O-tetradecanoylphorbol-13 acetate (TPA), neutrophils, and placenta, throughout the ENCODE region reveals a large number of novel transcribed regions. Interestingly, neutrophils exhibit a great deal of novel expression in several intronic regions. Comparison of the hybridization results of NB4 cells treated with different stimuli relative to untreated cells reveals that many new regions are expressed upon cell differentiation. One such region is the Hox locus, which contains a large number of novel regions expressed in a number of cell types. Analysis of the trinucleotide composition of the novel transcribed regions reveals that it is similar to that of known exons. These results suggest that many of the novel transcribed regions may have a functional role.
Collapse
Affiliation(s)
- J Rozowsky
- Molecular Biophysics & Biochemistry Department, Yale University, New Haven, Connecticut 06520, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
5
|
Abstract
We introduce a new system, called shortHMM, for predicting exons, which predicts individual exons using two related genomes. In this system, we build a hidden semi-Markov model to identify exons. In the hidden Markov model, we propose joint probability models of nucleotides in introns, splice sites, 5'UTR, 3'UTR, and intergenic regions by exploiting the homology between related genomes. In order to reduce the false positive rate of the hidden Markov model, we develop a screening process which is able to identify intergenic regions. We then build a classifier by combining the statistics from the hidden Markov model and the screening process. We implement shortHMM on human-mouse sequence alignments. The source codes are available at < www.stat.purdue.edu/ jingwu/hmm >. Compared to TWINSCAN and SLAM, shortHMM is substantially more powerful in identifying AT-rich RefSeq exons (8% more AT-rich RefSeq exons were predicted), as well as slightly more powerful in identifying RefSeq exons (3-10% more RefSeq exons were predicted), at a similar or lower false positive rate, with less computing time and with less memory usage. Last, shortHMM is also capable of finding new potential exons.
Collapse
Affiliation(s)
- Jing Wu
- Department of Statistics, Purdue University, West Lafayette, Indiana 47906, USA.
| | | |
Collapse
|
6
|
Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, Harrow J, Hubbard T, Lewis SE, Reese MG. EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 2006; 7 Suppl 1:S2.1-31. [PMID: 16925836 PMCID: PMC1810551 DOI: 10.1186/gb-2006-7-s1-s2] [Citation(s) in RCA: 175] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
BACKGROUND We present the results of EGASP, a community experiment to assess the state-of-the-art in genome annotation within the ENCODE regions, which span 1% of the human genome sequence. The experiment had two major goals: the assessment of the accuracy of computational methods to predict protein coding genes; and the overall assessment of the completeness of the current human genome annotations as represented in the ENCODE regions. For the computational prediction assessment, eighteen groups contributed gene predictions. We evaluated these submissions against each other based on a 'reference set' of annotations generated as part of the GENCODE project. These annotations were not available to the prediction groups prior to the submission deadline, so that their predictions were blind and an external advisory committee could perform a fair assessment. RESULTS The best methods had at least one gene transcript correctly predicted for close to 70% of the annotated genes. Nevertheless, the multiple transcript accuracy, taking into account alternative splicing, reached only approximately 40% to 50% accuracy. At the coding nucleotide level, the best programs reached an accuracy of 90% in both sensitivity and specificity. Programs relying on mRNA and protein sequences were the most accurate in reproducing the manually curated annotations. Experimental validation shows that only a very small percentage (3.2%) of the selected 221 computationally predicted exons outside of the existing annotation could be verified. CONCLUSION This is the first such experiment in human DNA, and we have followed the standards established in a similar experiment, GASP1, in Drosophila melanogaster. We believe the results presented here contribute to the value of ongoing large-scale annotation projects and should guide further experimental methods when being scaled up to the entire human genome sequence.
Collapse
Affiliation(s)
- Roderic Guigó
- Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain
- Member of the EGASP Organizing Committee
| | - Paul Flicek
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Josep F Abril
- Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain
| | - Alexandre Reymond
- Center for Integrative Genomics, University of Lausanne, Switzerland
| | - Julien Lagarde
- Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain
| | - France Denoeud
- Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain
| | - Stylianos Antonarakis
- University of Geneva Medical School and University Hospitals of Geneva, 1211 Geneva, Switzerland
| | - Michael Ashburner
- Department of Genetics, University of Cambridge, Cambridge CB3 2EH, UK
- Member of the EGASP Advisory Board
| | - Vladimir B Bajic
- South African National Bioinformatics Institute (SANBI), University of Western Cape, Bellville 7535, South Africa
- Member of the EGASP Advisory Board
| | - Ewan Birney
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
- Member of the EGASP Organizing Committee
| | - Robert Castelo
- Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain
| | - Eduardo Eyras
- Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain
| | - Catherine Ucla
- University of Geneva Medical School and University Hospitals of Geneva, 1211 Geneva, Switzerland
| | - Thomas R Gingeras
- Affymetrix Inc., Santa Clara, California 95051, USA
- Member of the EGASP Advisory Board
| | - Jennifer Harrow
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK
- Member of the EGASP Organizing Committee
| | - Tim Hubbard
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK
- Member of the EGASP Organizing Committee
| | - Suzanna E Lewis
- Department of Molecular and Cellular Biology, University of California, Berkeley, California 94792, USA
- Member of the EGASP Advisory Board
| | - Martin G Reese
- Omicia Inc., Christie Ave., Emeryville, California 94608, USA
- Member of the EGASP Advisory Board
| |
Collapse
|
7
|
Stanke M, Steinkamp R, Waack S, Morgenstern B. AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res 2004; 32:W309-12. [PMID: 15215400 PMCID: PMC441517 DOI: 10.1093/nar/gkh379] [Citation(s) in RCA: 913] [Impact Index Per Article: 43.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
We present a www server for AUGUSTUS, a novel software program for ab initio gene prediction in eukaryotic genomic sequences. Our method is based on a generalized Hidden Markov Model with a new method for modeling the intron length distribution. This method allows approximation of the true intron length distribution more accurately than do existing programs. For genomic sequence data from human and Drosophila melanogaster, the accuracy of AUGUSTUS is superior to existing gene-finding approaches. The advantage of our program becomes apparent especially for larger input sequences containing more than one gene. The server is available at http://augustus.gobics.de.
Collapse
Affiliation(s)
- Mario Stanke
- University of Göttingen, Institut für Mikrobiologie und Genetik, Goldschmidtstrasse 1, 37077 Göttingen, Germany.
| | | | | | | |
Collapse
|
8
|
Abstract
Advances in high throughput sequencing technologies have led to an explosion of sequence information available for today's researchers. Efforts in the emerging next phase of the genomic era are focusing on the assignment of function to genes uncovered by genome sequencing programs. The main approaches include high throughput mutagenesis, predictions based on homology in primary sequence, microarray and proteomics. Despite the variety of strategies applied, only 30% of predicted human genes have any function assigned. There is a need, therefore, for additional tools to overcome some of the limitations of existing techniques. In this review we discuss some recent developments and their impact on gene function annotation, especially as they relate to the elucidation of signalling cascades activated by cytokines and growth factors.
Collapse
Affiliation(s)
- Endre Kiss-Toth
- Cardiovascular Research Unit, Division of Clinical Sciences (North), University of Sheffield, Northern General Hospital, Sheffield S5 7AU, UK.
| | | | | |
Collapse
|