101
|
Abstract
The accurate prediction of higher eukaryotic gene structures and regulatory elements directly from genomic sequences is an important early step in the understanding of newly assembled contigs and finished genomes. As more new genomes are sequenced, comparative approaches are becoming increasingly practical and valuable for predicting genes and regulatory elements. We demonstrate the effectiveness of a comparative method called pattern filtering; it utilizes synteny between two or more genomic segments for the annotation of genomic sequences. Pattern filtering optimally detects the signatures of conserved functional elements despite the stochastic noise inherent in evolutionary processes, allowing more accurate annotation of gene models. We anticipate that pattern filtering will facilitate sequence annotation and the discovery of new functional elements by the genetics and genomics communities.
Collapse
Affiliation(s)
- Jonathan E Moore
- Molecular Biology Institute, University of California Los Angeles, Los Angeles, CA 90095, USA
| | | |
Collapse
|
102
|
Kotlar D, Lavner Y. Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions. Genome Res 2003; 13:1930-7. [PMID: 12869578 PMCID: PMC403785 DOI: 10.1101/gr.1261703] [Citation(s) in RCA: 72] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2003] [Accepted: 05/21/2003] [Indexed: 11/24/2022]
Abstract
A new measure for gene prediction in eukaryotes is presented. The measure is based on the Discrete Fourier Transform (DFT) phase at a frequency of 1/3, computed for the four binary sequences for A, T, C, and G. Analysis of all the experimental genes of S. cerevisiae revealed distribution of the phase in a bell-like curve around a central value, in all four nucleotides, whereas the distribution of the phase in the noncoding regions was found to be close to uniform. Similar findings were obtained for other organisms. Several measures based on the phase property are proposed. The measures are computed by clockwise rotation of the vectors, obtained by DFT for each analysis frame, by an angle equal to the corresponding central value. In protein coding regions, this rotation is assumed to closely align all vectors in the complex plane, thereby amplifying the magnitude of the vector sum. In noncoding regions, this operation does not significantly change this magnitude. Computing the measures with one chromosome and applying them on sequences of others reveals improved performance compared with other algorithms that use the 1/3 frequency feature, especially in short exons. The phase property is also used to find the reading frame of the sequence.
Collapse
Affiliation(s)
- Daniel Kotlar
- Department of Computer Science, Tel-Hai Academic College, Upper Galilee 12210, Israel
| | | |
Collapse
|
103
|
Worthey EA, Martinez-Calvillo S, Schnaufer A, Aggarwal G, Cawthra J, Fazelinia G, Fong C, Fu G, Hassebrock M, Hixson G, Ivens AC, Kiser P, Marsolini F, Rickel E, Rickell E, Salavati R, Sisk E, Sunkin SM, Stuart KD, Myler PJ. Leishmania major chromosome 3 contains two long convergent polycistronic gene clusters separated by a tRNA gene. Nucleic Acids Res 2003; 31:4201-10. [PMID: 12853638 PMCID: PMC167632 DOI: 10.1093/nar/gkg469] [Citation(s) in RCA: 60] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Leishmania parasites (order Kinetoplastida, family Trypanosomatidae) cause a spectrum of human diseases ranging from asymptomatic to lethal. The approximately 33.6 Mb genome is distributed among 36 chromosome pairs that range in size from approximately 0.3 to 2.8 Mb. The complete nucleotide sequence of Leishmania major Friedlin chromosome 1 revealed 79 protein-coding genes organized into two divergent polycistronic gene clusters with the mRNAs transcribed towards the telomeres. We report here the complete nucleotide sequence of chromosome 3 (384 518 bp) and an analysis revealing 95 putative protein-coding ORFs. The ORFs are primarily organized into two large convergent polycistronic gene clusters (i.e. transcribed from the telomeres). In addition, a single gene at the left end is transcribed divergently towards the telomere, and a tRNA gene separates the two convergent gene clusters. Numerous genes have been identified, including those for metabolic enzymes, kinases, transporters, ribosomal proteins, spliceosome components, helicases, an RNA-binding protein and a DNA primase subunit.
Collapse
Affiliation(s)
- E A Worthey
- Seattle Biomedical Research Institute, 4 Nickerson Street, Seattle, WA 98109-1651, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
104
|
Aggarwal G, Worthey EA, McDonagh PD, Myler PJ. Importing statistical measures into Artemis enhances gene identification in the Leishmania genome project. BMC Bioinformatics 2003; 4:23. [PMID: 12793912 PMCID: PMC165441 DOI: 10.1186/1471-2105-4-23] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2003] [Accepted: 06/07/2003] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Seattle Biomedical Research Institute (SBRI) as part of the Leishmania Genome Network (LGN) is sequencing chromosomes of the trypanosomatid protozoan species Leishmania major. At SBRI, chromosomal sequence is annotated using a combination of trained and untrained non-consensus gene-prediction algorithms with ARTEMIS, an annotation platform with rich and user-friendly interfaces. RESULTS Here we describe a methodology used to import results from three different protein-coding gene-prediction algorithms (GLIMMER, TESTCODE and GENESCAN) into the ARTEMIS sequence viewer and annotation tool. Comparison of these methods, along with the CODONUSAGE algorithm built into ARTEMIS, shows the importance of combining methods to more accurately annotate the L. major genomic sequence. CONCLUSION An improvised and powerful tool for gene prediction has been developed by importing data from widely-used algorithms into an existing annotation platform. This approach is especially fruitful in the Leishmania genome project where there is large proportion of novel genes requiring manual annotation.
Collapse
Affiliation(s)
- Gautam Aggarwal
- Seattle Biomedical Research Institute 4 Nickerson Street, Seattle, WA 98109, USA
| | - EA Worthey
- Seattle Biomedical Research Institute 4 Nickerson Street, Seattle, WA 98109, USA
| | - Paul D McDonagh
- Immunex Corporation, 51 University Street, Seattle, WA 98101, USA
| | - Peter J Myler
- Seattle Biomedical Research Institute 4 Nickerson Street, Seattle, WA 98109, USA
- Departments of Pathobiology and Medical Education and Biomedical Informatics, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|
105
|
Aggarwal G, Ramaswamy R. Ab initio gene identification: prokaryote genome annotation with GeneScan and GLIMMER. J Biosci 2002; 27:7-14. [PMID: 11927773 DOI: 10.1007/bf02703679] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
We compare the annotation of three complete genomes using the ab initio methods of gene identification GeneScan and GLIMMER. The annotation given in GenBank, the standard against which these are compared, has been made using GeneMark. We find a number of novel genes which are predicted by both methods used here, as well as a number of genes that are predicted by GeneMark, but are not identified by either of the nonconsensus methods that we have used. The three organisms studied here are all prokaryotic species with fairly compact genomes. The Fourier measure forms the basis for an efficient non-consensus method for gene prediction, and the algorithm GeneScan exploits this measure. We have bench-marked this program as well as GLIMMER using 3 complete prokaryotic genomes. An effort has also been made to study the limitations of these techniques for complete genome analysis. GeneScan and GLIMMER are of comparable accuracy insofar as gene-identification is concerned, with sensitivities and specificities typically greater than 0.9. The number of false predictions (both positive and negative) is higher for GeneScan as compared to GLIMMER, but in a significant number of cases, similar results are provided by the two techniques. This suggests that there could be some as-yet unidentified additional genes in these three genomes, and also that some of the putative identifications made hitherto might require re-evaluation. All these cases are discussed in detail.
Collapse
Affiliation(s)
- Gautam Aggarwal
- School of Physical Sciences, Jawaharlal Nehru University, New Delhi 110 067, India
| | | |
Collapse
|
106
|
Hoskins RA, Smith CD, Carlson JW, Carvalho AB, Halpern A, Kaminker JS, Kennedy C, Mungall CJ, Sullivan BA, Sutton GG, Yasuhara JC, Wakimoto BT, Myers EW, Celniker SE, Rubin GM, Karpen GH. Heterochromatic sequences in a Drosophila whole-genome shotgun assembly. Genome Biol 2002; 3:RESEARCH0085. [PMID: 12537574 PMCID: PMC151187 DOI: 10.1186/gb-2002-3-12-research0085] [Citation(s) in RCA: 186] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2002] [Revised: 11/28/2002] [Accepted: 12/05/2002] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND Most eukaryotic genomes include a substantial repeat-rich fraction termed heterochromatin, which is concentrated in centric and telomeric regions. The repetitive nature of heterochromatic sequence makes it difficult to assemble and analyze. To better understand the heterochromatic component of the Drosophila melanogaster genome, we characterized and annotated portions of a whole-genome shotgun sequence assembly. RESULTS WGS3, an improved whole-genome shotgun assembly, includes 20.7 Mb of draft-quality sequence not represented in the Release 3 sequence spanning the euchromatin. We annotated this sequence using the methods employed in the re-annotation of the Release 3 euchromatic sequence. This analysis predicted 297 protein-coding genes and six non-protein-coding genes, including known heterochromatic genes, and regions of similarity to known transposable elements. Bacterial artificial chromosome (BAC)-based fluorescence in situ hybridization analysis was used to correlate the genomic sequence with the cytogenetic map in order to refine the genomic definition of the centric heterochromatin; on the basis of our cytological definition, the annotated Release 3 euchromatic sequence extends into the centric heterochromatin on each chromosome arm. CONCLUSIONS Whole-genome shotgun assembly produced a reliable draft-quality sequence of a significant part of the Drosophila heterochromatin. Annotation of this sequence defined the intron-exon structures of 30 known protein-coding genes and 267 protein-coding gene models. The cytogenetic mapping suggests that an additional 150 predicted genes are located in heterochromatin at the base of the Release 3 euchromatic sequence. Our analysis suggests strategies for improving the sequence and annotation of the heterochromatic portions of the Drosophila and other complex genomes.
Collapse
Affiliation(s)
- Roger A Hoskins
- Department of Genome Sciences, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
107
|
Lynn AM, Jain CK, Kosalai K, Barman P, Thakur N, Batra H, Bhattacharya A. An automated annotation tool for genomic DNA sequences using GeneScan and BLAST. J Genet 2001; 80:9-16. [PMID: 11910119 DOI: 10.1007/bf02811413] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2001] [Indexed: 10/22/2022]
Abstract
Genomic sequence data are often available well before the annotated sequence is published. We present a method for analysis of genomic DNA to identify coding sequences using the GeneScan algorithm and characterize these resultant sequences by BLAST. The routines are used to develop a system for automated annotation of genome DNA sequences.
Collapse
Affiliation(s)
- A M Lynn
- Bioinformatics Centre, Jawaharlal Nehru University, New Delhi 110 067, India
| | | | | | | | | | | | | |
Collapse
|
108
|
Bhattacharya A, Bhattacharya S, Joshi A, Ramachandran S, Ramaswamy R. Identification of parasitic genes by computational methods. PARASITOLOGY TODAY (PERSONAL ED.) 2000; 16:127-31. [PMID: 10689334 DOI: 10.1016/s0169-4758(99)01600-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
A number of parasite genome projects are under way, and large amounts of nucleotide sequence data are becoming available for analysis. There is an urgent need for development of theoretical tools to analyze the genome data, including identification of protein-coding sequences. The majority of the methods developed to date require prior information about the genome before accurate predictions can be made. Because such information is not available for many parasites, these methods cannot be directly applied. In this article, Alok Bhattacharya and colleagues describe some of the gene-prediction methods commonly in use, and a new method, GeneScan, that they have developed for the analysis of parasite genomes.
Collapse
Affiliation(s)
- A Bhattacharya
- School of Life Sciences, Jawaharlal Nehru University, New Delhi 110067, India.
| | | | | | | | | |
Collapse
|
109
|
Nelson PS, Gan L, Ferguson C, Moss P, Gelinas R, Hood L, Wang K. Molecular cloning and characterization of prostase, an androgen-regulated serine protease with prostate-restricted expression. Proc Natl Acad Sci U S A 1999; 96:3114-9. [PMID: 10077646 PMCID: PMC15904 DOI: 10.1073/pnas.96.6.3114] [Citation(s) in RCA: 164] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The identification of genes with selective expression in specific organs or cell types provides an entry point for understanding biological processes that occur uniquely within a particular tissue. Using a subtraction approach designed to identify genes preferentially expressed in specific tissues, we have identified prostase, a human serine protease with prostate-restricted expression. The prostase cDNA encodes a putative 254-aa polypeptide with a conserved serine protease catalytic triad and an amino-terminal pre-propeptide sequence, indicating a potential secretory function. The genomic sequence comprises five exons and four introns and contains multiple copies of a chromosome 19q-specific minisatellite repeat. Northern analysis indicates that prostase mRNA is expressed in hormonally responsive normal and neoplastic prostate epithelial tissues, but not in prostate stromal constituents. Prostase shares 35% amino acid identity with prostate-specific antigen (PSA) and 78% identity with the porcine enamel matrix serine proteinase 1, an enzyme involved in enamel matrix degradation and with a putative role in the disruption of intercellular junctions. Radiation-hybrid-panel mapping localized prostase to chromosome 19q13, a region containing several other serine proteases, including protease M, pancreatic/renal kallikrein hK1, and the prostate-specific kallikreins hK2 and hK3 (PSA). The sequence homology between prostase and other well-characterized serine proteases suggests several potential functional roles for the prostase protein that include the degradation of extracellular matrix and the activation of PSA and other proteases.
Collapse
Affiliation(s)
- P S Nelson
- Department of Molecular Biotechnology, University of Washington, Seattle, WA 98195, USA.
| | | | | | | | | | | | | |
Collapse
|
110
|
Audic S, Claverie JM. Self-identification of protein-coding regions in microbial genomes. Proc Natl Acad Sci U S A 1998; 95:10026-31. [PMID: 9707594 PMCID: PMC21455 DOI: 10.1073/pnas.95.17.10026] [Citation(s) in RCA: 44] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
A new method for predicting protein-coding regions in microbial genomic DNA sequences is presented. It uses an ab initio iterative Markov modeling procedure to automatically perform the partition of genomic sequences into three subsets shown to correspond to coding, coding on the opposite strand, and noncoding segments. In contrast to current methods, such as GENEMARK [Borodovsky, M. & McIninch, J. D. (1993) Comput. Chem. 17, 123-133], no training set or prior knowledge of the statistical properties of the studied genome are required. This new method tolerates error rates of 1-2% and can process unassembled sequences. It is thus ideal for the analysis of genome survey and/or fragmented sequence data from uncharacterized microorganisms. The method was validated on 10 complete bacterial genomes (from four major phylogenetic lineages). The results show that protein-coding regions can be identified with an accuracy of up to 90% with a totally automated and objective procedure.
Collapse
Affiliation(s)
- S Audic
- Structural and Genetic Information Laboratory, Centre National de la Recherche Scientifique-EP.91, 31 rue Joseph Aiguier, Marseille F-13402, France.
| | | |
Collapse
|