1
|
Uberbacher EC, Hyatt D, Shah M. GrailEXP and Genome Analysis Pipeline for genome annotation. ACTA ACUST UNITED AC 2008; Chapter 6:Unit 6.5. [PMID: 18428363 DOI: 10.1002/0471142905.hg0605s39] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The Gene Recognition and Analysis Internet Link (GRAIL) is one of the most widely used systems for evaluating the protein-coding potential of anonymous DNA sequences. This unit describes the use of the XGRAIL and genQuest client-server applications to locate exons in DNA sequences, to develop gene models, and to search databases for homologs. A support protocol describes how to obtain the GRAIL and genQuest client software by anonymous FTP.
Collapse
|
2
|
Abstract
The gene identification problem is the problem of interpreting nucleotide sequences by computer, in order to provide tentative annotation on the location, structure, and functional class of protein-coding genes. This problem is of self-evident importance, and is far from being fully solved, particularly for higher eukaryotes. Thus it is not surprising that the number of algorithm and software developers working in the area is rapidly increasing. The present paper is an overview of the field, with an emphasis on eukaryotes, for such developers.
Collapse
Affiliation(s)
- J W Fickett
- Theoretical Biology and Biophysics Group, MS K710, Los Alamos National Laboratory, Los Alamos, NM 87545, USA
| |
Collapse
|
3
|
Bissonnette N, Gilbert I, Levesque-Sergerie JP, Lacasse P, Petitclerc D. In vivo expression of the antimicrobial defensin and lactoferrin proteins allowed by the strategic insertion of introns adequately spliced. Gene 2006; 372:142-52. [PMID: 16516411 DOI: 10.1016/j.gene.2005.12.030] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2005] [Revised: 12/14/2005] [Accepted: 12/21/2005] [Indexed: 11/22/2022]
Abstract
A major limitation of conventional shuttle expression system, when cloning a bactericidal gene, is the basal expression level in bacteria, which is lethal. Although the expression level is low, the bactericidal feature inherent to the molecule leads to subsequent failure to recover intact transformants when the related gene is cloned into a conventional expression vector. Contrary to popular belief, the human cytomegalovirus immediate-early region 1 promoter (CMV), which is to date one of the most powerful promoters for eukaryotic expression, is active in bacteria. In this study, bactericidal genes were cloned into a conventional shuttle eukaryote expression vector harbouring the CMV promoter, but were interrupted with a sequence independent splicing element (SISE), thus inhibiting lethal gene expression in bacteria. The insertion strategy of the intron uses a universal restriction site-free cloning approach, which has been developed to insert a DNA fragment into a specific location of a gene, through a PCR-based cloning technique. We have found that one intervening sequence, which derives from an adenovirus, can be spliced in a mammalian system without respect to its location, thus the bactericidal protein is synthesized only when transfected into mammalian cells. Therein, lactoferrin and defensin proteins were produced in vivo without the necessity of complex expression systems. By introducing the adeno SISE within the coding sequence of the bactericidal genes, such genes can be easily synthesized in vitro through cloning into bacteria and still are able to express biologically active proteins when introduced into mammalian cells.
Collapse
Affiliation(s)
- Nathalie Bissonnette
- Dairy and Swine Research and Development Centre, Agriculture and Agri-Food Canada, P.O. Box 90, Lennoxville, Quebec Canada J1M 1Z3.
| | | | | | | | | |
Collapse
|
4
|
Uberbacher EC, Hyatt D, Shah M. GrailEXP and Genome Analysis Pipeline for genome annotation. CURRENT PROTOCOLS IN BIOINFORMATICS 2004; Chapter 4:Unit4.9. [PMID: 18428726 DOI: 10.1002/0471250953.bi0409s04] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
The Basic Protocol describes the use of GrailEXP, the latest version of the gene finding system from Oak Ridge National Laboratory. GrailEXP provides gene models, by making use of sequence similarity with Expressed Sequence Tags (ESTs) and known genes. GrailEXP also provides alternatively spliced constructs for each gene based on the available EST evidence. The Support Protocol describes the use of the Genome Analysis Pipeline, a web application which allows users to perform comprehensive sequence analysis by offering a selection from a wide choice of supported gene finders, other biological feature finders, and database searches.
Collapse
|
5
|
Mathé C, Sagot MF, Schiex T, Rouzé P. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res 2002; 30:4103-17. [PMID: 12364589 PMCID: PMC140543 DOI: 10.1093/nar/gkf543] [Citation(s) in RCA: 209] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2002] [Revised: 08/07/2002] [Accepted: 08/07/2002] [Indexed: 11/14/2022] Open
Abstract
While the genomes of many organisms have been sequenced over the last few years, transforming such raw sequence data into knowledge remains a hard task. A great number of prediction programs have been developed that try to address one part of this problem, which consists of locating the genes along a genome. This paper reviews the existing approaches to predicting genes in eukaryotic genomes and underlines their intrinsic advantages and limitations. The main mathematical models and computational algorithms adopted are also briefly described and the resulting software classified according to both the method and the type of evidence used. Finally, the several difficulties and pitfalls encountered by the programs are detailed, showing that improvements are needed and that new directions must be considered.
Collapse
Affiliation(s)
- Catherine Mathé
- Institut de Pharmacologie et Biologie Structurale, UMR 5089, 205 route de Narbonne, F-31077 Toulouse Cedex, France.
| | | | | | | |
Collapse
|
6
|
Affiliation(s)
- R J Mural
- Computational Biology Section, Oak Ridge National Laboratory, Tennessee 37831, USA
| |
Collapse
|
7
|
Rogozin IB, D'Angelo D, Milanesi L. Protein-coding regions prediction combining similarity searches and conservative evolutionary properties of protein-coding sequences. Gene 1999; 226:129-37. [PMID: 9889348 DOI: 10.1016/s0378-1119(98)00509-5] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The gene identification procedure in a completely new gene with no good homology with protein sequences can be a very complex task. In order to identify the protein-coding region, a new method, 'SYNCOD', based on the analysis of conservative evolutionary properties of coding regions, has been realized. This program is able to identify and use the coding region homologies of the non-annotated (unknown) protein-coding sequences already present in the nucleotide sequence databases by using the alignment produced by BLASTN. The ratio of number mismatches resulting in synonymous codons to the number of mismatches resulting in non-synonymous codons is estimated for each open reading frame. Monte Carlo simulations are then used to estimate the significance of the ratio deviation from random behavior. The SYNCOD program has been tested on generated random sequences and on different control sets. The high accuracy of predicting protein-coding regions (the correlation coefficient, CC, varies from 0.67 to 0.79) and the high specificity (the portion of wrong exons, WE, varies from 0.06 to 0.07) have proved to be important features of the suggested approach. The SYNCOD program is resident on the ITBA-CNR Web Server and can be used via the Internet (URL: www.itba.mi.cnr.it/webgene).
Collapse
Affiliation(s)
- I B Rogozin
- Istituto di Tecnologie Biomediche Avanzate CNR, via Fratelli Cervi 93, 20090 Segrate, Milan, Italy
| | | | | |
Collapse
|
8
|
Roytberg MA, Astakhova TV, Gelfand MS. Combinatorial approaches to gene recognition. COMPUTERS & CHEMISTRY 1998; 21:229-35. [PMID: 9440930 DOI: 10.1016/s0097-8485(96)00034-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Recognition of genes via exon assembly approaches leads naturally to the use of dynamic programming. We consider the general graph-theoretical formulation of the exon assembly problem and analyze in detail some specific variants: multicriterial optimization in the case of non-linear gene-scoring functions; context-dependent schemes for scoring exons and related procedures for exon filtering; and highly specific recognition of arbitrary gene segments, oligonucleotide probes and polymerase chain reaction (PCR) primers.
Collapse
Affiliation(s)
- M A Roytberg
- Institute of Mathematical Problems in Biology, Russian Academy of Sciences, Pushchino
| | | | | |
Collapse
|
9
|
Abstract
As the Human Genome Project enters the large-scale sequencing phase, computational gene identification methods are becoming essential for the automatic analysis and annotation of large uncharacterized genomic sequences. Currently available computer programs relying mainly on sequence coding statistics are of great use in pin-pointing regions in genomic sequences containing exons. Such programs perform rather poorly, however, when the problem is to fully elucidate gene structure. For this problem, the DNA sequence signals involved in the specification of the genes--start sites and splice sites--carry a lot of information, and simple methods relying on such information can predict gene structure with an accuracy to some extent comparable to that of other more sophisticated computational methods.
Collapse
Affiliation(s)
- R Guigó
- Departament d'Informàtica Mèdica, Institut Municipal d'Investigació Mèdica (IMIM), Barcelona, Spain.
| |
Collapse
|
10
|
Sze SH, Pevzner PA. Las Vegas algorithms for gene recognition: suboptimal and error-tolerant spliced alignment. J Comput Biol 1997; 4:297-309. [PMID: 9278061 DOI: 10.1089/cmb.1997.4.297] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
Recently, Gelfand, Mironov and Pevzner (1996) proposed a spliced alignment approach to gene recognition that provides 99% accurate recognition of human genes if a related mammalian protein is available. However, even 99% accurate gene predictions are insufficient for automated sequence annotation in large-scale sequencing projects and therefore have to be complemented by experimental gene verification. One hundred percent accurate gene predictions would lead to a substantial reduction of experimental work on gene identification. Our goal is to develop an algorithm that either predicts an exon assembly with accuracy sufficient for sequence annotation or warns a biologist that the accuracy of a prediction is insufficient and further experimental work is required. We study suboptimal and error-tolerant spliced alignment problems as the first steps towards such an algorithm, and report an algorithm which provides 100% accurate recognition of human genes in 37% of cases (if a related mammalian protein is available). In 52% of genes, the algorithm predicts at least one exon with 100% accuracy.
Collapse
Affiliation(s)
- S H Sze
- Department of Computer Science, University of Southern California, Los Angeles 90089-1113, USA.
| | | |
Collapse
|
11
|
Gelfand MS, Mironov AA, Pevzner PA. Gene recognition via spliced sequence alignment. Proc Natl Acad Sci U S A 1996; 93:9061-6. [PMID: 8799154 PMCID: PMC38595 DOI: 10.1073/pnas.93.17.9061] [Citation(s) in RCA: 192] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
Gene recognition is one of the most important problems in computational molecular biology. Previous attempts to solve this problem were based on statistics, and applications of combinatorial methods for gene recognition were almost unexplored. Recent advances in large-scale cDNA sequencing open a way toward a new approach to gene recognition that uses previously sequenced genes as a clue for recognition of newly sequenced genes. This paper describes a spliced alignment algorithm and software tool that explores all possible exon assemblies in polynomial time and finds the multiexon structure with the best fit to a related protein. Unlike other existing methods, the algorithm successfully recognizes genes even in the case of short exons or exons with unusual codon usage; we also report correct assemblies for genes with more than 10 exons. On a test sample of human genes with known mammalian relatives, the average correlation between the predicted and actual proteins was 99%. The algorithm correctly reconstructed 87% of genes and the rare discrepancies between the predicted and real exon-intron structures were caused either by short (less than 5 amino acids) initial/terminal exons or by alternative splicing. Moreover, the algorithm predicts human genes reasonably well when the homologous protein is nonvertebrate or even prokaryotic. The surprisingly good performance of the method was confirmed by extensive simulations: in particular, with target proteins at 160 accepted point mutations (PAM) (25% similarity), the correlation between the predicted and actual genes was still as high as 95%.
Collapse
Affiliation(s)
- M S Gelfand
- Institute of Protein Research, Russian Academy of Sciences, Puschino, Moscow, Russia
| | | | | |
Collapse
|
12
|
Gelfand MS, Podolsky LI, Astakhova TV, Roytberg MA. Recognition of genes in human DNA sequences. J Comput Biol 1996; 3:223-34. [PMID: 8811484 DOI: 10.1089/cmb.1996.3.223] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
A new approach to computer-assisted gene recognition in higher eukaryote DNA is suggested. It allows one to use not only linear functions for scoring structures, but all functions satisfying natural monotonicity conditions. The algorithm constructs the set of structures guaranteed to contain an optimal structure for every function. So, it uncouples the time-consuming step of generation of this set from the fast step of structure scoring, thus making it simple to experiment with different functions. One particular scoring function, taking into account only codon usage and positional nucleotide frequencies of the splicing sites, has been implemented in the Genome Recognition and Exon Assembly Tool program, and has been tested on an independent sample of human genes, yielding 88% sensitivity and 79% specificity.
Collapse
Affiliation(s)
- M S Gelfand
- Institute of Protein Research, Russian Academy of Sciences, Moscow Region, Russia
| | | | | | | |
Collapse
|
13
|
Reddy BV, Pandit MW. A statistical analytical approach to decipher information from biological sequences: application to murine splice-site analysis and prediction. J Biomol Struct Dyn 1995; 12:785-801. [PMID: 7779300 DOI: 10.1080/07391102.1995.10508776] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
A simple statistical approach for the analysis of biological sequences, such as splice-sites, promoter regions, helices and extended structure forming regions or any other sequence dependent functional entities in proteins, is presented. The approach has been proved useful to develop a method for prediction of such entities in newly available sequences. We first search for invariant sequence features of each functional entity from the experimentally available sequences and identify a set of 'like' sequences with similar sequence features. In the next step, concrete features of sequence entities in terms of occurrences of smaller subsequences are identified at various positions which are used as a knowledge base to select potential functional entities from the identified 'like' sequences. The third step consists of refinement of this pattern learning, statistical improvements of the knowledge base weight matrices, and finally its application to predict functional entities in newly available sequences. Such an analysis is operationally described for murine splice-site predictions. Regions comprising -30 to +30 nucleotides from the splice-junction at the murine splice-sites (donors and acceptors), reported earlier, were analyzed. Invariant sequence-specific features in terms of monomer frequency average were used to identify splice-site-like sequences in the EMBL murine DNA sequence data base. The frequencies of occurrence of mono-, di-, tri- and tetranucleotides in the known splice-sites were studied in comparison with the splice-site-like sequences; the significant differences in their occurrences were extracted as statistical knowledge coded in weight matrices for computer to identify potential splice-sites. The algorithm was refined and a method was developed to predict potential splice-sites in a given murine DNA; the analysis was also extended to human DNA. The success rate of the method to predict correct splice-sites in these species is found to be 80% and 85%, respectively. The major strength of this method lies in reducing significantly the number of false positives which are normally picked up in such analysis.
Collapse
Affiliation(s)
- B V Reddy
- Centre for Cellular and Molecular Biology, Hyderabad, India
| | | |
Collapse
|
14
|
Abstract
Recognition of function of newly sequenced DNA fragments is an important area of computational molecular biology. Here we present an extensive review of methods for prediction of functional sites, tRNA, and protein-coding genes and discuss possible further directions of research in this area.
Collapse
Affiliation(s)
- M S Gelfand
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, Moscow region, Russia
| |
Collapse
|
15
|
Affiliation(s)
- V V Solovyev
- Institute of Cytology and Genetics, Russian Academy of Science, Novosibirsk
| |
Collapse
|
16
|
Gelfand MS, Roytberg MA. Prediction of the exon-intron structure by a dynamic programming approach. Biosystems 1993; 30:173-82. [PMID: 8374074 DOI: 10.1016/0303-2647(93)90069-o] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Affiliation(s)
- M S Gelfand
- Institute of Protein Research, Russia Academy of Sciences, Pushchino, Moscow region
| | | |
Collapse
|
17
|
Affiliation(s)
- M S Gelfand
- Institute of Protein Research, Russia Academy of Sciences, Puschino, Moscow region
| |
Collapse
|
18
|
Stephens RM, Schneider TD. Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites. J Mol Biol 1992; 228:1124-36. [PMID: 1474582 DOI: 10.1016/0022-2836(92)90320-j] [Citation(s) in RCA: 220] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
An information analysis of the 5' (donor) and 3' (acceptor) sequences spanning the ends of nearly 1800 human introns has provided evidence for structural features of splice sites that bear upon spliceosome evolution and function: (1) 82% of the sequence information (i.e. sequence conservation) at donor junctions and 97% of the sequence information at acceptor junctions is confined to the introns, allowing codon choices throughout exons to be largely unrestricted. The distribution of information at intron-exon junctions is also described in detail and compared with footprints. (2) Acceptor sites are found to possess enough information to be located in the transcribed portion of the human genome, whereas donor sites possess about one bit less than the information needed to locate them independently. This difference suggests that acceptor sites are located first in humans and, having been located, reduce by a factor of two the number of alternative sites available as donors. Direct experimental evidence exists to support this conclusion. (3) The sequences of donor and acceptor splice sites exhibit a striking similarity. This suggests that the two junctions derive from a common ancestor and that during evolution the information of both sites shifted onto the intron. If so, the protein and RNA components that are found in contemporary spliceosomes, and which are responsible for recognizing donor and acceptor sequences, should also be related. This conclusion is supported by the common structures found in different parts of the spliceosome.
Collapse
Affiliation(s)
- R M Stephens
- National Cancer Institute, Frederick Cancer Research and Development Center, Laboratory of Mathematical Biology, MD 21702-1201
| | | |
Collapse
|
19
|
Abstract
Nonhomologous fully sequenced human protein-coding genes were studied. Three sets of exon-exon junctions were formed defined by the intron (shadow) position relative to the reading frame. For the analysis of intron shadow signals in exons, information content and discrimination energy approaches were used with the correction allowing one to ignore the influence of a protein-coding message. The corrected formulas allow one to define the consensuses for the three types of intron shadow signals as a AG/guwn, cAG/GUnn, and cAG/gunU, and provide better recognition than the original formulas. The analysis of the codon usage in the signal positions leads to the conclusion that the prevalence of some amino acids in corresponding protein sites is caused by the signal requirements and not vice versa. The distribution of potential intron shadow signals in exons contradicts the hypothesis of intron insertion into suitable preexisting sites. There exists a correlation between the intron types and/or the exon length modulo 3.
Collapse
Affiliation(s)
- M S Gelfand
- Institute of Protein Research, Russia Academy of Sciences, Pushchino, Moscow Region
| |
Collapse
|
20
|
Hutchinson GB, Hayden MR. The prediction of exons through an analysis of spliceable open reading frames. Nucleic Acids Res 1992; 20:3453-62. [PMID: 1321415 PMCID: PMC312502 DOI: 10.1093/nar/20.13.3453] [Citation(s) in RCA: 55] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
We have developed a computer program which predicts internal exons from naive genomic sequence data and which will run on any IBM-compatible 80286 (or higher) computer. The algorithm searches a sequence for 'spliceable open reading frames' (SORFs), which are open reading frames bracketed by suitable splice-recognition sequences, and then analyzes the region for codon usage. Potential exons are stratified according to the reliability of their prediction, from confidence levels 1 to 5. The program is designed to predict internal exons of length greater than 60 nucleotides. In an analysis of 116 genes of a training set, 384 out of 441 such exons (87.1%) are identified, with 280 (63.5%) of predictions matching the true exon exactly (at both 5' and 3' splice junctions and in the correct reading frame), and with 104 (23.6%) exons matching partially. In a similar analysis of 14 genes in a test set unrelated to the genes used to generate the parameters of the program, 70 out of 80 internal exons greater than 60 bp in length are identified (87.5%), with 47 completely and 23 partially matched. SORFs that partially match true internal exons share at least one splice junction with the exon, or share both splice junctions but are interpreted in an incorrect reading frame. Specificity (the percentage of SORFs that correspond to true exons) varies from 91% at confidence level 1 to 16% at confidence level 5, with an overall specificity of 35-40%. The output displays nucleotide position, confidence level, reading frame phase at the 5' and 3' ends, acceptor and donor sequences and scoring statistics and also gives an amino acid translation of the potential exon. SORFIND compares favourably with other programs currently used to predict protein-coding regions.
Collapse
Affiliation(s)
- G B Hutchinson
- Department of Medical Genetics, University of British Columbia, Vancouver, Canada
| | | |
Collapse
|
21
|
Abstract
We have developed a hierarchical rule base system for identifying genes in DNA sequences. Atomic sites (such as initiation codons, stop codons, acceptor sites and donor sites) are identified by a number of different methods and evaluated by a set of filters and rules chosen to maximize sensitivity; these are combined into higher-order gene elements (such as exons), evaluated, filtered and combined as equivalence classes into probable genes, which are evaluated and ranked. The system has been tested on an extensive collection of vertebrate genes smaller than 15,000 bases. Results obtained show that, on average, 88% of the predicted coding region for a transcription unit is actually coding, and 80% of the actual coding is correctly predicted. This will, in most applications, be sufficient for a search against protein sequence databases for the identification of probable gene function. In addition, the system provides a general test platform for both gene atomic site identification and the rules for their evaluation and assembly.
Collapse
Affiliation(s)
- R Guigó
- Molecular Biology Computer Research Resource, Dana-Farber Cancer Institute, Boston, MA
| | | | | | | |
Collapse
|
22
|
Uberbacher EC, Mural RJ. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc Natl Acad Sci U S A 1991; 88:11261-5. [PMID: 1763041 PMCID: PMC53114 DOI: 10.1073/pnas.88.24.11261] [Citation(s) in RCA: 440] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
Genes in higher eukaryotes may span tens or hundreds of kilobases with the protein-coding regions accounting for only a few percent of the total sequence. Identifying genes within large regions of uncharacterized DNA is a difficult undertaking and is currently the focus of many research efforts. We describe a reliable computational approach for locating protein-coding portions of genes in anonymous DNA sequence. Using a concept suggested by robotic environmental sensing, our method combines a set of sensor algorithms and a neural network to localize the coding regions. Several algorithms that report local characteristics of the DNA sequence, and therefore act as sensors, are also described. In its current configuration the "coding recognition module" identifies 90% of coding exons of length 100 bases or greater with less than one false positive coding exon indicated per five coding exons indicated. This is a significantly lower false positive rate than any method of which we are aware. This module demonstrates a method with general applicability to sequence-pattern recognition problems and is available for current research efforts.
Collapse
|