1
|
Abstract
Evolutionary genomics is a field that relies heavily upon comparing genomes, that is, the full complement of genes of one species with another. However, given a genome sequence and little else, as is now often the case, genes must first be found and annotated before downstream analyses can be done. Computational gene prediction techniques are brought to bear on the problem of constructing a genome annotation as manual annotation is extremely time-consuming and costly. This chapter reviews the methods by which the individual components of a typical gene structure are detected in genomic sequence and then discusses several popular statistical frameworks for integrated gene prediction on eukaryotic genome sequences.
Collapse
Affiliation(s)
- Tyler Alioto
- Centro Nacional de Análisis Genómico, Barcelona, Spain.
| |
Collapse
|
2
|
|
3
|
Abstract
Accurate gene prediction in eukaryotes is a difficult and subtle problem. Here we point out a useful feature of expected distributions of spliceosomal intron lengths. Since introns are removed from transcripts prior to translation, intron lengths are not expected to respect coding frame, thus the number of genomic introns that are a multiple of three bases (‘3n introns’) should be similar to the number that are a multiple of three plus one bases (or plus two bases). Skewed predicted intron length distributions thus suggest systematic errors in intron prediction. For instance, a genome-wide excess of 3n introns suggests that many internal exonic sequences have been incorrectly called introns, whereas a deficit of 3n introns suggests that many 3n introns that lack stop codons have been mistaken for exonic sequence. A survey of genomic annotations for 29 diverse eukaryotic species showed that skew in intron length distributions is a common problem. We discuss several examples of skews in genome-wide intron length distributions that indicate systematic problems with gene prediction. We suggest that evaluation of length distributions of predicted introns is a fast and simple method for detecting a variety of possible systematic biases in gene prediction or even problems with genome assemblies, and discuss ways in which these insights could be incorporated into genome annotation protocols.
Collapse
Affiliation(s)
- Scott William Roy
- Allan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston North, New Zealand.
| | | |
Collapse
|
4
|
Farrar M. Striped Smith-Waterman speeds database searches six times over other SIMD implementations. Bioinformatics 2006; 23:156-61. [PMID: 17110365 DOI: 10.1093/bioinformatics/btl582] [Citation(s) in RCA: 112] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The only algorithm guaranteed to find the optimal local alignment is the Smith-Waterman. It is also one of the slowest due to the number of computations required for the search. To speed up the algorithm, Single-Instruction Multiple-Data (SIMD) instructions have been used to parallelize the algorithm at the instruction level. RESULTS A faster implementation of the Smith-Waterman algorithm is presented. This algorithm achieved 2-8 times performance improvement over other SIMD based Smith-Waterman implementations. On a 2.0 GHz Xeon Core 2 Duo processor, speeds of >3.0 billion cell updates/s were achieved. AVAILABILITY http://farrar.michael.googlepages.com/Smith-waterman
Collapse
|
5
|
Abstract
We introduce a new system, called shortHMM, for predicting exons, which predicts individual exons using two related genomes. In this system, we build a hidden semi-Markov model to identify exons. In the hidden Markov model, we propose joint probability models of nucleotides in introns, splice sites, 5'UTR, 3'UTR, and intergenic regions by exploiting the homology between related genomes. In order to reduce the false positive rate of the hidden Markov model, we develop a screening process which is able to identify intergenic regions. We then build a classifier by combining the statistics from the hidden Markov model and the screening process. We implement shortHMM on human-mouse sequence alignments. The source codes are available at < www.stat.purdue.edu/ jingwu/hmm >. Compared to TWINSCAN and SLAM, shortHMM is substantially more powerful in identifying AT-rich RefSeq exons (8% more AT-rich RefSeq exons were predicted), as well as slightly more powerful in identifying RefSeq exons (3-10% more RefSeq exons were predicted), at a similar or lower false positive rate, with less computing time and with less memory usage. Last, shortHMM is also capable of finding new potential exons.
Collapse
Affiliation(s)
- Jing Wu
- Department of Statistics, Purdue University, West Lafayette, Indiana 47906, USA.
| | | |
Collapse
|
6
|
Bertone P, Trifonov V, Rozowsky JS, Schubert F, Emanuelsson O, Karro J, Kao MY, Snyder M, Gerstein M. Design optimization methods for genomic DNA tiling arrays. Genome Res 2005; 16:271-81. [PMID: 16365382 PMCID: PMC1361723 DOI: 10.1101/gr.4452906] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
A recent development in microarray research entails the unbiased coverage, or tiling, of genomic DNA for the large-scale identification of transcribed sequences and regulatory elements. A central issue in designing tiling arrays is that of arriving at a single-copy tile path, as significant sequence cross-hybridization can result from the presence of non-unique probes on the array. Due to the fragmentation of genomic DNA caused by the widespread distribution of repetitive elements, the problem of obtaining adequate sequence coverage increases with the sizes of subsequence tiles that are to be included in the design. This becomes increasingly problematic when considering complex eukaryotic genomes that contain many thousands of interspersed repeats. The general problem of sequence tiling can be framed as finding an optimal partitioning of non-repetitive subsequences over a prescribed range of tile sizes, on a DNA sequence comprising repetitive and non-repetitive regions. Exact solutions to the tiling problem become computationally infeasible when applied to large genomes, but successive optimizations are developed that allow their practical implementation. These include an efficient method for determining the degree of similarity of many oligonucleotide sequences over large genomes, and two algorithms for finding an optimal tile path composed of longer sequence tiles. The first algorithm, a dynamic programming approach, finds an optimal tiling in linear time and space; the second applies a heuristic search to reduce the space complexity to a constant requirement. A Web resource has also been developed, accessible at http://tiling.gersteinlab.org, to generate optimal tile paths from user-provided DNA sequences.
Collapse
Affiliation(s)
- Paul Bertone
- Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, CN 06520, USA. P50 HG02357
| | | | | | | | | | | | | | | | | |
Collapse
|
7
|
Abstract
The human genome sequence is the book of our life. Buried in this large volume are our genes, which are scattered as small DNA fragments throughout the genome and comprise a small percentage of the total text. Finding these indistinct 'needles' in a vast genomic 'haystack' can be extremely challenging. In response to this challenge, computational prediction approaches have proliferated in recent years that predict the location and structure of genes. Here, I discuss these approaches and explain why they have become essential for the analyses of newly sequenced genomes.
Collapse
Affiliation(s)
- Michael Q Zhang
- Watson School of Biological Sciences, Cold Spring Harbor Laboratory, 1 Bungtown Road, PO Box 100, Cold Spring Harbor, New York 11724, USA.
| |
Collapse
|
8
|
Abstract
The advent of whole-genome data resources--not only sequence but also other genome-scale data collections such as gene expression, protein interaction, and genetic variation--is having two marked, complementary effects on the relatively new discipline of bioinformatics. First, the veritable flood of data is creating a need and demand for new tools for dealing adequately with the deluge, and, second, the unprecedented extent, diversity, and impending completeness of the data sets are creating opportunities for new approaches to discovery based on computational methods.
Collapse
Affiliation(s)
- D B Searls
- Bioinformatics Department, SmithKline Beecham Pharmaceuticals, King of Prussia, Pennsylvania 19406, USA.
| |
Collapse
|
9
|
Grosse I, Herzel H, Buldyrev SV, Stanley HE. Species independence of mutual information in coding and noncoding DNA. PHYSICAL REVIEW. E, STATISTICAL PHYSICS, PLASMAS, FLUIDS, AND RELATED INTERDISCIPLINARY TOPICS 2000; 61:5624-5629. [PMID: 11031617 DOI: 10.1103/physreve.61.5624] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/29/1999] [Indexed: 05/23/2023]
Abstract
We explore if there exist universal statistical patterns that are different in coding and noncoding DNA and can be found in all living organisms, regardless of their phylogenetic origin. We find that (i) the mutual information function [symbol: see text] has a significantly different functional form in coding and noncoding DNA. We further find that (ii) the probability distributions of the average mutual information [symbol: see text] are significantly different in coding and noncoding DNA, while (iii) they are almost the same for organisms of all taxonomic classes. Surprisingly, we find that [symbol: see text] is capable of predicting coding regions as accurately as organism-specific coding measures.
Collapse
Affiliation(s)
- I Grosse
- Center for Polymer Studies, Boston University, Massachusetts 02215, USA
| | | | | | | |
Collapse
|
10
|
Affiliation(s)
- G D Stormo
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110-8232, USA.
| |
Collapse
|
11
|
Abstract
In a number of programs for gene structure prediction in higher eukaryotic genomic sequences, exon prediction is decoupled from gene assembly: a large pool of candidate exons is predicted and scored from features located in the query DNA sequence, and candidate genes are assembled from such a pool as sequences of nonoverlapping frame-compatible exons. Genes are scored as a function of the scores of the assembled exons, and the highest scoring candidate gene is assumed to be the most likely gene encoded by the query DNA sequence. Considering additive gene scoring functions, currently available algorithms to determine such a highest scoring candidate gene run in time proportional to the square of the number of predicted exons. Here, we present an algorithm whose running time grows only linearly with the size of the set of predicted exons. Polynomial algorithms rely on the fact that, while scanning the set of predicted exons, the highest scoring gene ending in a given exon can be obtained by appending the exon to the highest scoring among the highest scoring genes ending at each compatible preceding exon. The algorithm here relies on the simple fact that such highest scoring gene can be stored and updated. This requires scanning the set of predicted exons simultaneously by increasing acceptor and donor position. On the other hand, the algorithm described here does not assume an underlying gene structure model. Indeed, the definition of valid gene structures is externally defined in the so-called Gene Model. The Gene Model specifies simply which gene features are allowed immediately upstream which other gene features in valid gene structures. This allows for great flexibility in formulating the gene identification problem. In particular it allows for multiple-gene two-strand predictions and for considering gene features other than coding exons (such as promoter elements) in valid gene structures.
Collapse
Affiliation(s)
- R Guigó
- Institut Municipal d'Investigació Mèdica, Departament d'Estadística, Universitat de Barcelona, Spain.
| |
Collapse
|
12
|
Roytberg MA, Astakhova TV, Gelfand MS. Combinatorial approaches to gene recognition. COMPUTERS & CHEMISTRY 1998; 21:229-35. [PMID: 9440930 DOI: 10.1016/s0097-8485(96)00034-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Recognition of genes via exon assembly approaches leads naturally to the use of dynamic programming. We consider the general graph-theoretical formulation of the exon assembly problem and analyze in detail some specific variants: multicriterial optimization in the case of non-linear gene-scoring functions; context-dependent schemes for scoring exons and related procedures for exon filtering; and highly specific recognition of arbitrary gene segments, oligonucleotide probes and polymerase chain reaction (PCR) primers.
Collapse
Affiliation(s)
- M A Roytberg
- Institute of Mathematical Problems in Biology, Russian Academy of Sciences, Pushchino
| | | | | |
Collapse
|
13
|
Sze SH, Pevzner PA. Las Vegas algorithms for gene recognition: suboptimal and error-tolerant spliced alignment. J Comput Biol 1997; 4:297-309. [PMID: 9278061 DOI: 10.1089/cmb.1997.4.297] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
Recently, Gelfand, Mironov and Pevzner (1996) proposed a spliced alignment approach to gene recognition that provides 99% accurate recognition of human genes if a related mammalian protein is available. However, even 99% accurate gene predictions are insufficient for automated sequence annotation in large-scale sequencing projects and therefore have to be complemented by experimental gene verification. One hundred percent accurate gene predictions would lead to a substantial reduction of experimental work on gene identification. Our goal is to develop an algorithm that either predicts an exon assembly with accuracy sufficient for sequence annotation or warns a biologist that the accuracy of a prediction is insufficient and further experimental work is required. We study suboptimal and error-tolerant spliced alignment problems as the first steps towards such an algorithm, and report an algorithm which provides 100% accurate recognition of human genes in 37% of cases (if a related mammalian protein is available). In 52% of genes, the algorithm predicts at least one exon with 100% accuracy.
Collapse
Affiliation(s)
- S H Sze
- Department of Computer Science, University of Southern California, Los Angeles 90089-1113, USA.
| | | |
Collapse
|
14
|
Abstract
We present an improved splice site predictor for the genefinding program Genie. Genie is based on a generalized Hidden Markov Model (GHMM) that describes the grammar of a legal parse of a multi-exon gene in a DNA sequence. In Genie, probabilities are estimated for gene features by using dynamic programming to combine information from multiple content and signal sensors, including sensors that integrate matches to homologous sequences from a database. One of the hardest problems in genefinding is to determine the complete gene structure correctly. The splice site sensors are the key signal sensors that address this problem. We replaced the existing splice site sensors in Genie with two novel neural networks based on dinucleotide frequencies. Using these novel sensors, Genie shows significant improvements in the sensitivity and specificity of gene structure identification. Experimental results in tests using a standard set of annotated genes showed that Genie identified 86% of coding nucleotides correctly with a specificity of 85%, versus 80% and 84% in the older system. In further splice site experiments, we also looked at correlations between splice site scores and intron and exon lengths, as well as at the effect of distance to the nearest splice site on false positive rates.
Collapse
Affiliation(s)
- M G Reese
- Human Genome Informatics Group, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA.
| | | | | | | |
Collapse
|
15
|
Abstract
Computational methods for gene identification in genomic sequences typically have two phases: coding region recognition and gene parsing. While there are a number of effective methods for recognizing coding regions (exons), parsing the recognized exons into proper gene structures, to a large extent, remains an unsolved problem. We have developed a computer program which can automatically parse the recognized exons into gene models that are most consistent with the available Expressed Sequence Tags (ESTs) and a set of biological heuristics, derived empirically. The gene modeling algorithm used in this program provides a general framework for applying EST information so the modeling accuracy improves as the amount of available EST information increases. Based on preliminary tests on a number of large DNA sequences, using the dbEST database, we have observed that the algorithm can (1) accurately model complicated multiple gene structures, including embedded genes, (2) identify falsely-recognized exons and locate missed exons by the initial exon recognition phase, and (3) make more accurate exon boundary predictions, if the necessary EST information is available. We have extended this EST-based gene modeling algorithm to model genes on unfinished DNA contigs at the end of the shotgun sequencing. This extended version can automatically determine the orientations and the relative order of the DNA contigs (with gaps between them) using the available ESTs as reference models, before the gene modeling phase.
Collapse
Affiliation(s)
- Y Xu
- Computer Science and Mathematics Division, Oak Ridge National Laboratory, Tennessee 37831-6364, USA.
| | | |
Collapse
|
16
|
Abstract
We introduce a general probabilistic model of the gene structure of human genomic sequences which incorporates descriptions of the basic transcriptional, translational and splicing signals, as well as length distributions and compositional features of exons, introns and intergenic regions. Distinct sets of model parameters are derived to account for the many substantial differences in gene density and structure observed in distinct C + G compositional regions of the human genome. In addition, new models of the donor and acceptor splice signals are described which capture potentially important dependencies between signal positions. The model is applied to the problem of gene identification in a computer program, GENSCAN, which identifies complete exon/intron structures of genes in genomic DNA. Novel features of the program include the capacity to predict multiple genes in a sequence, to deal with partial as well as complete genes, and to predict consistent sets of genes occurring on either or both DNA strands. GENSCAN is shown to have substantially higher accuracy than existing methods when tested on standardized sets of human and vertebrate genes, with 75 to 80% of exons identified exactly. The program is also capable of indicating fairly accurately the reliability of each predicted exon. Consistently high levels of accuracy are observed for sequences of differing C + G content and for distinct groups of vertebrates.
Collapse
Affiliation(s)
- C Burge
- Department of Mathematics, Stanford University, CA 94305, USA
| | | |
Collapse
|
17
|
Gelfand MS, Mironov AA, Pevzner PA. Gene recognition via spliced sequence alignment. Proc Natl Acad Sci U S A 1996; 93:9061-6. [PMID: 8799154 PMCID: PMC38595 DOI: 10.1073/pnas.93.17.9061] [Citation(s) in RCA: 192] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
Gene recognition is one of the most important problems in computational molecular biology. Previous attempts to solve this problem were based on statistics, and applications of combinatorial methods for gene recognition were almost unexplored. Recent advances in large-scale cDNA sequencing open a way toward a new approach to gene recognition that uses previously sequenced genes as a clue for recognition of newly sequenced genes. This paper describes a spliced alignment algorithm and software tool that explores all possible exon assemblies in polynomial time and finds the multiexon structure with the best fit to a related protein. Unlike other existing methods, the algorithm successfully recognizes genes even in the case of short exons or exons with unusual codon usage; we also report correct assemblies for genes with more than 10 exons. On a test sample of human genes with known mammalian relatives, the average correlation between the predicted and actual proteins was 99%. The algorithm correctly reconstructed 87% of genes and the rare discrepancies between the predicted and real exon-intron structures were caused either by short (less than 5 amino acids) initial/terminal exons or by alternative splicing. Moreover, the algorithm predicts human genes reasonably well when the homologous protein is nonvertebrate or even prokaryotic. The surprisingly good performance of the method was confirmed by extensive simulations: in particular, with target proteins at 160 accepted point mutations (PAM) (25% similarity), the correlation between the predicted and actual genes was still as high as 95%.
Collapse
Affiliation(s)
- M S Gelfand
- Institute of Protein Research, Russian Academy of Sciences, Puschino, Moscow, Russia
| | | | | |
Collapse
|
18
|
Abstract
An algorithm called segment-based dynamic programming is described for predicting gene structure from a sequence of genomic DNA. The algorithm explores the space of gene structures that satisfy junctional and frame constraints and finds the gene structure that optimizes the sum of junctional and segmental scoring functions. Junctional constraints specify acceptable sites of initiation, termination, and splicing, whereas frame constraints ensure that the total exon length is a multiple of three and that no in-frame stop codons occur within exons or at exon-exon junctions. By computing over segments, segment-based dynamic programming maintains reading frame and phase information for each segment, it can assemble exons in-frame as well as score them in-frame. The algorithm is used to quantify the computational power of constraints. Experimental results show that frame constraints reduce the size of the search space by several orders of magnitude and that cardinality constraints place an asymptotic limit on the size of the search space. The algorithm is also used to compare the accuracy of different methods for assembly and scoring. A scoring scheme based on fifth-order Markov hexamer frequencies is presented and used in three objective functions, corresponding to in-frame, frame-independent, and frame-maximal scoring strategies. Experimental results show that in-frame assembly improves specificity only slightly over frame-independent assembly, whereas in-frame scoring improves specificity substantially over frame-independent and frame-maximal scoring.
Collapse
Affiliation(s)
- T D Wu
- Beckman Center for Molecular and Genetic Medicine, Stanford University Medical Center, California 94305, USA.
| |
Collapse
|
19
|
Gelfand MS, Podolsky LI, Astakhova TV, Roytberg MA. Recognition of genes in human DNA sequences. J Comput Biol 1996; 3:223-34. [PMID: 8811484 DOI: 10.1089/cmb.1996.3.223] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
A new approach to computer-assisted gene recognition in higher eukaryote DNA is suggested. It allows one to use not only linear functions for scoring structures, but all functions satisfying natural monotonicity conditions. The algorithm constructs the set of structures guaranteed to contain an optimal structure for every function. So, it uncouples the time-consuming step of generation of this set from the fast step of structure scoring, thus making it simple to experiment with different functions. One particular scoring function, taking into account only codon usage and positional nucleotide frequencies of the splicing sites, has been implemented in the Genome Recognition and Exon Assembly Tool program, and has been tested on an independent sample of human genes, yielding 88% sensitivity and 79% specificity.
Collapse
Affiliation(s)
- M S Gelfand
- Institute of Protein Research, Russian Academy of Sciences, Moscow Region, Russia
| | | | | | | |
Collapse
|
20
|
Reddy BV, Pandit MW. A statistical analytical approach to decipher information from biological sequences: application to murine splice-site analysis and prediction. J Biomol Struct Dyn 1995; 12:785-801. [PMID: 7779300 DOI: 10.1080/07391102.1995.10508776] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
A simple statistical approach for the analysis of biological sequences, such as splice-sites, promoter regions, helices and extended structure forming regions or any other sequence dependent functional entities in proteins, is presented. The approach has been proved useful to develop a method for prediction of such entities in newly available sequences. We first search for invariant sequence features of each functional entity from the experimentally available sequences and identify a set of 'like' sequences with similar sequence features. In the next step, concrete features of sequence entities in terms of occurrences of smaller subsequences are identified at various positions which are used as a knowledge base to select potential functional entities from the identified 'like' sequences. The third step consists of refinement of this pattern learning, statistical improvements of the knowledge base weight matrices, and finally its application to predict functional entities in newly available sequences. Such an analysis is operationally described for murine splice-site predictions. Regions comprising -30 to +30 nucleotides from the splice-junction at the murine splice-sites (donors and acceptors), reported earlier, were analyzed. Invariant sequence-specific features in terms of monomer frequency average were used to identify splice-site-like sequences in the EMBL murine DNA sequence data base. The frequencies of occurrence of mono-, di-, tri- and tetranucleotides in the known splice-sites were studied in comparison with the splice-site-like sequences; the significant differences in their occurrences were extracted as statistical knowledge coded in weight matrices for computer to identify potential splice-sites. The algorithm was refined and a method was developed to predict potential splice-sites in a given murine DNA; the analysis was also extended to human DNA. The success rate of the method to predict correct splice-sites in these species is found to be 80% and 85%, respectively. The major strength of this method lies in reducing significantly the number of false positives which are normally picked up in such analysis.
Collapse
Affiliation(s)
- B V Reddy
- Centre for Cellular and Molecular Biology, Hyderabad, India
| | | |
Collapse
|
21
|
Abstract
Recognition of function of newly sequenced DNA fragments is an important area of computational molecular biology. Here we present an extensive review of methods for prediction of functional sites, tRNA, and protein-coding genes and discuss possible further directions of research in this area.
Collapse
Affiliation(s)
- M S Gelfand
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, Moscow region, Russia
| |
Collapse
|