101
|
Revilla-Fernández S, Wallner B, Truschner K, Benczak A, Brem G, Schmoll F, Mueller M, Steinborn R. The use of endogenous and exogenous reference RNAs for qualitative and quantitative detection of PRRSV in porcine semen. J Virol Methods 2005; 126:21-30. [PMID: 15847915 PMCID: PMC7112884 DOI: 10.1016/j.jviromet.2005.01.018] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2004] [Revised: 01/17/2005] [Accepted: 01/25/2005] [Indexed: 11/25/2022]
Abstract
Semen is known to be a route of porcine reproductive and respiratory syndrome virus (PRRSV) transmission. A method was developed for qualitative and quantitative detection of the seminal cell-associated PRRSV RNA in relation to endogenous and exogenous reference RNAs. As endogenous control for one-step real-time reverse transcription (RT)-PCR UBE2D2 mRNA was selected. Particularly for the analysis of persistent infections associated with low copy numbers of PRRSV RNA, UBE2D2 mRNA is an ideal control due to its low expression in seminal cells and its detection in all samples analysed (n = 36). However, the amount of UBE2D2 mRNA in porcine semen varied (up to 106-fold), thus its use is limited to qualitative detection of PRRSV RNA. For quantitation, a synthetic, non-metazoan RNA was added to the RNA isolation reaction at an exact copy number. The photosynthesis gene ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL) from Arabidopsis thaliana was used as an exogenous spike. Unexpectedly, PRRSV RNA was detected in a herd of specific pathogen-free (SPF) boars which were tested ELISA-negative for anti-PRRSV antibodies. Therefore, RT-PCR for seminal cell-associated PRRSV is a powerful tool for managing the SPF status during quarantine programs and for routine outbreak investigations.
Collapse
Affiliation(s)
- Sandra Revilla-Fernández
- Institute of Animal Breeding and Genetics, Department for Animal Breeding and Reproduction, University of Veterinary Medicine, Veterinaerplatz 1, A-1210 Vienna, Austria
| | - Barbara Wallner
- Institute of Animal Breeding and Genetics, Department for Animal Breeding and Reproduction, University of Veterinary Medicine, Veterinaerplatz 1, A-1210 Vienna, Austria
| | - Klaus Truschner
- Traunkreis Vet Clinic, A-4551 Ried im Traunkreis, Helmbergerstrasse 10, Austria
| | - Alexandra Benczak
- Institute of Animal Breeding and Genetics, Department for Animal Breeding and Reproduction, University of Veterinary Medicine, Veterinaerplatz 1, A-1210 Vienna, Austria
| | - Gottfried Brem
- Institute of Animal Breeding and Genetics, Department for Animal Breeding and Reproduction, University of Veterinary Medicine, Veterinaerplatz 1, A-1210 Vienna, Austria
- Agrobiogen, D-86567 Hilgertshausen, Germany
- Ludwig-Boltzmann Institute for Immuno-, Cyto- and Molecular Genetic Research, Veterinaerplatz 1, A-1210 Vienna, Austria
| | - Friedrich Schmoll
- Clinic for Swine, Clinical Department for Farm Animals and Herd Managment, University of Veterinary Medicine, Veterinaerplatz 1, A-1210 Vienna, Austria
| | - Mathias Mueller
- Institute of Animal Breeding and Genetics, Department for Animal Breeding and Reproduction, University of Veterinary Medicine, Veterinaerplatz 1, A-1210 Vienna, Austria
| | - Ralf Steinborn
- Institute of Animal Breeding and Genetics, Department for Animal Breeding and Reproduction, University of Veterinary Medicine, Veterinaerplatz 1, A-1210 Vienna, Austria
- Ludwig-Boltzmann Institute for Immuno-, Cyto- and Molecular Genetic Research, Veterinaerplatz 1, A-1210 Vienna, Austria
- Corresponding author. Tel.: +43 1 25077 5625; fax: +43 1 25077 5693.
| |
Collapse
|
102
|
Choi JH, Cho HG, Kim S. GAME: a simple and efficient whole genome alignment method using maximal exact match filtering. Comput Biol Chem 2005; 29:244-53. [PMID: 15979044 DOI: 10.1016/j.compbiolchem.2005.04.004] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2005] [Revised: 04/17/2005] [Accepted: 04/18/2005] [Indexed: 11/30/2022]
Abstract
In this paper, we present a simple and efficient whole genome alignment method using maximal exact match (MEM). The major problem with the use of MEM anchor is that the number of hits in non-homologous regions increases exponentially when shorter MEM anchors are used to detect more homologous regions. To deal with this problem, we have developed a fast and accurate anchor filtering scheme based on simple match extension with minimum percent identity and extension length criteria. Due to its simplicity and accuracy, all MEM anchors in a pair of genomes can be exhaustively tested and filtered. In addition, by incorporating the translation technique, the alignment quality and speed of our genome alignment algorithm have been further improved. As a result, our genome alignment algorithm, GAME (Genome Alignment by Match Extension), performs competitively over existing algorithms and can align large whole genomes, e.g., A. thaliana, without the requirement of typical large memory and parallel processors. This is shown using an experiment which compares the performance of BLAST, BLASTZ, PatternHunter, MUMmer and our algorithm in aligning all 45 pairs of 10 microbial genomes. The scalability of our algorithm is shown in another experiment where all pairs of five chromosomes in A. thaliana were compared.
Collapse
Affiliation(s)
- Jeong-Hyeon Choi
- School of Informatics, Indiana University, Bloomington, IN 47408, USA.
| | | | | |
Collapse
|
103
|
Flannick J, Batzoglou S. Using multiple alignments to improve seeded local alignment algorithms. Nucleic Acids Res 2005; 33:4563-77. [PMID: 16100379 PMCID: PMC1185574 DOI: 10.1093/nar/gki767] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2005] [Revised: 07/06/2005] [Accepted: 07/27/2005] [Indexed: 11/23/2022] Open
Abstract
Multiple alignments among genomes are becoming increasingly prevalent. This trend motivates the development of tools for efficient homology search between a query sequence and a database of multiple alignments. In this paper, we present an algorithm that uses the information implicit in a multiple alignment to dynamically build an index that is weighted most heavily towards the promising regions of the multiple alignment. We have implemented Typhon, a local alignment tool that incorporates our indexing algorithm, which our test results show to be more sensitive than algorithms that index only a sequence. This suggests that when applied on a whole-genome scale, Typhon should provide improved homology searches in time comparable to existing algorithms.
Collapse
Affiliation(s)
- Jason Flannick
- Department of Computer Science, Stanford University, Stanford, CA 94304, USA.
| | | |
Collapse
|
104
|
Pöhler D, Werner N, Steinkamp R, Morgenstern B. Multiple alignment of genomic sequences using CHAOS, DIALIGN and ABC. Nucleic Acids Res 2005; 33:W532-4. [PMID: 15980528 PMCID: PMC1160147 DOI: 10.1093/nar/gki386] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Comparative analysis of genomic sequences is a powerful approach to discover functional sites in these sequences. Herein, we present a WWW-based software system for multiple alignment of genomic sequences. We use the local alignment tool CHAOS to rapidly identify chains of pairwise similarities. These similarities are used as anchor points to speed up the DIALIGN multiple-alignment program. Finally, the visualization tool ABC is used for interactive graphical representation of the resulting multiple alignments. Our software is available at Göttingen Bioinformatics Compute Server (GOBICS) at
Collapse
Affiliation(s)
| | | | | | - Burkhard Morgenstern
- To whom correspondence should be addressed. Tel: +49 551 39 14628; Fax: +49 551 39 14929;
| |
Collapse
|
105
|
Sharan R, Ideker T, Kelley B, Shamir R, Karp RM. Identification of Protein Complexes by Comparative Analysis of Yeast and Bacterial Protein Interaction Data. J Comput Biol 2005; 12:835-46. [PMID: 16108720 DOI: 10.1089/cmb.2005.12.835] [Citation(s) in RCA: 76] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022] Open
Abstract
Mounting evidence shows that many protein complexes are conserved in evolution. Here we use conservation to find complexes that are common to the yeast S. cerevisiae and the bacteria H. pylori. Our analysis combines protein interaction data that are available for each of the two species and orthology information based on protein sequence comparison. We develop a detailed probabilistic model for protein complexes in a single species and a model for the conservation of complexes between two species. Using these models, one can recast the question of finding conserved complexes as a problem of searching for heavy subgraphs in an edge- and node-weighted graph, whose nodes are orthologous protein pairs. We tested this approach on the data currently available for yeast and bacteria and detected 11 significantly conserved complexes. Several of these complexes match very well with prior experimental knowledge on complexes in yeast only and serve for validation of our methodology. The complexes suggest new functions for a variety of uncharacterized proteins. By identifying a conserved complex whose yeast proteins function predominantly in the nuclear pore complex, we propose that the corresponding bacterial proteins function as a coherent cellular membrane transport system. We also compare our results to two alternative methods for detecting complexes and demonstrate that our methodology obtains a much higher specificity.
Collapse
Affiliation(s)
- Roded Sharan
- School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel.
| | | | | | | | | |
Collapse
|
106
|
Choo KH, Tong JC, Zhang L. Recent applications of Hidden Markov Models in computational biology. GENOMICS PROTEOMICS & BIOINFORMATICS 2005; 2:84-96. [PMID: 15629048 PMCID: PMC5172443 DOI: 10.1016/s1672-0229(04)02014-5] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
This paper examines recent developments and applications of Hidden Markov Models (HMMs) to various problems in computational biology, including multiple sequence alignment, homology detection, protein sequences classification, and genomic annotation.
Collapse
Affiliation(s)
- Khar Heng Choo
- Department of Biochemistry, National University of Singapore, 10 Kent Ridge Crescent, Singapore 119260
| | - Joo Chuan Tong
- Department of Biochemistry, National University of Singapore, 10 Kent Ridge Crescent, Singapore 119260
| | - Louxin Zhang
- Department of Mathematics, National University of Singapore, 2 Science Drive 2, Singapore 117543
- Corresponding author.
| |
Collapse
|
107
|
Dadzie AS, Burger A. Providing visualisation support for the analysis of anatomy ontology data. BMC Bioinformatics 2005; 6:74. [PMID: 15790390 PMCID: PMC1087473 DOI: 10.1186/1471-2105-6-74] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2004] [Accepted: 03/24/2005] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Improvements in technology have been accompanied by the generation of large amounts of complex data. This same technology must be harnessed effectively if the knowledge stored within the data is to be retrieved. Storing data in ontologies aids its management; ontologies serve as controlled vocabularies that promote data exchange and re-use, improving analysis. The Edinburgh Mouse Atlas Project stores the developmental stages of the mouse embryo in anatomy ontologies. This project is looking at the use of visual data overviews for intuitive analysis of the ontology data. RESULTS A prototype has been developed that visualises the ontologies using directed acyclic graphs in two dimensions, with the ability to study detail in regions of interest in isolation or within the context of the overview. This is followed by the development of a technique that layers individual anatomy ontologies in three-dimensional space, so that relationships across multiple data sets may be mapped using physical links drawn along the third axis. CONCLUSION Usability evaluations of the applications confirmed advantages in visual analysis of complex data. This project will look next at data input from multiple sources, and continue to develop the techniques presented to provide intuitive identification of relationships that span multiple ontologies.
Collapse
Affiliation(s)
- Aba-Sah Dadzie
- School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh EH14 4AS, Scotland
| | - Albert Burger
- School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh EH14 4AS, Scotland
- Medical Research Council, Human Genetics Unit, Western General Hospital, Edinburgh EH4 2XU, Scotland
| |
Collapse
|
108
|
Stover CM, Lynch NJ, Hanson SJ, Windbichler M, Gregory SG, Schwaeble WJ. Organization of the MASP2 locus and its expression profile in mouse and rat. Mamm Genome 2005; 15:887-900. [PMID: 15672593 DOI: 10.1007/s00335-004-3006-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
The mouse, rat, and human MASP2 loci are situated on syntenic chromosome regions and are highly conserved. They comprise the genes for MASP-2/ MAp19, TAR DNA binding protein of 43 kDa, FRAP kinase, CDT6, Polymyositis-Scleroderma 100-kDa autoantigen, spermidine synthase, and TERE which were analyzed by annotation of available gene transcript data and cross-species comparison of available genomic sequences. The human and rat genes for spermidine synthase have an additional intron compared to the mouse gene. The mouse and rat genes for Polymyositis-Scleroderma 100-kDa autoantigen have an additional exon compared to the human gene. We find support for the hypothesis that the MAp19-specific exon within the MASP2 gene may have originated in a transposable element. Blocks of highly conserved intronic sequences were found in the MASP2 gene and the TARDBP gene. The expression of all genes within the MASP2 locus was analyzed in mouse and rat. The restricted expression of MASP-2 and MAp19 mRNA in liver contrasts with the ubiquitous expression of all neighboring genes studied.
Collapse
Affiliation(s)
- Cordula M Stover
- Department of Infection, Immunity and Inflammation, University of Leicester, Leicester LE1 9HN, United Kingdom.
| | | | | | | | | | | |
Collapse
|
109
|
Hobolth A, Jensen JL. Applications of Hidden Markov Models for Characterization of Homologous DNA Sequences with a Common Gene. J Comput Biol 2005; 12:186-203. [PMID: 15767776 DOI: 10.1089/cmb.2005.12.186] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Identifying and characterizing the structure in genome sequences is one of the principal challenges in modern molecular biology, and comparative genomics offers a powerful tool. In this paper, we introduce a hidden Markov model that allows a comparative analysis of multiple sequences related by a phylogenetic tree, and we present an efficient method for estimating the parameters of the model. The model integrates structure prediction methods for one sequence, statistical multiple alignment methods, and phylogenetic information. This unified model is particularly useful for a detailed characterization of DNA sequences with a common gene. We illustrate the model on a variety of homologous sequences.
Collapse
Affiliation(s)
- Asger Hobolth
- Bioinformatics Research Center, University of Aarhus, Aarhus, Denmark.
| | | |
Collapse
|
110
|
Regan MR, Lin DDM, Emerick MC, Agnew WS. The effect of higher order RNA processes on changing patterns of protein domain selection: A developmentally regulated transcriptome of type 1 inositol 1,4,5-trisphosphate receptors. Proteins 2005; 59:312-31. [PMID: 15739177 DOI: 10.1002/prot.20225] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
The domain structure of proteins synthesized from a single gene can be remodeled during tissue development by activities at the RNA level of gene expression. The impact of higher order RNA processing on changing patterns of protein domain selection may be explored by systematically profiling single-gene transcriptomes. itpr1 is one of three mammalian genes encoding receptors for the second messenger inositol 1,4,5-trisphosphate (InsP3). Some phenotypic variations of InsP3 receptors have been attributed to hetero-oligomers of subunit isoforms from itpr1, itpr2, and itpr3. However, itpr1 itself is subject to alternative RNA splicing, with 7 sites of transcript variation, 6 within the ORF. We have identified 17 itpr1 subunit species expressed in mammalian brain in ensembles that change with tissue differentiation. Statistical analyses of populations comprising >1,300 full-length clones suggest that subunit variation arises from a variably biased stochastic splicing mechanism. Surprisingly, the protein domains of this highly allosteric receptor appear to be assembled in a partially randomized way, yielding stochastic arrays of subunit species that form tetrameric complexes in single cells. Nevertheless, functional expression studies of selected subunits confirm that splicing regulation is connected to phenotypic variation. The potential for itpr1 subunits to form hetero-tetramers in single cells suggests the expression of a developmentally regulated continuum of molecular forms that could display diverse properties, including incremental sensitivities to agonist activation and varying patterns of Ca2+ mobilization. These studies illuminate the extent to which itpr1 molecular phenotype is induced by higher order RNA processing.
Collapse
Affiliation(s)
- Melissa R Regan
- Department of Physiology, The Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, USA
| | | | | | | |
Collapse
|
111
|
Abstract
MOTIVATION We introduce GMAP, a standalone program for mapping and aligning cDNA sequences to a genome. The program maps and aligns a single sequence with minimal startup time and memory requirements, and provides fast batch processing of large sequence sets. The program generates accurate gene structures, even in the presence of substantial polymorphisms and sequence errors, without using probabilistic splice site models. Methodology underlying the program includes a minimal sampling strategy for genomic mapping, oligomer chaining for approximate alignment, sandwich DP for splice site detection, and microexon identification with statistical significance testing. RESULTS On a set of human messenger RNAs with random mutations at a 1 and 3% rate, GMAP identified all splice sites accurately in over 99.3% of the sequences, which was one-tenth the error rate of existing programs. On a large set of human expressed sequence tags, GMAP provided higher-quality alignments more often than blat did. On a set of Arabidopsis cDNAs, GMAP performed comparably with GeneSeqer. In these experiments, GMAP demonstrated a several-fold increase in speed over existing programs. AVAILABILITY Source code for gmap and associated programs is available at http://www.gene.com/share/gmap SUPPLEMENTARY INFORMATION http://www.gene.com/share/gmap.
Collapse
Affiliation(s)
- Thomas D Wu
- Department of Bioinformatics Genentech, Inc., South San Francisco, CA 94080, USA.
| | | |
Collapse
|
112
|
Majoros WH, Pertea M, Salzberg SL. Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. Bioinformatics 2005; 21:1782-8. [PMID: 15691859 DOI: 10.1093/bioinformatics/bti297] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The increased availability of genome sequences of closely related organisms has generated much interest in utilizing homology to improve the accuracy of gene prediction programs. Generalized pair hidden Markov models (GPHMMs) have been proposed as one means to address this need. However, all GPHMM implementations currently available are either closed-source or the details of their operation are not fully described in the literature, leaving a significant hurdle for others wishing to advance the state of the art in GPHMM design. RESULTS We have developed an open-source GPHMM gene finder, TWAIN, which performs very well on two related Aspergillus species, A.fumigatus and A.nidulans, finding 89% of the exons and predicting 74% of the gene models exactly correctly in a test set of 147 conserved gene pairs. We describe the implementation of this GPHMM and we explicitly address the assumptions and limitations of the system. We suggest possible ways of relaxing those assumptions to improve the utility of the system without sacrificing efficiency beyond what is practical. AVAILABILITY Available at http://www.tigr.org/software/pirate/twain/twain.html under the open-source Artistic License.
Collapse
Affiliation(s)
- W H Majoros
- Bioinformatics Department, The Institute for Genomic Research, Rockville, MD, USA.
| | | | | |
Collapse
|
113
|
Abstract
We describe a multiple alignment program named MAP2 based on a generalized pairwise global alignment algorithm for handling long, different intergenic and intragenic regions in genomic sequences. The MAP2 program produces an ordered list of local multiple alignments of similar regions among sequences, where different regions between local alignments are indicated by reporting only similar regions. We propose two similarity measures for the evaluation of the performance of MAP2 and existing multiple alignment programs. Experimental results produced by MAP2 on four real sets of orthologous genomic sequences show that MAP2 rarely missed a block of transitively similar regions and that MAP2 never produced a block of regions that are not transitively similar. Experimental results by MAP2 on six simulated data sets show that MAP2 found the boundaries between similar and different regions precisely. This feature is useful for finding conserved functional elements in genomic sequences. The MAP2 program is freely available in source code form at http://bioinformatics.iastate.edu/aat/sas.html for academic use.
Collapse
Affiliation(s)
| | - Xiaoqiu Huang
- To whom correspondence should be addressed. Tel: +1 515 294 2432; Fax: +1 515 294 0258;
| |
Collapse
|
114
|
Abstract
Genome comparisons are behind the powerful new annotation methods being developed to find all human genes, as well as genes from other genomes. Genomes are now frequently being studied in pairs to provide cross-comparison datasets. This 'Noah's Ark' approach often reveals unsuspected genes and may support the deletion of false-positive predictions. Joining mouse and human as the cross-comparison dataset for the first two mammals are: two Drosophila species, D. melanogaster and D. pseudoobscura; two sea squirts, Ciona intestinalis and Ciona savignyi; four yeast (Saccharomyces) species; two nematodes, Caenorhabditis elegans and Caenorhabditis briggsae; and two pufferfish (Takefugu rubripes and Tetraodon nigroviridis). Even genomes like yeast and C. elegans, which have been known for more than five years, are now being significantly improved. Methods developed for yeast or nematodes will now be applied to mouse and human, and soon to additional mammals such as rat and dog, to identify all the mammalian protein-coding genes. Current large disparities between human Unigene predictions (127,835 genes) and gene-scanning methods (45,000 genes) still need to be resolved. This will be the challenge during the next few years.
Collapse
Affiliation(s)
- David R Nelson
- Department of Molecular Sciences and The UT Center of Excellence in Genomics and Bioinformatics, University of Tennessee, Memphis, Tennessee 38163, USA
| | - Daniel W Nebert
- Department of Environmental Health and Center for Environmental Genetics (CEG), University of Cincinnati Medical Center, Cincinnati, Ohio 45267-0056, USA
| |
Collapse
|
115
|
Lenhard B, Sandelin A, Mendoza L, Engström P, Jareborg N, Wasserman WW. Identification of conserved regulatory elements by comparative genome analysis. J Biol 2004; 2:13. [PMID: 12760745 PMCID: PMC193685 DOI: 10.1186/1475-4924-2-13] [Citation(s) in RCA: 198] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2002] [Revised: 03/21/2003] [Accepted: 04/08/2003] [Indexed: 12/04/2022] Open
Abstract
BACKGROUND For genes that have been successfully delineated within the human genome sequence, most regulatory sequences remain to be elucidated. The annotation and interpretation process requires additional data resources and significant improvements in computational methods for the detection of regulatory regions. One approach of growing popularity is based on the preferential conservation of functional sequences over the course of evolution by selective pressure, termed 'phylogenetic footprinting'. Mutations are more likely to be disruptive if they appear in functional sites, resulting in a measurable difference in evolution rates between functional and non-functional genomic segments. RESULTS We have devised a flexible suite of methods for the identification and visualization of conserved transcription-factor-binding sites. The system reports those putative transcription-factor-binding sites that are both situated in conserved regions and located as pairs of sites in equivalent positions in alignments between two orthologous sequences. An underlying collection of metazoan transcription-factor-binding profiles was assembled to facilitate the study. This approach results in a significant improvement in the detection of transcription-factor-binding sites because of an increased signal-to-noise ratio, as demonstrated with two sets of promoter sequences. The method is implemented as a graphical web application, ConSite, which is at the disposal of the scientific community at http://www.phylofoot.org/. CONCLUSIONS Phylogenetic footprinting dramatically improves the predictive selectivity of bioinformatic approaches to the analysis of promoter sequences. ConSite delivers unparalleled performance using a novel database of high-quality binding models for metazoan transcription factors. With a dynamic interface, this bioinformatics tool provides broad access to promoter analysis with phylogenetic footprinting.
Collapse
Affiliation(s)
- Boris Lenhard
- Center for Genomics and Bioinformatics, Karolinska Institutet, 171 77 Stockholm, Sweden
| | - Albin Sandelin
- Center for Genomics and Bioinformatics, Karolinska Institutet, 171 77 Stockholm, Sweden
| | - Luis Mendoza
- Center for Genomics and Bioinformatics, Karolinska Institutet, 171 77 Stockholm, Sweden
- Current address: Serono Research and Development, CH-1121 Geneva 20, Switzerland
| | - Pär Engström
- Center for Genomics and Bioinformatics, Karolinska Institutet, 171 77 Stockholm, Sweden
| | - Niclas Jareborg
- Center for Genomics and Bioinformatics, Karolinska Institutet, 171 77 Stockholm, Sweden
- Current address: AstraZeneca Research and Development, S-151 85 Södertälje, Sweden
| | - Wyeth W Wasserman
- Center for Genomics and Bioinformatics, Karolinska Institutet, 171 77 Stockholm, Sweden
- Current address: Centre for Molecular Medicine and Therapeutics, University of British Columbia, Vancouver, BC V5Z 4H4, Canada
| |
Collapse
|
116
|
GuhaThakurta D, Schriefer LA, Waterston RH, Stormo GD. Novel transcription regulatory elements in Caenorhabditis elegans muscle genes. Genome Res 2004; 14:2457-68. [PMID: 15574824 PMCID: PMC534670 DOI: 10.1101/gr.2961104] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2004] [Accepted: 10/04/2004] [Indexed: 11/24/2022]
Abstract
We report the identification of three new transcription regulatory elements that are associated with muscle gene expression in the nematode Caenorhabditis elegans. Starting from a subset of well-characterized nematode muscle genes, we identified conserved DNA motifs in the promoter regions using computational DNA pattern-recognition algorithms. These were considered to be putative muscle transcription regulatory motifs. Using the green-fluorescent protein (GFP) as a reporter, experiments were done to determine the biological activity of these motifs in driving muscle gene expression. Prediction accuracy of muscle expression based on the presence of these three motifs was encouraging; nine of 10 previously uncharacterized genes that were predicted to have muscle expression were shown to be expressed either specifically or selectively in the muscle tissues, whereas only one of the nine that scored low for these motifs expressed in muscle. Knockouts of putative regulatory elements in the promoter of the mlc-2 and unc-89 genes show that they significantly contribute to muscle expression and act in a synergistic manner. We find that these DNA motifs are also present in the muscle promoters of C. briggsae, indicating that they are functionally conserved in the nematodes.
Collapse
Affiliation(s)
- Debraj GuhaThakurta
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | | | | | | |
Collapse
|
117
|
Abstract
The genomes from three mammals (human, mouse, and rat), two worms, and several yeasts have been sequenced, and more genomes will be completed in the near future for comparison with those of the major model organisms. Scientists have used various methods to align and compare the sequenced genomes to address critical issues in genome function and evolution. This review covers some of the major new insights about gene content, gene regulation, and the fraction of mammalian genomes that are under purifying selection and presumed functional. We review the evolutionary processes that shape genomes, with particular attention to variation in rates within genomes and along different lineages. Internet resources for accessing and analyzing the treasure trove of sequence alignments and annotations are reviewed, and we discuss critical problems to address in new bioinformatic developments in comparative genomics.
Collapse
Affiliation(s)
- Webb Miller
- The Center for Comparative Genomics and Bioinformatics, The Huck Institutes of Life Sciences, Department of Biology, Pennsylvania State University, University Park, Pennsylvania, USA.
| | | | | | | |
Collapse
|
118
|
Margulies EH, Green ED. Detecting highly conserved regions of the human genome by multispecies sequence comparisons. COLD SPRING HARBOR SYMPOSIA ON QUANTITATIVE BIOLOGY 2004; 68:255-63. [PMID: 15338625 DOI: 10.1101/sqb.2003.68.255] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Affiliation(s)
- E H Margulies
- Genome Technology Branch and NIH Intramural Sequencing Center, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | | |
Collapse
|
119
|
Darling ACE, Mau B, Blattner FR, Perna NT. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 2004; 14:1394-403. [PMID: 15231754 PMCID: PMC442156 DOI: 10.1101/gr.2289704] [Citation(s) in RCA: 3508] [Impact Index Per Article: 167.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
As genomes evolve, they undergo large-scale evolutionary processes that present a challenge to sequence comparison not posed by short sequences. Recombination causes frequent genome rearrangements, horizontal transfer introduces new sequences into bacterial chromosomes, and deletions remove segments of the genome. Consequently, each genome is a mosaic of unique lineage-specific segments, regions shared with a subset of other genomes and segments conserved among all the genomes under consideration. Furthermore, the linear order of these segments may be shuffled among genomes. We present methods for identification and alignment of conserved genomic DNA in the presence of rearrangements and horizontal transfer. Our methods have been implemented in a software package called Mauve. Mauve has been applied to align nine enterobacterial genomes and to determine global rearrangement structure in three mammalian genomes. We have evaluated the quality of Mauve alignments and drawn comparison to other methods through extensive simulations of genome evolution.
Collapse
Affiliation(s)
- Aaron C E Darling
- Department of Computer Science, University of Wisconsin-Madison, Madison, Wisconsin 53706, USA
| | | | | | | |
Collapse
|
120
|
Kellis M, Patterson N, Birren B, Berger B, Lander ES. Methods in comparative genomics: genome correspondence, gene identification and regulatory motif discovery. J Comput Biol 2004; 11:319-55. [PMID: 15285895 DOI: 10.1089/1066527041410319] [Citation(s) in RCA: 61] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In Kellis et al. (2003), we reported the genome sequences of S. paradoxus, S. mikatae, and S. bayanus and compared these three yeast species to their close relative, S. cerevisiae. Genomewide comparative analysis allowed the identification of functionally important sequences, both coding and noncoding. In this companion paper we describe the mathematical and algorithmic results underpinning the analysis of these genomes. (1) We present methods for the automatic determination of genome correspondence. The algorithms enabled the automatic identification of orthologs for more than 90% of genes and intergenic regions across the four species despite the large number of duplicated genes in the yeast genome. The remaining ambiguities in the gene correspondence revealed recent gene family expansions in regions of rapid genomic change. (2) We present methods for the identification of protein-coding genes based on their patterns of nucleotide conservation across related species. We observed the pressure to conserve the reading frame of functional proteins and developed a test for gene identification with high sensitivity and specificity. We used this test to revisit the genome of S. cerevisiae, reducing the overall gene count by 500 genes (10% of previously annotated genes) and refining the gene structure of hundreds of genes. (3) We present novel methods for the systematic de novo identification of regulatory motifs. The methods do not rely on previous knowledge of gene function and in that way differ from the current literature on computational motif discovery. Based on genomewide conservation patterns of known motifs, we developed three conservation criteria that we used to discover novel motifs. We used an enumeration approach to select strongly conserved motif cores, which we extended and collapsed into a small number of candidate regulatory motifs. These include most previously known regulatory motifs as well as several noteworthy novel motifs. The majority of discovered motifs are enriched in functionally related genes, allowing us to infer a candidate function for novel motifs. Our results demonstrate the power of comparative genomics to further our understanding of any species. Our methods are validated by the extensive experimental knowledge in yeast and will be invaluable in the study of complex genomes like that of the human.
Collapse
Affiliation(s)
- Manolis Kellis
- Whitehead Institute Center for Genome Research, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139, USA
| | | | | | | | | |
Collapse
|
121
|
Taher L, Rinner O, Garg S, Sczyrba A, Morgenstern B. AGenDA: gene prediction by cross-species sequence comparison. Nucleic Acids Res 2004; 32:W305-8. [PMID: 15215399 PMCID: PMC441524 DOI: 10.1093/nar/gkh386] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Automatic gene prediction is one of the major challenges in computational sequence analysis. Traditional approaches to gene finding rely on statistical models derived from previously known genes. By contrast, a new class of comparative methods relies on comparing genomic sequences from evolutionary related organisms to each other. These methods are based on the concept of phylogenetic footprinting: they exploit the fact that functionally important regions in genomic sequences are usually more conserved than non-functional regions. We created a WWW-based software program for homology-based gene prediction at BiBiServ (Bielefeld Bioinformatics Server). Our tool takes pairs of evolutionary related genomic sequences as input data, e.g. from human and mouse. The server runs CHAOS and DIALIGN to create an alignment of the input sequences and subsequently searches for conserved splicing signals and start/stop codons near regions of local sequence conservation. Genes are predicted based on local homology information and splice signals. The server returns predicted genes together with a graphical representation of the underlying alignment. The program is available at http://bibiserv.TechFak.Uni-Bielefeld.DE/agenda/.
Collapse
Affiliation(s)
- Leila Taher
- International Graduate School for Bioinformatics and Genome Research, University of Bielefeld, Postfach 10 01 31, 33501 Bielefeld, Germany
| | | | | | | | | |
Collapse
|
122
|
Stanke M, Steinkamp R, Waack S, Morgenstern B. AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res 2004; 32:W309-12. [PMID: 15215400 PMCID: PMC441517 DOI: 10.1093/nar/gkh379] [Citation(s) in RCA: 911] [Impact Index Per Article: 43.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
We present a www server for AUGUSTUS, a novel software program for ab initio gene prediction in eukaryotic genomic sequences. Our method is based on a generalized Hidden Markov Model with a new method for modeling the intron length distribution. This method allows approximation of the true intron length distribution more accurately than do existing programs. For genomic sequence data from human and Drosophila melanogaster, the accuracy of AUGUSTUS is superior to existing gene-finding approaches. The advantage of our program becomes apparent especially for larger input sequences containing more than one gene. The server is available at http://augustus.gobics.de.
Collapse
Affiliation(s)
- Mario Stanke
- University of Göttingen, Institut für Mikrobiologie und Genetik, Goldschmidtstrasse 1, 37077 Göttingen, Germany.
| | | | | | | |
Collapse
|
123
|
Brudno M, Steinkamp R, Morgenstern B. The CHAOS/DIALIGN WWW server for multiple alignment of genomic sequences. Nucleic Acids Res 2004; 32:W41-4. [PMID: 15215346 PMCID: PMC441499 DOI: 10.1093/nar/gkh361] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Cross-species sequence comparison is a powerful approach to analyze functional sites in genomic sequences and many discoveries have been made based on genomic alignments. Herein, we present a WWW-based software system for multiple alignment of large genomic sequences. Our server utilizes the previously developed combination of CHAOS and DIALIGN to achieve both speed and alignment accuracy. CHAOS is a fast database search tool that creates a list of local sequence similarities. These are used by DIALIGN as anchor points to speed up the final alignment procedure. The resulting alignment is returned to the user in different formats together with a list of anchor points found by CHAOS. The CHAOS/DIALIGN software is freely available at http://dialign.gobics.de/chaos-dialign-submission.
Collapse
Affiliation(s)
- Michael Brudno
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | | | | |
Collapse
|
124
|
Coventry A, Kleitman DJ, Berger B. MSARI: multiple sequence alignments for statistical detection of RNA secondary structure. Proc Natl Acad Sci U S A 2004; 101:12102-7. [PMID: 15304649 PMCID: PMC514400 DOI: 10.1073/pnas.0404193101] [Citation(s) in RCA: 67] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2003] [Indexed: 11/18/2022] Open
Abstract
We present a highly accurate method for identifying genes with conserved RNA secondary structure by searching multiple sequence alignments of a large set of candidate orthologs for correlated arrangements of reverse-complementary regions. This approach is growing increasingly feasible as the genomes of ever more organisms are sequenced. A program called msari implements this method and is significantly more accurate than existing methods in the context of automatically generated alignments, making it particularly applicable to high-throughput scans. In our tests, it discerned clustalw-generated multiple sequence alignments of signal recognition particle or RNaseP orthologs from controls with 89.1% sensitivity at 97.5% specificity and with 74.4% sensitivity with no false positives in 494 controls. We used msari to conduct a comprehensive scan for secondary structure in mRNAs of coding genes, and we found many genes with known mRNA secondary structure and compelling evidence for secondary structure in other genes. msari uses a method for coping with sequence redundancy that is likely to have applications in a large set of other comparison-based search methods. The program is available for download from http://theory.csail.mit.edu/MSARi.
Collapse
Affiliation(s)
- Alex Coventry
- Department of Mathematics and Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | | | | |
Collapse
|
125
|
Sogayar MC, Camargo AA, Bettoni F, Carraro DM, Pires LC, Parmigiani RB, Ferreira EN, de Sá Moreira E, do Rosário D de O Latorre M, Simpson AJG, Cruz LO, Degaki TL, Festa F, Massirer KB, Sogayar MC, Filho FC, Camargo LP, Cunha MAV, De Souza SJ, Faria M, Giuliatti S, Kopp L, de Oliveira PSL, Paiva PB, Pereira AA, Pinheiro DG, Puga RD, S de Souza JE, Albuquerque DM, Andrade LEC, Baia GS, Briones MRS, Cavaleiro-Luna AMS, Cerutti JM, Costa FF, Costanzi-Strauss E, Espreafico EM, Ferrasi AC, Ferro ES, Fortes MAHZ, Furchi JRF, Giannella-Neto D, Goldman GH, Goldman MHS, Gruber A, Guimarães GS, Hackel C, Henrique-Silva F, Kimura ET, Leoni SG, Macedo C, Malnic B, Manzini B CV, Marie SKN, Martinez-Rossi NM, Menossi M, Miracca EC, Nagai MA, Nobrega FG, Nobrega MP, Oba-Shinjo SM, Oliveira MK, Orabona GM, Otsuka AY, Paço-Larson ML, Paixão BMC, Pandolfi JRC, Pardini MIMC, Passos Bueno MR, Passos GAS, Pesquero JB, Pessoa JG, Rahal P, Rainho CA, Reis CP, Ricca TI, Rodrigues V, Rogatto SR, Romano CM, Romeiro JG, Rossi A, Sá RG, Sales MM, Sant'Anna SC, Santarosa PL, Segato F, Silva WA, Silva IDCG, Silva NP, Soares-Costa A, Sonati MF, Strauss BE, Tajara EH, Valentini SR, Villanova FE, Ward LS, Zanette DL. A transcript finishing initiative for closing gaps in the human transcriptome. Genome Res 2004; 14:1413-23. [PMID: 15197164 PMCID: PMC442158 DOI: 10.1101/gr.2111304] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2003] [Accepted: 03/12/2004] [Indexed: 11/24/2022]
Abstract
We report the results of a transcript finishing initiative, undertaken for the purpose of identifying and characterizing novel human transcripts, in which RT-PCR was used to bridge gaps between paired EST clusters, mapped against the genomic sequence. Each pair of EST clusters selected for experimental validation was designated a transcript finishing unit (TFU). A total of 489 TFUs were selected for validation, and an overall efficiency of 43.1% was achieved. We generated a total of 59,975 bp of transcribed sequences organized into 432 exons, contributing to the definition of the structure of 211 human transcripts. The structure of several transcripts reported here was confirmed during the course of this project, through the generation of their corresponding full-length cDNA sequences. Nevertheless, for 21% of the validated TFUs, a full-length cDNA sequence is not yet available in public databases, and the structure of 69.2% of these TFUs was not correctly predicted by computer programs. The TF strategy provides a significant contribution to the definition of the complete catalog of human genes and transcripts, because it appears to be particularly useful for identification of low abundance transcripts expressed in a restricted set of tissues as well as for the delineation of gene boundaries and alternatively spliced isoforms.
Collapse
|
126
|
Abstract
Alternative splicing is a critical post-transcriptional event leading to an increase in the transcriptome diversity. Recent bioinformatics studies revealed a high frequency of alternative splicing. Although the extent of AS conservation among mammals is still being discussed, it has been argued that major forms of alternatively spliced transcripts are much better conserved than minor forms. It suggests that alternative splicing plays a major role in genome evolution allowing new exons to evolve with less constraint.
Collapse
|
127
|
Li LH, Li JC, Lin YF, Lin CY, Chen CY, Tsai SF. Genomic shotgun array: a procedure linking large-scale DNA sequencing with regional transcript mapping. Nucleic Acids Res 2004; 32:e27. [PMID: 14960710 PMCID: PMC373421 DOI: 10.1093/nar/gnh025] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
To facilitate transcript mapping and to investigate alterations in genomic structure and gene expression in a defined genomic target, we developed a novel microarray-based method to detect transcriptional activity of the human chromosome 4q22-24 region. Loss of heterozygosity of human 4q22-24 is frequently observed in hepatocellular carcinoma (HCC). One hundred and eighteen well-characterized genes have been identified from this region. We took previously sequenced shotgun subclones as templates to amplify overlapping sequences for the genomic segment and constructed a chromosome-region-specific microarray. Using genomic DNA fragments as probes, we detected transcriptional activity from within this region among five different tissues. The hybridization results indicate that there are new transcripts that have not yet been identified by other methods. The existence of new transcripts encoded by genes in this region was confirmed by PCR cloning or cDNA library screening. The procedure reported here allows coupling of shotgun sequencing with transcript mapping and, potentially, detailed analysis of gene expression and chromosomal copy of the genomic sequence for the putative HCC tumor suppressor gene(s) in the 4q candidate region.
Collapse
Affiliation(s)
- Ling-Hui Li
- Division of Molecular and Genomic Medicine, National Health Research Institutes, Taipei, Taiwan
| | | | | | | | | | | |
Collapse
|
128
|
Zhou Y, Yang L, Wang H, Lu F, Wan H. Prediction of eukaryotic gene structures based on multilevel optimization. ACTA ACUST UNITED AC 2004. [DOI: 10.1007/bf02900313] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
129
|
Xie T, Rowen L, Aguado B, Ahearn ME, Madan A, Qin S, Campbell RD, Hood L. Analysis of the gene-dense major histocompatibility complex class III region and its comparison to mouse. Genome Res 2004; 13:2621-36. [PMID: 14656967 PMCID: PMC403804 DOI: 10.1101/gr.1736803] [Citation(s) in RCA: 80] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
In mammals, the Major Histocompatibility Complex class I and II gene clusters are separated by an approximately 700-kb stretch of sequence called the MHC class III region, which has been associated with susceptibility to numerous diseases. To facilitate understanding of this medically important and architecturally interesting portion of the genome, we have sequenced and analyzed both the human and mouse class III regions. The cross-species comparison has facilitated the identification of 60 genes in human and 61 in mouse, including a potential RNA gene for which the introns are more conserved across species than the exons. Delineation of global organization, gene structure, alternative splice forms, protein similarities, and potential cis-regulatory elements leads to several conclusions: (1) The human MHC class III region is the most gene-dense region of the human genome: >14% of the sequence is coding, approximately 72% of the region is transcribed, and there is an average of 8.5 genes per 100 kb. (2) Gene sizes, number of exons, and intergenic distances are for the most part similar in both species, implying that interspersed repeats have had little impact in disrupting the tight organization of this densely packed set of genes. (3) The region contains a heterogeneous mixture of genes, only a few of which have a clearly defined and proven function. Although many of the genes are of ancient origin, some appear to exist only in mammals and fish, implying they might be specific to vertebrates. (4) Conserved noncoding sequences are found primarily in or near the 5'-UTR or the first intron of genes, and seldom in the intergenic regions. Many of these conserved blocks are likely to be cis-regulatory elements.
Collapse
Affiliation(s)
- Tao Xie
- Institute for Systems Biology, Seattle, Washington 98103, USA
| | | | | | | | | | | | | | | |
Collapse
|
130
|
Margulies EH, Blanchette M, Haussler D, Green ED. Identification and characterization of multi-species conserved sequences. Genome Res 2004; 13:2507-18. [PMID: 14656959 PMCID: PMC403793 DOI: 10.1101/gr.1602203] [Citation(s) in RCA: 242] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Comparative sequence analysis has become an essential component of studies aiming to elucidate genome function. The increasing availability of genomic sequences from multiple vertebrates is creating the need for computational methods that can detect highly conserved regions in a robust fashion. Towards that end, we are developing approaches for identifying sequences that are conserved across multiple species; we call these "Multi-species Conserved Sequences" (or MCSs). Here we report two strategies for MCS identification, demonstrating their ability to detect virtually all known actively conserved sequences (specifically, coding sequences) but very little neutrally evolving sequence (specifically, ancestral repeats). Importantly, we find that a substantial fraction of the bases within MCSs (approximately 70%) resides within non-coding regions; thus, the majority of sequences conserved across multiple vertebrate species has no known function. Initial characterization of these MCSs has revealed sequences that correspond to clusters of transcription factor-binding sites, non-coding RNA transcripts, and other candidate functional elements. Finally, the ability to detect MCSs represents a valuable metric for assessing the relative contribution of a species' sequence to identifying genomic regions of interest, and our results indicate that the currently available genome sequences are insufficient for the comprehensive identification of MCSs in the human genome.
Collapse
Affiliation(s)
- Elliott H Margulies
- Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | | | | | | |
Collapse
|
131
|
Abstract
The accurate prediction of higher eukaryotic gene structures and regulatory elements directly from genomic sequences is an important early step in the understanding of newly assembled contigs and finished genomes. As more new genomes are sequenced, comparative approaches are becoming increasingly practical and valuable for predicting genes and regulatory elements. We demonstrate the effectiveness of a comparative method called pattern filtering; it utilizes synteny between two or more genomic segments for the annotation of genomic sequences. Pattern filtering optimally detects the signatures of conserved functional elements despite the stochastic noise inherent in evolutionary processes, allowing more accurate annotation of gene models. We anticipate that pattern filtering will facilitate sequence annotation and the discovery of new functional elements by the genetics and genomics communities.
Collapse
Affiliation(s)
- Jonathan E Moore
- Molecular Biology Institute, University of California Los Angeles, Los Angeles, CA 90095, USA
| | | |
Collapse
|
132
|
Seneff S, Wang C, Burge CB. Gene Structure Prediction Using an Orthologous Gene of Known Exon-Intron Structure. ACTA ACUST UNITED AC 2004; 3:81-90. [PMID: 15693733 DOI: 10.2165/00822942-200403020-00002] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
Given the availability of complete genome sequences from related organisms, sequence conservation can provide important clues for predicting gene structure. In particular, one should be able to leverage information about known genes in one species to help determine the structures of related genes in another. Such an approach is appealing in that high-quality gene prediction can be achieved for newly sequenced species, such as mouse and puffer fish, using the extensive knowledge that has been accumulated about human genes. This article reports a novel approach to predicting the exon-intron structures of mouse genes by incorporating constraints from orthologous human genes using techniques that have previously been exploited in speech and natural language processing applications. The approach uses a context-free grammar to parse a training corpus of annotated human genes. A statistical training procedure produces a weighted recursive transition network (RTN) intended to capture the general features of a mammalian gene. This RTN is expanded into a finite state transducer (FST) and composed with an FST capturing the specific features of the human orthologue. This model includes a trigram language model on the amino acid sequence as well as exon length constraints. A final stage uses the free software package ClustalW to align the top n candidates in the search space. For a set of 98 orthologous human-mouse pairs, we achieved 96% sensitivity and 97% specificity at the exon level on the mouse genes, given only knowledge gleaned from the annotated human genome.
Collapse
Affiliation(s)
- Stephanie Seneff
- Computer Science and Artificial Intelligence Laboratory, 32-G438, Spoken Language Systems Group, Massachusetts Institute of Technology, 32 Vassar Street, Cambridge, MA 02139, USA.
| | | | | |
Collapse
|
133
|
|
134
|
Brudno M, Chapman M, Göttgens B, Batzoglou S, Morgenstern B. Fast and sensitive multiple alignment of large genomic sequences. BMC Bioinformatics 2003; 4:66. [PMID: 14693042 PMCID: PMC521198 DOI: 10.1186/1471-2105-4-66] [Citation(s) in RCA: 120] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2003] [Accepted: 12/23/2003] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Genomic sequence alignment is a powerful method for genome analysis and annotation, as alignments are routinely used to identify functional sites such as genes or regulatory elements. With a growing number of partially or completely sequenced genomes, multiple alignment is playing an increasingly important role in these studies. In recent years, various tools for pair-wise and multiple genomic alignment have been proposed. Some of them are extremely fast, but often efficiency is achieved at the expense of sensitivity. One way of combining speed and sensitivity is to use an anchored-alignment approach. In a first step, a fast search program identifies a chain of strong local sequence similarities. In a second step, regions between these anchor points are aligned using a slower but more accurate method. RESULTS Herein, we present CHAOS, a novel algorithm for rapid identification of chains of local pair-wise sequence similarities. Local alignments calculated by CHAOS are used as anchor points to improve the running time of DIALIGN, a slow but sensitive multiple-alignment tool. We show that this way, the running time of DIALIGN can be reduced by more than 95% for BAC-sized and longer sequences, without affecting the quality of the resulting alignments. We apply our approach to a set of five genomic sequences around the stem-cell-leukemia (SCL) gene and demonstrate that exons and small regulatory elements can be identified by our multiple-alignment procedure. CONCLUSION We conclude that the novel CHAOS local alignment tool is an effective way to significantly speed up global alignment tools such as DIALIGN without reducing the alignment quality. We likewise demonstrate that the DIALIGN/CHAOS combination is able to accurately align short regulatory sequences in distant orthologues.
Collapse
Affiliation(s)
- Michael Brudno
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Michael Chapman
- Department of Haematology, University of Cambridge, Cambridge Institute for Medical Research, Hills Road, Cambridge CB2 2XY, United Kingdom
| | - Berthold Göttgens
- Department of Haematology, University of Cambridge, Cambridge Institute for Medical Research, Hills Road, Cambridge CB2 2XY, United Kingdom
| | - Serafim Batzoglou
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Burkhard Morgenstern
- International Graduate School in Bioinformatics and Genome Research, Universität Bielefeld, Postfach 100131, 33501 Bielefeld, Germany
- University of Göttingen, Institute of Microbiology and Genetics, Goldschmidtstr. 1, 37077 Göttingen, Germany
| |
Collapse
|
135
|
Abstract
Most current computational tools have been designed for pairwise comparisons of DNA sequences, and efficient extension of these tools to multiple species will require knowledge of the ideal evolutionary distance to choose and the development of new algorithms for alignment, analysis of conservation, and visualization of results. Multi-species comparisons of DNA sequences are more powerful for discovering functional sequences than pairwise DNA sequence comparisons. Most current computational tools have been designed for pairwise comparisons, and efficient extension of these tools to multiple species will require knowledge of the ideal evolutionary distance to choose and the development of new algorithms for alignment, analysis of conservation, and visualization of results.
Collapse
Affiliation(s)
- Inna Dubchak
- Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
| | | |
Collapse
|
136
|
Abstract
Comparing the genomes of two different species allow the exploration of a host of intriguing evolutionary and genetic questions
Collapse
|
137
|
Abstract
Fifty years after the publication of DNA structure, the whole human genome sequence will be officially finished. This achievement marks the beginning of the task to catalogue every human gene and identify each of their function expression patterns. Currently, researchers estimate that there are about 30,000 human genes and approximately 70% of these can be automatically predicted using a combination of ab initio and similarity-based programs. However, to experimentally investigate every gene's function, the research community requires a high-quality annotation of alternative splicing, pseudogenes, and promoter regions that can only be provided by manual intervention. Manual curation of the human genome will be a long-term project as experimental data are continually produced to confirm or refine the predictions, and new features such as noncoding RNAs and enhancers have not been fully identified. Such a highly curated human gene-set made publicly available will be a great asset for the experimental community and for future comparative genome projects.
Collapse
Affiliation(s)
- Jennifer L Ashurst
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.
| | | |
Collapse
|
138
|
Thomas JW, Touchman JW, Blakesley RW, Bouffard GG, Beckstrom-Sternberg SM, Margulies EH, Blanchette M, Siepel AC, Thomas PJ, McDowell JC, Maskeri B, Hansen NF, Schwartz MS, Weber RJ, Kent WJ, Karolchik D, Bruen TC, Bevan R, Cutler DJ, Schwartz S, Elnitski L, Idol JR, Prasad AB, Lee-Lin SQ, Maduro VVB, Summers TJ, Portnoy ME, Dietrich NL, Akhter N, Ayele K, Benjamin B, Cariaga K, Brinkley CP, Brooks SY, Granite S, Guan X, Gupta J, Haghighi P, Ho SL, Huang MC, Karlins E, Laric PL, Legaspi R, Lim MJ, Maduro QL, Masiello CA, Mastrian SD, McCloskey JC, Pearson R, Stantripop S, Tiongson EE, Tran JT, Tsurgeon C, Vogt JL, Walker MA, Wetherby KD, Wiggins LS, Young AC, Zhang LH, Osoegawa K, Zhu B, Zhao B, Shu CL, De Jong PJ, Lawrence CE, Smit AF, Chakravarti A, Haussler D, Green P, Miller W, Green ED. Comparative analyses of multi-species sequences from targeted genomic regions. Nature 2003; 424:788-93. [PMID: 12917688 DOI: 10.1038/nature01858] [Citation(s) in RCA: 421] [Impact Index Per Article: 19.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2003] [Accepted: 06/16/2003] [Indexed: 11/08/2022]
Abstract
The systematic comparison of genomic sequences from different organisms represents a central focus of contemporary genome analysis. Comparative analyses of vertebrate sequences can identify coding and conserved non-coding regions, including regulatory elements, and provide insight into the forces that have rendered modern-day genomes. As a complement to whole-genome sequencing efforts, we are sequencing and comparing targeted genomic regions in multiple, evolutionarily diverse vertebrates. Here we report the generation and analysis of over 12 megabases (Mb) of sequence from 12 species, all derived from the genomic region orthologous to a segment of about 1.8 Mb on human chromosome 7 containing ten genes, including the gene mutated in cystic fibrosis. These sequences show conservation reflecting both functional constraints and the neutral mutational events that shaped this genomic region. In particular, we identify substantial numbers of conserved non-coding segments beyond those previously identified experimentally, most of which are not detectable by pair-wise sequence comparisons alone. Analysis of transposable element insertions highlights the variation in genome dynamics among these species and confirms the placement of rodents as a sister group to the primates.
Collapse
Affiliation(s)
- J W Thomas
- Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892,USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
139
|
Zhao A, Lew JL, Huang L, Yu J, Zhang T, Hrywna Y, Thompson JR, de Pedro N, Blevins RA, Peláez F, Wright SD, Cui J. Human kininogen gene is transactivated by the farnesoid X receptor. J Biol Chem 2003; 278:28765-70. [PMID: 12761213 DOI: 10.1074/jbc.m304568200] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
Abstract
Human kininogen belongs to the plasma kallikreinkinin system. High molecular weight kininogen is the precursor for two-chain kinin-free kininogen and bradykinin. It has been shown that the two-chain kinin-free kininogen has the properties of anti-adhesion, anti-platelet aggregation, and anti-thrombosis, whereas bradykinin is a potent vasodilator and mediator of inflammation. In this study we show that the human kininogen gene is strongly up-regulated by agonists of the farnesoid X receptor (FXR), a nuclear receptor for bile acids. In primary human hepatocytes, both the endogenous FXR agonist chenodeoxycholate and synthetic FXR agonist GW4064 increased kininogen mRNA with a maximum induction of 8-10-fold. A more robust induction of kininogen expression was observed in HepG2 cells, where kininogen mRNA was increased by chenodeoxycholate or GW4064 up to 130-140-fold as shown by real time PCR. Northern blot analysis confirmed the up-regulation of kininogen expression by FXR agonists. To determine whether kininogen is a direct target of FXR, we examined the sequence of the kininogen promoter and identified a highly conserved FXR response element (inverted repeat, IR-1) in the proximity of the kininogen promoter (-66/-54). FXR/RXRalpha heterodimers specifically bind to this IR-1. A construct of a minimal promoter with the luciferase reporter containing this IR-1 was transactivated by FXR. Deletion or mutation of this IR-1 abolished FXR-mediated promoter activation, indicating that this IR-1 element is responsible for the promoter transactivation by FXR. We conclude that kininogen is a novel and direct target of FXR, and bile acids may play a role in the vasodilation and anti-coagulation processes.
Collapse
Affiliation(s)
- Annie Zhao
- Department of Atherosclerosis and Endocrinology, Bioinformatics, and Molecular Profiling, Merck Research Laboratories, Rahway, New Jersey 07065, USA
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
140
|
Cawley S, Pachter L, Alexandersson M. SLAM web server for comparative gene finding and alignment. Nucleic Acids Res 2003; 31:3507-9. [PMID: 12824355 PMCID: PMC168989 DOI: 10.1093/nar/gkg583] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2003] [Revised: 04/03/2003] [Accepted: 04/03/2003] [Indexed: 11/14/2022] Open
Abstract
SLAM is a program that simultaneously aligns and annotates pairs of homologous sequences. The SLAM web server integrates SLAM with repeat masking tools and the AVID alignment program to allow for rapid alignment and gene prediction in user submitted sequences. Along with annotations and alignments for the submitted sequences, users obtain a list of predicted conserved non-coding sequences (and their associated alignments). The web site also links to whole genome annotations of the human, mouse and rat genomes produced with the SLAM program. The server can be accessed at http://bio.math.berkeley.edu/slam.
Collapse
Affiliation(s)
- Simon Cawley
- Affymetrix Inc., 6550 Vallejo St, Suite 100, Emeryville, CA 94608, USA.
| | | | | |
Collapse
|
141
|
Zhang L, Pavlovic V, Cantor CR, Kasif S. Human-mouse gene identification by comparative evidence integration and evolutionary analysis. Genome Res 2003; 13:1190-202. [PMID: 12743024 PMCID: PMC403647 DOI: 10.1101/gr.703903] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2002] [Accepted: 02/03/2003] [Indexed: 11/24/2022]
Abstract
The identification of genes in the human genome remains a challenge, as the actual predictions appear to disagree tremendously and vary dramatically on the basis of the specific gene-finding methodology used. Because the pattern of conservation in coding regions is expected to be different from intronic or intergenic regions, a comparative computational analysis can lead, in principle, to an improved computational identification of genes in the human genome by using a reference, such as mouse genome. However, this comparative methodology critically depends on three important factors: (1) the selection of the most appropriate reference genome. In particular, it is not clear whether the mouse is at the correct evolutionary distance from the human to provide sufficiently distinctive conservation levels in different genomic regions, (2) the selection of comparative features that provide the most benefit to gene recognition, and (3) the selection of evidence integration architecture that effectively interprets the comparative features. We address the first question by a novel evolutionary analysis that allows us to explicitly correlate the performance of the gene recognition system with the evolutionary distance (time) between the two genomes. Our simulation results indicate that there is a wide range of reference genomes at different evolutionary time points that appear to deliver reasonable comparative prediction of human genes. In particular, the evolutionary time between human and mouse generally falls in the region of good performance; however, better accuracy might be achieved with a reference genome further than mouse. To address the second question, we propose several natural comparative measures of conservation for identifying exons and exon boundaries. Finally, we experiment with Bayesian networks for the integration of comparative and compositional evidence.
Collapse
Affiliation(s)
- Lingang Zhang
- Center for Advanced Biotechnology, Boston University, Boston, Massachusetts 02215, USA
| | | | | | | |
Collapse
|
142
|
Modrek B, Lee CJ. Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss. Nat Genet 2003; 34:177-80. [PMID: 12730695 DOI: 10.1038/ng1159] [Citation(s) in RCA: 399] [Impact Index Per Article: 18.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2003] [Accepted: 03/28/2003] [Indexed: 12/31/2022]
Abstract
One of the most interesting opportunities in comparative genomics is to compare not only genome sequences but additional phenomena, such as alternative splicing, using orthologous genes in different genomes to find similarities and differences between organisms. Recently, genomics studies have suggested that 40-60% of human genes are alternatively spliced and have catalogued up to 30,000 alternative splice relationships in human genes. Here we report an analysis of 9,434 orthologous genes in human and mouse, which indicates that alternative splicing is associated with a large increase in frequency of recent exon creation and/or loss. Whereas most exons in the mouse and human genomes are strongly conserved in both genomes, exons that are only included in alternative splice forms (as opposed to the constitutive or major transcript form) are mostly not conserved and thus are the product of recent exon creation or loss events. A similar comparison of orthologous exons in rat and human validates this pattern. Although this says nothing about the complex question of adaptive benefit, it does indicate that alternative splicing in these genomes has been associated with increased evolutionary change.
Collapse
Affiliation(s)
- Barmak Modrek
- Molecular Biology Institute, Center for Genomics and Proteomics and Dept. of Chemistry & Biochemistry, University of California Los Angeles, Los Angeles, California 90095, USA
| | | |
Collapse
|
143
|
Bogue CW. Genetic Models in Applied Physiology. Functional genomics in the mouse: powerful techniques for unraveling the basis of human development and disease. J Appl Physiol (1985) 2003; 94:2502-9. [PMID: 12736192 DOI: 10.1152/japplphysiol.00209.2003] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Now that near-complete DNA sequences of both the mouse and human genomes are available, the next major challenge will be to determine how each of these genes functions, both alone and in combination with other genes in the genome. The mouse has a long and rich history in biological research, and many consider it a model organism for the study of human development and disease. Over the past few years, exciting progress has been made in developing techniques for chromosome engineering, mutagenesis, mapping and maintenance of mutations, and identification of mutant genes in the mouse. In this mini-review, many of these powerful techniques will be presented along with their application to the study of development, physiology, and disease.
Collapse
Affiliation(s)
- Clifford W Bogue
- Yale Child Health Research Center, Section of Critical Care and Applied Physiology, Department of Pediatrics, Yale University School of Medicine, New Haven, Connecticut 06519, USA.
| |
Collapse
|
144
|
Abstract
Pattern formation in the mouse preimplantation embryo is tightly regulated and essential for successful development. Wnt genes are known to regulate cell interactions and cell fate in invertebrates and vertebrates and, therefore, may play a role in the specification of cell lineages and cellular interactions that occur in preimplantation development. Using degenerate primers based on conserved protein sequences in Wnt coding regions, we have found evidence for Wnt gene expression at the blastocyst stage of mouse preimplantation development. We have identified sequences encoding Wnts3a and 4 and confirmed that these are present as transcripts in early development by using reverse transcriptase-polymerase chain reaction (RT-PCR) with specific primers located in the 5' half of these Wnt genes. Studies on the timing of expression showed that Wnt3a transcripts were present in 2-cell embryos which may represent maternally or embryonically derived transcripts since the major transition of maternal to zygotic gene expression occurs during the late 2-cell stage. Both Wnt3a and 4 transcripts were detected in some precompact 4/8-cell stages with consistent expression detected in all compact 8-, 16-cell and blastocyst stages. To our knowledge, expression of Wnt genes has not been previously described at such an early stage of mammalian development.
Collapse
Affiliation(s)
- Susan Lloyd
- School of Medicine, Mailpoint 813, Southampton General Hospital, SO16 6YD Southampton, UK
| | | | | |
Collapse
|
145
|
Thanaraj TA, Clark F, Muilu J. Conservation of human alternative splice events in mouse. Nucleic Acids Res 2003; 31:2544-52. [PMID: 12736303 PMCID: PMC156037 DOI: 10.1093/nar/gkg355] [Citation(s) in RCA: 101] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
Human and mouse genomes share similar long-range sequence organization, and have most of their genes being homologous. As alternative splicing is a frequent and important aspect of gene regulation, it is of interest to assess the level of conservation of alternative splicing. We examined mouse transcript data sets (EST and mRNA) for the presence of transcripts that both make spliced-alignment with the draft mouse genome sequence and demonstrate conservation of human transcript-confirmed alternative and constitutive splice junctions. This revealed 15% of alternative and 67% of constitutive splice junctions as conserved; however, these numbers are patently dependent on the extent of transcript coverage. Transcript coverage of conserved splice patterns is found to correlate well between human and mouse. A model, which extrapolates from observed levels of conservation at increasing levels of transcript support, estimates overall conservation of 61% of alternative and 74% of constitutive splice junctions, albeit with broad confidence intervals. Observed numbers of conserved alternative splicing events agreed with those expected on the basis of the model. Thus, it is apparent that many, and probably most, alternative splicing events are conserved between human and mouse. This, combined with the preservation of alternative frame stop codons in conserved frame breaking events, indicates a high level of commonality in patterns of gene expression between these two species.
Collapse
Affiliation(s)
- T A Thanaraj
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK.
| | | | | |
Collapse
|
146
|
Pozzoli U, Elgar G, Cagliani R, Riva L, Comi GP, Bresolin N, Bardoni A, Sironi M. Comparative analysis of vertebrate dystrophin loci indicate intron gigantism as a common feature. Genome Res 2003; 13:764-72. [PMID: 12727896 PMCID: PMC430921 DOI: 10.1101/gr.776503] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The human DMD gene is the largest known to date, spanning > 2000 kb on the X chromosome. The gene size is mainly accounted for by huge intronic regions. We sequenced 190 kb of Fugu rubripes (pufferfish) genomic DNA corresponding to the complete dystrophin gene (FrDMD) and provide the first report of gene structure and sequence comparison among dystrophin genomic sequences from different vertebrate organisms. Almost all intron positions and phases are conserved between FrDMD and its mammalian counterparts, and the predicted protein product of the Fugu gene displays 55% identity and 71% similarity to human dystrophin. In analogy to the human gene, FrDMD presents several-fold longer than average intronic regions. Analysis of intron sequences of the human and murine genes revealed that they are extremely conserved in size and that a similar fraction of total intron length is represented by repetitive elements; moreover, our data indicate that intron expansion through repeat accumulation in the two orthologs is the result of independent insertional events. The hypothesis that intron length might be functionally relevant to the DMD gene regulation is proposed and substantiated by the finding that dystrophin intron gigantism is common to the three vertebrate genes.
Collapse
Affiliation(s)
- Uberto Pozzoli
- IRCCS E. Medea, Associazione La Nostra Famiglia, 23842 Bosisio Parini (LC), Italy.
| | | | | | | | | | | | | | | |
Collapse
|
147
|
Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res 2003; 13:721-31. [PMID: 12654723 PMCID: PMC430158 DOI: 10.1101/gr.926603] [Citation(s) in RCA: 770] [Impact Index Per Article: 35.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2002] [Accepted: 12/11/2002] [Indexed: 11/25/2022]
Abstract
To compare entire genomes from different species, biologists increasingly need alignment methods that are efficient enough to handle long sequences, and accurate enough to correctly align the conserved biological features between distant species. We present LAGAN, a system for rapid global alignment of two homologous genomic sequences, and Multi-LAGAN, a system for multiple global alignment of genomic sequences. We tested our systems on a data set consisting of greater than 12 Mb of high-quality sequence from 12 vertebrate species. All the sequence was derived from the genomic region orthologous to an approximately 1.5-Mb region on human chromosome 7q31.3. We found that both LAGAN and Multi-LAGAN compare favorably with other leading alignment methods in correctly aligning protein-coding exons, especially between distant homologs such as human and chicken, or human and fugu. Multi-LAGAN produced the most accurate alignments, while requiring just 75 minutes on a personal computer to obtain the multiple alignment of all 12 sequences. Multi-LAGAN is a practical method for generating multiple alignments of long genomic sequences at any evolutionary distance. Our systems are publicly available at http://lagan.stanford.edu.
Collapse
Affiliation(s)
- Michael Brudno
- Department of Computer Science, Stanford University, Stanford, California 94305-9010, USA
| | | | | | | | | | | | | | | |
Collapse
|
148
|
Ureta-Vidal A, Ettwiller L, Birney E. Comparative genomics: genome-wide analysis in metazoan eukaryotes. Nat Rev Genet 2003; 4:251-62. [PMID: 12671656 DOI: 10.1038/nrg1043] [Citation(s) in RCA: 143] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
The increasing number of complete and nearly complete metazoan genome sequences provides a significant amount of material for large-scale comparative genomic analysis. Finding new effective methods to analyse such enormous datasets has been the object of intense research. Three main areas in comparative genomics have recently shown important developments: whole-genome alignment, gene prediction and regulatory-region prediction. Each of these areas improves the methods of deciphering long genomic sequences and uncovering what lies hidden in them.
Collapse
Affiliation(s)
- Abel Ureta-Vidal
- EnsEMBL Project, Room A2-06, EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | | | | |
Collapse
|
149
|
Alexandersson M, Cawley S, Pachter L. SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res 2003; 13:496-502. [PMID: 12618381 PMCID: PMC430255 DOI: 10.1101/gr.424203] [Citation(s) in RCA: 120] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2002] [Accepted: 12/03/2002] [Indexed: 11/25/2022]
Abstract
Comparative-based gene recognition is driven by the principle that conserved regions between related organisms are more likely than divergent regions to be coding. We describe a probabilistic framework for gene structure and alignment that can be used to simultaneously find both the gene structure and alignment of two syntenic genomic regions. A key feature of the method is the ability to enhance gene predictions by finding the best alignment between two syntenic sequences, while at the same time finding biologically meaningful alignments that preserve the correspondence between coding exons. Our probabilistic framework is the generalized pair hidden Markov model, a hybrid of (1). generalized hidden Markov models, which have been used previously for gene finding, and (2). pair hidden Markov models, which have applications to sequence alignment. We have built a gene finding and alignment program called SLAM, which aligns and identifies complete exon/intron structures of genes in two related but unannotated sequences of DNA. SLAM is able to reliably predict gene structures for any suitably related pair of organisms, most notably with fewer false-positive predictions compared to previous methods (examples are provided for Homo sapiens/Mus musculus and Plasmodium falciparum/Plasmodium vivax comparisons). Accuracy is obtained by distinguishing conserved noncoding sequence (CNS) from conserved coding sequence. CNS annotation is a novel feature of SLAM and may be useful for the annotation of UTRs, regulatory elements, and other noncoding features.
Collapse
Affiliation(s)
- Marina Alexandersson
- Department of Statistics, University of California, Berkeley, California 94720, USA
| | | | | |
Collapse
|
150
|
Guigo R, Dermitzakis ET, Agarwal P, Ponting CP, Parra G, Reymond A, Abril JF, Keibler E, Lyle R, Ucla C, Antonarakis SE, Brent MR. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc Natl Acad Sci U S A 2003; 100:1140-5. [PMID: 12552088 PMCID: PMC298740 DOI: 10.1073/pnas.0337561100] [Citation(s) in RCA: 88] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2002] [Accepted: 12/11/2002] [Indexed: 11/18/2022] Open
Abstract
A primary motivation for sequencing the mouse genome was to accelerate the discovery of mammalian genes by using sequence conservation between mouse and human to identify coding exons. Achieving this goal proved challenging because of the large proportion of the mouse and human genomes that is apparently conserved but apparently does not code for protein. We developed a two-stage procedure that exploits the mouse and human genome sequences to produce a set of genes with a much higher rate of experimental verification than previously reported prediction methods. RT-PCR amplification and direct sequencing applied to an initial sample of mouse predictions that do not overlap previously known genes verified the regions flanking one intron in 139 predictions, with verification rates reaching 76%. On average, the confirmed predictions show more restricted expression patterns than the mouse orthologs of known human genes, and two-thirds lack homologs in fish genomes, demonstrating the sensitivity of this dual-genome approach to hard-to-find genes. We verified 112 previously unknown homologs of known proteins, including two homeobox proteins relevant to developmental biology, an aquaporin, and a homolog of dystrophin. We estimate that transcription and splicing can be verified for >1,000 gene predictions identified by this method that do not overlap known genes. This is likely to constitute a significant fraction of the previously unknown, multiexon mammalian genes.
Collapse
Affiliation(s)
- Roderic Guigo
- Research Group in Biomedical Informatics, Institut Municipal d'Investigació Mèdica/Universitat Pompeu Fabra/Centre de Regulació Genòmica, E08003 Barcelona, Catalonia, Spain
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|