151
|
Wakefield MJ, Graves JAM. The kangaroo genome. Leaps and bounds in comparative genomics. EMBO Rep 2003; 4:143-7. [PMID: 12612602 PMCID: PMC1315837 DOI: 10.1038/sj.embor.embor739] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2002] [Accepted: 12/18/2002] [Indexed: 11/09/2022] Open
Abstract
The kangaroo genome is a rich and unique resource for comparative genomics. Marsupial genetics and cytology have made significant contributions to the understanding of gene function and evolution, and increasing the availability of kangaroo DNA sequence information would provide these benefits on a genomic scale. Here we summarize the contributions from cytogenetic and genetic studies of marsupials, describe the genomic resources currently available and those being developed, and explore the benefits of a kangaroo genome project.
Collapse
Affiliation(s)
- Matthew J Wakefield
- Research School of Biological Sciences, The Australian National University, Canberra, ACT 0200, Australia.
| | | |
Collapse
|
152
|
Abstract
Phylogenetic footprinting is an approach to finding functionally important sequences in the genome that relies on detecting their high degrees of conservation across different species. A new study shows how much it improves the prediction of gene-regulatory elements in the human genome.
Collapse
Affiliation(s)
- Zhaolei Zhang
- Department of Molecular Biophysics and Biochemistry, Yale University, 266 Whitney Avenue, New Haven, CT 06520-8114, USA
| | - Mark Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, 266 Whitney Avenue, New Haven, CT 06520-8114, USA
| |
Collapse
|
153
|
Parra G, Agarwal P, Abril JF, Wiehe T, Fickett JW, Guigó R. Comparative gene prediction in human and mouse. Genome Res 2003; 13:108-17. [PMID: 12529313 PMCID: PMC430976 DOI: 10.1101/gr.871403] [Citation(s) in RCA: 155] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2002] [Accepted: 11/15/2002] [Indexed: 11/25/2022]
Abstract
The completion of the sequencing of the mouse genome promises to help predict human genes with greater accuracy. While current ab initio gene prediction programs are remarkably sensitive (i.e., they predict at least a fragment of most genes), their specificity is often low, predicting a large number of false-positive genes in the human genome. Sequence conservation at the protein level with the mouse genome can help eliminate some of those false positives. Here we describe SGP2, a gene prediction program that combines ab initio gene prediction with TBLASTX searches between two genome sequences to provide both sensitive and specific gene predictions. The accuracy of SGP2 when used to predict genes by comparing the human and mouse genomes is assessed on a number of data sets, including single-gene data sets, the highly curated human chromosome 22 predictions, and entire genome predictions from ENSEMBL. Results indicate that SGP2 outperforms purely ab initio gene prediction methods. Results also indicate that SGP2 works about as well with 3x shotgun data as it does with fully assembled genomes. SGP2 provides a high enough specificity that its predictions can be experimentally verified at a reasonable cost. SGP2 was used to generate a complete set of gene predictions on both the human and mouse by comparing the genomes of these two species. Our results suggest that another few thousand human and mouse genes currently not in ENSEMBL are worth verifying experimentally.
Collapse
Affiliation(s)
- Genís Parra
- Grup de Recerca en Informàtica Biomèdica. Institut Municipal d'Investigació Medica / Universitat Pompeu Fabra / Centre de Regulació Genòmica 08003 Barcelona, Catalonia, Spain
| | | | | | | | | | | |
Collapse
|
154
|
|
155
|
Couronne O, Poliakov A, Bray N, Ishkhanov T, Ryaboy D, Rubin E, Pachter L, Dubchak I. Strategies and tools for whole-genome alignments. Genome Res 2003; 13:73-80. [PMID: 12529308 PMCID: PMC430965 DOI: 10.1101/gr.762503] [Citation(s) in RCA: 159] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2002] [Accepted: 11/06/2002] [Indexed: 11/25/2022]
Abstract
The availability of the assembled mouse genome makes possible, for the first time, an alignment and comparison of two large vertebrate genomes. We investigated different strategies of alignment for the subsequent analysis of conservation of genomes that are effective for assemblies of different quality. These strategies were applied to the comparison of the working draft of the human genome with the Mouse Genome Sequencing Consortium assembly, as well as other intermediate mouse assemblies. Our methods are fast and the resulting alignments exhibit a high degree of sensitivity, covering more than 90% of known coding exons in the human genome. We obtained such coverage while preserving specificity. With a view towards the end user, we developed a suite of tools and Web sites for automatically aligning and subsequently browsing and working with whole-genome comparisons. We describe the use of these tools to identify conserved non-coding regions between the human and mouse genomes, some of which have not been identified by other methods.
Collapse
Affiliation(s)
- Olivier Couronne
- Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
| | | | | | | | | | | | | | | |
Collapse
|
156
|
Bray N, Dubchak I, Pachter L. AVID: A global alignment program. Genome Res 2003; 13:97-102. [PMID: 12529311 PMCID: PMC430967 DOI: 10.1101/gr.789803] [Citation(s) in RCA: 286] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2002] [Accepted: 11/07/2002] [Indexed: 11/25/2022]
Abstract
In this paper we describe a new global alignment method called AVID. The method is designed to be fast, memory efficient, and practical for sequence alignments of large genomic regions up to megabases long. We present numerous applications of the method, ranging from the comparison of assemblies to alignment of large syntenic genomic regions and whole genome human/mouse alignments. We have also performed a quantitative comparison of AVID with other popular alignment tools. To this end, we have established a format for the representation of alignments and methods for their comparison. These formats and methods should be useful for future studies. The tools we have developed for the alignment comparisons, as well as the AVID program, are publicly available. See Web Site References section for AVID Web address and Web addresses for other programs discussed in this paper.
Collapse
Affiliation(s)
- Nick Bray
- Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
| | | | | |
Collapse
|
157
|
Frazer KA, Elnitski L, Church DM, Dubchak I, Hardison RC. Cross-species sequence comparisons: a review of methods and available resources. Genome Res 2003; 13:1-12. [PMID: 12529301 PMCID: PMC430969 DOI: 10.1101/gr.222003] [Citation(s) in RCA: 155] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
With the availability of whole-genome sequences for an increasing number of species, we are now faced with the challenge of decoding the information contained within these DNA sequences. Comparative analysis of DNA sequences from multiple species at varying evolutionary distances is a powerful approach for identifying coding and functional noncoding sequences, as well as sequences that are unique for a given organism. In this review, we outline the strategy for choosing DNA sequences from different species for comparative analyses and describe the methods used and the resources publicly available for these studies.
Collapse
Affiliation(s)
- Kelly A Frazer
- Perlegen Sciences, Mountain View, California 94043, USA.
| | | | | | | | | |
Collapse
|
158
|
Flicek P, Keibler E, Hu P, Korf I, Brent MR. Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. Genome Res 2003; 13:46-54. [PMID: 12529305 PMCID: PMC430948 DOI: 10.1101/gr.830003] [Citation(s) in RCA: 80] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The availability of draft sequences for both the mouse and human genomes makes it possible, for the first time, to annotate whole mammalian genomes using comparative methods. TWINSCAN is a gene-prediction system that combines the methods of single-genome predictors like GENSCAN with information derived from genome comparison, thereby improving accuracy. Because TWINSCAN uses genomic sequence only, it is less biased toward highly and/or ubiquitously expressed genes than GENEWISE, GENOMESCAN, and other methods based on evidence derived from transcripts. We show that TWINSCAN improves gene prediction in human using intermediate products from various stages of the sequencing and analysis of the mouse genome, from low-redundancy, whole-genome shotgun reads to the draft assembly and the synteny map. TWINSCAN improves on the prior state of the art even when alignments from only 1X coverage of the mouse genome are available. Gene prediction accuracy improves steadily from 1X through 3X, more slowly from 3X to 4X, and relatively little thereafter. The assembly and the synteny map greatly speed the computations, however. Our human annotation using the mouse assembly is conservative, predicting only 25,622 genes, and appears to be one of the best de novo annotations of the human genome to date.
Collapse
Affiliation(s)
- Paul Flicek
- Department of Computer Science and Engineering, Washington University, St. Louis, Missouri 63130, USA
| | | | | | | | | |
Collapse
|
159
|
Elnitski L, Hardison RC, Li J, Yang S, Kolbe D, Eswara P, O'Connor MJ, Schwartz S, Miller W, Chiaromonte F. Distinguishing regulatory DNA from neutral sites. Genome Res 2003; 13:64-72. [PMID: 12529307 PMCID: PMC430974 DOI: 10.1101/gr.817703] [Citation(s) in RCA: 104] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
We explore several computational approaches to analyzing interspecies genomic sequence alignments, aiming to distinguish regulatory regions from neutrally evolving DNA. Human-mouse genomic alignments were collected for three sets of human regions: (1) experimentally defined gene regulatory regions, (2) well-characterized exons (coding sequences, as a positive control), and (3) interspersed repeats thought to have inserted before the human-mouse split (a good model for neutrally evolving DNA). Models that potentially could distinguish functional noncoding sequences from neutral DNA were evaluated on these three data sets, as well as bulk genome alignments. Our analyses show that discrimination based on frequencies of individual nucleotide pairs or gaps (i.e., of possible alignment columns) is only partially successful. In contrast, scoring procedures that include the alignment context, based on frequencies of short runs of alignment columns, dramatically improve separation between regulatory and neutral features. Such scoring functions should aid in the identification of putative regulatory regions throughout the human genome.
Collapse
Affiliation(s)
- Laura Elnitski
- Departments of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
160
|
Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, Antonarakis SE, Attwood J, Baertsch R, Bailey J, Barlow K, Beck S, Berry E, Birren B, Bloom T, Bork P, Botcherby M, Bray N, Brent MR, Brown DG, Brown SD, Bult C, Burton J, Butler J, Campbell RD, Carninci P, Cawley S, Chiaromonte F, Chinwalla AT, Church DM, Clamp M, Clee C, Collins FS, Cook LL, Copley RR, Coulson A, Couronne O, Cuff J, Curwen V, Cutts T, Daly M, David R, Davies J, Delehaunty KD, Deri J, Dermitzakis ET, Dewey C, Dickens NJ, Diekhans M, Dodge S, Dubchak I, Dunn DM, Eddy SR, Elnitski L, Emes RD, Eswara P, Eyras E, Felsenfeld A, Fewell GA, Flicek P, Foley K, Frankel WN, Fulton LA, Fulton RS, Furey TS, Gage D, Gibbs RA, Glusman G, Gnerre S, Goldman N, Goodstadt L, Grafham D, Graves TA, Green ED, Gregory S, Guigó R, Guyer M, Hardison RC, Haussler D, Hayashizaki Y, Hillier LW, Hinrichs A, Hlavina W, Holzer T, Hsu F, Hua A, Hubbard T, Hunt A, Jackson I, Jaffe DB, Johnson LS, Jones M, Jones TA, Joy A, Kamal M, Karlsson EK, et alWaterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, Antonarakis SE, Attwood J, Baertsch R, Bailey J, Barlow K, Beck S, Berry E, Birren B, Bloom T, Bork P, Botcherby M, Bray N, Brent MR, Brown DG, Brown SD, Bult C, Burton J, Butler J, Campbell RD, Carninci P, Cawley S, Chiaromonte F, Chinwalla AT, Church DM, Clamp M, Clee C, Collins FS, Cook LL, Copley RR, Coulson A, Couronne O, Cuff J, Curwen V, Cutts T, Daly M, David R, Davies J, Delehaunty KD, Deri J, Dermitzakis ET, Dewey C, Dickens NJ, Diekhans M, Dodge S, Dubchak I, Dunn DM, Eddy SR, Elnitski L, Emes RD, Eswara P, Eyras E, Felsenfeld A, Fewell GA, Flicek P, Foley K, Frankel WN, Fulton LA, Fulton RS, Furey TS, Gage D, Gibbs RA, Glusman G, Gnerre S, Goldman N, Goodstadt L, Grafham D, Graves TA, Green ED, Gregory S, Guigó R, Guyer M, Hardison RC, Haussler D, Hayashizaki Y, Hillier LW, Hinrichs A, Hlavina W, Holzer T, Hsu F, Hua A, Hubbard T, Hunt A, Jackson I, Jaffe DB, Johnson LS, Jones M, Jones TA, Joy A, Kamal M, Karlsson EK, Karolchik D, Kasprzyk A, Kawai J, Keibler E, Kells C, Kent WJ, Kirby A, Kolbe DL, Korf I, Kucherlapati RS, Kulbokas EJ, Kulp D, Landers T, Leger JP, Leonard S, Letunic I, Levine R, Li J, Li M, Lloyd C, Lucas S, Ma B, Maglott DR, Mardis ER, Matthews L, Mauceli E, Mayer JH, McCarthy M, McCombie WR, McLaren S, McLay K, McPherson JD, Meldrim J, Meredith B, Mesirov JP, Miller W, Miner TL, Mongin E, Montgomery KT, Morgan M, Mott R, Mullikin JC, Muzny DM, Nash WE, Nelson JO, Nhan MN, Nicol R, Ning Z, Nusbaum C, O'Connor MJ, Okazaki Y, Oliver K, Overton-Larty E, Pachter L, Parra G, Pepin KH, Peterson J, Pevzner P, Plumb R, Pohl CS, Poliakov A, Ponce TC, Ponting CP, Potter S, Quail M, Reymond A, Roe BA, Roskin KM, Rubin EM, Rust AG, Santos R, Sapojnikov V, Schultz B, Schultz J, Schwartz MS, Schwartz S, Scott C, Seaman S, Searle S, Sharpe T, Sheridan A, Shownkeen R, Sims S, Singer JB, Slater G, Smit A, Smith DR, Spencer B, Stabenau A, Stange-Thomann N, Sugnet C, Suyama M, Tesler G, Thompson J, Torrents D, Trevaskis E, Tromp J, Ucla C, Ureta-Vidal A, Vinson JP, Von Niederhausern AC, Wade CM, Wall M, Weber RJ, Weiss RB, Wendl MC, West AP, Wetterstrand K, Wheeler R, Whelan S, Wierzbowski J, Willey D, Williams S, Wilson RK, Winter E, Worley KC, Wyman D, Yang S, Yang SP, Zdobnov EM, Zody MC, Lander ES. Initial sequencing and comparative analysis of the mouse genome. Nature 2002; 420:520-62. [PMID: 12466850 DOI: 10.1038/nature01262] [Show More Authors] [Citation(s) in RCA: 4941] [Impact Index Per Article: 214.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2002] [Accepted: 10/31/2002] [Indexed: 12/18/2022]
Abstract
The sequence of the mouse genome is a key informational tool for understanding the contents of the human genome and a key experimental tool for biomedical research. Here, we report the results of an international collaboration to produce a high-quality draft sequence of the mouse genome. We also present an initial comparative analysis of the mouse and human genomes, describing some of the insights that can be gleaned from the two sequences. We discuss topics including the analysis of the evolutionary forces shaping the size, structure and sequence of the genomes; the conservation of large-scale synteny across most of the genomes; the much lower extent of sequence orthology covering less than half of the genomes; the proportions of the genomes under selection; the number of protein-coding genes; the expansion of gene families related to reproduction and immunity; the evolution of proteins; and the identification of intraspecies polymorphism.
Collapse
MESH Headings
- Animals
- Base Composition
- Chromosomes, Mammalian/genetics
- Conserved Sequence/genetics
- CpG Islands/genetics
- Evolution, Molecular
- Gene Expression Regulation
- Genes/genetics
- Genetic Variation/genetics
- Genome
- Genome, Human
- Genomics
- Humans
- Mice/classification
- Mice/genetics
- Mice, Knockout
- Mice, Transgenic
- Models, Animal
- Multigene Family/genetics
- Mutagenesis
- Neoplasms/genetics
- Physical Chromosome Mapping
- Proteome/genetics
- Pseudogenes/genetics
- Quantitative Trait Loci/genetics
- RNA, Untranslated/genetics
- Repetitive Sequences, Nucleic Acid/genetics
- Selection, Genetic
- Sequence Analysis, DNA
- Sex Chromosomes/genetics
- Species Specificity
- Synteny
Collapse
|
161
|
Kan Z, States D, Gish W. Selecting for functional alternative splices in ESTs. Genome Res 2002; 12:1837-45. [PMID: 12466287 PMCID: PMC187565 DOI: 10.1101/gr.764102] [Citation(s) in RCA: 141] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2002] [Accepted: 09/30/2002] [Indexed: 11/24/2022]
Abstract
The expressed sequence tag (EST) collection in dbEST provides an extensive resource for detecting alternative splicing on a genomic scale. Using genomically aligned ESTs, a computational tool (TAP) was used to identify alternative splice patterns for 6400 known human genes from the RefSeq database. With sufficient EST coverage, one or more alternatively spliced forms could be detected for nearly all genes examined. To identify high (>95%) confidence observations of alternative splicing, splice variants were clustered on the basis of having mutually exclusive structures, and sample statistics were then applied. Through this selection, alternative splices expected at a frequency of >5% within their respective clusters were seen for only 17%-28% of genes. Although intron retention events (potentially unspliced messages) had been seen for 36% of the genes overall, the same statistical selection yielded reliable cases of intron retention for <5% of genes. For high-confidence alternative splices in the human ESTs, we also noted significantly higher rates both of cross-species conservation in mouse ESTs and of validation in the GenBank mRNA collection. We suggest quantitative analytical approaches such as these can aid in selecting useful targets for further experimental characterization and in so doing may help elucidate the mechanisms and biological implications of alternative splicing.
Collapse
Affiliation(s)
- Zhengyan Kan
- Department of Genetics, Washington University, St. Louis, Missouri 63110, USA
| | | | | |
Collapse
|
162
|
Elnitski L, Riemer C, Petrykowska H, Florea L, Schwartz S, Miller W, Hardison R. PipTools: a computational toolkit to annotate and analyze pairwise comparisons of genomic sequences. Genomics 2002; 80:681-90. [PMID: 12504859 DOI: 10.1006/geno.2002.7018] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Sequence conservation between species is useful both for locating coding regions of genes and for identifying functional noncoding segments. Hence interspecies alignment of genomic sequences is an important computational technique. However, its utility is limited without extensive annotation. We describe a suite of software tools, PipTools, and related programs that facilitate the annotation of genes and putative regulatory elements in pairwise alignments. The alignment server PipMaker uses the output of these tools to display detailed information needed to interpret alignments. These programs are provided in a portable format for use on common desktop computers and both the toolkit and the PipMaker server can be found at our Web site (http://bio.cse.psu.edu/). We illustrate the utility of the toolkit using annotation of a pairwise comparison of the mouse MHC class II and class III regions with orthologous human sequences and subsequently identify conserved, noncoding sequences that are DNase I hypersensitive sites in chromatin of mouse cells.
Collapse
Affiliation(s)
- Laura Elnitski
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA.
| | | | | | | | | | | | | |
Collapse
|
163
|
Pachter L, Alexandersson M, Cawley S. Applications of generalized pair hidden Markov models to alignment and gene finding problems. J Comput Biol 2002; 9:389-99. [PMID: 12015888 DOI: 10.1089/10665270252935520] [Citation(s) in RCA: 56] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Hidden Markov models (HMMs) have been successfully applied to a variety of problems in molecular biology, ranging from alignment problems to gene finding and annotation. Alignment problems can be solved with pair HMMs, while gene finding programs rely on generalized HMMs in order to model exon lengths. In this paper, we introduce the generalized pair HMM (GPHMM), which is an extension of both pair and generalized HMMs. We show how GPHMMs, in conjunction with approximate alignments, can be used for cross-species gene finding and describe applications to DNA-cDNA and DNA-protein alignment. GPHMMs provide a unifying and probabilistically sound theory for modeling these problems.
Collapse
Affiliation(s)
- Lior Pachter
- Department of Mathematics, University of California Berkeley, Berkeley, CA 94720, USA.
| | | | | |
Collapse
|
164
|
Mathé C, Sagot MF, Schiex T, Rouzé P. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res 2002; 30:4103-17. [PMID: 12364589 PMCID: PMC140543 DOI: 10.1093/nar/gkf543] [Citation(s) in RCA: 209] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2002] [Revised: 08/07/2002] [Accepted: 08/07/2002] [Indexed: 11/14/2022] Open
Abstract
While the genomes of many organisms have been sequenced over the last few years, transforming such raw sequence data into knowledge remains a hard task. A great number of prediction programs have been developed that try to address one part of this problem, which consists of locating the genes along a genome. This paper reviews the existing approaches to predicting genes in eukaryotic genomes and underlines their intrinsic advantages and limitations. The main mathematical models and computational algorithms adopted are also briefly described and the resulting software classified according to both the method and the type of evidence used. Finally, the several difficulties and pitfalls encountered by the programs are detailed, showing that improvements are needed and that new directions must be considered.
Collapse
Affiliation(s)
- Catherine Mathé
- Institut de Pharmacologie et Biologie Structurale, UMR 5089, 205 route de Narbonne, F-31077 Toulouse Cedex, France.
| | | | | | | |
Collapse
|
165
|
Toyoda A, Noguchi H, Taylor TD, Ito T, Pletcher MT, Sakaki Y, Reeves RH, Hattori M. Comparative genomic sequence analysis of the human chromosome 21 Down syndrome critical region. Genome Res 2002; 12:1323-32. [PMID: 12213769 PMCID: PMC186650 DOI: 10.1101/gr.153702] [Citation(s) in RCA: 42] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Comprehensive knowledge of the gene content of human chromosome 21 (HSA21) is essential for understanding the etiology of Down syndrome (DS). Here we report the largest comparison of finished mouse and human sequence to date for a 1.35-Mb region of mouse chromosome 16 (MMU16) that corresponds to human chromosome 21q22.2. This includes a portion of the commonly described "DS critical region," thought to contain a gene or genes whose dosage imbalance contributes to a number of phenotypes associated with DS. We used comparative sequence analysis to construct a DNA feature map of this region that includes all known genes, plus 144 conserved sequences > or =100 bp long that show > or =80% identity between mouse and human but do not match known exons. Twenty of these have matches to expressed sequence tag and cDNA databases, indicating that they may be transcribed sequences from chromosome 21. Eight putative CpG islands are found at conserved positions. Models for two human genes, DSCR4 and DSCR8, are not supported by conserved sequence, and close examination indicates that low-level transcripts from these loci are unlikely to encode proteins. Gene prediction programs give different results when used to analyze the well-conserved regions between mouse and human sequences. Our findings have implications for evolution and for modeling the genetic basis of DS in mice.
Collapse
Affiliation(s)
- Atsushi Toyoda
- Human Genome Research Group, Genomic Sciences Center, RIKEN Yokohama Institute, 1-7-22, Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, Japan
| | | | | | | | | | | | | | | |
Collapse
|
166
|
Abstract
The human genome sequence is the book of our life. Buried in this large volume are our genes, which are scattered as small DNA fragments throughout the genome and comprise a small percentage of the total text. Finding these indistinct 'needles' in a vast genomic 'haystack' can be extremely challenging. In response to this challenge, computational prediction approaches have proliferated in recent years that predict the location and structure of genes. Here, I discuss these approaches and explain why they have become essential for the analyses of newly sequenced genomes.
Collapse
Affiliation(s)
- Michael Q Zhang
- Watson School of Biological Sciences, Cold Spring Harbor Laboratory, 1 Bungtown Road, PO Box 100, Cold Spring Harbor, New York 11724, USA.
| |
Collapse
|
167
|
Gregory SG, Sekhon M, Schein J, Zhao S, Osoegawa K, Scott CE, Evans RS, Burridge PW, Cox TV, Fox CA, Hutton RD, Mullenger IR, Phillips KJ, Smith J, Stalker J, Threadgold GJ, Birney E, Wylie K, Chinwalla A, Wallis J, Hillier L, Carter J, Gaige T, Jaeger S, Kremitzki C, Layman D, Maas J, McGrane R, Mead K, Walker R, Jones S, Smith M, Asano J, Bosdet I, Chan S, Chittaranjan S, Chiu R, Fjell C, Fuhrmann D, Girn N, Gray C, Guin R, Hsiao L, Krzywinski M, Kutsche R, Lee SS, Mathewson C, McLeavy C, Messervier S, Ness S, Pandoh P, Prabhu AL, Saeedi P, Smailus D, Spence L, Stott J, Taylor S, Terpstra W, Tsai M, Vardy J, Wye N, Yang G, Shatsman S, Ayodeji B, Geer K, Tsegaye G, Shvartsbeyn A, Gebregeorgis E, Krol M, Russell D, Overton L, Malek JA, Holmes M, Heaney M, Shetty J, Feldblyum T, Nierman WC, Catanese JJ, Hubbard T, Waterston RH, Rogers J, de Jong PJ, Fraser CM, Marra M, McPherson JD, Bentley DR. A physical map of the mouse genome. Nature 2002; 418:743-50. [PMID: 12181558 DOI: 10.1038/nature00957] [Citation(s) in RCA: 205] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
A physical map of a genome is an essential guide for navigation, allowing the location of any gene or other landmark in the chromosomal DNA. We have constructed a physical map of the mouse genome that contains 296 contigs of overlapping bacterial clones and 16,992 unique markers. The mouse contigs were aligned to the human genome sequence on the basis of 51,486 homology matches, thus enabling use of the conserved synteny (correspondence between chromosome blocks) of the two genomes to accelerate construction of the mouse map. The map provides a framework for assembly of whole-genome shotgun sequence data, and a tile path of clones for generation of the reference sequence. Definition of the human-mouse alignment at this level of resolution enables identification of a mouse clone that corresponds to almost any position in the human genome. The human sequence may be used to facilitate construction of other mammalian genome maps using the same strategy.
Collapse
Affiliation(s)
- Simon G Gregory
- The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
168
|
Peterson KA, King BL, Hagge-Greenberg A, Roix JJ, Bult CJ, O'Brien TP. Functional and comparative genomic analysis of the piebald deletion region of mouse chromosome 14. Genomics 2002; 80:172-84. [PMID: 12160731 DOI: 10.1006/geno.2002.6818] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Several developmentally important genomic regions map within the piebald deletion complex on distal mouse chromosome 14. We have combined computational gene prediction and comparative sequence analysis to characterize an approximately 4.3-Mb segment of the piebald region to identify candidate genes for the phenotypes presented by homozygous deletion mice. As a result we have ordered 13 deletion breakpoints, integrated the sequence with markers from a bacterial artificial chromosome (BAC) physical map, and identified 16 known or predicted genes and >1500 conserved sequence elements (CSEs) across the region. The candidate genes identified include Phr1 (formerly Pam) and Spry2, which are mouse homologs of genes required for development in Drosophila melanogaster. Gene content, order, and position are highly conserved between mouse chromosome 14 and the orthologous region of human chromosome 13. Our studies combining computational gene prediction with genetic and comparative genomic analyses provide insight regarding the functional composition and organization of this defined chromosomal region.
Collapse
|
169
|
van der Leij FR, Cox KB, Jackson VN, Huijkman NCA, Bartelds B, Kuipers JRG, Dijkhuizen T, Terpstra P, Wood PA, Zammit VA, Price NT. Structural and functional genomics of the CPT1B gene for muscle-type carnitine palmitoyltransferase I in mammals. J Biol Chem 2002; 277:26994-7005. [PMID: 12015320 DOI: 10.1074/jbc.m203189200] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Abstract
Muscle-type carnitine palmitoyltransferase I (M-CPT I) is a key enzyme in the control of beta-oxidation of long-chain fatty acids in the heart and skeletal muscle. Because knowledge of the mammalian genes encoding M-CPT I may aid in studies of disturbed energy metabolism, we obtained new genomic and cDNA data for M-CPT I for the human, mouse, rat, and sheep. The introns of these compact genes are 80% (mouse versus rat) and 60% (mouse versus human) identical. Sheep and goat, but not cow, pig, rodent, or human promoter sequences contain a short interspersed repeated sequence (SINE) upstream of highly conserved regulatory elements. These elements constitute two promoters in humans, sheep, and mice, and, contrary to previous reports, there is a second promoter in rats as well. Thus, the transcriptional organization of these genes is more uniform than previously supposed, with interspecies differences in the 5'-ends of the mRNAs reflecting differences in splicing; only in humans extensive splicing and splice variation is found in the 5'- and 3'-untranslated regions. In the mouse, intron retention was detected in heart, muscle, and testes and may indicate an additional mechanism of regulation of M-CPT I expression. Splice variation in the coding region was previously proposed to lead to expression of CPT I enzymes with altered malonyl-CoA sensitivity (Yu, G. S., Lu, Y. C., and Gulick, T. (1998) Biochem. J. 334, 225-231). However, when expressed in the yeast Pichia pastoris, none of three earlier described splice variants had CPT I activity. Therefore, the involvement of splice variation of M-CPT I in the modulation of malonyl-CoA inhibition of fatty acid oxidation may be less relevant than hitherto assumed.
Collapse
Affiliation(s)
- Feike R van der Leij
- Department of Pediatrics, Groningen University Institute for Drug Exploration, University of Groningen and Beatrix Children's Hospital, Groningen 9700RB, The Netherlands.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
170
|
Walker M, Pavlovic V, Kasif S. A comparative genomic method for computational identification of prokaryotic translation initiation sites. Nucleic Acids Res 2002; 30:3181-91. [PMID: 12136100 PMCID: PMC135744 DOI: 10.1093/nar/gkf423] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The ever growing number of completely sequenced prokaryotic genomes facilitates cross-species comparisons by genomic annotation algorithms. This paper introduces a new probabilistic framework for comparative genomic analysis and demonstrates its utility in the context of improving the accuracy of prokaryotic gene start site detection. Our frame work employs a product hidden Markov model (PROD-HMM) with state architecture to model the species-specific trinucleotide frequency patterns in sequences immediately upstream and downstream of a translation start site and to detect the contrasting non-synonymous (amino acid changing) and synonymous (silent) substitution rates that differentiate prokaryotic coding from intergenic regions. Depending on the intricacy of the features modeled by the hidden state architecture, intergenic, regulatory, promoter and coding regions can be delimited by this method. The new system is evaluated using a preliminary set of orthologous Pyrococcus gene pairs, for which it demonstrates an improved accuracy of detection. Its robustness is confirmed by analysis with cross-validation of an experimentally verified set of Escherichia coli K-12 and Salmonella thyphimurium LT2 orthologs. The novel architecture has a number of attractive features that distinguish it from previous comparative models such as pair-HMMs.
Collapse
Affiliation(s)
- Megon Walker
- Bioinformatics Program, Boston University, Boston, MA 02215, USA
| | | | | |
Collapse
|
171
|
Chureau C, Prissette M, Bourdet A, Barbe V, Cattolico L, Jones L, Eggen A, Avner P, Duret L. Comparative sequence analysis of the X-inactivation center region in mouse, human, and bovine. Genome Res 2002; 12:894-908. [PMID: 12045143 PMCID: PMC1383731 DOI: 10.1101/gr.152902] [Citation(s) in RCA: 125] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
We have sequenced to high levels of accuracy 714-kb and 233-kb regions of the mouse and bovine X-inactivation centers (Xic), respectively, centered on the Xist gene. This has provided the basis for a fully annotated comparative analysis of the mouse Xic with the 2.3-Mb orthologous region in human and has allowed a three-way species comparison of the core central region, including the Xist gene. These comparisons have revealed conserved genes, both coding and noncoding, conserved CpG islands and, more surprisingly, conserved pseudogenes. The distribution of repeated elements, especially LINE repeats, in the mouse Xic region when compared to the rest of the genome does not support the hypothesis of a role for these repeat elements in the spreading of X inactivation. Interestingly, an asymmetric distribution of LINE elements on the two DNA strands was observed in the three species, not only within introns but also in intergenic regions. This feature is suggestive of important transcriptional activity within these intergenic regions. In silico prediction followed by experimental analysis has allowed four new genes, Cnbp2, Ftx, Jpx, and Ppnx, to be identified and novel, widespread, complex, and apparently noncoding transcriptional activity to be characterized in a region 5' of Xist that was recently shown to attract histone modification early after the onset of X inactivation.
Collapse
Affiliation(s)
- Corinne Chureau
- Unité de Génétique Moléculaire Murine, URA CNRS 1947, Institut Pasteur, Paris, France
| | | | | | | | | | | | | | | | | |
Collapse
|
172
|
Burgess HA, Reiner O. Alternative splice variants of doublecortin-like kinase are differentially expressed and have different kinase activities. J Biol Chem 2002; 277:17696-705. [PMID: 11884394 DOI: 10.1074/jbc.m111981200] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Alternative splicing of mRNA transcripts expands the range of protein products from a single gene locus. Several splice variants of DCLK (doublecortin-like kinase) have previously been reported. Here, we report the genomic organization underlying the splice variants of DCLK and examine the expression profile of two splice variants affecting the kinase domain of DCLK and CPG16 (candidate plasticity gene 16), one containing an Arg-rich domain and the other affecting the C terminus of the protein. These splice alternatives were differentially expressed in embryonic and adult brain. Both splice variants disrupted DCLK PEST domains; however, all splice variants remained sensitive to proteolysis by calpain. The adult-specific C-terminal splice variant of DCLK had reduced autophosphorylation activity, but similar kinase activity for myelin basic protein relative to the embryonic splice variant. The splice variant adding an Arg-rich domain gained an autophosphorylation site at Ser-382. Although this protein isoform was expressed mainly in the adult brain, the phosphorylated form was strongly enriched in embryonic brain and adult olfactory bulb, suggesting a possible role in migrating neurons.
Collapse
Affiliation(s)
- Harold A Burgess
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot 76100, Israel
| | | |
Collapse
|
173
|
Wei L, Liu Y, Dubchak I, Shon J, Park J. Comparative genomics approaches to study organism similarities and differences. J Biomed Inform 2002; 35:142-50. [PMID: 12474427 DOI: 10.1016/s1532-0464(02)00506-3] [Citation(s) in RCA: 41] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Comparative genomics is a large-scale, holistic approach that compares two or more genomes to discover the similarities and differences between the genomes and to study the biology of the individual genomes. Comparative studies can be performed at different levels of the genomes to obtain multiple perspectives about the organisms. We discuss in detail the type of analyses that offer significant biological insights in the comparisons of (1) genome structure including overall genome statistics, repeats, genome rearrangement at both DNA and gene level, synteny, and breakpoints; (2) coding regions including gene content, protein content, orthologs, and paralogs; and (3) noncoding regions including the prediction of regulatory elements. We also briefly review the currently available computational tools in comparative genomics such as algorithms for genome-scale sequence alignment, gene identification, and nonhomology-based function prediction.
Collapse
Affiliation(s)
- Liping Wei
- Nexus Genomics, Inc., 229 Polaris Ave., Suite 6, Mountain View, CA 94043, USA.
| | | | | | | | | |
Collapse
|
174
|
Camargo AA, de Souza SJ, Brentani RR, Simpson AJG. Human gene discovery through experimental definition of transcribed regions of the human genome. Curr Opin Chem Biol 2002; 6:13-6. [PMID: 11827817 DOI: 10.1016/s1367-5931(01)00279-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
The sequencing of the human genome has failed to realize its primary goal: the identification of all human genes. We have learned that genes can only be identified with certainty within this vast and information-sparse structure by comparison with transcript sequences. Significantly more sequence data of this kind is required before we can claim to have deciphered our genetic blueprint.
Collapse
Affiliation(s)
- Anamaria A Camargo
- The Ludwig Institute for Cancer Research, Rua Professor Antonio Prudente, 109, 4th floor, Saõ Paulo, 01509-010, SP, Brazil
| | | | | | | |
Collapse
|
175
|
Abstract
The human genome sequence provides a reference point from which we can compare ourselves with other organisms. Interspecies comparison is a powerful tool for inferring function from genomic sequence and could ultimately lead to the discovery of what makes humans unique. To date, most comparative sequencing has focused on pair-wise comparisons between human and a limited number of other vertebrates, such as mouse. Targeted approaches now exist for mapping and sequencing vertebrate bacterial artificial chromosomes (BACs) from numerous species, allowing rapid and detailed molecular and phylogenetic investigation of multi-megabase loci. Such targeted sequencing is complementary to current whole-genome sequencing projects, and would benefit greatly from the creation of BAC libraries from a diverse range of vertebrates.
Collapse
Affiliation(s)
- James W Thomas
- Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | | |
Collapse
|
176
|
DeSilva U, Elnitski L, Idol JR, Doyle JL, Gan W, Thomas JW, Schwartz S, Dietrich NL, Beckstrom-Sternberg SM, McDowell JC, Blakesley RW, Bouffard GG, Thomas PJ, Touchman JW, Miller W, Green ED. Generation and comparative analysis of approximately 3.3 Mb of mouse genomic sequence orthologous to the region of human chromosome 7q11.23 implicated in Williams syndrome. Genome Res 2002; 12:3-15. [PMID: 11779826 PMCID: PMC155257 DOI: 10.1101/gr.214802] [Citation(s) in RCA: 54] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
Williams syndrome is a complex developmental disorder that results from the heterozygous deletion of a approximately 1.6-Mb segment of human chromosome 7q11.23. These deletions are mediated by large (approximately 300 kb) duplicated blocks of DNA of near-identical sequence. Previously, we showed that the orthologous region of the mouse genome is devoid of such duplicated segments. Here, we extend our studies to include the generation of approximately 3.3 Mb of genomic sequence from the mouse Williams syndrome region, of which just over 1.4 Mb is finished to high accuracy. Comparative analyses of the mouse and human sequences within and immediately flanking the interval commonly deleted in Williams syndrome have facilitated the identification of nine previously unreported genes, provided detailed sequence-based information regarding 30 genes residing in the region, and revealed a number of potentially interesting conserved noncoding sequences. Finally, to facilitate comparative sequence analysis, we implemented several enhancements to the program, including the addition of links from annotated features within a generated percent-identity plot to specific records in public databases. Taken together, the results reported here provide an important comparative sequence resource that should catalyze additional studies of Williams syndrome, including those that aim to characterize genes within the commonly deleted interval and to develop mouse models of the disorder.
Collapse
Affiliation(s)
- Udaya DeSilva
- Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
177
|
Nekrutenko A, Makova KD, Li WH. The K(A)/K(S) ratio test for assessing the protein-coding potential of genomic regions: an empirical and simulation study. Genome Res 2002; 12:198-202. [PMID: 11779845 PMCID: PMC155263 DOI: 10.1101/gr.200901] [Citation(s) in RCA: 204] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Comparative genomics is a simple, powerful way to increase the accuracy of gene prediction. In this study, we show the utility of a simple test for the identification of protein-coding exons using human/mouse sequence comparisons. The test takes advantage of the fact that in the vast majority of coding regions, synonymous substitutions (K(S)) occur much more frequently than nonsynonymous ones (K(A)) and uses the K(A)/K(S) ratio as the criterion. We show the following: (1) most of the human and mouse exons are sufficiently long and have a suitable degree of sequence divergence for the test to perform reliably; (2) the test is suited for the identification of long exons and single exon genes, which are difficult to predict by current methods; (3) the test has a false-negative rate, lower than most of current gene prediction methods and a false-positive rate lower than all current methods; (4) the test has been automated and can be used in combination with other existing gene-prediction methods.
Collapse
Affiliation(s)
- Anton Nekrutenko
- Department of Ecology and Evolution, University of Chicago, Chicago, Illinois 60637, USA
| | | | | |
Collapse
|
178
|
Abstract
In the post-genomic era, the new discipline of functional genomics is now facing the challenge of associating a function (as well as estimating its relevance to industrial applications) to about 100,000 microbial, plant or animal genes of known sequence but unknown function. Besides the design of databases, computational methods are increasingly becoming intimately linked with the various experimental approaches. Consequently, bioinformatics is rapidly evolving into independent fields addressing the specific problems of interpreting i) genomic sequences, ii) protein sequences and 3D-structures, as well as iii) transcriptome and macromolecular interaction data. It is thus increasingly difficult for the biologist to choose the computational approaches that perform best in these various areas. This paper attempts to review the most useful developments of the last 2 years.
Collapse
Affiliation(s)
- J M Claverie
- Structural and Genetic Information Laboratory,UMR 1889 CNRS-AVENTIS, 31 Chemin Joseph Aiguier, 13402 Marseille Cedex 20, France.
| | | | | | | |
Collapse
|
179
|
Rivas E, Eddy SR. Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics 2001; 2:8. [PMID: 11801179 PMCID: PMC64605 DOI: 10.1186/1471-2105-2-8] [Citation(s) in RCA: 310] [Impact Index Per Article: 12.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2001] [Accepted: 10/10/2001] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND Noncoding RNA genes produce transcripts that exert their function without ever producing proteins. Noncoding RNA gene sequences do not have strong statistical signals, unlike protein coding genes. A reliable general purpose computational genefinder for noncoding RNA genes has been elusive. RESULTS We describe a comparative sequence analysis algorithm for detecting novel structural RNA genes. The key idea is to test the pattern of substitutions observed in a pairwise alignment of two homologous sequences. A conserved coding region tends to show a pattern of synonymous substitutions, whereas a conserved structural RNA tends to show a pattern of compensatory mutations consistent with some base-paired secondary structure. We formalize this intuition using three probabilistic "pair-grammars": a pair stochastic context free grammar modeling alignments constrained by structural RNA evolution, a pair hidden Markov model modeling alignments constrained by coding sequence evolution, and a pair hidden Markov model modeling a null hypothesis of position-independent evolution. Given an input pairwise sequence alignment (e.g. from a BLASTN comparison of two related genomes) we classify the alignment into the coding, RNA, or null class according to the posterior probability of each class. CONCLUSIONS We have implemented this approach as a program, QRNA, which we consider to be a prototype structural noncoding RNA genefinder. Tests suggest that this approach detects noncoding RNA genes with a fair degree of reliability.
Collapse
Affiliation(s)
- Elena Rivas
- Howard Hughes Medical Institute and Department of Genetics, Washington University School of Medicine, Saint Louis, Missouri, USA
| | - Sean R Eddy
- Howard Hughes Medical Institute and Department of Genetics, Washington University School of Medicine, Saint Louis, Missouri, USA
| |
Collapse
|
180
|
Camargo AA, Samaia HP, Dias-Neto E, Simão DF, Migotto IA, Briones MR, Costa FF, Nagai MA, Verjovski-Almeida S, Zago MA, Andrade LE, Carrer H, El-Dorry HF, Espreafico EM, Habr-Gama A, Giannella-Neto D, Goldman GH, Gruber A, Hackel C, Kimura ET, Maciel RM, Marie SK, Martins EA, Nobrega MP, Paco-Larson ML, Pardini MI, Pereira GG, Pesquero JB, Rodrigues V, Rogatto SR, da Silva ID, Sogayar MC, Sonati MF, Tajara EH, Valentini SR, Alberto FL, Amaral ME, Aneas I, Arnaldi LA, de Assis AM, Bengtson MH, Bergamo NA, Bombonato V, de Camargo ME, Canevari RA, Carraro DM, Cerutti JM, Correa ML, Correa RF, Costa MC, Curcio C, Hokama PO, Ferreira AJ, Furuzawa GK, Gushiken T, Ho PL, Kimura E, Krieger JE, Leite LC, Majumder P, Marins M, Marques ER, Melo AS, Melo MB, Mestriner CA, Miracca EC, Miranda DC, Nascimento AL, Nobrega FG, Ojopi EP, Pandolfi JR, Pessoa LG, Prevedel AC, Rahal P, Rainho CA, Reis EM, Ribeiro ML, da Ros N, de Sa RG, Sales MM, Sant'anna SC, dos Santos ML, da Silva AM, da Silva NP, Silva WA, da Silveira RA, Sousa JF, Stecconi D, Tsukumo F, Valente V, Soares F, Moreira ES, Nunes DN, Correa RG, Zalcberg H, Carvalho AF, Reis LF, Brentani RR, Simpson AJ, de Souza SJ, et alCamargo AA, Samaia HP, Dias-Neto E, Simão DF, Migotto IA, Briones MR, Costa FF, Nagai MA, Verjovski-Almeida S, Zago MA, Andrade LE, Carrer H, El-Dorry HF, Espreafico EM, Habr-Gama A, Giannella-Neto D, Goldman GH, Gruber A, Hackel C, Kimura ET, Maciel RM, Marie SK, Martins EA, Nobrega MP, Paco-Larson ML, Pardini MI, Pereira GG, Pesquero JB, Rodrigues V, Rogatto SR, da Silva ID, Sogayar MC, Sonati MF, Tajara EH, Valentini SR, Alberto FL, Amaral ME, Aneas I, Arnaldi LA, de Assis AM, Bengtson MH, Bergamo NA, Bombonato V, de Camargo ME, Canevari RA, Carraro DM, Cerutti JM, Correa ML, Correa RF, Costa MC, Curcio C, Hokama PO, Ferreira AJ, Furuzawa GK, Gushiken T, Ho PL, Kimura E, Krieger JE, Leite LC, Majumder P, Marins M, Marques ER, Melo AS, Melo MB, Mestriner CA, Miracca EC, Miranda DC, Nascimento AL, Nobrega FG, Ojopi EP, Pandolfi JR, Pessoa LG, Prevedel AC, Rahal P, Rainho CA, Reis EM, Ribeiro ML, da Ros N, de Sa RG, Sales MM, Sant'anna SC, dos Santos ML, da Silva AM, da Silva NP, Silva WA, da Silveira RA, Sousa JF, Stecconi D, Tsukumo F, Valente V, Soares F, Moreira ES, Nunes DN, Correa RG, Zalcberg H, Carvalho AF, Reis LF, Brentani RR, Simpson AJ, de Souza SJ, Melo M. The contribution of 700,000 ORF sequence tags to the definition of the human transcriptome. Proc Natl Acad Sci U S A 2001; 98:12103-8. [PMID: 11593022 PMCID: PMC59775 DOI: 10.1073/pnas.201182798] [Show More Authors] [Citation(s) in RCA: 93] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Open reading frame expressed sequences tags (ORESTES) differ from conventional ESTs by providing sequence data from the central protein coding portion of transcripts. We generated a total of 696,745 ORESTES sequences from 24 human tissues and used a subset of the data that correspond to a set of 15,095 full-length mRNAs as a means of assessing the efficiency of the strategy and its potential contribution to the definition of the human transcriptome. We estimate that ORESTES sampled over 80% of all highly and moderately expressed, and between 40% and 50% of rarely expressed, human genes. In our most thoroughly sequenced tissue, the breast, the 130,000 ORESTES generated are derived from transcripts from an estimated 70% of all genes expressed in that tissue, with an equally efficient representation of both highly and poorly expressed genes. In this respect, we find that the capacity of the ORESTES strategy both for gene discovery and shotgun transcript sequence generation significantly exceeds that of conventional ESTs. The distribution of ORESTES is such that many human transcripts are now represented by a scaffold of partial sequences distributed along the length of each gene product. The experimental joining of the scaffold components, by reverse transcription-PCR, represents a direct route to transcript finishing that may represent a useful alternative to full-length cDNA cloning.
Collapse
Affiliation(s)
- A A Camargo
- Ludwig Institute for Cancer Research, 01509-010, São Paulo, Brazil
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
181
|
Wiehe T, Gebauer-Jung S, Mitchell-Olds T, Guigó R. SGP-1: prediction and validation of homologous genes based on sequence alignments. Genome Res 2001; 11:1574-83. [PMID: 11544202 PMCID: PMC311140 DOI: 10.1101/gr.177401] [Citation(s) in RCA: 81] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2001] [Accepted: 06/05/2001] [Indexed: 11/24/2022]
Abstract
Conventional methods of gene prediction rely on the recognition of DNA-sequence signals, the coding potential or the comparison of a genomic sequence with a cDNA, EST, or protein database. Reasons for limited accuracy in many circumstances are species-specific training and the incompleteness of reference databases. Lately, comparative genome analysis has attracted increasing attention. Several analysis tools that are based on human/mouse comparisons are already available. Here, we present a program for the prediction of protein-coding genes, termed SGP-1 (Syntenic Gene Prediction), which is based on the similarity of homologous genomic sequences. In contrast to most existing tools, the accuracy of depends little on species-specific properties such as codon usage or the nucleotide distribution. may therefore be applied to nonstandard model organisms in vertebrates as well as in plants, without the need for extensive parameter training. In addition to predicting genes in large-scale genomic sequences, the program may be useful to validate gene structure annotations from databases. To this end, SGP-1 output also contains comparisons between predicted and annotated gene structures in HTML format. The program can be accessed via a Web server at http://soft.ice.mpg.de/sgp-1. The source code, written in ANSI C, is available on request from the authors.
Collapse
Affiliation(s)
- T Wiehe
- Max Planck Institute for Chemical Ecology, Jena, Germany.
| | | | | | | |
Collapse
|
182
|
Qiu Y, Cavelier L, Chiu S, Yang X, Rubin E, Cheng JF. Human and mouse ABCA1 comparative sequencing and transgenesis studies revealing novel regulatory sequences. Genomics 2001; 73:66-76. [PMID: 11352567 DOI: 10.1006/geno.2000.6467] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
The expression of ABCA1, a major participant in apolipoprotein-mediated cholesterol efflux, is regulated by a variety of factors, including intracellular cholesterol concentration. To identify sequences involved in its regulation, we sequenced and compared approximately 200 kb of mouse and human DNA containing the ABCA1 gene. Furthermore, expression of the human gene containing different 5' ends was examined in transgenic mice. Sequence comparison revealed multiple conserved noncoding sequences. The two most highly conserved noncoding elements (CNS1, 88% identity over 498 bp; CNS2, 81% identity over 214 bp) were also highly conserved in other organisms. Mice containing the human ABCA1 gene, 70 kb of upstream DNA, and 35 kb of downstream DNA expressed the transgene similarly to endogenous Abca1. A second transgene beginning 3' to exon 1 was expressed only in liver, providing strong evidence of an unsuspected liver-specific promoter. The identified conserved noncoding sequences invite further investigation to elucidate ABCA1 regulation.
Collapse
Affiliation(s)
- Y Qiu
- Genome Science Department, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
| | | | | | | | | | | |
Collapse
|
183
|
|
184
|
Bergman CM, Kreitman M. Analysis of conserved noncoding DNA in Drosophila reveals similar constraints in intergenic and intronic sequences. Genome Res 2001; 11:1335-45. [PMID: 11483574 DOI: 10.1101/gr.178701] [Citation(s) in RCA: 124] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Comparative genomic approaches to gene and cis-regulatory prediction are based on the principle that differential DNA sequence conservation reflects variation in functional constraint. Using this principle, we analyze noncoding sequence conservation in Drosophila for 40 loci with known or suspected cis-regulatory function encompassing >100 kb of DNA. We estimate the fraction of noncoding DNA conserved in both intergenic and intronic regions and describe the length distribution of ungapped conserved noncoding blocks. On average, 22%-26% of noncoding sequences surveyed are conserved in Drosophila, with median block length approximately 19 bp. We show that point substitution in conserved noncoding blocks exhibits transition bias as well as lineage effects in base composition, and occurs more than an order of magnitude more frequently than insertion/deletion (indel) substitution. Overall, patterns of noncoding DNA structure and evolution differ remarkably little between intergenic and intronic conserved blocks, suggesting that the effects of transcription per se contribute minimally to the constraints operating on these sequences. The results of this study have implications for the development of alignment and prediction algorithms specific to noncoding DNA, as well as for models of cis-regulatory DNA sequence evolution.
Collapse
Affiliation(s)
- C M Bergman
- Department of Ecology and Evolution, University of Chicago, Chicago, Illinois 60637, USA.
| | | |
Collapse
|
185
|
Wilihoeft U, Campos-Góngora E, Touzni S, Bruchhaus I, Tannich E. Introns of Entamoeba histolytica and Entamoeba dispar. Protist 2001; 152:149-56. [PMID: 11545438 DOI: 10.1078/1434-4610-00053] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The genome of Entamoeba histolytica is considered to possess very few intervening sequences (introns), as only 5 intron-containing genes from this protozoan parasite have been reported so far. However, while sequencing a number of genomic contigs as well as three independent genes coding for ribosomal protein L27a, we have identified 9 additional intron-containing genes of E. histolytica and the closely related species Entamoeba dispar, indicating that introns are more common in these organisms than previously suggested. The various amoeba introns are relatively short comprising between 46 and 115 nucleotides only and have a higher AT-content compared to the corresponding exon sequences. In contrast to higher eukaryotes, amoeba introns do not contain a well-conserved branch point consensus, and have extended donor and acceptor splice sites of the sequences G
Collapse
Affiliation(s)
- U Wilihoeft
- Bernhard Nocht Institute for Tropical Medicine, Hamburg, Germany
| | | | | | | | | |
Collapse
|
186
|
Abstract
In its early days, the entire field of computational biology revolved almost entirely around biological sequence analysis. Over the past few years, however, a number of new non-sequence-based areas of investigation have become mainstream, from the analysis of gene expression data from microarrays, to whole-genome association discovery, and to the reverse engineering of gene regulatory pathways. Nonetheless, with the completion of private and public efforts to map the human genome, as well as those of other organisms, sequence data continue to be a veritable mother lode of valuable biological information that can be mined in a variety of contexts. Furthermore, the integration of sequence data with a variety of alternative information is providing valuable and fundamentally new insight into biological processes, as well as an array of new computational methodologies for the analysis of biological data.
Collapse
Affiliation(s)
- A Califano
- First Genetic Trust Inc., 9 Polito Avenue, Suite 930, Lyndhurst, NJ 07071, USA.
| |
Collapse
|
187
|
Shiraishi T, Druck T, Mimori K, Flomenberg J, Berk L, Alder H, Miller W, Huebner K, Croce CM. Sequence conservation at human and mouse orthologous common fragile regions, FRA3B/FHIT and Fra14A2/Fhit. Proc Natl Acad Sci U S A 2001; 98:5722-7. [PMID: 11320209 PMCID: PMC33280 DOI: 10.1073/pnas.091095898] [Citation(s) in RCA: 55] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
It has been suggested that delayed DNA replication underlies fragility at common human fragile sites, but specific sequences responsible for expression of these inducible fragile sites have not been identified. One approach to identify such cis-acting sequences within the large nonexonic regions of fragile sites would be to identify conserved functional elements within orthologous fragile sites by interspecies sequence comparison. This study describes a comparison of orthologous fragile regions, the human FRA3B/FHIT and the murine Fra14A2/Fhit locus. We sequenced over 600 kbp of the mouse Fra14A2, covering the region orthologous to the fragile epicenter of FRA3B, and determined the Fhit deletion break points in a mouse kidney cancer cell line (RENCA). The murine Fra14A2 locus, like the human FRA3B, was characterized by a high AT content. Alignment of the two sequences showed that this fragile region was stable in evolution despite its susceptibility to mitotic recombination on inhibition of DNA replication. There were also several unusual highly conserved regions (HCRs). The positions of predicted matrix attachment regions (MARs), possibly related to replication origins, were not conserved. Of known fragile region landmarks, five cancer cell break points, one viral integration site, and one aphidicolin break cluster were located within or near HCRs. Thus, comparison of orthologous fragile regions has identified highly conserved sequences with possible functional roles in maintenance of fragility.
Collapse
Affiliation(s)
- T Shiraishi
- Kimmel Cancer Center, Jefferson Medical College, 233 South 10th Street, Philadelphia, PA 19107, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
188
|
Pletcher MT, Wiltshire T, Cabin DE, Villanueva M, Reeves RH. Use of Comparative Physical and Sequence Mapping to Annotate Mouse Chromosome 16 and Human Chromosome 21. Genomics 2001; 74:45-54. [PMID: 11374901 DOI: 10.1006/geno.2001.6533] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Distal mouse chromosome 16 (MMU16) shares conserved linkage with human chromosome 21 (HSA21), trisomy for which causes Down syndrome (DS). A 4.5-Mb physical map extending from Cbr1 to Tmprss2 on MMU16 provides a minimal tiling path of P1 artificial chromosomes (PACs) for comparative mapping and genomic sequencing. Thirty-four expressed sequences were positioned on the mouse map, including 19 that were not physically mapped previously. This region of the mouse:human comparative map shows a high degree of evolutionary conservation of gene order and content, which differs only by insertion of one gene (in mouse) and a small inversion involving two adjacent genes. "Low-pass" (2.2x) mouse sequence from a portion of the contig was ordered and oriented along 510 kb of finished HSA21 sequence. In combination with 68 kb of unique PAC end sequence, the comparison provided confirmation of genes predicted by comparative mapping, indicated gene predictions that are likely to be incorrect, and identified three candidate genes in mouse and human that were not observed in the initial HSA21 sequence annotation. This comparative map and sequence derived from it are powerful tools for identifying genes and regulatory regions, information that will in turn provide insights into the genetic mechanisms by which trisomy 21 results in DS.
Collapse
Affiliation(s)
- M T Pletcher
- Department of Physiology, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA
| | | | | | | | | |
Collapse
|
189
|
Yeh RF, Lim LP, Burge CB. Computational inference of homologous gene structures in the human genome. Genome Res 2001; 11:803-16. [PMID: 11337476 PMCID: PMC311055 DOI: 10.1101/gr.175701] [Citation(s) in RCA: 264] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
With the human genome sequence approaching completion, a major challenge is to identify the locations and encoded protein sequences of all human genes. To address this problem we have developed a new gene identification algorithm, GenomeScan, which combines exon-intron and splice signal models with similarity to known protein sequences in an integrated model. Extensive testing shows that GenomeScan can accurately identify the exon-intron structures of genes in finished or draft human genome sequence with a low rate of false-positives. Application of GenomeScan to 2.7 billion bases of human genomic DNA identified at least 20,000-25,000 human genes out of an estimated 30,000-40,000 present in the genome. The results show an accurate and efficient automated approach for identifying genes in higher eukaryotic genomes and provide a first-level annotation of the draft human genome.
Collapse
Affiliation(s)
- R F Yeh
- Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | | | | |
Collapse
|
190
|
Kan Z, Rouchka EC, Gish WR, States DJ. Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Res 2001; 11:889-900. [PMID: 11337482 PMCID: PMC311065 DOI: 10.1101/gr.155001] [Citation(s) in RCA: 255] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
With the availability of a nearly complete sequence of the human genome, aligning expressed sequence tags (EST) to the genomic sequence has become a practical and powerful strategy for gene prediction. Elucidating gene structure is a complex problem requiring the identification of splice junctions, gene boundaries, and alternative splicing variants. We have developed a software tool, Transcript Assembly Program (TAP), to delineate gene structures using genomically aligned EST sequences. TAP assembles the joint gene structure of the entire genomic region from individual splice junction pairs, using a novel algorithm that uses the EST-encoded connectivity and redundancy information to sort out the complex alternative splicing patterns. A method called polyadenylation site scan (PASS) has been developed to detect poly-A sites in the genome. TAP uses these predictions to identify gene boundaries by segmenting the joint gene structure at polyadenylated terminal exons. Reconstructing 1007 known transcripts, TAP scored a sensitivity (Sn) of 60% and a specificity (Sp) of 92% at the exon level. The gene boundary identification process was found to be accurate 78% of the time. also reports alternative splicing patterns in EST alignments. An analysis of alternative splicing in 1124 genic regions suggested that more than half of human genes undergo alternative splicing. Surprisingly, we saw an absolute majority of the detected alternative splicing events affect the coding region. Furthermore, the evolutionary conservation of alternative splicing between human and mouse was analyzed using an EST-based approach. (See http://stl.wustl.edu/~zkan/TAP/)
Collapse
Affiliation(s)
- Z Kan
- Center for Computational Biology, Washington University, St. Louis, Missouri 63110, USA
| | | | | | | |
Collapse
|
191
|
|
192
|
Dubcovsky J, Ramakrishna W, SanMiguel PJ, Busso CS, Yan L, Shiloff BA, Bennetzen JL. Comparative sequence analysis of colinear barley and rice bacterial artificial chromosomes. PLANT PHYSIOLOGY 2001; 125:1342-53. [PMID: 11244114 PMCID: PMC65613 DOI: 10.1104/pp.125.3.1342] [Citation(s) in RCA: 104] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/15/2000] [Revised: 12/14/2000] [Accepted: 12/18/2000] [Indexed: 05/18/2023]
Abstract
Colinearity of a large region from barley (Hordeum vulgare) chromosome 5H and rice (Oryza sativa) chromosome 3 has been demonstrated by mapping of several common restriction fragment-length polymorphism clones on both regions. One of these clones, WG644, was hybridized to rice and barley bacterial artificial chromosome (BAC) libraries to select homologous clones. One BAC from each species with the largest overlapping segment was selected by fingerprinting and blot hybridization with three additional restriction fragment-length polymorphism clones. The complete barley BAC 635P2 and a 50-kb segment of the rice BAC 36I5 were completely sequenced. A comparison of the rice and barley DNA sequences revealed the presence of four conserved regions, containing four predicted genes. The four genes are in the same orientation in rice, but the second gene is in inverted orientation in barley. The fourth gene is duplicated in tandem in barley but not in rice. Comparison of the homeologous barley and rice sequences assisted the gene identification process and helped determine individual gene structures. General gene structure (exon number, size, and location) was largely conserved between rice and barley and to a lesser extent with homologous genes in Arabidopsis. Colinearity of these four genes is not conserved in Arabidopsis compared with the two grass species. Extensive similarity was not found between the rice and barley sequences other than within the exons of the structural genes, and short stretches of homology in the promoters and 3' untranslated regions. The larger distances between the first three genes in barley compared with rice are explained by the insertion of different transposable retroelements.
Collapse
Affiliation(s)
- J Dubcovsky
- Department of Agronomy and Range Science, University of California, Davis, CA 95616, USA
| | | | | | | | | | | | | |
Collapse
|
193
|
Shoemaker DD, Schadt EE, Armour CD, He YD, Garrett-Engele P, McDonagh PD, Loerch PM, Leonardson A, Lum PY, Cavet G, Wu LF, Altschuler SJ, Edwards S, King J, Tsang JS, Schimmack G, Schelter JM, Koch J, Ziman M, Marton MJ, Li B, Cundiff P, Ward T, Castle J, Krolewski M, Meyer MR, Mao M, Burchard J, Kidd MJ, Dai H, Phillips JW, Linsley PS, Stoughton R, Scherer S, Boguski MS. Experimental annotation of the human genome using microarray technology. Nature 2001; 409:922-7. [PMID: 11237012 DOI: 10.1038/35057141] [Citation(s) in RCA: 276] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The most important product of the sequencing of a genome is a complete, accurate catalogue of genes and their products, primarily messenger RNA transcripts and their cognate proteins. Such a catalogue cannot be constructed by computational annotation alone; it requires experimental validation on a genome scale. Using 'exon' and 'tiling' arrays fabricated by ink-jet oligonucleotide synthesis, we devised an experimental approach to validate and refine computational gene predictions and define full-length transcripts on the basis of co-regulated expression of their exons. These methods can provide more accurate gene numbers and allow the detection of mRNA splice variants and identification of the tissue- and disease-specific conditions under which genes are expressed. We apply our technique to chromosome 22q under 69 experimental condition pairs, and to the entire human genome under two experimental conditions. We discuss implications for more comprehensive, consistent and reliable genome annotation, more efficient, full-length complementary DNA cloning strategies and application to complex diseases.
Collapse
Affiliation(s)
- D D Shoemaker
- Rosetta Inpharmatics, Inc., Kirkland, Washington 98034, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
194
|
Kawai J, Shinagawa A, Shibata K, Yoshino M, Itoh M, Ishii Y, Arakawa T, Hara A, Fukunishi Y, Konno H, Adachi J, Fukuda S, Aizawa K, Izawa M, Nishi K, Kiyosawa H, Kondo S, Yamanaka I, Saito T, Okazaki Y, Gojobori T, Bono H, Kasukawa T, Saito R, Kadota K, Matsuda H, Ashburner M, Batalov S, Casavant T, Fleischmann W, Gaasterland T, Gissi C, King B, Kochiwa H, Kuehl P, Lewis S, Matsuo Y, Nikaido I, Pesole G, Quackenbush J, Schriml LM, Staubli F, Suzuki R, Tomita M, Wagner L, Washio T, Sakai K, Okido T, Furuno M, Aono H, Baldarelli R, Barsh G, Blake J, Boffelli D, Bojunga N, Carninci P, de Bonaldo MF, Brownstein MJ, Bult C, Fletcher C, Fujita M, Gariboldi M, Gustincich S, Hill D, Hofmann M, Hume DA, Kamiya M, Lee NH, Lyons P, Marchionni L, Mashima J, Mazzarelli J, Mombaerts P, Nordone P, Ring B, Ringwald M, Rodriguez I, Sakamoto N, Sasaki H, Sato K, Schönbach C, Seya T, Shibata Y, Storch KF, Suzuki H, Toyo-oka K, Wang KH, Weitz C, Whittaker C, Wilming L, Wynshaw-Boris A, Yoshida K, Hasegawa Y, Kawaji H, Kohtsuki S, Hayashizaki Y. Functional annotation of a full-length mouse cDNA collection. Nature 2001; 409:685-90. [PMID: 11217851 DOI: 10.1038/35055500] [Citation(s) in RCA: 488] [Impact Index Per Article: 20.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
The RIKEN Mouse Gene Encyclopaedia Project, a systematic approach to determining the full coding potential of the mouse genome, involves collection and sequencing of full-length complementary DNAs and physical mapping of the corresponding genes to the mouse genome. We organized an international functional annotation meeting (FANTOM) to annotate the first 21,076 cDNAs to be analysed in this project. Here we describe the first RIKEN clone collection, which is one of the largest described for any organism. Analysis of these cDNAs extends known gene families and identifies new ones.
Collapse
Affiliation(s)
- J Kawai
- Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center, Yokohama Institute, Kanagawa, Japan
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
195
|
Ball CA, Cherry JM. Genome comparisons highlight similarity and diversity within the eukaryotic kingdoms. Curr Opin Chem Biol 2001; 5:86-9. [PMID: 11166654 PMCID: PMC3040119 DOI: 10.1016/s1367-5931(00)00172-1] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
In 2000, the number of completely sequenced eukaryotic genomes increased to four. The addition of Drosophila and Arabidopsis into this cohort permits additional insights into the processes that have shaped evolution. Analysis and comparisons of both completed genomes and partially sequenced genomes have already shed light on mechanisms such as gene duplication and gene loss that have long been hypothesized to be major forces in speciation. Indeed, duplicate gene pairs in Saccharomyces, Arabidopsis, Caenorhabditis and Drosophila are high: 30%, 60%, 48% and 40%, respectively. Evidence of horizontal gene-transfer, thought to be a major evolutionary force in bacteria, has been found in Arabidopsis. The release of the 'first draft' of the human genome sequence in 2000 heralds a new stage of biological study. Understanding the as-yet-unannotated human genome will be largely based on conclusions, techniques and tools developed during the analysis and comparison of the genome of these four model organisms.
Collapse
|
196
|
Abstract
With the continuing accomplishments of the human genome project, high-throughput strategies to identify DNA sequences that are important in mammalian gene regulation are becoming increasingly feasible. In contrast to the historic, labour-intensive, wet-laboratory methods for identifying regulatory sequences, many modern approaches are heavily focused on the computational analysis of large genomic data sets. Data from inter-species genomic sequence comparisons and genome-wide expression profiling, integrated with various computational tools, are poised to contribute to the decoding of genomic sequence and to the identification of those sequences that orchestrate gene regulation. In this review, we highlight several genomic approaches that are being used to identify regulatory sequences in mammalian genomes.
Collapse
Affiliation(s)
- L A Pennacchio
- Genome Sciences Department, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, California 94720, USA
| | | |
Collapse
|
197
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2001. [PMCID: PMC2447185 DOI: 10.1002/cfg.55] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
|
198
|
Göttgens B, Gilbert JG, Barton LM, Grafham D, Rogers J, Bentley DR, Green AR. Long-range comparison of human and mouse SCL loci: localized regions of sensitivity to restriction endonucleases correspond precisely with peaks of conserved noncoding sequences. Genome Res 2001; 11:87-97. [PMID: 11156618 PMCID: PMC311011 DOI: 10.1101/gr.153001] [Citation(s) in RCA: 72] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2000] [Accepted: 10/12/2000] [Indexed: 11/24/2022]
Abstract
Long-range comparative sequence analysis provides a powerful strategy for identifying conserved regulatory elements. The stem cell leukemia (SCL) gene encodes a bHLH transcription factor with a pivotal role in hemopoiesis and vasculogenesis, and it displays a highly conserved expression pattern. We present here a detailed sequence comparison of 193 kb of the human SCL locus to 234 kb of the mouse SCL locus. Four new genes have been identified together with an ancient mitochondrial insertion in the human locus. The SCL gene is flanked upstream by the SIL gene and downstream by the MAP17 gene in both species, but the gene order is not collinear downstream from MAP17. To facilitate rapid identification of candidate regulatory elements, we have developed a new sequence analysis tool (SynPlot) that automates the graphical display of large-scale sequence alignments. Unlike existing programs, SynPlot can display the locus features of more than one sequence, thereby indicating the position of homology peaks relative to the structure of all sequences in the alignment. In addition, high-resolution analysis of the chromatin structure of the mouse SCL gene permitted the accurate positioning of localized zones accessible to restriction endonucleases. Zones known to be associated with functional regulatory regions were found to correspond precisely with peaks of human/mouse homology, thus demonstrating that long-range human/mouse sequence comparisons allow accurate prediction of the extent of accessible DNA associated with active regulatory regions.
Collapse
Affiliation(s)
- B Göttgens
- The Wellcome Trust Centre for Molecular Mechanisms in Disease, Cambridge Institute for Medical Research, Addenbrooke's Hospital Site, Cambridge CB2 2XY, UK.
| | | | | | | | | | | | | |
Collapse
|
199
|
Touchman JW, Dehejia A, Chiba-Falek O, Cabin DE, Schwartz JR, Orrison BM, Polymeropoulos MH, Nussbaum RL. Human and mouse alpha-synuclein genes: comparative genomic sequence analysis and identification of a novel gene regulatory element. Genome Res 2001; 11:78-86. [PMID: 11156617 PMCID: PMC311023 DOI: 10.1101/gr.165801] [Citation(s) in RCA: 90] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
The human alpha-synuclein gene (SNCA) encodes a presynaptic nerve terminal protein that was originally identified as a precursor of the non-beta-amyloid component of Alzheimer's disease plaques. More recently, mutations in SNCA have been identified in some cases of familial Parkinson's disease, presenting numerous new areas of investigation for this important disease. Molecular studies would benefit from detailed information about the long-range sequence context of SNCA. To that end, we have established the complete genomic sequence of the chromosomal regions containing the human and mouse alpha-synuclein genes, with the objective of using the resulting sequence information to identify conserved regions of biological importance through comparative sequence analysis. These efforts have yielded approximately 146 and approximately 119 kb of high-accuracy human and mouse genomic sequence, respectively, revealing the precise genetic architecture of the alpha-synuclein gene in both species. A simple repeat element upstream of SNCA/Snca has been identified and shown to be necessary for normal expression in transient transfection assays using a luciferase reporter construct. Together, these studies provide valuable data that should facilitate more detailed analysis of this medically important gene.
Collapse
Affiliation(s)
- J W Touchman
- NIH Intramural Sequencing Center, National Institutes of Health, Gaithersburg, Maryland 20877, USA
| | | | | | | | | | | | | | | |
Collapse
|
200
|
Dubchak I, Brudno M, Loots GG, Pachter L, Mayor C, Rubin EM, Frazer KA. Active conservation of noncoding sequences revealed by three-way species comparisons. Genome Res 2000; 10:1304-6. [PMID: 10984448 PMCID: PMC310906 DOI: 10.1101/gr.142200] [Citation(s) in RCA: 238] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Human and mouse genomic sequence comparisons are being increasingly used to search for evolutionarily conserved gene regulatory elements. Large-scale human-mouse DNA comparison studies have discovered numerous conserved noncoding sequences of which only a fraction has been functionally investigated A question therefore remains as to whether most of these noncoding sequences are conserved because of functional constraints or are the result of a lack of divergence time.
Collapse
Affiliation(s)
- I Dubchak
- Center for Bioinformatics and Computational Genomics, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
| | | | | | | | | | | | | |
Collapse
|