1
|
Marchet C, Lecompte L, Silva CD, Cruaud C, Aury JM, Nicolas J, Peterlongo P. De novo clustering of long reads by gene from transcriptomics data. Nucleic Acids Res 2019; 47:e2. [PMID: 30260405 PMCID: PMC6326815 DOI: 10.1093/nar/gky834] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2018] [Revised: 09/04/2018] [Accepted: 09/10/2018] [Indexed: 02/07/2023] Open
Abstract
Long-read sequencing currently provides sequences of several thousand base pairs. It is therefore possible to obtain complete transcripts, offering an unprecedented vision of the cellular transcriptome. However the literature lacks tools for de novo clustering of such data, in particular for Oxford Nanopore Technologies reads, because of the inherent high error rate compared to short reads. Our goal is to process reads from whole transcriptome sequencing data accurately and without a reference genome in order to reliably group reads coming from the same gene. This de novo approach is therefore particularly suitable for non-model species, but can also serve as a useful pre-processing step to improve read mapping. Our contribution both proposes a new algorithm adapted to clustering of reads by gene and a practical and free access tool that allows to scale the complete processing of eukaryotic transcriptomes. We sequenced a mouse RNA sample using the MinION device. This dataset is used to compare our solution to other algorithms used in the context of biological clustering. We demonstrate that it is the best approach for transcriptomics long reads. When a reference is available to enable mapping, we show that it stands as an alternative method that predicts complementary clusters.
Collapse
Affiliation(s)
| | | | - Corinne Da Silva
- Commissariat à l’Énergie Atomique (CEA), Institut de Biologie François Jacob, Genoscope, 91000 Evry, France
| | - Corinne Cruaud
- Commissariat à l’Énergie Atomique (CEA), Institut de Biologie François Jacob, Genoscope, 91000 Evry, France
| | - Jean-Marc Aury
- Commissariat à l’Énergie Atomique (CEA), Institut de Biologie François Jacob, Genoscope, 91000 Evry, France
| | | | | |
Collapse
|
2
|
Scheibye-Alsing K, Hoffmann S, Frankel A, Jensen P, Stadler PF, Mang Y, Tommerup N, Gilchrist MJ, Nygård AB, Cirera S, Jørgensen CB, Fredholm M, Gorodkin J. Sequence assembly. Comput Biol Chem 2008; 33:121-36. [PMID: 19152793 DOI: 10.1016/j.compbiolchem.2008.11.003] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2008] [Revised: 11/28/2008] [Accepted: 11/28/2008] [Indexed: 01/20/2023]
Abstract
Despite the rapidly increasing number of sequenced and re-sequenced genomes, many issues regarding the computational assembly of large-scale sequencing data have remain unresolved. Computational assembly is crucial in large genome projects as well for the evolving high-throughput technologies and plays an important role in processing the information generated by these methods. Here, we provide a comprehensive overview of the current publicly available sequence assembly programs. We describe the basic principles of computational assembly along with the main concerns, such as repetitive sequences in genomic DNA, highly expressed genes and alternative transcripts in EST sequences. We summarize existing comparisons of different assemblers and provide a detailed descriptions and directions for download of assembly programs at: http://genome.ku.dk/resources/assembly/methods.html.
Collapse
Affiliation(s)
- K Scheibye-Alsing
- Division of Genetics and Bioinformatics, IBHV, University of Copenhagen, Grønnegårdsvej 3, 1870 Frederiksberg C, Denmark
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
3
|
Abstract
The "conventional" isoform of myosin that polymerizes into filaments (myosin II) is the molecular motor powering contraction in all three types of muscle. Considerable attention has been paid to the developmental progression, isoform distribution, and mutations that affect myocardial development, function, and adaptation. Optical trap (laser tweezer) experiments and various types of high-resolution fluorescence microscopy, capable of interrogating individual protein motors, are revealing novel and detailed information about their functionally relevant nanometer motions and pico-Newton forces. Single-molecule laser tweezer studies of cardiac myosin isoforms and their mutants have helped to elucidate the pathogenesis of familial hypertrophic cardiomyopathies. Surprisingly, some disease mutations seem to enhance myosin function. More broadly, the myosin superfamily includes more than 20 nonfilamentous members with myriad cellular functions, including targeted organelle transport, endocytosis, chemotaxis, cytokinesis, modulation of sensory systems, and signal transduction. Widely varying genetic, developmental and functional disorders of the nervous, pigmentation, and immune systems have been described in accordance with these many roles. Compared to the collective nature of myosin II, some myosin family members operate with only a few partners or even alone. Individual myosin V and VI molecules can carry cellular vesicular cargoes much farther distances than their own size. Laser tweezer mechanics, single-molecule fluorescence polarization, and imaging with nanometer precision have elucidated the very different mechano-chemical properties of these isoforms. Critical contributions of nonsarcomeric myosins to myocardial development and adaptation are likely to be discovered in future studies, so these techniques and concepts may become important in cardiovascular research.
Collapse
Affiliation(s)
- Jody A Dantzig
- University of Pennsylvania School of Medicine, Pennsylvania Muscle Institute, 3700 Hamilton Walk, D700 Richards Building, Philadelphia, PA 19104-6083, USA
| | | | | |
Collapse
|
4
|
Zhu T, Zhou J, An Y, Zhou J, Li H, Xu G, Ma D. Construction and characterization of a rock-cluster-based EST analysis pipeline. Comput Biol Chem 2006; 30:81-6. [PMID: 16321574 DOI: 10.1016/j.compbiolchem.2005.10.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2005] [Revised: 10/04/2005] [Accepted: 10/04/2005] [Indexed: 10/25/2022]
Abstract
Open access to vast amount of expression sequence tags (ESTs) data in the public databases has provided a powerful platform for gene identification, gene expression studies and comparative/functional genomic studies. To facilitate management of large-scale EST data, high performance cluster and analysis softwares, especially parallel softwares, are fundamentally essential. We reported herein a convenient approach to construct a high performance computating (HPC) cluster based on popular Rocks and a perl-scripted analysis pipeline for EST pre-processing, clustering, assembling and annotation and any other desired analysis modules through parallel computing. We tested the system using different datasets on increasing nodes. Our present results showed that the cluster and pipeline accelerate the EST analysis without artificial interference.
Collapse
Affiliation(s)
- Tao Zhu
- Cancer Biology Research Center, TongJi Hospital, TongJi Medical School, Huazhong University of Science and Technology, WuHan, Hubei 430030, PR China
| | | | | | | | | | | | | |
Collapse
|
5
|
Abstract
Alternative splicing and gene duplication are two major sources of proteomic function diversity. Here, we study the evolutionary trend of alternative splicing after gene duplication by analyzing the alternative splicing differences between duplicate genes. We observed that duplicate genes have fewer alternative splice (AS) forms than single-copy genes, and that a negative correlation exists between the mean number of AS forms and the gene family size. Interestingly, we found that the loss of alternative splicing in duplicate genes may occur shortly after the gene duplication. These results support the subfunctionization model of alternative splicing in the early stage after gene duplication. Further analysis of the alternative splicing distribution in human duplicate pairs showed the asymmetric evolution of alternative splicing after gene duplications; i.e., the AS forms between duplicates may differ dramatically. We therefore conclude that alternative splicing and gene duplication may not evolve independently. In the early stage after gene duplication, young duplicates may take over a certain amount of protein function diversity that previously was carried out by the alternative splicing mechanism. In the late stage, the gain and loss of alternative splicing seem to be independent between duplicates.
Collapse
Affiliation(s)
- Zhixi Su
- James D. Watson Institute of Genome Sciences, Zhejiang University, Hangzhou 310008, China
| | | | | | | | | |
Collapse
|
6
|
Eyras E, Caccamo M, Curwen V, Clamp M. ESTGenes: alternative splicing from ESTs in Ensembl. Genome Res 2004; 14:976-87. [PMID: 15123595 PMCID: PMC479129 DOI: 10.1101/gr.1862204] [Citation(s) in RCA: 64] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2003] [Accepted: 12/18/2003] [Indexed: 11/24/2022]
Abstract
We describe a novel algorithm for deriving the minimal set of nonredundant transcripts compatible with the splicing structure of a set of ESTs mapped on a genome. Sets of ESTs with compatible splicing are represented by a special type of graph. We describe the algorithms for building the graphs and for deriving the minimal set of transcripts from the graphs that are compatible with the evidence. These algorithms are part of the Ensembl automatic gene annotation system, and its results, using ESTs, are provided at www.ensembl.org as ESTgenes for the mosquito, Caenorhabditis briggsae, C. elegans, zebrafish, human, mouse, and rat genomes. Here we also report on the results of this method applied to the human and mouse genomes.
Collapse
Affiliation(s)
- Eduardo Eyras
- The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
| | | | | | | |
Collapse
|
7
|
Tan PK, Downey TJ, Spitznagel EL, Xu P, Fu D, Dimitrov DS, Lempicki RA, Raaka BM, Cam MC. Evaluation of gene expression measurements from commercial microarray platforms. Nucleic Acids Res 2003; 31:5676-84. [PMID: 14500831 PMCID: PMC206463 DOI: 10.1093/nar/gkg763] [Citation(s) in RCA: 492] [Impact Index Per Article: 22.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Multiple commercial microarrays for measuring genome-wide gene expression levels are currently available, including oligonucleotide and cDNA, single- and two-channel formats. This study reports on the results of gene expression measurements generated from identical RNA preparations that were obtained using three commercially available microarray platforms. RNA was collected from PANC-1 cells grown in serum-rich medium and at 24 h following the removal of serum. Three biological replicates were prepared for each condition, and three experimental replicates were produced for the first biological replicate. RNA was labeled and hybridized to microarrays from three major suppliers according to manufacturers' protocols, and gene expression measurements were obtained using each platform's standard software. For each platform, gene targets from a subset of 2009 common genes were compared. Correlations in gene expression levels and comparisons for significant gene expression changes in this subset were calculated, and showed considerable divergence across the different platforms, suggesting the need for establishing industrial manufacturing standards, and further independent and thorough validation of the technology.
Collapse
Affiliation(s)
- Paul K Tan
- Microarray Core Laboratory, National Institute of Diabetes and Digestive and Kidney Disorders, National Institutes of Health, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
8
|
Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD, Salzberg SL, White O. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 2003; 31:5654-66. [PMID: 14500829 PMCID: PMC206470 DOI: 10.1093/nar/gkg770] [Citation(s) in RCA: 1490] [Impact Index Per Article: 67.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The spliced alignment of expressed sequence data to genomic sequence has proven a key tool in the comprehensive annotation of genes in eukaryotic genomes. A novel algorithm was developed to assemble clusters of overlapping transcript alignments (ESTs and full-length cDNAs) into maximal alignment assemblies, thereby comprehensively incorporating all available transcript data and capturing subtle splicing variations. Complete and partial gene structures identified by this method were used to improve The Institute for Genomic Research Arabidopsis genome annotation (TIGR release v.4.0). The alignment assemblies permitted the automated modeling of several novel genes and >1000 alternative splicing variations as well as updates (including UTR annotations) to nearly half of the approximately 27 000 annotated protein coding genes. The algorithm of the Program to Assemble Spliced Alignments (PASA) tool is described, as well as the results of automated updates to Arabidopsis gene annotations.
Collapse
Affiliation(s)
- Brian J Haas
- The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
9
|
Kalyanaraman A, Aluru S, Kothari S, Brendel V. Efficient clustering of large EST data sets on parallel computers. Nucleic Acids Res 2003; 31:2963-74. [PMID: 12771222 PMCID: PMC156714 DOI: 10.1093/nar/gkg379] [Citation(s) in RCA: 56] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Clustering expressed sequence tags (ESTs) is a powerful strategy for gene identification, gene expression studies and identifying important genetic variations such as single nucleotide polymorphisms. To enable fast clustering of large-scale EST data, we developed PaCE (for Parallel Clustering of ESTs), a software program for EST clustering on parallel computers. In this paper, we report on the design and development of PaCE and its evaluation using Arabidopsis ESTs. The novel features of our approach include: (i) design of memory efficient algorithms to reduce the memory required to linear in the size of the input, (ii) a combination of algorithmic techniques to reduce the computational work without sacrificing the quality of clustering, and (iii) use of parallel processing to reduce run-time and facilitate clustering of larger data sets. Using a combination of these techniques, we report the clustering of 168 200 Arabidopsis ESTs in 15 min on an IBM xSeries cluster with 30 dual-processor nodes. We also clustered 327 632 rat ESTs in 47 min and 420 694 Triticum aestivum ESTs in 3 h and 15 min. We demonstrate the quality of our software using benchmark Arabidopsis EST data, and by comparing it with CAP3, a software widely used for EST assembly. Our software allows clustering of much larger EST data sets than is possible with current software. Because of its speed, it also facilitates multiple runs with different parameters, providing biologists a tool to better analyze EST sequence data. Using PaCE, we clustered EST data from 23 plant species and the results are available at the PlantGDB website.
Collapse
|
10
|
Zhu W, Schlueter SD, Brendel V. Refined annotation of the Arabidopsis genome by complete expressed sequence tag mapping. PLANT PHYSIOLOGY 2003; 132:469-84. [PMID: 12805580 PMCID: PMC166990 DOI: 10.1104/pp.102.018101] [Citation(s) in RCA: 76] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/21/2002] [Revised: 01/06/2003] [Accepted: 02/20/2003] [Indexed: 05/18/2023]
Abstract
Expressed sequence tags (ESTs) currently encompass more entries in the public databases than any other form of sequence data. Thus, EST data sets provide a vast resource for gene identification and expression profiling. We have mapped the complete set of 176,915 publicly available Arabidopsis EST sequences onto the Arabidopsis genome using GeneSeqer, a spliced alignment program incorporating sequence similarity and splice site scoring. About 96% of the available ESTs could be properly aligned with a genomic locus, with the remaining ESTs deriving from organelle genomes and non-Arabidopsis sources or displaying insufficient sequence quality for alignment. The mapping provides verified sets of EST clusters for evaluation of EST clustering programs. Analysis of the spliced alignments suggests corrections to current gene structure annotation and provides examples of alternative and non-canonical pre-mRNA splicing. All results of this study were parsed into a database and are accessible via a flexible Web interface at http://www.plantgdb.org/AtGDB/.
Collapse
Affiliation(s)
- Wei Zhu
- Department of Zoology and Genetics, Iowa State University, Ames 50011-3260, USA
| | | | | |
Collapse
|
11
|
Unneberg P, Wennborg A, Larsson M. Transcript identification by analysis of short sequence tags--influence of tag length, restriction site and transcript database. Nucleic Acids Res 2003; 31:2217-26. [PMID: 12682372 PMCID: PMC153741 DOI: 10.1093/nar/gkg313] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
There exist a number of gene expression profiling techniques that utilize restriction enzymes for generation of short expressed sequence tags. We have studied how the choice of restriction enzyme influences various characteristics of tags generated in an experiment. We have also investigated various aspects of in silico transcript identification that these profiling methods rely on. First, analysis of 14 248 mRNA sequences derived from the RefSeq transcript database showed that 1-30% of the sequences lack a given restriction enzyme recognition site. Moreover, 1-5% of the transcripts have recognition sites located less than 10 bases from the poly(A) tail. The uniqueness of 10 bp tags lies in the range 90-95%, which increases only slightly with longer tags, due to the existence of closely related transcripts. Furthermore, 3-30% of upstream 10 bp tags are identical to 3' tags, introducing a risk of misclassification if upstream tags are present in a sample. Second, we found that a sequence length of 16-17 bp, including the recognition site, is sufficient for unique transcript identification by BLAST based sequence alignment to the UniGene Human non-redundant database. Third, we constructed a tag-to-gene mapping for UniGene and compared it to an existing mapping database. The mappings agreed to 79-83%, where the selection of representative sequences in the UniGene clusters is the main cause of the disagreement. The results of this study may serve to improve the interpretation of sequence-based expression studies and the design of hybridization arrays, by identifying short tags that have a high reliability and separating them from tags that carry an inherent ambiguity in their capacity to discriminate between genes. To this end, supplementary information in the form of a web companion to this paper is located at http:// biobase.biotech.kth.se/tagseq.
Collapse
Affiliation(s)
- Per Unneberg
- Department of Biotechnology, Royal Institute of Technology (KTH), Roslagsvägen 30B, S-106 91 Stockholm, Sweden.
| | | | | |
Collapse
|
12
|
Osato N, Itoh M, Konno H, Kondo S, Shibata K, Carninci P, Shiraki T, Shinagawa A, Arakawa T, Kikuchi S, Sato K, Kawai J, Hayashizaki Y. A computer-based method of selecting clones for a full-length cDNA project: simultaneous collection of negligibly redundant and variant cDNAs. Genome Res 2002; 12:1127-34. [PMID: 12097351 PMCID: PMC186622 DOI: 10.1101/gr.75202] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
We describe a computer-based method that selects representative clones for full-length sequencing in a full-length cDNA project. Our method classifies end sequences using two kinds of criteria, grouping, and clustering. Grouping places together variant cDNAs, family genes, and cDNAs with sequencing errors. Clustering separates those cDNA clones into distinct clusters. The full-length sequences of the clones selected by grouping are determined preferentially, and then the sequences selected by clustering are determined. Grouping reduced the number of rice cDNA clones for full-length sequencing to 21% and mouse cDNA clones to 25%. Rice full-length sequences selected by grouping showed a 1.07-fold redundancy. Mouse full-length sequences showed a 1.04-fold redundancy, which can be reduced by approximately 30% from the selection using our previous method. To estimate the coverage of unique genes, we used FANTOM (Functional Annotation of RIKEN Mouse cDNA Clones) clusters (). Grouping covered almost all unique genes (93% of FANTOM clusters), and clustering covered all genes. Therefore, our method is useful for the selection of appropriate representative clones for full-length sequencing, thereby greatly reducing the cost, labor, and time necessary for this process.
Collapse
Affiliation(s)
- Naoki Osato
- Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center, Yokohama, 230-0045, Japan
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
13
|
Abstract
The recent release of the draft sequence and the eventual completion of the human genome present the scientific community with a rich source of data to mine. Yet, these data are content poor in the absence of additional correlative information. Expressed sequence tag (EST) datasets and their associated gene indices have existed for many years, and represent the first attempt at understanding the complexity of the genome. These datasets remain extremely important as information sources and, in particular, as tools for analyzing the completed genomes. Here, we discuss the nature of ESTs and their associated tools and gene-indexing databases. In particular, we will compare three EST gene indices (UNIGENE, Merck Gene Index Version 2.0 and Doubletwist CAT), discuss how these gene indices are applied for both genome analysis and drug discovery, and demonstrate their importance as a complementary dataset to the annotated human genome.
Collapse
Affiliation(s)
- J Yuan
- Department of Bioinformatics, Merck & Co., Inc., P.O. Box 2000-RY80-A1, Rahway, NJ 07065, USA.
| | | | | | | | | |
Collapse
|
14
|
Kan Z, Rouchka EC, Gish WR, States DJ. Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Res 2001; 11:889-900. [PMID: 11337482 PMCID: PMC311065 DOI: 10.1101/gr.155001] [Citation(s) in RCA: 255] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
With the availability of a nearly complete sequence of the human genome, aligning expressed sequence tags (EST) to the genomic sequence has become a practical and powerful strategy for gene prediction. Elucidating gene structure is a complex problem requiring the identification of splice junctions, gene boundaries, and alternative splicing variants. We have developed a software tool, Transcript Assembly Program (TAP), to delineate gene structures using genomically aligned EST sequences. TAP assembles the joint gene structure of the entire genomic region from individual splice junction pairs, using a novel algorithm that uses the EST-encoded connectivity and redundancy information to sort out the complex alternative splicing patterns. A method called polyadenylation site scan (PASS) has been developed to detect poly-A sites in the genome. TAP uses these predictions to identify gene boundaries by segmenting the joint gene structure at polyadenylated terminal exons. Reconstructing 1007 known transcripts, TAP scored a sensitivity (Sn) of 60% and a specificity (Sp) of 92% at the exon level. The gene boundary identification process was found to be accurate 78% of the time. also reports alternative splicing patterns in EST alignments. An analysis of alternative splicing in 1124 genic regions suggested that more than half of human genes undergo alternative splicing. Surprisingly, we saw an absolute majority of the detected alternative splicing events affect the coding region. Furthermore, the evolutionary conservation of alternative splicing between human and mouse was analyzed using an EST-based approach. (See http://stl.wustl.edu/~zkan/TAP/)
Collapse
Affiliation(s)
- Z Kan
- Center for Computational Biology, Washington University, St. Louis, Missouri 63110, USA
| | | | | | | |
Collapse
|
15
|
Haas SA, Beissbarth T, Rivals E, Krause A, Vingron M. GeneNest: automated generation and visualization of gene indices. Trends Genet 2000; 16:521-3. [PMID: 12199289 DOI: 10.1016/s0168-9525(00)02116-8] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- S A Haas
- Deutsches Krebforschungs-zentrum, Department of Theoretical Bioinformatics, Im Neuenheimer Feld 280, D-69120 Heidelberg, Germany.
| | | | | | | | | |
Collapse
|
16
|
Liang F, Holt I, Pertea G, Karamycheva S, Salzberg SL, Quackenbush J. An optimized protocol for analysis of EST sequences. Nucleic Acids Res 2000; 28:3657-65. [PMID: 10982889 PMCID: PMC110731 DOI: 10.1093/nar/28.18.3657] [Citation(s) in RCA: 99] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The vast body of Expressed Sequence Tag (EST) data in the public databases provide an important resource for comparative and functional genomics studies and an invaluable tool for the annotation of genomic sequences. We have developed a rigorous protocol for reconstructing the sequences of transcribed genes from EST and gene sequence fragments. A key element in developing this protocol has been the evaluation of a number of sequence assembly programs to determine which most faithfully reproduce transcript sequences from EST data. The TIGR Gene Indices constructed using this protocol for human, mouse, rat and a variety of other plant and animal models have demonstrated their utility in a variety of applications and are freely available to the scientific research community.
Collapse
Affiliation(s)
- F Liang
- The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA
| | | | | | | | | | | |
Collapse
|
17
|
Bortoluzzi S, d'Alessi F, Romualdi C, Danieli GA. The human adult skeletal muscle transcriptional profile reconstructed by a novel computational approach. Genome Res 2000; 10:344-9. [PMID: 10720575 PMCID: PMC311426 DOI: 10.1101/gr.10.3.344] [Citation(s) in RCA: 39] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
Abstract
By applying a novel software tool, information on 4080 UniGene clusters was retrieved from three adult human skeletal muscle cDNA libraries, which were selected for being neither normalized nor subtracted. Reconstruction of a transcriptional profile of the corresponding tissue was attempted by a computational approach, classifying each transcript according to its level of expression. About 25% of the transcripts accounted for about 80% of the detected transcriptional activity, whereas most genes showed a low level of expression. This in silico transcriptional profile was then compared with data obtained by a SAGE study. A fairly good agreement between the two methods was observed. About 400 genes, highly expressed in skeletal muscle or putatively skeletal muscle-specific, may represent the minimal set of genes needed to determine the tissue specificity. These genes could be used as a convenient reference to monitor major changes in the transcriptional profile of adult human skeletal muscle in response to different physiological or pathological conditions, thus providing a framework for designing DNA microarrays and initiating biological studies.
Collapse
Affiliation(s)
- S Bortoluzzi
- Department of Biology, University of Padua, 35131 Padua, Italy
| | | | | | | |
Collapse
|
18
|
Quackenbush J, Liang F, Holt I, Pertea G, Upton J. The TIGR gene indices: reconstruction and representation of expressed gene sequences. Nucleic Acids Res 2000; 28:141-5. [PMID: 10592205 PMCID: PMC102391 DOI: 10.1093/nar/28.1.141] [Citation(s) in RCA: 219] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Expressed sequence tags (ESTs) have provided a first glimpse of the collection of transcribed sequences in a variety of organisms. However, a careful analysis of this sequence data can provide significant additional functional, structural and evolutionary information. Our analysis of the public EST sequences, available through the TIGR Gene Indices (TGI; http://www.tigr.org/tdb/tdb.html ), is an attempt to identify the genes represented by that data and to provide additional information regarding those genes. Gene Indices are constructed for selected organisms by first clustering, then assembling EST and annotated gene sequences from GenBank. This process produces a set of unique, high-fidelity virtual transcripts, or tentative consensus (TC) sequences. The TC sequences can be used to provide putative genes with functional annotation, to link the transcripts to mapping and genomic sequence data, and to provide links between orthologous and paralogous genes.
Collapse
Affiliation(s)
- J Quackenbush
- The Institute for Genomic Research, Rockville, MD 20850, USA.
| | | | | | | | | |
Collapse
|
19
|
Miller RT, Christoffels AG, Gopalakrishnan C, Burke J, Ptitsyn AA, Broveak TR, Hide WA. A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. Genome Res 1999; 9:1143-55. [PMID: 10568754 PMCID: PMC310831 DOI: 10.1101/gr.9.11.1143] [Citation(s) in RCA: 106] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/1999] [Accepted: 09/20/1999] [Indexed: 11/24/2022]
Abstract
The expressed human genome is being sequenced and analyzed by disparate groups producing disparate data. The majority of the identified coding portion is in the form of expressed sequence tags (ESTs). The need to discover exonic representation and expression forms of full-length cDNAs for each human gene is frustrated by the partial and variable quality nature of this data delivery. A highly redundant human EST data set has been processed into integrated and unified expressed transcript indices that consist of hierarchically organized human transcript consensi reflecting gene expression forms and genetic polymorphism within an index class. The expression index and its intermediate outputs include cleaned transcript sequence, expression, and alignment information and a higher fidelity subset, SANIGENE. The STACK_PACK clustering system has been applied to dbEST release 121598 (GenBank version 110). Sixty-four percent of 1,313, 103 Homo sapiens ESTs are condensed into 143,885 tissue level multiple sequence clusters; linking through clone-ID annotations produces 68,701 total assemblies, such that 81% of the original input set is captured in a STACK multiple sequence or linked cluster. Indexing of alignments by substituent EST accession allows browsing of the data structure and its cross-links to UniGene. STACK metaclusters consolidate a greater number of ESTs by a factor of 1. 86 with respect to the corresponding UniGene build. Fidelity comparison with genome reference sequence AC004106 demonstrates consensus expression clusters that reflect significantly lower spurious repeat sequence content and capture alternate splicing within a whole body index cluster and three STACK v.2.3 tissue-level clusters. Statistics of a staggered release whole body index build of STACK v.2.0 are presented.
Collapse
Affiliation(s)
- R T Miller
- South African National Bioinformatics Institute, Private Bag X17, Bellville 7535, University of the Western Cape, South Africa
| | | | | | | | | | | | | |
Collapse
|