1
|
Luo J, Ding Y, Peng Z, Chen K, Zhang X, Xiao T, Chen J. Molecular diversity and evolutionary trends of cysteine-rich peptides from the venom glands of Chinese spider Heteropoda venatoria. Sci Rep 2021; 11:3211. [PMID: 33547373 PMCID: PMC7865051 DOI: 10.1038/s41598-021-82668-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2020] [Accepted: 01/20/2021] [Indexed: 11/14/2022] Open
Abstract
Heteropoda venatoria in the family Sparassidae is highly valued in pantropical countries because the species feed on domestic insect pests. Unlike most other species of Araneomorphae, H. venatoria uses the great speed and strong chelicerae (mouthparts) with toxin glands to capture the insects instead of its web. Therefore, H. venatoria provides unique opportunities for venom evolution research. The venom of H. venatoria was explored by matrix-assisted laser desorption/ionization tandem time-of-flight and analyzing expressed sequence tags. The 154 sequences coding cysteine-rich peptides (CRPs) revealed 24 families based on the phylogenetic analyses of precursors and cysteine frameworks in the putative mature regions. Intriguingly, four kinds of motifs are first described in spider venom. Furthermore, combining the diverse CRPs of H. venatoria with previous spider venom peptidomics data, the structures of precursors and the patterns of cysteine frameworks were analyzed. This work revealed the dynamic evolutionary trends of venom CRPs in H. venatoria: the precursor has evolved an extended mature peptide with more cysteines, and a diminished or even vanished propeptides between the signal and mature peptides; and the CRPs evolved by multiple duplications of an ancestral ICK gene as well as recruitments of non-toxin genes.
Collapse
Affiliation(s)
- Jie Luo
- College of Bioscience and Biotechnology, Hunan Agricultural University, Changsha, 410128, People's Republic of China
| | - Yiying Ding
- College of Bioscience and Biotechnology, Hunan Agricultural University, Changsha, 410128, People's Republic of China
| | - Zhihao Peng
- College of Bioscience and Biotechnology, Hunan Agricultural University, Changsha, 410128, People's Republic of China
| | - Kezhi Chen
- College of Bioscience and Biotechnology, Hunan Agricultural University, Changsha, 410128, People's Republic of China
| | - Xuewen Zhang
- College of Bioscience and Biotechnology, Hunan Agricultural University, Changsha, 410128, People's Republic of China
| | - Tiaoyi Xiao
- College of Animal Science and Technology, Hunan Agricultural University, Changsha, 410128, People's Republic of China
| | - Jinjun Chen
- College of Bioscience and Biotechnology, Hunan Agricultural University, Changsha, 410128, People's Republic of China. .,Hunan Provincial Engineering Technology Research Center for Cell Mechanics and Function Analysis, Changsha, 410128, People's Republic of China.
| |
Collapse
|
2
|
Wijayawardena BK, Minchella DJ, DeWoody JA. Horizontal gene transfer in schistosomes: A critical assessment. Mol Biochem Parasitol 2015; 201:57-65. [DOI: 10.1016/j.molbiopara.2015.05.008] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2015] [Revised: 05/27/2015] [Accepted: 05/29/2015] [Indexed: 02/04/2023]
|
3
|
Parada GE, Munita R, Cerda CA, Gysling K. A comprehensive survey of non-canonical splice sites in the human transcriptome. Nucleic Acids Res 2014; 42:10564-78. [PMID: 25123659 PMCID: PMC4176328 DOI: 10.1093/nar/gku744] [Citation(s) in RCA: 84] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
We uncovered the diversity of non-canonical splice sites at the human transcriptome using deep transcriptome profiling. We mapped a total of 3.7 billion human RNA-seq reads and developed a set of stringent filters to avoid false non-canonical splice site detections. We identified 184 splice sites with non-canonical dinucleotides and U2/U12-like consensus sequences. We selected 10 of the herein identified U2/U12-like non-canonical splice site events and successfully validated 9 of them via reverse transcriptase-polymerase chain reaction and Sanger sequencing. Analyses of the 184 U2/U12-like non-canonical splice sites indicate that 51% of them are not annotated in GENCODE. In addition, 28% of them are conserved in mouse and 76% are involved in alternative splicing events, some of them with tissue-specific alternative splicing patterns. Interestingly, our analysis identified some U2/U12-like non-canonical splice sites that are converted into canonical splice sites by RNA A-to-I editing. Moreover, the U2/U12-like non-canonical splice sites have a differential distribution of splicing regulatory sequences, which may contribute to their recognition and regulation. Our analysis provides a high-confidence group of U2/U12-like non-canonical splice sites, which exhibit distinctive features among the total human splice sites.
Collapse
Affiliation(s)
- Guillermo E Parada
- Nucleus Millennium in Stress and Addiction, Department of Cellular and Molecular Biology, Faculty of Biological Sciences, Pontificia Universidad Católica de Chile, Alameda 340, Santiago, Chile
| | - Roberto Munita
- Nucleus Millennium in Stress and Addiction, Department of Cellular and Molecular Biology, Faculty of Biological Sciences, Pontificia Universidad Católica de Chile, Alameda 340, Santiago, Chile
| | - Cledi A Cerda
- Nucleus Millennium in Stress and Addiction, Department of Cellular and Molecular Biology, Faculty of Biological Sciences, Pontificia Universidad Católica de Chile, Alameda 340, Santiago, Chile
| | - Katia Gysling
- Nucleus Millennium in Stress and Addiction, Department of Cellular and Molecular Biology, Faculty of Biological Sciences, Pontificia Universidad Católica de Chile, Alameda 340, Santiago, Chile
| |
Collapse
|
4
|
Characterisation of full-length mitochondrial copies and partial nuclear copies (numts) of the cytochrome b and cytochrome c oxidase subunit I genes of Toxoplasma gondii, Neospora caninum, Hammondia heydorni and Hammondia triffittae (Apicomplexa: Sarcocystidae). Parasitol Res 2013; 112:1493-511. [PMID: 23358734 DOI: 10.1007/s00436-013-3296-4] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2012] [Accepted: 01/11/2013] [Indexed: 10/27/2022]
Abstract
Genomic DNA was extracted from three oocyst isolates of Hammondia triffittae from foxes and two oocyst isolates of Hammondia heydorni from dogs, as well as from cell culture-derived tachyzoites of Toxoplasma gondii (RH strain) and Neospora caninum (NC-Liverpool strain), and examined by PCR with primers targeting the cytochrome b (cytb) and the cytochrome c oxidase subunit I (cox1) genes in order to characterise both genes and, if possible, the remainder of the mitochondrial genome of these species. Several primers were designed and used in various combinations to amplify regions within and between both genes and to determine gene order. When certain forward primers targeting cytb were used in combination with certain reverse primers targeting cox1, two overlapping sequences were obtained for each species and isolate studied, which showed that a full-length copy of cytb was followed 36-37 bp downstream by a full-length copy of cox1, and these sequences are believed to represent the true mitochondrial genes and the gene order in the mitochondrial genome of the four species examined. The cytb of T. gondii, N. caninum, H. heydorni and H. triffittae comprised a total of 1,080 bp (359 amino acids) and used ATG and TAA as start and stop codon, respectively. The cox1 of these species also used TAA as stop codon, whereas the most likely start codon was ATG, resulting in a gene comprising 1,491 bp (496 amino acids). Pair-wise sequence comparisons based on either cytb or cox1 clearly separated T. gondii from N. caninum and both of these species from the two Hammondia species, whereas the latter two species were 100 % identical at cytb and shared 99.3 % identity at cox1. Phylogenetic analyses using the maximum-likelihood method confirmed these findings and placed T. gondii in a clade separate from the three other species and all four Toxoplasmatinae in a sister clade to Eimeria spp. PCR with other primers and/or primer pairs than those used to obtain the full-length mitochondrial genes yielded several types of about 1-1.5 kb long sequences, which comprised stretches of the primer-targeted genes at both ends and an intervening non-coding sequence of various length and composition. Thus, portions of cytb could be found both upstream and downstream from portions of cox1 and portions of the same gene could be found adjacent to each other (cytb→cox1; cox1→cytb; cytb→cytb; cox1→cox1). Sequence comparisons revealed that some of these gene fragments were truncated genes, whereas others included the putative start or stop codon of the full-length mitochondrial genes. From the nature of the gene fragments and/or their flanking sequences, they are assumed to be located on the chromosomes of the nuclear genome and to represent nuclear mitochondrial DNA segments (numts) or pseudogenes. In the four species examined, there were no nucleotide differences between the full-length mitochondrial copies of cytb and cox1 and their various incomplete nuclear counterparts. With a few exceptions, identical numt types and closely similar flanking sequences were obtained for all four species, which would indicate that the original transfer of these mitochondrial genes to the nuclear genome and/or the majority of any subsequent rearrangements of these gene fragments within the nuclear genome happened before the four species diverged. Yet, there were species-specific differences in the nucleotide composition of the nuclear gene fragments, identical to the differences in the mitochondrial genes, which would indicate that the incomplete nuclear copies of cytb and cox1 have been continuously updated during evolution to conform to their mitochondrial parent genes. The PCR-based findings of numts were further supported by Basic Local Alignment Search Tool (BLAST) searches against genome sequences of T. gondii and N. caninum using the concatenated mitochondrial cytb/cox1 sequences as queries. These searches revealed the presence of numerous numts of eighth distinct types in both species, with each one having a fixed starting and end point with respect to the nucleotide positions in the full-length mitochondrial genes. Four numt types were completely homologous between both species, whereas four other types differed with respect to their end point and/or the absence/presence of a 96-bp deletion. Each starting and end point was associated with a unique 100-200-bp long flanking sequence, which further revealed the presence of numts. For both species, the numt types and their various arrangements with respect to each other were identical or similar to those obtained by PCR in all four species examined. None of the identified numts covered a full-length gene, but together, the various numts covered the entire mitochondrial cytb and cox1 genes in an overlapping manner. In addition, they were fairly closely spaced on the chromosomes, and these features may explain why the nuclear copies were preferentially amplified to the exclusion of the true mitochondrial genes with most primers and primer pairs used in the present study. The possibility of a similar high prevalence of numts occurring in the nuclear genome of dinoflagellates is discussed.
Collapse
|
5
|
Li S, Heermann DW. Using chimaeric expression sequence tag as the reference to identify three-dimensional chromosome contacts. DNA Res 2012; 20:45-53. [PMID: 23213109 PMCID: PMC3576657 DOI: 10.1093/dnares/dss032] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Transcription-induced chimaeric transcripts, the potential post-transcriptional processing products, might reflect the spatial proximity of actively transcribed genes co-localized in transcription factories. A growing number of expression data deposited in databases provide us with the raw material for screening such chimaeric transcripts and using them as the probes to identify interactions between genes in cis or in trans. Based on the high-quality chimaeric transcripts gleaned from human expression sequence tag data with selection criteria, we identified the patterns of inter- and intrachromosomal gene–gene interactions. On top the contact pattern from interchromosomal interactions, we also observed an exponential behaviour of the intrachromosomal interactions within a certain length scale, which is consistent with the independent experimental results from Hi-C screening and with the Random Loop Model. A compatible result is found for mouse. Transcription-induced chimaeric transcripts, most of which might be accidental products with trivial functions, shed light on the spatial organization of chromosomes. These inter- and intrachromosomal interactions might contribute to the compaction of chromosomes, their segregation and formation of the chromosome territories, and their spatial distribution within the nucleus.
Collapse
Affiliation(s)
- Songling Li
- Theoretical Biophysics Group, Institute for Theoretical Physics, University of Heidelberg, Heidelberg, Germany
| | | |
Collapse
|
6
|
Lu J, Li C, Shi C, Balducci J, Huang H, Ji HL, Chang Y, Huang Y. Identification of novel splice variants and exons of human endothelial cell-specific chemotaxic regulator (ECSCR) by bioinformatics analysis. Comput Biol Chem 2012; 41:41-50. [PMID: 23147565 DOI: 10.1016/j.compbiolchem.2012.10.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2012] [Revised: 10/10/2012] [Accepted: 10/11/2012] [Indexed: 01/01/2023]
Abstract
Recent discovery of biological function of endothelial cell-specific chemotaxic regulator (ECSCR), previously known as endothelial cell-specific molecule 2 (ECSM2), in modulating endothelial cell migration, apoptosis, and angiogenesis, has made it an attractive molecule in vascular research. Thus, identification of splice variants of ECSCR could provide new strategies for better understanding its roles in health and disease. In this study, we performed a series of blast searches on the human EST database with known ECSCR cDNA sequence (Variant 1), and identified additional three splice variants (Variants 2-4). When examining the ECSCR gene in the human genome assemblies, we found a large unknown region between Exons 9 and 11. By PCR amplification and sequencing, we partially mapped Exon 10 within this previously unknown region of the ECSCR gene. Taken together, in addition to previously reported human ECSCR, we identified three novel full-length splice variants potentially encoding different protein isoforms. We further defined a total of twelve exons and nearly all exon-intron boundaries of the gene, of which only eight are annotated in current public databases. Our work provides new information on gene structure and alternative splicing of the human ECSCR, which may imply its functional complexity. This undoubtedly opens new opportunities for future investigation of the biological and pathological significance of these ECSCR splice variants.
Collapse
Affiliation(s)
- Jia Lu
- Department of Obstetrics and Gynecology, Barrow Neurological Institute, St Joseph's Hospital and Medical Center, Phoenix, AZ 85013, USA
| | | | | | | | | | | | | | | |
Collapse
|
7
|
Grunau C, Boissier J. No evidence for lateral gene transfer between salmonids and schistosomes. Nat Genet 2010; 42:918-9. [PMID: 20980981 DOI: 10.1038/ng1110-918] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
8
|
Ribosomal protein genes of holometabolan insects reject the Halteria, instead revealing a close affinity of Strepsiptera with Coleoptera. Mol Phylogenet Evol 2010; 55:846-59. [PMID: 20348001 DOI: 10.1016/j.ympev.2010.03.024] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2009] [Revised: 02/18/2010] [Accepted: 03/19/2010] [Indexed: 11/23/2022]
Abstract
The phylogenetic relationships among holometabolan insect orders remain poorly known, despite a wealth of previous studies. In particular, past attempts to clarify the sister-group of the enigmatic order Strepsiptera with rRNA genes have led to intense debate about long-branch attraction (the 'Strepsiptera problem'), without resolving the taxonomic question at hand. Here, we appealed to alternative nuclear sequences of 27 ribosomal proteins (RPs) to generate a data matrix of 10,731 nucleotides for 22 holometabolan taxa, including two strepsipteran species. Phylogenetic relationships among holometabolan insects were analyzed under several nucleotide-coding schemes to explore differences in signal and systematic biases. Saturation and compositional bias particularly affected third positions, which greatly differed in AT content (18-72%). Such confounding factors were best reduced by R-Y coding and removal of third codon positions, resulting in more strongly supported topologies, whereas amino acid coding gave poor resolution. The placement of Strepsiptera with Coleoptera (the Coleopterida) was recovered under most coding schemes and analytical methods, if often with modest support and ambiguity. In contrast, an alternative sister-group with Diptera (the Halteria) was only found in one analysis using parsimony, and weakly supported. The topologies here generally support a Coleoptera+Strepsiptera as sister-group to Mecopterida (Siphonaptera+Mecoptera+Diptera+Lepidoptera+Trichoptera), while Hymenoptera were always recovered as sister-group to the remaining Holometabola.
Collapse
|
9
|
Hittinger CT, Johnston M, Tossberg JT, Rokas A. Leveraging skewed transcript abundance by RNA-Seq to increase the genomic depth of the tree of life. Proc Natl Acad Sci U S A 2010; 107:1476-81. [PMID: 20080632 PMCID: PMC2824393 DOI: 10.1073/pnas.0910449107] [Citation(s) in RCA: 91] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Assembling the tree of life is a major goal of biology, but progress has been hindered by the difficulty and expense of obtaining the orthologous DNA required for accurate and fully resolved phylogenies. Next-generation DNA sequencing technologies promise to accelerate progress, but sequencing the genomes of hundreds of thousands of eukaryotic species remains impractical. Eukaryotic transcriptomes, which are smaller than genomes and biased toward highly expressed genes that tend to be conserved, could potentially provide a rich set of phylogenetic characters. We sampled the transcriptomes of 10 mosquito species by assembling 36-bp sequence reads into phylogenomic data matrices containing hundreds of thousands of orthologous nucleotides from hundreds of genes. Analysis of these data matrices yielded robust phylogenetic inferences, even with data matrices constructed from surprisingly few sequence reads. This approach is more efficient, data-rich, and economical than traditional PCR-based and EST-based methods and provides a scalable strategy for generating phylogenomic data matrices to infer the branches and twigs of the tree of life.
Collapse
Affiliation(s)
- Chris Todd Hittinger
- Department of Biochemistry and Molecular Genetics, University of Colorado Denver Health Sciences Center, Aurora, CO 80045
- Center for Genome Sciences, Department of Genetics, Washington University School of Medicine, St. Louis, MO 63108
| | - Mark Johnston
- Department of Biochemistry and Molecular Genetics, University of Colorado Denver Health Sciences Center, Aurora, CO 80045
- Center for Genome Sciences, Department of Genetics, Washington University School of Medicine, St. Louis, MO 63108
| | - John T. Tossberg
- Department of Biological Sciences, Vanderbilt University, Nashville, TN 37235
| | - Antonis Rokas
- Department of Biological Sciences, Vanderbilt University, Nashville, TN 37235
| |
Collapse
|
10
|
Funari VA, Voevodski K, Leyfer D, Yerkes L, Cramer D, Tolan DR. Quantitative gene expression profiles in real time from expressed sequence tag databases. Gene Expr 2010; 14:321-36. [PMID: 20635574 PMCID: PMC2954622 DOI: 10.3727/105221610x12717040569820] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
An accumulation of expressed sequence tag (EST) data in the public domain and the availability of bioinformatic programs have made EST gene expression profiling a common practice. However, the utility and validity of using EST databases (e.g., dbEST) has been criticized, particularly for quantitative assessment of gene expression. Problems with EST sequencing errors, library construction, EST annotation, and multiple paralogs make generation of specific and sensitive qualitative arid quantitative expression profiles a concern. In addition, most EST-derived expression data exists in previously assembled databases. The Virtual Northern Blot (VNB) (http: //tlab.bu.edu/vnb.html) allows generation, evaluation, and optimization of expression profiles in real time, which is especially important for alternatively spliced, novel, or poorly characterized genes. Representative gene families with variable nucleotide sequence identity, tissue specificity, and levels of expression (bcl-xl, aldoA, and cyp2d9) are used to assess the quality of VNB's output. The profiles generated by VNB are more sensitive and specific than those constructed with ESTs listed in preindexed databases at UCSC and NCBI. Moreover, quantitative expression profiles produced by VNB are comparable to quantization obtained from Northern blots and qPCR. The VNB pipeline generates real-time gene expression profiles for single-gene queries that are both qualitatively and quantitatively reliable.
Collapse
Affiliation(s)
| | | | - Dimitry Leyfer
- †Bioinformatics Program, Boston University, Boston, MA, USA
| | - Laura Yerkes
- *Biology Department, Boston University, Boston, MA, USA
| | - Donald Cramer
- *Biology Department, Boston University, Boston, MA, USA
| | - Dean R. Tolan
- *Biology Department, Boston University, Boston, MA, USA
- †Bioinformatics Program, Boston University, Boston, MA, USA
| |
Collapse
|
11
|
Barash U, Cohen-Kaplan V, Arvatz G, Gingis-Velitski S, Levy-Adam F, Nativ O, Shemesh R, Ayalon-Sofer M, Ilan N, Vlodavsky I. A novel human heparanase splice variant, T5, endowed with protumorigenic characteristics. FASEB J 2009; 24:1239-48. [PMID: 20007507 DOI: 10.1096/fj.09-147074] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Heparanase is a mammalian endo-beta-d-glucuronidase that can cleave heparan sulfate side chains, an activity strongly implicated in tumor cell dissemination. The current study aimed to identify and characterize heparanase splice variants. LEADS, Compugen's alternative splicing modeling platform (Compugen, Tel Aviv, Israel), was used to search for splice variants in silico; tumor-derived cell lines (i.e., CAG myeloma) and tumor biopsies were utilized to validate T5 expression in vivo; signaling (i.e., Src phosphorylation) was evaluated following T5 gene silencing or overexpression and correlated with cell proliferation, colony formation, and tumor xenograft development. A novel spliced form of human heparanase, termed T5, was identified. In this splice variant, 144 bp of intron 5 are joined with exon 4, which results in a truncated, enzymatically inactive protein. T5 overexpression resulted in increased cell proliferation and larger colonies in soft agar, mediated by Src activation. Furthermore, T5 overexpression markedly enhanced tumor xenograft development. T5 expression is up-regulated in 75% of human renal cell carcinoma biopsies examined, which suggests that this splice variant is clinically relevant. Controls included cells overexpressing wild-type heparanase or an empty plasmid and normal-looking tissue adjacent the carcinoma lesion. T5 is a novel functional splice variant of human heparanase endowed with protumorigenic characteristics.-Barash, U., Cohen-Kaplan, V., Arvatz, G., Gingis-Velitski, S., Levy-Adam, F., Nativ, O., Shemesh, R., Ayalon-Sofer, M., Ilan, N., Vlodavsky, I. A novel human heparanase splice variant, T5, endowed with protumorigenic characteristics.
Collapse
Affiliation(s)
- Uri Barash
- Cancer and Vascular Biology Research Center, Faculty of Medicine, Technion, P.O. Box 9649, Haifa 31096, Israel
| | | | | | | | | | | | | | | | | | | |
Collapse
|
12
|
Bragg LM, Stone G. k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage. Bioinformatics 2009; 25:2302-8. [PMID: 19570806 PMCID: PMC2735666 DOI: 10.1093/bioinformatics/btp410] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Motivation: The clustering of expressed sequence tags (ESTs) is a crucial step in many sequence analysis studies that require a high level of redundancy. Chimeric sequences, while uncommon, can make achieving the optimal EST clustering a challenge. Single-linkage algorithms are particularly vulnerable to the effects of chimeras. To avoid chimera-facilitated erroneous merges, researchers using single-linkage algorithms are forced to use stringent sequence–similarity thresholds. Such thresholds reduce the sensitivity of the clustering algorithm. Results: We introduce the concept of k-link clustering for EST data. We evaluate how clustering error rates vary over a range of linkage thresholds. Using k-link, we show that Type II error decreases in response to increasing the number of shared ESTs (ie. links) required. We observe a base level of Type II error likely caused by the presence of unmasked low-complexity or repetitive sequence. We find that Type I error increases gradually with increased linkage. To minimize the Type I error introduced by increased linkage requirements, we propose an extension to k-link which modifies the required number of links with respect to the size of clusters being compared. Availability: The implementation of k-link is available under the terms of the GPL from http://www.bioinformatics.csiro.au/products.shtml. k-link is licensed under the GNU General Public License, and can be downloaded from http://www.bioinformatics.csiro.au/products.shtml. k-link is written in C++. Contact:lauren.bragg@csiro.au Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lauren M Bragg
- CSIRO Mathematical and Information Sciences, North Ryde, NSW, Australia.
| | | |
Collapse
|
13
|
Zhang Z, Xin D, Wang P, Zhou L, Hu L, Kong X, Hurst LD. Noisy splicing, more than expression regulation, explains why some exons are subject to nonsense-mediated mRNA decay. BMC Biol 2009; 7:23. [PMID: 19442261 PMCID: PMC2697156 DOI: 10.1186/1741-7007-7-23] [Citation(s) in RCA: 58] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2009] [Accepted: 05/14/2009] [Indexed: 01/23/2023] Open
Abstract
Background Nonsense-mediated decay is a mechanism that degrades mRNAs with a premature termination codon. That some exons have premature termination codons at fixation is paradoxical: why make a transcript if it is only to be destroyed? One model supposes that splicing is inherently noisy and spurious transcripts are common. The evolution of a premature termination codon in a regularly made unwanted transcript can be a means to prevent costly translation. Alternatively, nonsense-mediated decay can be regulated under certain conditions so the presence of a premature termination codon can be a means to up-regulate transcripts needed when nonsense-mediated decay is suppressed. Results To resolve this issue we examined the properties of putative nonsense-mediated decay targets in humans and mice. We started with a well-annotated set of protein coding genes and found that 2 to 4% of genes are probably subject to nonsense-mediated decay, and that the premature termination codon reflects neither rare mutations nor sequencing artefacts. Several lines of evidence suggested that the noisy splicing model has considerable relevance: 1) exons that are uniquely found in nonsense-mediated decay transcripts (nonsense-mediated decay-specific exons) tend to be newly created; 2) have low-inclusion level; 3) tend not to be a multiple of three long; 4) belong to genes with multiple splice isoforms more often than expected; and 5) these genes are not obviously enriched for any functional class nor conserved as nonsense-mediated decay candidates in other species. However, nonsense-mediated decay-specific exons for which distant orthologous exons can be found tend to have been under purifying selection, consistent with the regulation model. Conclusion We conclude that for recently evolved exons the noisy splicing model is the better explanation of their properties, while for ancient exons the nonsense-mediated decay regulated gene expression is a viable explanation.
Collapse
Affiliation(s)
- Zhenguo Zhang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences (SIBS), Chinese Academy of Sciences (CAS) & Shanghai Jiao Tong University School of Medicine (SJTUSM), Shanghai, PR China.
| | | | | | | | | | | | | |
Collapse
|
14
|
van Hooff SR, Koster J, Hulsen T, van Schaik BDC, Roos M, van Batenburg MF, Versteeg R, van Kampen AHC. The construction of genome-based transcriptional units. OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2009; 13:105-114. [PMID: 19320556 DOI: 10.1089/omi.2008.0036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Gene-oriented sequence clusters (transcriptional units) have found many applications in genomics research including the construction of transcriptome maps and identification of splice variants. We developed a new method to construct transcriptional that uses the genomic sequence as a template. We present and discuss our method in detail together with an evaluation of the transcriptional units for human. We constructed 33,007 and 27,792 transcriptional units for human and mouse, respectively. The sensitivity (81%) and specificity (90%) of our method compares favorably to other established methods. We evaluated the representation of experimentally validated and predicted intergenic spliced transcripts in humans and show that we correctly represent a large fraction of these cases by single transcriptional units. Our method performs well, but the evaluation of the final set of transcriptional units show that improvements to the algorithm are still possible. However, because the precise number and types of errors are difficult to track, it is not obvious how to significantly improve the algorithm. We believe that ongoing research efforts are necessary to further improve current methods. This should include detailed documentation, comparison, and evaluation of current methods.
Collapse
Affiliation(s)
- Sander R van Hooff
- Bioinformatics Laboratory, Academic Medical Center, Meibergdreef 9, Amsterdam, The Netherlands
| | | | | | | | | | | | | | | |
Collapse
|
15
|
Mello BP, Abrantes EF, Torres CH, Machado-Lima A, Fonseca RDS, Carraro DM, Brentani RR, Reis LFL, Brentani H. No-match ORESTES explored as tumor markers. Nucleic Acids Res 2009; 37:2607-17. [PMID: 19270067 PMCID: PMC2677862 DOI: 10.1093/nar/gkp074] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Sequencing technologies and new bioinformatics tools have led to the complete sequencing of various genomes. However, information regarding the human transcriptome and its annotation is yet to be completed. The Human Cancer Genome Project, using ORESTES (open reading frame EST sequences) methodology, contributed to this objective by generating data from about 1.2 million expressed sequence tags. Approximately 30% of these sequences did not align to ESTs in the public databases and were considered no-match ORESTES. On the basis that a set of these ESTs could represent new transcripts, we constructed a cDNA microarray. This platform was used to hybridize against 12 different normal or tumor tissues. We identified 3421 transcribed regions not associated with annotated transcripts, representing 83.3% of the platform. The total number of differentially expressed sequences was 1007. Also, 28% of analyzed sequences could represent noncoding RNAs. Our data reinforces the knowledge of the human genome being pervasively transcribed, and point out molecular marker candidates for different cancers. To reinforce our data, we confirmed, by real-time PCR, the differential expression of three out of eight potentially tumor markers in prostate tissues. Lists of 1007 differentially expressed sequences, and the 291 potentially noncoding tumor markers were provided.
Collapse
Affiliation(s)
- Barbara P Mello
- Hospital A. C. Camargo, Rua Prof. Antônio Prudente 211, São Paulo, SP, Brazil
| | | | | | | | | | | | | | | | | |
Collapse
|
16
|
Koscielny G, Texier VL, Gopalakrishnan C, Kumanduri V, Riethoven JJ, Nardone F, Stanley E, Fallsehr C, Hofmann O, Kull M, Harrington E, Boué S, Eyras E, Plass M, Lopez F, Ritchie W, Moucadel V, Ara T, Pospisil H, Herrmann A, G. Reich J, Guigó R, Bork P, Doeberitz MVK, Vilo J, Hide W, Apweiler R, Thanaraj TA, Gautheret D. ASTD: The Alternative Splicing and Transcript Diversity database. Genomics 2009; 93:213-20. [DOI: 10.1016/j.ygeno.2008.11.003] [Citation(s) in RCA: 71] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2008] [Revised: 11/03/2008] [Accepted: 11/05/2008] [Indexed: 10/21/2022]
|
17
|
Scheibye-Alsing K, Hoffmann S, Frankel A, Jensen P, Stadler PF, Mang Y, Tommerup N, Gilchrist MJ, Nygård AB, Cirera S, Jørgensen CB, Fredholm M, Gorodkin J. Sequence assembly. Comput Biol Chem 2008; 33:121-36. [PMID: 19152793 DOI: 10.1016/j.compbiolchem.2008.11.003] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2008] [Revised: 11/28/2008] [Accepted: 11/28/2008] [Indexed: 01/20/2023]
Abstract
Despite the rapidly increasing number of sequenced and re-sequenced genomes, many issues regarding the computational assembly of large-scale sequencing data have remain unresolved. Computational assembly is crucial in large genome projects as well for the evolving high-throughput technologies and plays an important role in processing the information generated by these methods. Here, we provide a comprehensive overview of the current publicly available sequence assembly programs. We describe the basic principles of computational assembly along with the main concerns, such as repetitive sequences in genomic DNA, highly expressed genes and alternative transcripts in EST sequences. We summarize existing comparisons of different assemblers and provide a detailed descriptions and directions for download of assembly programs at: http://genome.ku.dk/resources/assembly/methods.html.
Collapse
Affiliation(s)
- K Scheibye-Alsing
- Division of Genetics and Bioinformatics, IBHV, University of Copenhagen, Grønnegårdsvej 3, 1870 Frederiksberg C, Denmark
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
18
|
Abstract
The EST division of GenBank, dbEST, is widely used in many applications such as gene discovery and verification of exon–intron structure. However, the use of EST sequences in the dbEST libraries is often hampered by inconsistent terminology used to describe the library sources and by the presence of contaminated sequences. Here, we describe CleanEST, a novel database server that classified dbEST libraries and removes contaminants. We classified all dbEST libraries according to species and sequencing center. In addition, we further classified human EST libraries by anatomical and pathological systems according to eVOC ontologies. For each dbEST library, we provide two different cleansed sequences: ‘pre-cleansed’ and ‘user-cleansed’. To generate pre-cleansed sequences, we cleansed sequences in dbEST by alignment of EST sequences against well-known contamination sources: UniVec, Escherichia coli, mitochondria and chloroplast (for plant). To provide user-cleansed sequences, we built an automatic user-cleansing pipeline, in which sequences of a user-selected library are cleansed on-the-fly according to user-selected options. The server is available at http://cleanest.kobic.re.kr/ and the database is updated monthly.
Collapse
Affiliation(s)
- Byungwook Lee
- Korean BioInformation Center, KRIBB, Daejeon 305-817, Korea.
| | | |
Collapse
|
19
|
Tiran Z, Oren A, Hermesh C, Rotman G, Levine Z, Amitai H, Handelsman T, Beiman M, Chen A, Landesman-Milo D, Dassa L, Peres Y, Koifman C, Glezer S, Vidal-Finkelstein R, Bahat K, Pergam T, Israel C, Horev J, Tsarfaty I, Ayalon-Soffer M. A Novel Recombinant Soluble Splice Variant of Met Is a Potent Antagonist of the Hepatocyte Growth Factor/Scatter Factor-Met Pathway. Clin Cancer Res 2008; 14:4612-21. [DOI: 10.1158/1078-0432.ccr-08-0108] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
20
|
Helftenbein G, Koslowski M, Dhaene K, Seitz G, Sahin U, Türeci O. In silico strategy for detection of target candidates for antibody therapy of solid tumors. Gene 2008; 414:76-84. [PMID: 18358640 DOI: 10.1016/j.gene.2008.02.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2007] [Revised: 02/05/2008] [Accepted: 02/13/2008] [Indexed: 10/22/2022]
Abstract
In contrast to earlier attempts for the identification of target candidates suitable for monoclonal antibody (mAb) based cancer therapies we concentrated on highly selective lineage-specific genes additionally preserved or even overexpressed in orthotopic cancers. In a script aided workflow we reduced all human entries of the RefSeq mRNA database to those encoding transmembrane domain bearing gene products and subjected them to BLAST analysis against the human EST database. All BLAST results were validated in a gene centric way allowing two types of data curation prior to expression profiling of matching ESTs in selected healthy tissues: (i) exclusion of questionable ESTs arising e.g. from genomic contamination and (ii) elimination of erroneously predicted mRNAs as well as transcripts with only weak EST coverage. The impact of such stringent input control on accuracy of prediction is underlined by RT-PCR confirmation of predicted tissue distribution patterns for a number of selected candidates.
Collapse
|
21
|
Zhu J, He F, Wang J, Yu J. Modeling transcriptome based on transcript-sampling data. PLoS One 2008; 3:e1659. [PMID: 18286206 PMCID: PMC2243018 DOI: 10.1371/journal.pone.0001659] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2007] [Accepted: 01/21/2008] [Indexed: 01/10/2023] Open
Abstract
Background Newly-evolved multiplex sequencing technology has been bringing transcriptome sequencing into an unprecedented depth. Millions of transcript tags now can be acquired in a single experiment through parallelization. The significant increase in throughput and reduction in cost required us to address some fundamental questions, such as how many transcript tags do we have to sequence for a given transcriptome? How could we estimate the total number of unique transcripts for different cell types (transcriptome diversity) and the distribution of their copy numbers (transcriptome dynamics)? What is the probability that a transcript with a given expression level to be detected at a certain sampling depth? Methodology/Principal Findings We developed a statistical model to evaluate these parameters based on transcriptome-sampling data. Three mixture models were exploited for their potentials to model the sampling frequencies. We demonstrated that relative abundances of all transcripts in a transcriptome follow the generalized inverse Gaussian distribution. The widely known beta and gamma distributions failed to fulfill the singular characteristics of relative abundance distribution, i.e., highly skewed toward zero and with a long tail. An estimator of transcriptome diversity and an analytical form of sampling growth curve were proposed in a coherent framework. Experimental data fitted this model very well and Monte Carlo simulations based on this model replicated sampling experiments in a remarkable precision. Conclusions Taking human embryonic stem cell as a prototype, we demonstrated that sequencing tens of thousands of transcript tags in an ordinary EST/SAGE experiment was far from sufficient. In order to fully characterize a human transcriptome, millions of transcript tags had to be sequenced. This model lays a statistical basis for transcriptome-sampling experiments and in essence can be used in all sampling-based data.
Collapse
Affiliation(s)
- Jiang Zhu
- Chinese Academy of Sciences (CAS) Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
- Graduate University of Chinese Academy of Sciences, Beijing, China
| | - Fuhong He
- Chinese Academy of Sciences (CAS) Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
- Graduate University of Chinese Academy of Sciences, Beijing, China
| | - Jing Wang
- Chinese Academy of Sciences (CAS) Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
- * To whom correspondence should be addressed. E-mail: (JW); (JY)
| | - Jun Yu
- Chinese Academy of Sciences (CAS) Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
- * To whom correspondence should be addressed. E-mail: (JW); (JY)
| |
Collapse
|
22
|
Zou X, Chung T, Lin X, Malakhova ML, Pike HM, Brown RE. Human glycolipid transfer protein (GLTP) genes: organization, transcriptional status and evolution. BMC Genomics 2008; 9:72. [PMID: 18261224 PMCID: PMC2262070 DOI: 10.1186/1471-2164-9-72] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2007] [Accepted: 02/08/2008] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND Glycolipid transfer protein is the prototypical and founding member of the new GLTP superfamily distinguished by a novel conformational fold and glycolipid binding motif. The present investigation provides the first insights into the organization, transcriptional status, phylogenetic/evolutionary relationships of GLTP genes. RESULTS In human cells, single-copy GLTP genes were found in chromosomes 11 and 12. The gene at locus 11p15.1 exhibited several features of a potentially active retrogene, including a highly homologous (approximately 94%), full-length coding sequence containing all key amino acid residues involved in glycolipid liganding. To establish the transcriptional activity of each human GLTP gene, in silico EST evaluations, RT-PCR amplifications of GLTP transcript(s), and methylation analyses of regulator CpG islands were performed using various human cells. Active transcription was found for 12q24.11 GLTP but 11p15.1 GLTP was transcriptionally silent. Heterologous expression and purification of the GLTP paralogs showed glycolipid intermembrane transfer activity only for 12q24.11 GLTP. Phylogenetic/evolutionary analyses indicated that the 5-exon/4-intron organizational pattern and encoded sequence of 12q24.11 GLTP were highly conserved in therian mammals and other vertebrates. Orthologs of the intronless GLTP gene were observed in primates but not in rodentiates, carnivorates, cetartiodactylates, or didelphimorphiates, consistent with recent evolutionary development. CONCLUSION The results identify and characterize the gene responsible for GLTP expression in humans and provide the first evidence for the existence of a GLTP pseudogene, while demonstrating the rigorous approach needed to unequivocally distinguish transcriptionally-active retrogenes from silent pseudogenes. The results also rectify errors in the Ensembl database regarding the organizational structure of the actively transcribed GLTP gene in Pan troglodytes and establish the intronless GLTP as a primate-specific, processed pseudogene marker. A solid foundation has been established for future identification of hereditary defects in human GLTP genes.
Collapse
Affiliation(s)
- Xianqiong Zou
- The Hormel Institute, University of Minnesota, Austin, Minnesota 55912, USA.
| | | | | | | | | | | |
Collapse
|
23
|
Woolfe A, Elgar G. Organization of conserved elements near key developmental regulators in vertebrate genomes. ADVANCES IN GENETICS 2008; 61:307-38. [PMID: 18282512 DOI: 10.1016/s0065-2660(07)00012-0] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Sequence conservation has traditionally been used as a means to target functional regions of complex genomes. In addition to its use in identifying coding regions of genes, the recent availability of whole genome data for a number of vertebrates has permitted high-resolution analyses of the noncoding "dark matter" of the genome. This has resulted in the identification of a large number of highly conserved sequence elements that appear to be preserved in all bony vertebrates. Further positional analysis of these conserved noncoding elements (CNEs) in the genome demonstrates that they cluster around genes involved in developmental regulation. This chapter describes the identification and characterization of these elements, with particular reference to their composition and organization.
Collapse
Affiliation(s)
- Adam Woolfe
- School of Biological and Chemical Sciences, Queen Mary, University of London, London E1 4NS, United Kingdom
| | | |
Collapse
|
24
|
Abstract
In recent years, genome-wide detection of alternative splicing based on Expressed Sequence Tag (EST) sequence alignments with mRNA and genomic sequences has dramatically expanded our understanding of the role of alternative splicing in functional regulation. This chapter reviews the data, methodology, and technical challenges of these genome-wide analyses of alternative splicing, and briefly surveys some of the uses to which such alternative splicing databases have been put. For example, with proper alternative splicing database schema design, it is possible to query genome-wide for alternative splicing patterns that are specific to particular tissues, disease states (e.g., cancer), gender, or developmental stages. EST alignments can be used to estimate exon inclusion or exclusion level of alternatively spliced exons and evolutionary changes for various species can be inferred from exon inclusion level. Such databases can also help automate design of probes for RT-PCR and microarrays, enabling high throughput experimental measurement of alternative splicing.
Collapse
|
25
|
Liang C, Wang G, Liu L, Ji G, Fang L, Liu Y, Carter K, Webb JS, Dean JFD. ConiferEST: an integrated bioinformatics system for data reprocessing and mining of conifer expressed sequence tags (ESTs). BMC Genomics 2007; 8:134. [PMID: 17535431 PMCID: PMC1894976 DOI: 10.1186/1471-2164-8-134] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2006] [Accepted: 05/29/2007] [Indexed: 11/30/2022] Open
Abstract
Background With the advent of low-cost, high-throughput sequencing, the amount of public domain Expressed Sequence Tag (EST) sequence data available for both model and non-model organism is growing exponentially. While these data are widely used for characterizing various genomes, they also present a serious challenge for data quality control and validation due to their inherent deficiencies, particularly for species without genome sequences. Description ConiferEST is an integrated system for data reprocessing, visualization and mining of conifer ESTs. In its current release, Build 1.0, it houses 172,229 loblolly pine EST sequence reads, which were obtained from reprocessing raw DNA sequencer traces using our software – WebTraceMiner. The trace files were downloaded from NCBI Trace Archive. ConiferEST provides biologists unique, easy-to-use data visualization and mining tools for a variety of putative sequence features including cloning vector segments, adapter sequences, restriction endonuclease recognition sites, polyA and polyT runs, and their corresponding Phred quality values. Based on these putative features, verified sequence features such as 3' and/or 5' termini of cDNA inserts in either sense or non-sense strand have been identified in-silico. Interestingly, only 30.03% of the designated 3' ESTs were found to have an authenticated 5' terminus in the non-sense strand (i.e., polyT tails), while fewer than 5.34% of the designated 5' ESTs had a verified 5' terminus in the sense strand. Such previously ignored features provide valuable insight for data quality control and validation of error-prone ESTs, as well as the ability to identify novel functional motifs embedded in large EST datasets. We found that "double-termini adapters" were effective indicators of potential EST chimeras. For all sequences with in-silico verified termini/terminus, we used InterProScan to assign protein domain signatures, results of which are available for in-depth exploration using our biologist-friendly web interfaces. Conclusion ConiferEST represents a unique and complementary public resource for EST data integration and mining in conifers by reprocessing raw DNA traces, identifying putative sequence features and determining and annotating in-silico verified features. Seamlessly integrated with other public resources, ConiferEST provides biologists powerful tools to verify data, visualize abnormalities, including EST chimeras, and explore large EST datasets.
Collapse
Affiliation(s)
- Chun Liang
- Department of Botany, Miami University, Oxford, Ohio 45056, USA
| | - Gang Wang
- Department of Botany, Miami University, Oxford, Ohio 45056, USA
| | - Lin Liu
- Department of Botany, Miami University, Oxford, Ohio 45056, USA
| | - Guoli Ji
- Department of Automation, Xiamen University, Xiamen, Fujian, 361005, China
| | - Lin Fang
- Beijing Genomics Institute, Beijing 101300, China
| | - Yuansheng Liu
- Department of Botany, Miami University, Oxford, Ohio 45056, USA
| | - Kikia Carter
- Department of Botany, Miami University, Oxford, Ohio 45056, USA
| | - Jason S Webb
- Department of Botany, Miami University, Oxford, Ohio 45056, USA
| | - Jeffrey FD Dean
- Warnell School of Forestry and Natural Resources, University of Georgia, Athens, Georgia 30602, USA
| |
Collapse
|
26
|
Kleffe J, Möller F, Wittig B. Simultaneous identification of long similar substrings in large sets of sequences. BMC Bioinformatics 2007; 8 Suppl 5:S7. [PMID: 17570866 PMCID: PMC1892095 DOI: 10.1186/1471-2105-8-s5-s7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Background Sequence comparison faces new challenges today, with many complete genomes and large libraries of transcripts known. Gene annotation pipelines match these sequences in order to identify genes and their alternative splice forms. However, the software currently available cannot simultaneously compare sets of sequences as large as necessary especially if errors must be considered. Results We therefore present a new algorithm for the identification of almost perfectly matching substrings in very large sets of sequences. Its implementation, called ClustDB, is considerably faster and can handle 16 times more data than VMATCH, the most memory efficient exact program known today. ClustDB simultaneously generates large sets of exactly matching substrings of a given minimum length as seeds for a novel method of match extension with errors. It generates alignments of maximum length with a considered maximum number of errors within each overlapping window of a given size. Such alignments are not optimal in the usual sense but faster to calculate and often more appropriate than traditional alignments for genomic sequence comparisons, EST and full-length cDNA matching, and genomic sequence assembly. The method is used to check the overlaps and to reveal possible assembly errors for 1377 Medicago truncatula BAC-size sequences published at . Conclusion The program ClustDB proves that window alignment is an efficient way to find long sequence sections of homogenous alignment quality, as expected in case of random errors, and to detect systematic errors resulting from sequence contaminations. Such inserts are systematically overlooked in long alignments controlled by only tuning penalties for mismatches and gaps. ClustDB is freely available for academic use.
Collapse
Affiliation(s)
- Jürgen Kleffe
- Institut für Molekularbiologie und Bioinformatik, Charite-Universitätsmedizin Berlin, Arnimallee 22, 14195 Berlin, Germany
| | - Friedrich Möller
- Institut für Molekularbiologie und Bioinformatik, Charite-Universitätsmedizin Berlin, Arnimallee 22, 14195 Berlin, Germany
| | - Burghardt Wittig
- Institut für Molekularbiologie und Bioinformatik, Charite-Universitätsmedizin Berlin, Arnimallee 22, 14195 Berlin, Germany
| |
Collapse
|
27
|
Unneberg P, Claverie JM. Tentative mapping of transcription-induced interchromosomal interaction using chimeric EST and mRNA data. PLoS One 2007; 2:e254. [PMID: 17330142 PMCID: PMC1804257 DOI: 10.1371/journal.pone.0000254] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2006] [Accepted: 02/06/2007] [Indexed: 11/18/2022] Open
Abstract
Recent studies on chromosome conformation show that chromosomes colocalize in the nucleus, bringing together active genes in transcription factories. This spatial proximity of actively transcribing genes could provide a means for RNA interaction at the transcript level. We have screened public databases for chimeric EST and mRNA sequences with the intent of mapping transcription-induced interchromosomal interactions. We suggest that chimeric transcripts may be the result of close encounters of active genes, either as functional products or "noise" in the transcription process, and that they could be used as probes for chromosome interactions. We have found a total of 5,614 chimeric ESTs and 587 chimeric mRNAs that meet our selection criteria. Due to their higher quality, the mRNA findings are of particular interest and we hope that they may serve as food for thought for specialists in diverse areas of molecular biology.
Collapse
Affiliation(s)
- Per Unneberg
- Structural and Genomic Information Laboratory, Centre National de la Recherche Scientifique (CNRS) UPR-2589, Institut de Biologie Structurale et Microbiologie, Marseille, France.
| | | |
Collapse
|
28
|
Baek D, Davis C, Ewing B, Gordon D, Green P. Characterization and predictive discovery of evolutionarily conserved mammalian alternative promoters. Genome Res 2007; 17:145-55. [PMID: 17210929 PMCID: PMC1781346 DOI: 10.1101/gr.5872707] [Citation(s) in RCA: 76] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Recent studies suggest that surprisingly many mammalian genes have alternative promoters (APs); however, their biological roles, and the characteristics that distinguish them from single promoters (SPs), remain poorly understood. We constructed a large data set of evolutionarily conserved promoters, and used it to identify sequence features, functional associations, and expression patterns that differ by promoter type. The four promoter categories CpG-rich APs, CpG-poor APs, CpG-rich SPs, and CpG-poor SPs each show characteristic strengths and patterns of sequence conservation, frequencies of putative transcription-related motifs, and tissue and developmental stage expression preferences. APs display substantially higher sequence conservation than SPs and CpG-poor promoters than CpG-rich promoters. Among CpG-poor promoters, APs and SPs show sharply contrasting developmental stage preferences and TATA box frequencies. We developed a discriminator to computationally predict promoter type, verified its accuracy through experimental tests that incorporate a novel method for deconvolving mixed sequence traces, and used it to find several new APs. The discriminator predicts that almost half of all mammalian genes have evolutionarily conserved APs. This high frequency of APs, together with the strong purifying selection maintaining them, implies a crucial role in expanding the expression diversity of the mammalian genome.
Collapse
Affiliation(s)
- Daehyun Baek
- Department of Bioengineering, University of Washington, Seattle, Washington 98195, USA
- Corresponding authors.E-mail ; fax (206) 685-9720.E-mail ; fax (206) 685-9720
| | - Colleen Davis
- Howard Hughes Medical Institute and Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
| | - Brent Ewing
- Howard Hughes Medical Institute and Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
| | - David Gordon
- Howard Hughes Medical Institute and Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
| | - Phil Green
- Howard Hughes Medical Institute and Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
- Corresponding authors.E-mail ; fax (206) 685-9720.E-mail ; fax (206) 685-9720
| |
Collapse
|
29
|
Lu F, Li J, Jiang Z. Computational identification and analysis of G protein-coupled receptor targets. Drug Dev Res 2007. [DOI: 10.1002/ddr.20148] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
30
|
Pi C, Liu J, Peng C, Liu Y, Jiang X, Zhao Y, Tang S, Wang L, Dong M, Chen S, Xu A. Diversity and evolution of conotoxins based on gene expression profiling of Conus litteratus. Genomics 2006; 88:809-819. [PMID: 16908117 DOI: 10.1016/j.ygeno.2006.06.014] [Citation(s) in RCA: 80] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2006] [Revised: 06/21/2006] [Accepted: 06/22/2006] [Indexed: 11/24/2022]
Abstract
Cone snails are attracting increasing scientific attention due to their unprecedented diversity of invaluable channel-targeted peptides. As arguably the largest and most successful evolutionary genus of invertebrates, Conus also may become the model system to study the evolution of multigene families and biodiversity. Here, a set of 897 expressed sequence tags (ESTs) derived from a Conus litteratus venom duct was analyzed to illuminate the diversity and evolution mechanism of conotoxins. Nearly half of these ESTs represent the coding sequences of conotoxins, which were grouped into 42 novel conotoxin cDNA sequences (seven superfamilies), with T-superfamily conotoxins being the dominant component. The gene expression profile of conotoxin revealed that transcripts are expressed with order-of-magnitude differences, sequence divergence within a superfamily increases from the N to the C terminus of the open reading frame, and even multiple scaffold-different mature peptides exist in a conotoxin gene superfamily. Most excitingly, we identified a novel conotoxin superfamily and three novel cysteine scaffolds. These results give an initial insight into the C. litteratus transcriptome that will contribute to a better understanding of conotoxin evolution and the study of the cone snail genome in the near future.
Collapse
Affiliation(s)
- Canhui Pi
- State Key Laboratory of Biocontrol, Guangdong Province Key Laboratory of Therapeutic Functional Genes, The Open Laboratory for Marine Functional Genomics of the State High-Tech Development Program, Department of Biochemistry, College of Life Sciences, Sun Yat-sen (Zhongshan) University, 135 Xingangxi Road, Guangzhou 510275, People's Republic of China
| | - Junliang Liu
- State Key Laboratory of Biocontrol, Guangdong Province Key Laboratory of Therapeutic Functional Genes, The Open Laboratory for Marine Functional Genomics of the State High-Tech Development Program, Department of Biochemistry, College of Life Sciences, Sun Yat-sen (Zhongshan) University, 135 Xingangxi Road, Guangzhou 510275, People's Republic of China
| | - Can Peng
- State Key Laboratory of Biocontrol, Guangdong Province Key Laboratory of Therapeutic Functional Genes, The Open Laboratory for Marine Functional Genomics of the State High-Tech Development Program, Department of Biochemistry, College of Life Sciences, Sun Yat-sen (Zhongshan) University, 135 Xingangxi Road, Guangzhou 510275, People's Republic of China
| | - Yun Liu
- State Key Laboratory of Biocontrol, Guangdong Province Key Laboratory of Therapeutic Functional Genes, The Open Laboratory for Marine Functional Genomics of the State High-Tech Development Program, Department of Biochemistry, College of Life Sciences, Sun Yat-sen (Zhongshan) University, 135 Xingangxi Road, Guangzhou 510275, People's Republic of China
| | - Xiuhua Jiang
- State Key Laboratory of Biocontrol, Guangdong Province Key Laboratory of Therapeutic Functional Genes, The Open Laboratory for Marine Functional Genomics of the State High-Tech Development Program, Department of Biochemistry, College of Life Sciences, Sun Yat-sen (Zhongshan) University, 135 Xingangxi Road, Guangzhou 510275, People's Republic of China
| | - Yu Zhao
- State Key Laboratory of Biocontrol, Guangdong Province Key Laboratory of Therapeutic Functional Genes, The Open Laboratory for Marine Functional Genomics of the State High-Tech Development Program, Department of Biochemistry, College of Life Sciences, Sun Yat-sen (Zhongshan) University, 135 Xingangxi Road, Guangzhou 510275, People's Republic of China
| | - Shaojun Tang
- State Key Laboratory of Biocontrol, Guangdong Province Key Laboratory of Therapeutic Functional Genes, The Open Laboratory for Marine Functional Genomics of the State High-Tech Development Program, Department of Biochemistry, College of Life Sciences, Sun Yat-sen (Zhongshan) University, 135 Xingangxi Road, Guangzhou 510275, People's Republic of China
| | - Lei Wang
- State Key Laboratory of Biocontrol, Guangdong Province Key Laboratory of Therapeutic Functional Genes, The Open Laboratory for Marine Functional Genomics of the State High-Tech Development Program, Department of Biochemistry, College of Life Sciences, Sun Yat-sen (Zhongshan) University, 135 Xingangxi Road, Guangzhou 510275, People's Republic of China
| | - Meiling Dong
- State Key Laboratory of Biocontrol, Guangdong Province Key Laboratory of Therapeutic Functional Genes, The Open Laboratory for Marine Functional Genomics of the State High-Tech Development Program, Department of Biochemistry, College of Life Sciences, Sun Yat-sen (Zhongshan) University, 135 Xingangxi Road, Guangzhou 510275, People's Republic of China
| | - Shangwu Chen
- State Key Laboratory of Biocontrol, Guangdong Province Key Laboratory of Therapeutic Functional Genes, The Open Laboratory for Marine Functional Genomics of the State High-Tech Development Program, Department of Biochemistry, College of Life Sciences, Sun Yat-sen (Zhongshan) University, 135 Xingangxi Road, Guangzhou 510275, People's Republic of China
| | - Anlong Xu
- State Key Laboratory of Biocontrol, Guangdong Province Key Laboratory of Therapeutic Functional Genes, The Open Laboratory for Marine Functional Genomics of the State High-Tech Development Program, Department of Biochemistry, College of Life Sciences, Sun Yat-sen (Zhongshan) University, 135 Xingangxi Road, Guangzhou 510275, People's Republic of China.
| |
Collapse
|
31
|
Jeong SC, Yang K, Park JY, Han KS, Yu S, Hwang TY, Hur CG, Kim SH, Park PB, Kim HM, Park YI, Liu JR. Structure, expression, and mapping of two nodule-specific genes identified by mining public soybean EST databases. Gene 2006; 383:71-80. [PMID: 16973305 DOI: 10.1016/j.gene.2006.07.015] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2006] [Revised: 07/13/2006] [Accepted: 07/13/2006] [Indexed: 11/17/2022]
Abstract
Numerous nodule-specific genes, which are involved in the root nodule development and function, have been known and are still being discovered. Here, we reported the structure, expression, and genetic map location of two novel nodule-specific genes. First, two EST groups, one obtained from a nodule library and the other from all aboveground tissue libraries, were clustered with regard to in silico expression profiles. We compiled a pool of 103 putative nodule-specific sequence clusters. Then, two representative ESTs were selected for further experimental analyses. According to the full-length cDNA sequences, one was an EST of a novel nodule-specific polygalacturonase gene, GmPGN, and the other was an EST of a new short nodule-specific gene, GmEKN. The results of expression analyses of the GmPGN cDNAs indicated that GmPGN expression was not detectable in any of the soybean tissues except in the nodule tissue and may be regulated via alternative splicing. GmEKN expression was the most strongly detected in the nodule. The predicted GmEKN protein is both glutamic acid- and lysine-rich, and is also highly hydrophilic. Genetic mapping located GmPGN near the known quantitative trait locus conferring resistance to soybean cyst nematode on soybean molecular linkage group (MLG) B1, and GmEKN on MLG A2. These results provide useful information for the use of these genes in research on the orchestration of numerous genes in nodule development and function.
Collapse
Affiliation(s)
- Soon-Chun Jeong
- BioEvaluation Center, Korea Research Institute of Bioscience and Biotechnology, #52, Oun-dong, Yuseong-gu, Daejeon 305-806, Republic of Korea.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
32
|
Gilat R, Goncharov S, Esterman N, Shweiki D. Under-representation of PolyA/PolyT tailed ESTs in human ESTdb: an obstacle to alternative polyadenylation inference. Bioinformation 2006; 1:220-4. [PMID: 17597892 PMCID: PMC1891686] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2006] [Revised: 10/02/2006] [Accepted: 10/02/2006] [Indexed: 11/21/2022] Open
Abstract
Alternative polyadenylation is a key regulatory process which affects the 3' end formation of variants of the same transcription unit, thus altering gene expression pattern, and transcripts' cellular behaviour and characteristics. The common methodology for computational analysis of alternative polyadenylation signal utilization is based on EST data, specifically on PolyA/PolyT tailed ESTs. Studying the human ESTs dataset we detected a significant underrepresentation of PolyA/PolyT tailed ESTs, constituting only 10% of most libraries. Consequently, more than 50% of false-negative events are revealed in the analysis of alternatively polyadenylated variants' expression. We therefore argue that the ratios of PolyA/PolyT tailed ESTs, as represented in the human EST database, do not reflect the truepicture of 3' end variants formation of a given physiological situation. Thus the EST database should not be considered a reliable source for alternative polyadenylation signal usage inference.
Collapse
Affiliation(s)
| | | | | | - Dorit Shweiki
- Dorit Shweiki
E-mail:
; Phone: +972 3 5211853; Fax: +972 3 5211871; Corresponding author
| |
Collapse
|
33
|
Gray TA, Wilson A, Fortin PJ, Nicholls RD. The putatively functional Mkrn1-p1 pseudogene is neither expressed nor imprinted, nor does it regulate its source gene in trans. Proc Natl Acad Sci U S A 2006; 103:12039-44. [PMID: 16882727 PMCID: PMC1567693 DOI: 10.1073/pnas.0602216103] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
A recently promoted genome evolution model posits that mammalian pseudogenes can regulate their founding source genes, and it thereby ascribes an important function to "junk DNA." This model arose from analysis of a serendipitous mouse mutant in which a transgene insertion/deletion caused severe polycystic kidney disease and osteogenesis imperfecta with approximately 80% perinatal lethality, when inherited paternally [Hirotsune, S., et al. (2003) Nature 423, 91-96]. The authors concluded that the transgene reduced the expression of a nearby transcribed and imprinted pseudogene, Mkrn1-p1. This reduction in chromosome 5-imprinted Mkrn1-p1 transcripts was proposed to destabilize the cognate chromosome 6 Mkrn1 source gene mRNA, with a partial reduction in one Mkrn1 isoform leading to the imprinted phenotype. Here, we show that 5' Mkrn1-p1 is fully methylated on both alleles, a pattern indicative of silenced chromatin, and that Mkrn1-p1 is not transcribed and therefore cannot stabilize Mkrn1 transcripts in trans. A small, truncated, rodent-specific Mkrn1 transcript explains the product erroneously attributed to Mkrn1-p1. Additionally, Mkrn1 expression is not imprinted, and 5' Mkrn1 is fully unmethylated. Finally, mice in which Mkrn1 has been directly disrupted show none of the phenotypes attributed to a partial reduction of Mkrn1. These data contradict the previous suggestions that Mkrn1-p1 is imprinted, and that either it or its source Mkrn1 gene relates to the original imprinted transgene phenotype. This study invalidates the data upon which the pseudogene trans-regulation model is based and therefore strongly supports the view that mammalian pseudogenes are evolutionary relics.
Collapse
Affiliation(s)
- Todd A. Gray
- *Wadsworth Center, David Axelrod Institute, Albany, NY 12208; and
- To whom correspondence may be addressed. E-mail:
or
| | - Alison Wilson
- *Wadsworth Center, David Axelrod Institute, Albany, NY 12208; and
| | | | - Robert D. Nicholls
- Birth Defects Laboratories, Division of Medical Genetics, Department of Pediatrics, Children's Hospital of Pittsburgh, and
- Department of Human Genetics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15213
- To whom correspondence may be addressed. E-mail:
or
| |
Collapse
|
34
|
Ro S, Kang SH, Farrelly AM, Ordog T, Partain R, Fleming N, Sanders KM, Kenyon JL, Keef KD. Template switching within exons 3 and 4 of KV11.1 (HERG) gives rise to a 5' truncated cDNA. Biochem Biophys Res Commun 2006; 345:1342-1349. [PMID: 16723117 DOI: 10.1016/j.bbrc.2006.05.032] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2006] [Accepted: 05/02/2006] [Indexed: 10/24/2022]
Abstract
K(V)11.1 (HERG) channels contribute to membrane potential in a number of excitable cell types. We cloned a variant of K(V)11.1 from human jejunum containing a 171 bp deletion spanning exons 3 and 4. Expression of a full-length cDNA clone containing this deletion gave rise to protein that trafficked to the cell membrane and generated robust currents. The deletion occurred in a G/C-rich region and identical sequence elements of UGGUGG were located at the deletion boundaries. In recent studies these features have been implicated to cause deletions via template switching during cDNA synthesis. To examine this possibility we compared cDNAs from human brain, heart, and jejunum synthesized at lower (42 degrees C) and higher temperatures (70 degrees C). The 171 bp deletion was absent at the higher temperature. Our results suggest that the sequence and secondary structure of mRNA in the G/C rich region leads to template switching producing a cDNA product with a 171 bp deletion.
Collapse
Affiliation(s)
- S Ro
- Department of Physiology and Cell Biology, University of Nevada School of Medicine, Reno, 89557, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
35
|
Xing Y, Yu T, Wu YN, Roy M, Kim J, Lee C. An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs. Nucleic Acids Res 2006; 34:3150-60. [PMID: 16757580 PMCID: PMC1475746 DOI: 10.1093/nar/gkl396] [Citation(s) in RCA: 122] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2006] [Revised: 04/13/2006] [Accepted: 05/10/2006] [Indexed: 11/13/2022] Open
Abstract
Reconstructing full-length transcript isoforms from sequence fragments (such as ESTs) is a major interest and challenge for bioinformatic analysis of pre-mRNA alternative splicing. This problem has been formulated as finding traversals across the splice graph, which is a directed acyclic graph (DAG) representation of gene structure and alternative splicing. In this manuscript we introduce a probabilistic formulation of the isoform reconstruction problem, and provide an expectation-maximization (EM) algorithm for its maximum likelihood solution. Using a series of simulated data and expressed sequences from real human genes, we demonstrate that our EM algorithm can correctly handle various situations of fragmentation and coupling in the input data. Our work establishes a general probabilistic framework for splice graph-based reconstructions of full-length isoforms.
Collapse
Affiliation(s)
- Yi Xing
- Molecular Biology Institute, Center for Computational Biology, Department of Chemistry and Biochemistry, University of CaliforniaLos Angeles, USA
| | - Tianwei Yu
- Department of Statistics, University of CaliforniaLos Angeles, USA
- Dental Research Institute, School of Dentistry, University of CaliforniaLos Angeles, USA
| | - Ying Nian Wu
- Department of Statistics, University of CaliforniaLos Angeles, USA
| | - Meenakshi Roy
- Molecular Biology Institute, Center for Computational Biology, Department of Chemistry and Biochemistry, University of CaliforniaLos Angeles, USA
| | - Joseph Kim
- Molecular Biology Institute, Center for Computational Biology, Department of Chemistry and Biochemistry, University of CaliforniaLos Angeles, USA
| | - Christopher Lee
- Molecular Biology Institute, Center for Computational Biology, Department of Chemistry and Biochemistry, University of CaliforniaLos Angeles, USA
| |
Collapse
|
36
|
Liang C, Sun F, Wang H, Qu J, Freeman RM, Pratt LH, Cordonnier-Pratt MM. MAGIC-SPP: a database-driven DNA sequence processing package with associated management tools. BMC Bioinformatics 2006; 7:115. [PMID: 16522212 PMCID: PMC1421442 DOI: 10.1186/1471-2105-7-115] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2005] [Accepted: 03/07/2006] [Indexed: 11/29/2022] Open
Abstract
Background Processing raw DNA sequence data is an especially challenging task for relatively small laboratories and core facilities that produce as many as 5000 or more DNA sequences per week from multiple projects in widely differing species. To meet this challenge, we have developed the flexible, scalable, and automated sequence processing package described here. Results MAGIC-SPP is a DNA sequence processing package consisting of an Oracle 9i relational database, a Perl pipeline, and user interfaces implemented either as JavaServer Pages (JSP) or as a Java graphical user interface (GUI). The database not only serves as a data repository, but also controls processing of trace files. MAGIC-SPP includes an administrative interface, a laboratory information management system, and interfaces for exploring sequences, monitoring quality control, and troubleshooting problems related to sequencing activities. In the sequence trimming algorithm it employs new features designed to improve performance with respect to concerns such as concatenated linkers, identification of the expected start position of a vector insert, and extending the useful length of trimmed sequences by bridging short regions of low quality when the following high quality segment is sufficiently long to justify doing so. Conclusion MAGIC-SPP has been designed to minimize human error, while simultaneously being robust, versatile, flexible and automated. It offers a unique combination of features that permit administration by a biologist with little or no informatics background. It is well suited to both individual research programs and core facilities.
Collapse
Affiliation(s)
- Chun Liang
- Laboratory for Genomics and Bioinformatics, Department of Plant Biology, University of Georgia, Athens, GA 30602, USA
- Department of Botany, Miami University, Oxford, OH 45056, USA
| | - Feng Sun
- Laboratory for Genomics and Bioinformatics, Department of Plant Biology, University of Georgia, Athens, GA 30602, USA
- Nanosphere, Inc., 4088 Commercial Avenue, Northbrook, IL 60062, USA
| | - Haiming Wang
- Laboratory for Genomics and Bioinformatics, Department of Plant Biology, University of Georgia, Athens, GA 30602, USA
- Department of Genetics, University of Georgia, Athens, GA 30602, USA
| | - Junfeng Qu
- Department of Computer Science, University of Georgia, Athens, GA 30602, USA
| | - Robert M Freeman
- Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA
| | - Lee H Pratt
- Laboratory for Genomics and Bioinformatics, Department of Plant Biology, University of Georgia, Athens, GA 30602, USA
| | | |
Collapse
|
37
|
Urbanczyk-Wochniak E, Usadel B, Thimm O, Nunes-Nesi A, Carrari F, Davy M, Bläsing O, Kowalczyk M, Weicht D, Polinceusz A, Meyer S, Stitt M, Fernie AR. Conversion of MapMan to allow the analysis of transcript data from Solanaceous species: effects of genetic and environmental alterations in energy metabolism in the leaf. PLANT MOLECULAR BIOLOGY 2006; 60:773-92. [PMID: 16649112 DOI: 10.1007/s11103-005-5772-4] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/23/2005] [Accepted: 12/08/2005] [Indexed: 05/08/2023]
Abstract
The tomato microarray TOM1 offers the possibility to monitor the levels of several thousand transcripts in parallel. The microelements represented on this tomato microarray have been putatively assigned to unigenes, and organised in functional classes using the MapMan ontology (Thimm et al., 2004. Plant J. 37: 914-939). This ontology was initially developed for use with the Arabidopsis ATH1 array, has a low level of redundancy, and can be combined with the MapMan software to provide a biologically structured overview of changes of transcripts, metabolites and enzyme activities. Use of this application is illustrated using three case studies with published or novel TOM1 array data sets for Solanaceous species. Comparison of previously reported data on transcript levels in potato leaves in the middle of the day and the middle of the night identified coordinated changes in the levels of transcripts of genes involved in various metabolic pathways and cellular events. Comparison with diurnal changes of gene expression in Arabidopsis revealed common features, illustrating how MapMan can be used to compare responses in different organisms. Comparison of transcript levels in new experiments performed on the leaves of the cultivated tomato S. lycopersicum and the wild relative S. pennellii revealed a general decrease of levels of transcripts of genes involved in terpene and, phenylpropanoid metabolism as well as chorismate biosynthesis in the crop compared to the wild relative. This matches the recently reported decrease of the levels of secondary metabolites in the latter. In the third case study, new expression array data for two genotypes deficient in TCA cycle enzymes is analysed to show that these genotypes have elevated levels of transcripts associated with photosynthesis. This in part explains the previously documented enhanced rates of photosynthesis in these genotypes. Since the Solanaceous MapMan is intended to be a community resource it will be regularly updated on improvements in tomato gene annotation and transcript profiling resources.
Collapse
Affiliation(s)
- Ewa Urbanczyk-Wochniak
- Max-Planck-Institut für Molekulare Pflanzenphysiologie, Am Mühlenberg 1, Golm, 14476, Germany
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
38
|
Liu D, Graber JH. Quantitative comparison of EST libraries requires compensation for systematic biases in cDNA generation. BMC Bioinformatics 2006; 7:77. [PMID: 16503995 PMCID: PMC1431573 DOI: 10.1186/1471-2105-7-77] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2005] [Accepted: 02/17/2006] [Indexed: 12/28/2022] Open
Abstract
Background Publicly accessible EST libraries contain valuable information that can be utilized for studies of tissue-specific gene expression and processing of individual genes. This information is, however, confounded by multiple systematic effects arising from the procedures used to generate these libraries. Results We used alignment of ESTs against a reference set of transcripts to estimate the size distributions of the cDNA inserts and sampled mRNA transcripts in individual EST libraries and show how these measurements can be used to inform quantitative comparisons of libraries. While significant attention has been paid to the effects of normalization and substraction, we also find significant biases in transcript sampling introduced by the combined procedures of reverse transcription and selection of cDNA clones for sequencing. Using examples drawn from studies of mRNA 3'-processing (cleavage and polyadenylation), we demonstrate effects of the transcript sampling bias, and provide a method for identifying libraries that can be safely compared without bias. All data sets, supplemental data, and software are available at our supplemental web site [1]. Conclusion The biases we characterize in the transcript sampling of EST libraries represent a significant and heretofore under-appreciated source of false positive candidates for tissue-, cell type-, or developmental stage-specific activity or processing of genes. Uncorrected, quantitative comparison of dissimilar EST libraries will likely result in the identification of statistically significant, but biologically meaningless changes.
Collapse
Affiliation(s)
- Donglin Liu
- The Jackson Laboratory, 600 Main Street, Bar Harbor, ME 04609, USA
| | - Joel H Graber
- The Jackson Laboratory, 600 Main Street, Bar Harbor, ME 04609, USA
| |
Collapse
|
39
|
Pi C, Liu Y, Peng C, Jiang X, Liu J, Xu B, Yu X, Yu Y, Jiang X, Wang L, Dong M, Chen S, Xu AL. Analysis of expressed sequence tags from the venom ducts of Conus striatus: focusing on the expression profile of conotoxins. Biochimie 2006; 88:131-40. [PMID: 16183187 DOI: 10.1016/j.biochi.2005.08.001] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2005] [Accepted: 08/16/2005] [Indexed: 11/19/2022]
Abstract
Cone snails (genus Conus) are predatory marine gastropods that use venom peptides for interacting with prey, predators and competitors. A majority of these peptides, generally known as conotoxins demonstrate striking selectivity in targeting specific subtypes of ion channels and neurotransmitter receptors. So they are not only useful tools in neuroscience to characterize receptors and receptor subtypes, but offer great potential in new drug research and development as well. Here, a cDNA library from the venom ducts of a fish-hunting cone snail species, Conus striatus is described for the generation of expressed sequence tags (ESTs). A total of 429 ESTs were grouped into 137 clusters or singletons. Among these sequences, 221 were toxin sequences, accounting for 52.1% (corresponding to 19 clusters) of all transcripts. A-superfamily (132 ESTs) and O-superfamily conotoxins (80 ESTs) constitute the predominant toxin components. Some non-disulfide-rich Conus peptides were also found. The expression profile of conotoxins also explained to some extent the pharmacological and physiological reactions elicited by this typical piscivorous species. For the first time, a nonstop transcript of conotoxin was identified, which is suggestive that alternative polyadenylation may be a means of post-transcriptional regulation of conotoxin production. A comparison analysis of these conotoxins reveals the different variation and divergence patterns in these two superfamilies. Our investigations indicate that focal hyper-mutation, block substitution and exon shuffling are three main mechanisms leading to the conotoxin diversity in a species. The comprehensive set of Conus gene sequences allowed the identification of the representative classes of conotoxins and related components, which may lay the foundation for further research and development of conotoxins.
Collapse
Affiliation(s)
- Canhui Pi
- State Key Laboratory of Biocontrol, Guangdong Province Key Laboratory of Therapeutic Functional Genes, Department of Biochemistry, College of Life Sciences, Sun Yat-sen (Zhongshan) University, 135 Xingangxi Road, 510275 Guangzhou, China
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
40
|
Bowman TV, McCooey AJ, Merchant AA, Ramos CA, Fonseca P, Poindexter A, Bradfute SB, Oliveira DM, Green R, Zheng Y, Jackson KA, Chambers SM, McKinney-Freeman SL, Norwood KG, Darlington G, Gunaratne PH, Steffen D, Goodell MA. Differential mRNA processing in hematopoietic stem cells. Stem Cells 2005; 24:662-70. [PMID: 16373690 DOI: 10.1634/stemcells.2005-0552] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
Hematopoietic stem cells (HSCs) maintain tissue homeostasis by rapidly responding to environmental changes. Although this function is well understood, the molecular mechanisms governing this characteristic are largely unknown. We used a sequenced-based strategy to explore the role of both transcriptional and post-transcriptional regulation in HSC biology. We characterized the gene expression differences between HSCs, both quiescent and proliferating, and their differentiated progeny. This analysis revealed a large fraction of sequence tags aligned to intronic sequences, which we showed were derived from unspliced transcripts. A comparison of the biological properties of the observed spliced versus unspliced transcripts in HSCs showed that the unspliced transcripts were enriched in genes involved in DNA binding and RNA processing. In addition, levels of unspliced message decreased in a transcript-specific fashion after HSC activation in vivo. This change in unspliced transcript level coordinated with increases in gene expression of splicing machinery components. Combined, these results suggest that post-transcriptional regulation is important in HSC activation in vivo.
Collapse
Affiliation(s)
- Teresa V Bowman
- Cell and Gene Therapy Center, Baylor College of Medicine, N1030, One Baylor Plaza, Houston, Texas 77030, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
41
|
Akiva P, Toporik A, Edelheit S, Peretz Y, Diber A, Shemesh R, Novik A, Sorek R. Transcription-mediated gene fusion in the human genome. Genome Res 2005; 16:30-6. [PMID: 16344562 PMCID: PMC1356126 DOI: 10.1101/gr.4137606] [Citation(s) in RCA: 214] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Transcription of a gene usually ends at a regulated termination point, preventing the RNA-polymerase from reading through the next gene. However, sporadic reports suggest that chimeric transcripts, formed by transcription of two consecutive genes into one RNA, can occur in human. The splicing and translation of such RNAs can lead to a new, fused protein, having domains from both original proteins. Here, we systematically identified over 200 cases of intergenic splicing in the human genome (involving 421 genes), and experimentally demonstrated that at least half of these fusions exist in human tissues. We showed that unique splicing patterns dominate the functional and regulatory nature of the resulting transcripts, and found intergenic distance bias in fused compared with nonfused genes. We demonstrate that the hundreds of fused genes we identified are only a subset of the actual number of fused genes in human. We describe a novel evolutionary mechanism where transcription-induced chimerism followed by retroposition results in a new, active fused gene. Finally, we provide evidence that transcription-induced chimerism can be a mechanism contributing to the evolution of protein complexes.
Collapse
|
42
|
Zhang J, Zhang L, Coombes KR. Gene sequence signatures revealed by mining the UniGene affiliation network. Bioinformatics 2005; 22:385-91. [PMID: 16339286 DOI: 10.1093/bioinformatics/bti796] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
BACKGROUND In the post-genomic era, developing tools to decode biological information from genomic sequences is important. Inspired by affiliation network theory, we investigated gene sequences of two kinds of UniGene clusters (UCs): narrowly expressed transcripts (NETs), whose expression is confined to a few tissues; and prevalently expressed transcripts (PETs) that are expressed in many tissues. RESULTS We explored the human and the mouse UniGene databases to compare NETs and PETs from different perspectives. We found that NETs were associated with smaller cluster size, shorter sequence length, a lower likelihood of having LocusLink annotations, and lower and more sporadic levels of expression. Significantly, the dinucleotide frequencies of NETs are similar to those of intergenic sequences in the genome, and they differ from those of PETs. We used these differences in dinucleotide frequencies to develop a discriminant analysis model to distinguish PETs from intergenic sequences. CONCLUSIONS Our results show that most NETs resemble intergenic sequences, casting doubts on the quality of such UniGene clusters. However, we also noted that a fraction of NETs resemble PETs in terms of dinucleotide frequencies and other features. Such NETs may have fewer quality problems. This work may be helpful in the studies of non-coding RNAs and in the validation of gene sequence databases.
Collapse
Affiliation(s)
- Jiexin Zhang
- Department of Biostatistics and Applied Mathematics, The University of Texas M.D. Anderson Cancer Center, 1515 Holcombe Boulevard, Box 447, Houston, TX 77030-4009, USA
| | | | | |
Collapse
|
43
|
Dixon RJ, Eperon IC, Hall L, Samani NJ. A genome-wide survey demonstrates widespread non-linear mRNA in expressed sequences from multiple species. Nucleic Acids Res 2005; 33:5904-13. [PMID: 16237125 PMCID: PMC1258171 DOI: 10.1093/nar/gki893] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2005] [Revised: 09/26/2005] [Accepted: 09/26/2005] [Indexed: 01/11/2023] Open
Abstract
We describe here the results of the first genome-wide survey of candidate exon repetition events in expressed sequences from human, mouse, rat, chicken, zebrafish and fly. Exon repetition is a rare event, reported in <10 genes, in which one or more exons is tandemly duplicated in mRNA but not in the gene. To identify candidates, we analysed database sequences for mRNA transcripts in which the order of the spliced exons does not follow the linear genomic order of the individual gene [events we term rearrangements or repetition in exon order (RREO)]. Using a computational approach, we have identified 245 genes in mammals that produce RREO events. RREO in mRNA occurs predominantly in the coding regions of genes. However, exon 1 is never involved. Analysis of the open reading frames suggests that this process may increase protein diversity and regulate protein expression via nonsense-mediated RNA decay. The sizes of the exons and introns involved around these events suggest a gene model structure that may facilitate non-linear splicing. These findings imply that RREO affects a significant subset of genes within a genome and suggests that non-linear information encoded within the genomes of complex organisms could contribute to phenotypic variation.
Collapse
Affiliation(s)
- Richard J Dixon
- Department of Cardiovascular Sciences, University of Leicester, Clinical Sciences Wing, Glenfield Hospital, Leicester LE3 9QP, UK.
| | | | | | | |
Collapse
|
44
|
Reis EM, Louro R, Nakaya HI, Verjovski-Almeida S. As antisense RNA gets intronic. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2005; 9:2-12. [PMID: 15805775 DOI: 10.1089/omi.2005.9.2] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Recent work describing the transcriptional output of the human genome points to the existence of a significant number of non-coding RNA transcripts coming from intronic regions, with a fraction of these being oriented antisense relative to the protein-coding mRNA of the known gene. In this article, we survey the main findings of the large-scale expression analysis projects that led to the identification of antisense intronic messages and which demonstrate their ubiquitous expression in the human genome. We review the current knowledge on long, unspliced, intronic antisense transcripts, a new class of non-coding RNAs, recently described by our group to be correlated with the degree of tumor differentiation in prostate cancer, which we postulate is involved in the fine tuning of gene expression in eukaryotes. Possible mechanisms of antisense intronic transcript biogenesis and function in gene expression regulation are discussed, as is their involvement in human diseases. While there is still no conclusive evidence demonstrating a functional role for these long, intronic antisense messages, the far-reaching implications of their existence for the mechanisms regulating gene expression certainly warrant further experimentation.
Collapse
Affiliation(s)
- Eduardo M Reis
- Department of Biochemistry, Institute of Chemistry, University of São Paulo, São Paulo, Brazil
| | | | | | | |
Collapse
|
45
|
Baek D, Green P. Sequence conservation, relative isoform frequencies, and nonsense-mediated decay in evolutionarily conserved alternative splicing. Proc Natl Acad Sci U S A 2005; 102:12813-8. [PMID: 16123126 PMCID: PMC1192826 DOI: 10.1073/pnas.0506139102] [Citation(s) in RCA: 112] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Studies of expressed sequence tag data sets have revealed large numbers of splicing variants for human genes, but it remains challenging to distinguish functionally important variants from aberrant splicing, clarify the nature of the alternative functions, and understand the signals that regulate splicing choices. To help address these issues, we have constructed and analyzed a large data set of 1,478 exon-skipping alternative splicing (AS) variants evolutionarily conserved in human and mouse. In about one-fifth of cases, one isoform appears subject to nonsense-mediated mRNA decay (NMD), supporting the idea that a major role of AS is to regulate gene expression; one-quarter of these NMD-inducing cases involve a conserved exon whose apparent sole purpose is to mediate destruction of the message when included. We explore sequence conservation likely related to splicing regulation, using in part a measure of the overall amount of conserved information in a sequence, and find that the increased conservation that has been observed within AS exons primarily affects synonymous sites, suggesting that regulatory signals significantly constrain synonymous substitution rates. We show that a lower frequency of the inclusion isoform relative to the exclusion isoform tends to be associated with weaker splice site signals, smaller exon size, and higher intronic sequence conservation, and provide evidence that all of these factors are under selection to control relative isoform frequencies. Some conserved instances of AS appear to represent aberrant splicing events that by chance have occurred in both species, and we develop a nonparametric likelihood approach to identify these.
Collapse
Affiliation(s)
- Daehyun Baek
- Department of Bioengineering, University of Washington, Box 357730, Seattle, WA 98195, USA.
| | | |
Collapse
|
46
|
Makalowska I, Lin CF, Makalowski W. Overlapping genes in vertebrate genomes. Comput Biol Chem 2005; 29:1-12. [PMID: 15680581 DOI: 10.1016/j.compbiolchem.2004.12.006] [Citation(s) in RCA: 78] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2004] [Revised: 12/15/2004] [Accepted: 12/15/2004] [Indexed: 11/19/2022]
Abstract
Overlapping genes in mammalian genomes are unexpected phenomena even though hundreds of pairs of protein coding overlapping genes have been reported so far. Overlapping genes can be divided into different categories based on direction of transcription as well as on sequence segments being shared between overlapping coding regions. The biologic functions of natural antisense transcripts, their involvement in physiological processes and gene regulation in living organisms are not fully understood. Number of documented examples indicates that they may exert control at various levels of gene expression, such as transcription, mRNA processing, splicing, stability, transport, and translation. Similarly, evolutionary origin of such genes is not known, existing hypotheses can explain only selected cases of mammalian gene overlaps which could originate as result of rearrangements, overprinting and/or adoption of signals in the neighboring gene locus.
Collapse
Affiliation(s)
- Izabela Makalowska
- The Huck Institute of the Life Sciences, The Pennsylvania State University, 502 Wartik Lab, University Park, PA 16802, USA.
| | | | | |
Collapse
|
47
|
Abstract
SUMMARY Recently, the Ka/Ks ratio test, which assesses the protein-coding potentials of genomic regions based on their non-synonymous to synonymous divergence rates, has been proposed and successfully used in genome annotations of eukaryotes. We systematically performed the Ka/Ks ratio test on 925 transcript-confirmed alternatively spliced exons in the human genome, which we describe in this manuscript. We found that 22.3% of evolutionarily conserved alternatively spliced exons cannot pass the Ka/Ks ratio test, compared with 9.8% for constitutive exons. The false negative rate was the highest (85.7%) for exons with low frequencies of transcript inclusion. Analyses of alternatively spliced exons supported by full-length mRNA sequences yielded similar results, and nearly half of exons involved in ancestral alternative splicing events could not pass this test. Our analysis suggests a future direction to incorporate comparative genomics-based alternative splicing predictions with the Ka/Ks ratio test in higher eukaryotes with extensive RNA alternative splicing.
Collapse
Affiliation(s)
- Yi Xing
- Department of Chemistry and Biochemistry, Molecular Biology Institute, Center for Genomics and Proteomics, University of California Los Angeles, Los Angeles, CA 90095-1570, USA
| | | |
Collapse
|
48
|
Kim N, Shin S, Lee S. ECgene: genome-based EST clustering and gene modeling for alternative splicing. Genome Res 2005; 15:566-76. [PMID: 15805497 PMCID: PMC1074371 DOI: 10.1101/gr.3030405] [Citation(s) in RCA: 78] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
With the availability of the human genome map and fast algorithms for sequence alignment, genome-based EST clustering became a viable method for gene modeling. We developed a novel gene-modeling method, ECgene (Gene modeling by EST Clustering), which combines genome-based EST clustering and the transcript assembly procedure in a coherent and consistent fashion. Specifically, ECgene takes alternative splicing events into consideration. The position of splice sites (i.e., exon-intron boundaries) in the genome map is utilized as the critical information in the whole procedure. Sequences that share any splice sites are grouped together to define an EST cluster in a manner similar to that of the genome-based version of the UniGene algorithm. Transcript assembly is achieved using graph theory that represents the exon connectivity in each cluster as a directed acyclic graph (DAG). Distinct paths along exons correspond to possible gene models encompassing all alternative splicing events. EST sequences in each cluster are subclustered further according to the compatibility with gene structure of each splice variant, and they can be regarded as clone evidence for the corresponding isoform. The reliability of each isoform is assessed from the nature of cluster members and from the minimum number of clones required to reconstruct all exons in the transcript.
Collapse
Affiliation(s)
- Namshin Kim
- Division of Molecular Life Sciences, Ewha Womans University, Seoul 120-750, Korea
| | | | | |
Collapse
|
49
|
Abstract
MOTIVATION We introduce GMAP, a standalone program for mapping and aligning cDNA sequences to a genome. The program maps and aligns a single sequence with minimal startup time and memory requirements, and provides fast batch processing of large sequence sets. The program generates accurate gene structures, even in the presence of substantial polymorphisms and sequence errors, without using probabilistic splice site models. Methodology underlying the program includes a minimal sampling strategy for genomic mapping, oligomer chaining for approximate alignment, sandwich DP for splice site detection, and microexon identification with statistical significance testing. RESULTS On a set of human messenger RNAs with random mutations at a 1 and 3% rate, GMAP identified all splice sites accurately in over 99.3% of the sequences, which was one-tenth the error rate of existing programs. On a large set of human expressed sequence tags, GMAP provided higher-quality alignments more often than blat did. On a set of Arabidopsis cDNAs, GMAP performed comparably with GeneSeqer. In these experiments, GMAP demonstrated a several-fold increase in speed over existing programs. AVAILABILITY Source code for gmap and associated programs is available at http://www.gene.com/share/gmap SUPPLEMENTARY INFORMATION http://www.gene.com/share/gmap.
Collapse
Affiliation(s)
- Thomas D Wu
- Department of Bioinformatics Genentech, Inc., South San Francisco, CA 94080, USA.
| | | |
Collapse
|
50
|
|